7.3. Text Classification Application: Fake News detection¶

Author: Johannes Maucher
Last update: 24.11.2020

In this notebook conventional Machine Learning algorithms are applied to learn a discriminator-model for distinguishing fake- and non-fake news.

What you will learn:

Access text from .csv file
Preprocess text for classification
Calculate BoW matrix
Apply conventional machine learning algorithms for fake news detection
Evaluation of classifiers

7.3.1. Access Data¶

In this notebook a fake-news corpus from Kaggle is applied for training and testing Machine Learning algorithms. Download the 3 files and save it in a directory. The path of this directory shall be assigned to the variable pathin the following code-cell:

import pandas as pd
pfad="/Users/johannes/DataSets/fake-news/"
train = pd.read_csv(pfad+'train.csv',index_col=0)
test = pd.read_csv(pfad+'test.csv',index_col=0)
test_labels=pd.read_csv(pfad+'submit.csv',index_col=0)

Data in dataframe train is applied for training. The dataframe testcontains the texts for testing the model and the dataframe test_labels contains the true labels of the test-texts.

print("Number of texts in train-dataframe: \t",train.shape[0])
print("Number of columns in train-dataframe: \t",train.shape[1])
train.head()

Number of texts in train-dataframe: 	 20800
Number of columns in train-dataframe: 	 4

	title	author	text	label
id
0	House Dem Aide: We Didn’t Even See Comey’s Let...	Darrell Lucus	House Dem Aide: We Didn’t Even See Comey’s Let...	1
1	FLYNN: Hillary Clinton, Big Woman on Campus - ...	Daniel J. Flynn	Ever get the feeling your life circles the rou...	0
2	Why the Truth Might Get You Fired	Consortiumnews.com	Why the Truth Might Get You Fired October 29, ...	1
3	15 Civilians Killed In Single US Airstrike Hav...	Jessica Purkiss	Videos 15 Civilians Killed In Single US Airstr...	1
4	Iranian woman jailed for fictional unpublished...	Howard Portnoy	Print \nAn Iranian woman has been sentenced to...	1

Append the test-dataframe with the labels, which are contained in dataframe test_labels.

test["label"]=test_labels["label"]

print("Number of texts in test-dataframe: \t",test.shape[0])
print("Number of columns in test-dataframe: \t",test.shape[1])
test.head()

Number of texts in test-dataframe: 	 5200
Number of columns in test-dataframe: 	 4

	title	author	text	label
id
20800	Specter of Trump Loosens Tongues, if Not Purse...	David Streitfeld	PALO ALTO, Calif. — After years of scorning...	0
20801	Russian warships ready to strike terrorists ne...	NaN	Russian warships ready to strike terrorists ne...	1
20802	#NoDAPL: Native American Leaders Vow to Stay A...	Common Dreams	Videos #NoDAPL: Native American Leaders Vow to...	0
20803	Tim Tebow Will Attempt Another Comeback, This ...	Daniel Victor	If at first you don’t succeed, try a different...	1
20804	Keiser Report: Meme Wars (E995)	Truth Broadcast Network	42 mins ago 1 Views 0 Comments 0 Likes 'For th...	1

7.3.2. Data Selection¶

In the following code cells, first the number of missing-data fields is determined. Then the information in columns author, title and text are concatenated to a single string, which is saved in the column total. After this process, only columns total and label are required, all other columns can be removed in the train- and the test-dataframe.

train.isnull().sum(axis=0)

title      558
author    1957
text        39
label        0
dtype: int64

test.isnull().sum(axis=0)

title     122
author    503
text        7
label       0
dtype: int64

train = train.fillna(' ')
train['total'] = train['title'] + ' ' + train['author'] + ' ' + train['text']

train = train[['total', 'label']]

train.head()

	total	label
id
0	House Dem Aide: We Didn’t Even See Comey’s Let...	1
1	FLYNN: Hillary Clinton, Big Woman on Campus - ...	0
2	Why the Truth Might Get You Fired Consortiumne...	1
3	15 Civilians Killed In Single US Airstrike Hav...	1
4	Iranian woman jailed for fictional unpublished...	1

test = test.fillna(' ')
test['total'] = test['title'] + ' ' + test['author'] + ' ' + test['text']
test = test[['total', 'label']]

7.3.3. Preprocessing¶

The input texts in column total shall be preprocessed as follows:

stopwords shall be removed
all characters, which are neither alpha-numeric nor whitespaces, shall be removed
all characters shall be represented in lower-case.
for all words, the lemma (base-form) shall be applied

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

stop_words = stopwords.words('english')

lemmatizer = WordNetLemmatizer()
for index in train.index:
    #filter_sentence = ''
    sentence = train.loc[index,'total']
    # Cleaning the sentence with regex
    sentence = re.sub(r'[^\w\s]', '', sentence)
    # Tokenization
    words = nltk.word_tokenize(sentence)
    # Stopwords removal
    words = [lemmatizer.lemmatize(w).lower() for w in words if not w in stop_words]
    filter_sentence = " ".join(words)
    train.loc[index, 'total'] = filter_sentence

First 5 cleaned texts in the training-dataframe:

train.head()

	total	label
id
0	house dem aide we didnt even see comeys letter...	1
1	flynn hillary clinton big woman campus breitba...	0
2	why truth might get you fired consortiumnewsco...	1
3	15 civilians killed in single us airstrike hav...	1
4	iranian woman jailed fictional unpublished sto...	1

Clean data in the test-dataframe in the same way as done for the training-dataframe above:

lemmatizer = WordNetLemmatizer()
for index in test.index:
    #filter_sentence = ''
    sentence = test.loc[index,'total']
    # Cleaning the sentence with regex
    sentence = re.sub(r'[^\w\s]', '', sentence)
    # Tokenization
    words = nltk.word_tokenize(sentence)
    # Stopwords removal
    words = [lemmatizer.lemmatize(w).lower() for w in words if not w in stop_words]
    filter_sentence = " ".join(words)
    test.loc[index, 'total'] = filter_sentence

First 5 cleaned texts in the test-dataframe:

test.head()

	total	label
id
20800	specter trump loosens tongues not purse string...	0
20801	russian warship ready strike terrorist near al...	1
20802	nodapl native american leaders vow stay all wi...	0
20803	tim tebow will attempt another comeback this t...	1
20804	keiser report meme wars e995 truth broadcast n...	1

7.3.4. Determine Bag-of-Word Matrix for Training- and Test-Data¶

In the code-cells below two different types of Bag-of-Word matrices are calculated. The first type contains the term-frequencies, i.e. the entry in row \(i\), column \(j\) is the frequency of word \(j\) in document \(i\). In the second type, the matrix-entries are not the term-frequencies, but the tf-idf-values.

Note that for a given typ (term-frequency or tf-idf) a separate matrix must be calculated for training and testing. Since we always pretend, that only training-data is known in advance, the matrix-structure, i.e. the columns (= words) depends only on the training-data. This matrix structure is calculated in the row:

count_vectorizer.fit(X_train)

and

tfidf.fit(freq_term_matrix_train),

respectively. An important parameter of the CountVectorizer-class is min_df. The value, which is assigned to this parameter is the minimum frequency of a word, such that it is regarded in the BoW-matrix. Words, which appear less often are disregarded.

The training data is then mapped to this structure by

count_vectorizer.transform(X_train)

and

tfidf.transform(X_train),

respectively.

For the test-data, however, no new matrix-structure is calculated. Instead the test-data is transformed to the structure of the matrix, defined by the training data.

X_train = train['total'].values
y_train = train['label'].values

X_test = test['total'].values
y_test = test['label'].values

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

Train BoW-models and transform training-data to BoW-matrix:

count_vectorizer = CountVectorizer(min_df=4)
count_vectorizer.fit(X_train)
freq_term_matrix_train = count_vectorizer.transform(X_train)
tfidf = TfidfTransformer(norm = "l2")
tfidf.fit(freq_term_matrix_train)
tf_idf_matrix_train = tfidf.transform(freq_term_matrix_train)

freq_term_matrix_train.toarray().shape

(20800, 55055)

tf_idf_matrix_train.toarray().shape

(20800, 55055)

Transform test-data to BoW-matrix:

freq_term_matrix_test = count_vectorizer.transform(X_test)
tf_idf_matrix_test = tfidf.transform(freq_term_matrix_test)

7.3.5. Train a linear classifier¶

Below a Logistic Regression model is trained. This is just a linear classifier with a sigmoid- or softmax- activity-function.

X_train=tf_idf_matrix_train
X_test=tf_idf_matrix_test
#X_train=freq_term_matrix_train
#X_test=freq_term_matrix_test

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

7.3.6. Evaluate trained model¶

First, the trained model is applied to predict the class of the training-samples:

y_pred_train = logreg.predict(X_train)

y_pred_train

array([1, 1, 1, ..., 0, 1, 1])

from sklearn.metrics import confusion_matrix, classification_report

confusion_matrix(y_train,y_pred_train)

array([[10200,   187],
       [  148, 10265]])

The model’s prediction are compared with the true classes of the training-samples. The classification-report contains the common metrics for evaluating classifiers:

print(classification_report(y_train,y_pred_train))

              precision    recall  f1-score   support

           0       0.99      0.98      0.98     10387
           1       0.98      0.99      0.98     10413

    accuracy                           0.98     20800
   macro avg       0.98      0.98      0.98     20800
weighted avg       0.98      0.98      0.98     20800

The output of the classification report shows, that the model is well fitted to the training data, since it predicts training data with an accuracy of 98%.

However, accuracy on the training-data, provides no information on the model’s capability to classify new data. Therefore, below the model’s prediction on the test-dataset is calculated:

y_pred_test = logreg.predict(X_test)

confusion_matrix(y_test,y_pred_test)

array([[1524,  815],
       [1061, 1800]])

print(classification_report(y_test,y_pred_test))

              precision    recall  f1-score   support

           0       0.59      0.65      0.62      2339
           1       0.69      0.63      0.66      2861

    accuracy                           0.64      5200
   macro avg       0.64      0.64      0.64      5200
weighted avg       0.64      0.64      0.64      5200

The model’s accuracy on the test-data is weak. The model is overfitted on the training-data. It seems that the distribution of test-data is significantly different from the distribution of training-data. This hypothesis can be verified by ignoring the data from test.csv and instead split data from train.csv into a train- and a test-partition. In this modified experiment performance on test-data is much better, because the texts within train.csv origin from the same distributions.

Natural Language Processing Lecture