55
C H A P T E R 4
Challenging Problems of Fake
News Detection
In previous chapters, we introduce how to extract features and build machine learning models
from news content and social context to detect fake news, which generally considers the standard
scenario of binary classification. Since fake news detection is a critical real-world problem, it has
encountered specific challenges that need to be considered. In addition, recent advancements of
machine learning methods such as deep neural networks, tensor factorization, and probabilistic
models allow us to better capture effective features of news from auxiliary information, and deal
with specialized settings of fake news detection.
In this chapter, we discuss several challenging problems of fake news detection. Specifi-
cally, there is a need for detecting fake news at the early stage to prevent further propagation of
fake news on social media. Since obtaining the ground truth of fake news is labor intensive and
time consuming, it is important to study fake news detection in a weakly supervised setting, i.e.,
with limited or no labels for training. It is also necessary to understand why a particular piece
of news is classified as fake by machine learning models, in which the derived explanation can
provide new insights and knowledge not obvious to practitioners.
4.1 FAKE NEWS EARLY DETECTION
Fake news early detection aims to give early alerts of fake news during the dissemination process
so that actions can be taken to help prevent its further propagation on social media.
4.1.1 A USER-RESPONSE GENERATION APPROACH
We learn that the rich social context information provides effective auxiliary signals to advance
fake news detection on social media. However, these types of social context information, such
as user comments, can only be available after people have already engaged in the fake news
propagation. erefore, a more practical solution for early fake news detection is to assume
the news content is the only available information. In addition, we can assume we have obtained
historical data that contains both news contents and user response, and can leverage the historical
data to help enhance early detection performance on newly emerging news articles without any
user responses.
56 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION
Let A D fa
1
; a
2
; ; a
N
g denote the set of news corpus, where each document a
i
is a
vector of term in a dictionary, with size of d D jj, and C D fc
1
; c
2
; ; c
n
g represents the
set of user responses. e detection task can be defined as: given a news article a, the goal is to
predict whether it is fake or not without using assuming the corresponding user response exists.
e framework mainly consists of two major components (see Figure 4.1): (1) a convolu-
tion neural network component to learn news representation; and (2) a user response generator
to generate auxiliary signals to help detect fake news.
Average Pooling Convolutional Layer
Fully Connected Layer
Fully Connected
Layer
TCNN
URG
Article
Latent
Distribution
Generative
Network
Generated
Response
Label
w w
w
w
s
s w
w
Figure 4.1: e user-response generation framework for early fake news detection. It consists
primarily of two stages: neural news representation learning and deep user response generator.
Based on [112].
Neural Representation Learning for News
To extract semantic information and learn the representation for news, we can use a two-level
convolution neural network (TCNN) structure: sentence-level and document-level. We have
introduced similar feature learning techniques in Section 2.1.3. For the sentence-level, we can
first derive sentence representation as the average of the word embeddings of those words in the
sentence. Each sentence in a news article can be represented as a one-hot vector s 2 f0; 1g
jT j
indicating which words from the vocabulary are present in the sentence. en the sentence
representation is defined by average pooling of word embedding vectors of words in the sentence
as follows:
v.s/ D
Ws
P
k
s
k
; (4.1)
where W is the embedding matrix for all words, where embedding of each word is obtained
from a pre-trained skip-gram algorithm [90] on all news articles. In the document level, the
4.1. FAKE NEWS EARLY DETECTION 57
news representation is derived from the sentence representations by concatenating (˚) each
sentence representation. For a news piece a
i
, containing L sentences S D fs
1
; ; s
L
g, the news
representation s
i
is represented as:
a
i
D v.s
1
/ ˚ v.s
2
/ ˚ v.s
L
/: (4.2)
After the news representation is obtained, we can use the convolution neural networks to learn
the representations as in Section 2.1.3.
User Response Generator
e goal of the user response generator is to generate user responses to help learn more effec-
tive representations to predict fake news. We can use a generative conditional variational auto-
encoder (CVAE) [144] to learn the a distribution over user responses, conditioned on the article,
and can therefore be used to generate varying responses sampled from the learned distribution.
Specifically, CVAE takes the user response C
.i/
D fc
i1
; ; c
i n
g and the news article a as the
input, and aim to reconstruct the input c conditioned on a by learning the latent representation
z. e objective is shown as follows:
E
zq
.c
ij
;a
i
/
Œ log p
.c
ij
jz; a
i
/ C D
KL
.q
.zjc
ij
; a
i
//: (4.3)
e first term is the reconstruction error designed as the negative log-likelihood of the data
reconstructed from the latent variable z under the influence of article a
i
. e second term is for
regularization and minimize the divergence between the encoder distribution q
.zjc; a/ and the
prior distribution p
.z/.
We use the learned representation a from the TCNN as the condition and feed into the
user response generator to generate synthetic responses. e user response generated by URG is
put through a nonlinear neural network and then combined with the text features extracted by
TCNN. en, the final feature vector is fed into a feed forward softmax classifier for classifica-
tion as in Figure 4.1.
4.1.2 AN EVENT-INVARIANT ADVERSARIAL APPROACH
Most of existing fake news detection methods perform supervised learning using historical data
that are collected from different news events. Actually, these methods tend to capture lots of
event-specific features which are not shared among different news events [165]. Such event-
specific features, though being able to help classify the posts on verified events, would have limit
help or even hurt the detection with regard to newly emerged events. erefore, it is important
to learn event-invariant features that are discriminative to detect fake news from unverified
events. e goal is to design an effective model to remove the nontransferable event-specific
features and preserve the shared event-invariant features among all the events to improve fake
news detection performance.
58 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION
In [165], an event adversarial neural network (EANN) model is proposed to extract
nontransferable multi-modal feature representations for fake news detection (see Figure 4.2).
EANN mainly consists of three components: (1) the multi-modal feature extractor; (2) the fake
news detector; and (3) the event discriminator. e multi-modal feature extractor cooperates
with the fake news detector to carry out the major task of identifying false news. Simultane-
ously, the multi-modal feature extractor tries to fool the event discriminator to learn the event
invariant representations.
.
.
.
.
.
.
.
.
.
.
.
.
Reddit,
has,
found,
a,
much,
clearer,
photo
Word
Embedding
Text
Feature
Multimodal
Feature
Text-CNN
Fake News Detector
Event Discriminator
Multimodal Feature Extractor
Visual Feature
VGG-19
vis-fc
reversal
adv-fc1
adv-fc2
pred-fc
Concatenation
R
T
R
F
R
V
.
.
.
Figure 4.2: e illustration of the event adversarial neural networks (EANN). It consists of three
parts: a multi-modal feature extractor, an event discriminator, and a fake news detector. Based
on [165].
Multi-Modal Feature Extractor
e multi-modal feature extractor aims to extract feature representations from news text and
images using neural networks. We introduced representative techniques in Sections 2.1.3 and
2.2.3 for neural textual and visual feature learning. In [165], for textual feature, the CNNs are
utilized to obtain R
cnn
; and for image feature, the VGG19 neural networks are adopted to get
R
vgg
. To enforce a standard feature representation of both text and image, we can add another
dense layer (“vis-fc”) to map the learned feature representation to the same dimension:
R
T
D .W
t
/R
cnn
R
V
D .W
v
/R
vgg
:
(4.4)
4.1. FAKE NEWS EARLY DETECTION 59
e textual features R
T
and the visual features R
V
will be concatenated to form the multi-
modal feature representation denoted as R
F
D R
T
˚ R
V
, which is the output of the multi-
modal feature extractor.
Fake News Detector
e fake news detector deploys a fully connected layer (“pred-fc”) with softmax function to
predict whether a news post is fake or real. e fake news detector takes the learned multi-
modal feature representation R
F
as the input. e objective function of fake news detector is to
minimize the cross entropy loss as follows:
min L
d
.
f
;
d
/ WD min EŒy log.P
.a// C .1 y/.log.1 P
.a///; (4.5)
where a is a news post,
f
and
d
are the parameters of the multi-modal feature extractor and
fake news detector. However, directly optimizing the loss function in Equation (4.5) only help
detect fake news on the events that are already included in the training data, so it does not
generalize well to new events. us, we need to enable the model to learn more general feature
representations that can capture the common features among all the events. Such representation
should be event-invariant and does not include any event-specific features.
Event Discriminator
To learn the event-invariant feature representations, it is required to remove the uniqueness of
each events in the training data and focuses on extracting features that are shared among different
events. To this end, we use an event-discriminator, which is a neural network consisting of two
dense layers, to correctly classify the post into one of E events .1; ; e/ correctly. e event
discriminator is a classifier and deploy the cross entropy loss function as follows:
min L
e
.
f
;
e
/ WD min E
"
E
X
kD1
1
ŒkDy
e
log.G
e
.G
f
.aI
f
//I
e
/
#
; (4.6)
where G
f
denotes the multi-modal feature extractor, y
e
denotes the even label predicted, and
and G
e
represents the event discriminator. Equation (4.6) can estimate the dissimilarities of
different events’ distributions. e large loss means the distributions of different events’ rep-
resentations are similar and the learned features are event-invariant. us, in order to remove
the uniqueness of each event, we need to maximize the discrimination loss L
e
by seeking the
optimal parameter
f
.
e above idea motivates a minimax game between the multi-modal feature extractor and
the event discriminator. On one hand, the multi-modal feature extractor tries to fool the event
discriminator to maximize the discrimination loss, and on the other hand, the event discrimi-
nator aims to discover the event-specific information included in the feature representations to
recognize the event. e integration process of three components and the final objective function
will be introduced in the next section.
60 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION
4.1.3 A PROPAGATION-PATH MODELING APPROACH
e diffusion paths of fake news and real news can be very different on social media. In ad-
dition to only relying on news contents to detect fake news, the auxiliary information from
spreaders at the early stage could also be important for fake news early detection. Existing de-
tection methods mainly explore temporal-linguistic features extracted from user comments, or
temporal-structural features extracted from propagation paths/trees or networks [80]. However,
compared to user comments, user characteristics are more available, reliable, and robust in the early
stage of news propagation than linguistic and structural features widely used by state-of-the-art
approaches.
Given the corpus of news pieces A D fa
1
; a
2
; ; a
N
g where each document a
i
is a vec-
tor of term in a dictionary, with size of d D jj. Let U D fu
1
; ; u
n
g denotes the set of
social media users, each user is associated with a feature vector u
i
2 R
k
. e propagation path
is defined as a variable-length multivariate time series P.a
i
/ D< ; .u
j
; t/; >, where each
tuple
.
u
j
; t/ denotes that user u
j
tweets/retweets the news story a
i
. Since the goal is to perform
early detection of fake news, the designed model should be able to make predictions based on
only a partial propagation path, defined as P.a
i
; T / D< x
i
; t < T >.
is framework consists of three major components (see Figure 4.3): (1) building the
propagation path; (2) learning path representations through RNN and CNN; and (3) predicting
fake news base on path representations.
Source Tweet Retweet
User Vectors
Label
Hidden
ConcatenationPooling Pooling
S
R
u
1
u
2
. . . u
n
h
2
h
1
h
n
f
1
f
2
f
n
S
C
GRU GRU GRU CNN CNN CNN
Figure 4.3: e propagation-path based framework for early fake news detection.
Building Propagation Path
e first step to build a propagation path is to identify the users who have engaged in the prop-
agation process. e propagation path P.a
i
/ for news piece a
i
is constructed by extracting the
4.1. FAKE NEWS EARLY DETECTION 61
user characteristics from those profiles of the users who posts/reposts the news piece. After
P.a
i
/ is obtained, the length of the propagation path would be different for different news
pieces. erefore, we can transform all the propagation paths with the fixed lengths n, denoted
as S.a
i
/ D< u
1
; ; u
n
>. If there are more than n tuples in P.a
i
/, then P.a
i
/ will be truncated
and only the first n tuples appear in S.a
i
/; if P.a
i
/ contains less than n tuples, then tuples are
over-sampled from P.a
i
/ to ensure the final length of S.a
i
/ is n.
Learning Path Representations
To learn the representation, both RNNs and CNNs are utilized to preserve the global and local
temporal information [80]. We have introduced how to use RNN to learn temporal represen-
tation in Section 3.3.3; a similar technique can be used here. As shown in Figure 4.3, we can
obtain the representation h
t
at each timestamp t, and the overall representation using RNN can
be computed as the mean pooling of all output vectors < h
1
; ; h
n
> for all timestamps, i.e.,
s
R
D
1
n
P
n
tD1
h
t
, which encodes the global variation of user characteristics.
To encode the local temporal feature of user characteristics, we first construct the local
propagation path U
tWtCh1
D< u
t
; ; u
tCh1
> for each timestamp t, with a moving window
h out of S.a
i
/. en we apply CNN with a convolution filter on U
tWtCh1
to get a scalar feature
c
t
:
c
t
D ReLU.W
f
U
tWtCh1
C b
f
/; (4.7)
where ReLU is the dense layer with rectified linear unit activation function and b
f
is a bias term.
To map the c
t
to a latent vector, we can utilize a convolution filter to transform c
t
to a k dimen-
sion vector f
t
. So, we can obtain a feature vectors for all the timestamps < f
1
; ; f
nhC1
>,
from which we can apply mean pooling operation and get the overall representations s
C
D
1
n
P
nhC1
tD1
f
t
that encodes the local representations of user characteristics.
Predicting Fake News
After we learn the representations from propagation paths through both the RNN-based and
CNN-based techniques, we can concatenate them into a single vector as follows:
a D Œs
R
; s
C
(4.8)
and then a is fed into a multi-layer (q layer) feed-forward neural network that finally to predict
the class label for the corresponding propagation path as follows:
I
j
D ReLU.W
j
I
j 1
C b
j
/; 8j 2 Œ1; ; q
y D softmax.I
q
/;
(4.9)
where q is the number of hidden layers, I
j
is the hidden states of the j
th
layer, and y is the output
over the set of all possible labels of news pieces.