4.1. FAKE NEWS EARLY DETECTION 59
e textual features R
T
and the visual features R
V
will be concatenated to form the multi-
modal feature representation denoted as R
F
D R
T
˚ R
V
, which is the output of the multi-
modal feature extractor.
Fake News Detector
e fake news detector deploys a fully connected layer (“pred-fc”) with softmax function to
predict whether a news post is fake or real. e fake news detector takes the learned multi-
modal feature representation R
F
as the input. e objective function of fake news detector is to
minimize the cross entropy loss as follows:
min L
d
.
f
;
d
/ WD min EŒy log.P
.a// C .1 y/.log.1 P
.a///; (4.5)
where a is a news post,
f
and
d
are the parameters of the multi-modal feature extractor and
fake news detector. However, directly optimizing the loss function in Equation (4.5) only help
detect fake news on the events that are already included in the training data, so it does not
generalize well to new events. us, we need to enable the model to learn more general feature
representations that can capture the common features among all the events. Such representation
should be event-invariant and does not include any event-specific features.
Event Discriminator
To learn the event-invariant feature representations, it is required to remove the uniqueness of
each events in the training data and focuses on extracting features that are shared among different
events. To this end, we use an event-discriminator, which is a neural network consisting of two
dense layers, to correctly classify the post into one of E events .1; ; e/ correctly. e event
discriminator is a classifier and deploy the cross entropy loss function as follows:
min L
e
.
f
;
e
/ WD min E
"
E
X
kD1
1
ŒkDy
e
log.G
e
.G
f
.aI
f
//I
e
/
#
; (4.6)
where G
f
denotes the multi-modal feature extractor, y
e
denotes the even label predicted, and
and G
e
represents the event discriminator. Equation (4.6) can estimate the dissimilarities of
different events’ distributions. e large loss means the distributions of different events’ rep-
resentations are similar and the learned features are event-invariant. us, in order to remove
the uniqueness of each event, we need to maximize the discrimination loss L
e
by seeking the
optimal parameter
f
.
e above idea motivates a minimax game between the multi-modal feature extractor and
the event discriminator. On one hand, the multi-modal feature extractor tries to fool the event
discriminator to maximize the discrimination loss, and on the other hand, the event discrimi-
nator aims to discover the event-specific information included in the feature representations to
recognize the event. e integration process of three components and the final objective function
will be introduced in the next section.