62 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION

4.2 WEAKLY SUPERVISED FAKE NEWS DETECTION

Weakly supervised fake news detection aims to predict fake news with limited or no supervision

labels. We introduce some representative methods for semi-supervised and unsupervised fake

news detection.

4.2.1 A TENSOR DECOMPOSITION SEMI-SUPERVISED APPROACH

In the scenario of semi-supervised fake news detection, assuming the labels of a limited number

of news pieces are given, we aim to predict the labels of unknown news articles. Formally, we

denote the corpus of fake news A D fa

; a

;    ; a

g where each document a

can be repre-

sented as a vector of term in a dictionary, † with size of d D j†j. Assuming that the labels of

some news articles are available, and let y 2 f1; 0; 1g denote the vector containing the partially

known labels, such that entries of 1 represent real news, 1 represents fake news, and 0 denotes

an unknown label. e problem is that, given a collection of news articles a and a label y with

entries for labeled real/fake news and unknown articles, the goal is to predict the class labels of

the unknown articles.

e framework for semi-supervised fake news detection mainly consists of two stages (see

Figure 4.4): (1) tensor decomposition; (2) building k-Nearest-Neighbor (KNN) news graph; and

(3) belief propagation. e tensor decomposition is to learn news representation from news con-

tents; the KNN graph is built to link labeled and unlabeled news pieces; and belief propagation

can utilize graph structure modeling to predict the labels of unknown news pieces.

-1

-1 -1

-1

Belief PropagationKNN Graph

Tensor

Decomposition

Figure 4.4: e tensor decomposition framework for semi-supervised fake news detection.

Building KNN Graph of News Articles

Spatial relations among words represent the aﬃnities of them in a news document. To incorpo-

rate spatial relations, a news document a

can be represented as a tensor X 2 R

N d d

. We then

decompose the three mode tensor X 2 R

N d d

into three matrix using the CP/PARAFAC de-

4.2. WEAKLY SUPERVISED FAKE NEWS DETECTION 63

composition as in Equation (2.2) in Section 2.1.2. en we can obtain the news representation

matrix F.

e tensor embedding we computed in last step provides a compact and discriminative

representation of news articles into a concise set of latent topics. Using this embedding, we

construct a graphical representation of news articles. Speciﬁcally, we can used the resultant news

representation matrix F 2 R

N d

to build a KNN graph G among labeled and unlabeled news

pieces. Each node in G represents a news article and each edge encodes that two articles are

similar in the embedding space. A KNN graph is a graph where a node a

and a

are connected

by an edge is one of them belongs to the KNN list of the other. e KNN of a data point is

deﬁned using the closeness relation in the feature space with a distance metric such as Euclidean

distance as follows. In practice, the parameter K is empirically decided:

d.F

; F

/ D kF

 F

: (4.10)

e resultant graph G is an undirected, symmetric graph where each node is connected to at

least k nodes. e graph can be compactly represented as an N  N adjacency matrix O.

Belief Propagation

Belief propagation is to propagate the label information of labeled news pieces to unlabeled ones

in the KNN graph G. e basic assumption is that news articles that are connected in G are

more likely to be of the same labels due to the construction method of the tensor embeddings.

To this end, a fast and linearized fast belief propagation (FaBP) [73] can be utilized due to the

eﬃciency and eﬀectiveness in large graphs. e operative intuition behind FaBP and other such

guilt-by-association methods is that nodes which are “close” are likely to have similar labels or

belief values. e FaBP solves the following linear problem:

ŒI C ˛D  c

Ob

D 

; (4.11)

where 

and b

denote the prior and ﬁnal beliefs (labels). O is the adjacency matrix of the KNN

graph, I is an identity matrix, and D is a diagonal matrix where D

and D

D 0 for

i ¤ j .

4.2.2 A TENSOR DECOMPOSITION UNSUPERVISED APPROACH

e news content is suggested to contain crucial information to diﬀerentiate fake from real

news [53]. Section 2.1.1 introduced linguistic textual feature extraction for fake news detection.

ese features such as n-gram, and tf-idf (term frequency-inverse document frequency) can

capture the correlations and similarities between diﬀerent news contents. However, they ignore

the context of a news document such as spatial vicinity of each word. To this end, we model the

corpus as a third-order tensor, which can simultaneously leverage the article and term relations,

as well as the spatial/contextual relations between the terms. e advantage is that by exploiting

both aspects of the corpus, in particular the spatial relations between the words, the learned

64 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION

news representation can be a determining factor for identifying groups of articles that fall under

diﬀerent types of fake news.

Given the corpus of news pieces A D fa

; a

;    ; a

g where each document a

is a vector

of term in a dictionary, † with size of d D j†j. e problem is clustering the news documents

based on their terms into diﬀerent categories such as fake and real.

e framework for unsupervised fake news detection consists of two stages (see Fig-

ure 4.5). First, we explain how to extract spatial relations between terms through tensor de-

composition; then, we discus the co-clustering to decompose relevant documents.

A1 A1 A2 A3 A3 C5 C5A2

A2 A3

B1 B2B3

C1C2 C3 C4C5

0.8

0.7

0.6

0.5

0.8

0.7

0.6

0.5

0.8

0.7

0.6

0.5

0.8

0.7

0.6

0.5

Co-cluster

Rank = 4

Rank = 5

Rank = 3

Terms

Documents

–

Terms

Figure 4.5: A tensor decomposition framework for unsupervised fake news detection. It consists

of two stages: tensor decomposition for spatial relation extraction and co-clustering decompo-

sition ensembles. Based on [53].

Tensor Ensemble Co-Clustering

We ﬁrst build a tensor X 2 R

N d d

for each news a. We then decompose the tensor X into

three matrix using the CP/PARAFAC decomposition as in Equation (2.2) in Section 2.1.2, and

we obtain the news representation matrix F. After learning the news representation through

spatial relation extraction process, the next step is to cluster the news pieces based on their

cluster membership among a set of diﬀerent tensor decompositions. Intuitively, the clustering

4.2. WEAKLY SUPERVISED FAKE NEWS DETECTION 65

seeks to ﬁnd a subset of news articles that frequently cluster together in diﬀerent conﬁgurations

of the tensor decomposition of the previous step. e intuition is that news articles that tend

to frequently appear surrounded each other among diﬀerent rank conﬁgurations, while having

the same ranking within their latent factors are more likely to ultimately belong to the same

category. e ranking of a news article with respect to a latent factor is derived by simply sorting

the coeﬃcients of each latent factor, corresponding to the clustering membership of a news

article to a latent factor.

To this end, we can combine the clustering results of each individual tensor decomposition

into a collective (news-article by latent-factor) matrix, from which we are going to extract co-

clusters of news articles and the corresponding latent factors (coming from the ensemble of

decompositions). For example, as shown in Figure 4.5, we can perform the tensor decomposition

three times with diﬀerent rank 3, 4, and 5, and then construct a collect feature matrix F

. e

co-clustering objective with l

norm regularization for a combine matrix [103] F

is shown as

follows:

min



 RQ



C .kRk

C kQk

/; (4.12)

where R 2 R

N k

is the representation matrix of news articles, and Q 2 R

M k

is the coding

matrix, and the term .kRk

C kQk

/ is to enforce the sparse constraints.

4.2.3 A PROBABILISTIC GENERATIVE UNSUPERVISED APPROACH

Existing work on fake news detection is mostly based on supervised methods. Although they

have shown some promising results, these supervised methods suﬀer from a critical limitation,

i.e., they require a reliably pre-annotated dataset to train a classiﬁcation model. However, ob-

taining a large number of annotations is time consuming and labor intensive, as the process

needs careful checking of news contents as well as other additional evidence such as authorita-

tive reports.

e key idea is to extract users’ opinions on the news by exploiting the auxiliary infor-

mation of the users’ engagements with the news tweets on social media, and aggregate their

opinions in a well-designed unsupervised way to generate our estimation results [174]. As news

propagates, users engage with diﬀerent types of behaviors on social media, such as publishing a

news tweet, liking, forwarding, or replying to a news tweet. is information can, on a certain

level, reﬂect the users opinions on the news. For example, Figure 4.6 shows two news tweet

examples regarding the aforementioned news. According to the users’ tweet contexts, we see

that the user in Figure 4.6a disagreed with the authenticity of the news, which may indicate the

user’s high credibility in identifying fake news. On the other hand, it appears that the user in

Figure4.6b falsely believed the news or intentionally spread the fake news, implying the user’s

deﬁciency in the ability to identify fake news. Besides, as for other users who engaged in the

tweets, it is likely that the users who liked/retweeted the ﬁrst tweet also doubted the news,

while those who liked/retweeted the second tweet may also be deceived by the news. e users’

66 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION

(b) Agreeing to the authenticity of the news

(a) Doubting the authenticity of the news

Figure 4.6: e illustration of user opinions to news pieces. Based on [174].

opinions toward the news can also be revealed from their replies to the news tweets. Next, we

introduce the scenario of the hierarchical structure of user engagements, and then describe how

to use a probabilistic graph model to infer fake news.

Hierarchical User Engagements Figure 4.7 presents an overview of the hierarchical structure

of user engagements on social media. Let A D fa

;    ; a

g denote a set of news corpus, and

and U

represent the sets of veriﬁed and unveriﬁed users , respectively. For each given news

2 A , we collect all the veriﬁed users’ tweets on this news. Let U

2 U

denote the set of

veriﬁed users who published tweets for the news a

. en, for the tweet of each veriﬁed user

2 U

M i

, we collect the unveriﬁed users’ social engagements. Let U

2 U

denote the set of

unveriﬁed users who engaged in the tweet.

News

Veriﬁed User

Unveriﬁed User



i,

i,j

i,M

i,j,

i,j,k

i,j,K

… …

Figure 4.7: An illustration of hierarchical user engagements.

4.2. WEAKLY SUPERVISED FAKE NEWS DETECTION 67

For each veriﬁed user u

2 U

, we let z

i;j

2 f0; 1g denote the user’s opinion on the news,

i.e., z

i;j

is 1 if the user thinks the news is real, and 0 otherwise. Several heuristics can be applied

to extract z

i;j

. Let a

and c

i;j

denote the news content and the user u

’s own text content of the

tweet, respectively. en, z

i;j

can be deﬁned as the sentiment of c

i;j

For veriﬁed user u

’s tweet on news a

, many unveriﬁed users may like, retweet, or reply

to the tweet. Let r

i;j;k

2 f0; 1g denote the opinion of the unveriﬁed user u

2 U

. We assume

that if the user u

liked or retweeted the tweet, then it implies that u

agreed to the opinion of

the tweet. If the user u

replied to the tweet, then its opinion can be extracted by employing oﬀ-

the-shelf sentiment analysis [55]. When an unveriﬁed user may conduct multiple engagements

in a tweet, the user’s opinion r

i;j;k

is obtained using majority voting.

Probabilistic Graphic Model for Inferring Fake News Given the deﬁnitions of a

, z

i;j

, and

i;j;k

, we now present the unsupervised fake news detection framework (UFD). Figure 4.8 shows

the probabilistic graphical structure of the model. Each node in the graph represents a random

variable or a prior parameter, where darker nodes and white nodes indicate observed or latent

variables, respectively.



0,0

i,j

i,j,k

0,

,0

0,0

0,

,0

,



Figure 4.8: A probabilistic graphic model for unsupervised fake news detection. e circles with

gray colors denote observed random variables, the circles without white color are non-observed

random variables, arrows mean the conditional dependencies, and rectangles illustrate the du-

plication frequencies.

Each news a

, y

is generated from a Bernoulli distribution with parameter 

 Bernoulli.

/: (4.13)

e prior probability of 

is generated from a Beta distribution with hyper parameter  D

.

; 

/ as 

 Beta.

; 

/, where 

is the prior number of true news pieces and 

is the

prior number of fake news pieces. If we do not have a strong belief in practice, we can initially

assign a uniform prior indicating that each news piece has an equal probability of being true or

fake.

68 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION

For veriﬁed user u

, its credibility toward news is modeled with two variables 

and 

denoting the probability that user u

thinks a news piece is real giving the true estimation of the

news is true or fake, deﬁned as follows:



WD p.z

i;j

D 1jy

D 1/



WD p.z

i;j

D 1jy

D 0/:

(4.14)

We generate the parameters 

from Beta distributions with hyper parameters ˛

D .˛

; ˛

where ˛

is the prior true positive count, and ˛

denotes the false negative count. Similarly, we

can generate 

from another Beta distribution with hyper parameters ˛

D .˛

; ˛



 Beta



; ˛





 Beta



; ˛



(4.15)

Given 

and 

, the opinion of each veriﬁed user u

for news i is generated from a Bernoulli

distribution with parameter 

, y

i;j

 Bernoulli.

For unveriﬁed user, he/she engages in the veriﬁed users’ tweets, and thus the opinion is

likely to be inﬂuenced by the news itself and the veriﬁed users’ opinions. erefore, for each

unveriﬁed user u

2 U

, the following variables are adopted to model the credibility:

0;0

WD p.r

i;j;k

D 1jx

D 0; z

i;j

D 0/

0;1

WD p.r

i;j;k

D 1jx

D 0; z

i;j

D 1/

1;0

WD p.r

i;j;k

D 1jx

D 1; z

i;j

D 0/

1;1

WD p.r

i;j;k

D 1jx

D 1; z

i;j

D 1/

(4.16)

and for each pair of .u; v/ 2 f0; 1g

u;v

represents the probability that the unveriﬁed user u

thinks the news is true under the condition that the true labels of the news is u and the veriﬁed

user’ opinion is v. For each

u;v

, it is generated from the Beta distribution as follows:



u;v

 Beta



u;v

; ˇ

u;v



: (4.17)

Given the truth estimation of news y

, and the veriﬁed user’ opinions z

i;j

, we generate the

unveriﬁed user’s opinion from a Bernoulli distribution with parameter 

i ;j

as follows:

i;j;k

D Bernoulli





i;j



: (4.18)

e overall objective is to ﬁnd instances of the latent variables that maximize the joint probability

estimation of y as follows:

Oy D argmax

•

p.y; z; r; ; ; /ddd ; (4.19)

where for simplicity of notations, we use  and to denote f

; 

g and f

0;0

;

0;1

;

1;0

;

1;1

respectively. However, the exact inference on the posterior distribution may result in exponential

complexity; we propose using Gibbs sampling to eﬀectively inference the variable estimations.