80 A. DATA REPOSITORY
• BS Detector:
4
is dataset is collected from a browser extension called BS detector devel-
oped for checking news veracity. It searches all links on a given web page for references to
unreliable sources by checking against a manually compiled list of domains. e labels are
the outputs of the BS detector, rather than human annotators.
• CREDBANK:
5
is is a large-scale crowd-sourced dataset [91] of around 60 million
tweets that cover 96 days starting from October 2015. e tweets are related to over
1,000 news events. Each event is assessed for credibilities by 30 annotators from Ama-
zon Mechanical Turk.
• BuzzFace:
6
is dataset [120] is collected by extending the BuzzFeed dataset with com-
ments related to news articles on Facebook. e dataset contains 2263 news articles and
1.6 million comments discussing news content.
• FacebookHoax:
7
is dataset [147] comprises information related to posts from the Face-
book pages related to scientific news (non-hoax) and conspiracy pages (hoax) collected
using Facebook Graph API. e dataset contains 15,500 posts from 32 pages (14 con-
spiracy and 18 scientific) with more than 2,300,000 likes.
• NELA-GT-2018:
8
is dataset collects articles between February 2018 through Novem-
ber 2018 from 194 news and media outlets including mainstream, hyper-partisan, and
conspiracy sources, resulting in 713 k articles. e ground truth labels are integrated from
eight independent assessments.
From Table A.1, we observe that no existing public dataset can provide all possible fea-
tures of news content, social context, and spatiotemporal information. Existing datasets have
some limitations that we try to address in our data repository. For example, BuzzFeedNews
only contains headlines and text for each news piece and covers news articles from very few
news agencies. LIAR dataset contains mostly short statements instead of entire news articles
with the meta attributes. BS Detector data is collected and annotated by using a developed
news veracity checking tool, rather than using human expert annotators. CREDBANK dataset
was originally collected for evaluating tweet credibilities and the tweets in the dataset are not
related to the fake news articles and hence cannot be effectively used for fake news detection.
BuzzFace dataset has basic news contents and social context information but it does not cap-
ture the temporal information. e FacebookHoax dataset consists very few instances about the
conspiracy theories and scientific news.
To address the disadvantages of existing fake news detection datasets, the proposed Fak-
eNewsNet repository collects multi-dimension information from news content, social context,
4
https://github.com/bs-detector/bs-detector
5
http://compsocial.github.io/CREDBANK-data/
6
https://github.com/gsantia/BuzzFace
7
https://github.com/gabll/some-like-it-hoax
8
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ULHLCB