Data and R code for the analysis supporting this Oct. 25, 2018 BuzzFeed News post analyzing engagement with English-language tweets from the Kremlin-backed Internet Research Agency. Supporting files are in this GitHub repository.
On Oct. 16, 2018 Twitter released a trove of data on accounts it had flagged as Russian trolls and shut down. Data on most of the 3.26 million English-language tweets from these accounts was released in July through FiveThirtyEight by researchers Darren Linvill and Patrick Warren of Clemson University in South Carolina. They also categorized the trolls into account categories including LeftTroll
, and RightTroll
. Many of the Left Trolls posed as liberal supporters of Black Lives Matter, tweeting about aspects of black culture as well as politics. Right Trolls posed as supporters of Donald Trump.
Download the Twitter data on Internet research agency tweets from here, unzip, and place the file ira_tweets_csv_hased.csv
in the data
folder. Run the script data_prep.R
to download the Clemson data, filter the Twitter data for English langauge tweets only, join the Clemson account categories to the Twitter data, and save as the file en_ira_tweets.csv
in the data
folder.
tweetid
Tweet identification number.userid
User identification number (anonymized for users which had fewer than 5,000 followers at the time of suspension).user_display_name
Name of the user (encoded as userid
for anonymized users).user_screen_name
Twitter handle of the user (encoded as userid
for anonymized users).user_reported_location
User’s self-reported location1.user_profile_description
User’s profile description1.user_profile_url
User’s profile URL1follower_count
Number of accounts following the user1.following_count
Number of accounts followed by the user1.account_creation_date
Date of user account creation.account_language
Language of the account as chosen by the user.tweet_language
Language of the tweet.tweet_text
Text of the tweet (mentions of anonymized accounts have been replaced with anonymized userid).tweet_time
Time when the tweet was published (UTC).tweet_client_name
Name of the client app used to publish the tweet.in_reply_to_tweetid
The tweetid
of the original tweet that this tweet is in reply to (for replies only).in_reply_to_userid
The userid
of the original tweet that this tweet is in reply to (for replies only).quoted_tweet_tweetid
The tweetid
of the original tweet that this tweet is quoting (for quotes only).is_retweet
Is this tweet a retweet (“true” or “false”)?retweet_userid
For retweets, userid
that authored the original tweet.retweet_tweetid
For retweets, tweetid
of the original tweet.latitude
Geolocated latitude, if available.longitude
Geolocated longitude, if available.quote_count
Number of tweets quoting this tweet.reply_count
Number of tweets replying to this tweet.like_count
Number of likes that this tweet received2.retweet_count
Number of retweets that this tweet received2.hashtags
a list of hashtags used in this tweeturls
List of urls used in this tweet.user_mentions
List of userids
mentioned in this tweet (includes anonymized userids
).poll_choices
If a tweet included a poll, this field displays the poll choices separated by “|”.account_category
“LeftTroll”,“RightTroll”,“NewsFeed”,“HashtagGamer”,“NonEnglish”,“Fearmonger”,“Unknown”. See Linvill and Warren’s research paper for definitions.1 At the time of account suspension.
2 These engagement counts exclude engagements from users suspended, deleted, or otherwise sanctioned by Twitter at the time of the data release.
Load required packages and tweets data; extract date elements from timestamps.
# load required packages
library(readr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(scales)
library(DT)
# Set default timezone for session to UTC
Sys.setenv(TZ = "UTC")
# load data, extract date elements from timestamps
en_ira_tweets <- read_csv("data/en_ira_tweets.csv", col_types = cols(tweetid = col_character())) %>%
mutate(tweet_date = as.Date(tweet_time),
tweet_month = month(tweet_time),
tweet_year = year(tweet_time))
tweets_category_month <- en_ira_tweets %>%
group_by(tweet_year,tweet_month,account_category) %>%
count() %>%
filter(tweet_year >= 2014) %>% # there were very few tweets in earlier years
mutate(date = as.Date(paste0(tweet_year,"-",tweet_month,"-15"))) # allows plotting in middle of month on date axis
ggplot(tweets_category_month, aes(x = date, y = n, color = account_category)) +
geom_point() +
geom_line() +
xlab("") +
ylab("Tweets") +
geom_hline(yintercept = 0, size = 0.3) +
geom_vline(xintercept = as.numeric(as.Date("2016-11-08")), linetype = "dotted") +
annotate("text",
x = as.Date("2016-07-01"),
y = 160000,
label = "Election",
family = "BasierSquare-SemiBold",
size = 4.5) +
theme_minimal(base_size = 16, base_family = "BasierSquare-SemiBold") +
scale_y_continuous(labels = comma) +
scale_color_brewer(palette = "Set1", name = "") +
theme(legend.position = "top")
retweets_category <- en_ira_tweets %>%
group_by(account_category) %>%
summarize(tweets = n(),
retweets = sum(retweet_count, na.rm = TRUE),
mean_retweets = round(mean(retweet_count, na.rm=TRUE))) %>%
arrange(-retweets) %>%
mutate(percent = round(retweets/sum(retweets)*100,2),
retweets = format(retweets, big.mark = ","),
tweets = format(tweets, big.mark = ","))
datatable(retweets_category, colnames = c("Account category", "Tweets", "Retweet count", "Avg. per tweet", "% of total"))
pre_election <- en_ira_tweets %>%
filter(tweet_date >= "2015-11-08" & tweet_date <= "2016-11-07")
retweets_category_pre <- pre_election %>%
group_by(account_category) %>%
summarize(tweets = n(),
retweets = sum(retweet_count, na.rm = TRUE),
mean_retweets = round(mean(retweet_count, na.rm=TRUE))) %>%
arrange(-retweets) %>%
mutate(percent = round(retweets/sum(retweets)*100,2),
retweets = format(retweets, big.mark = ","),
tweets = format(tweets, big.mark = ","))
datatable(retweets_category_pre, colnames = c("Account category", "Tweets", "Retweet count", "Avg. per tweet", "% of total"))
post_election <- en_ira_tweets %>%
filter(tweet_date >= "2016-11-09" & tweet_date <= "2017-11-08")
retweets_category_post <- post_election %>%
group_by(account_category) %>%
summarize(tweets = n(),
retweets = sum(retweet_count, na.rm = TRUE),
mean_retweets = round(mean(retweet_count, na.rm=TRUE))) %>%
arrange(-retweets) %>%
mutate(percent = round(retweets/sum(retweets)*100,2),
retweets = format(retweets, big.mark = ","),
tweets = format(tweets, big.mark = ","))
datatable(retweets_category_post, colnames = c("Account category", "Tweets", "Retweet count", "Avg. per tweet", "% of total"))
retweets_account <- en_ira_tweets %>%
group_by(userid,user_display_name,user_screen_name,account_category,user_profile_description) %>%
summarize(followers = max(follower_count, na.rm = TRUE),
tweets = n(),
retweets = sum(retweet_count, na.rm = TRUE),
mean_retweets = round(mean(retweet_count, na.rm = TRUE))) %>%
arrange(-retweets) %>%
ungroup() %>%
mutate(percent = round(retweets/sum(retweets)*100,2),
followers = format(followers, big.mark = ","),
tweets = format(tweets, big.mark = ","),
retweets = format(retweets, big.mark = ",")) %>%
select(-userid)
datatable(retweets_account, colnames = c("Display name", "Screen name", "Account category", "Profile description", "Followers", "Tweets", "Retweet count", "Avg. per tweet", "% of total"))
top_tweets <- en_ira_tweets %>%
arrange(-retweet_count) %>%
head(1000) %>%
select(user_display_name,user_screen_name,account_category,user_profile_description,tweet_text,tweet_date,retweet_count) %>%
mutate(retweet_count = format(retweet_count, big.mark = ","))
datatable(top_tweets, colnames = c("Display name", "Screen name", "Account category", "Profile description", "Tweet text", "Date", "Retweet count"))
quantile(en_ira_tweets$retweet_count, c(0.95, 0.99), na.rm = TRUE)
## 95% 99%
## 3 79
Five percent of tweets were retweeted 3 or more times; one percent of tweets were retweeted 79 times or more.
top_pc_category <- en_ira_tweets %>%
filter(retweet_count >= quantile(retweet_count, 0.99, na.rm = TRUE)) %>%
group_by(account_category) %>%
count() %>%
arrange(-n) %>%
mutate(n = format(n, big.mark = ","))
datatable(top_pc_category, colnames = c("Account category","Tweets in top 1%"))
top_pc_accounts <- en_ira_tweets %>%
filter(retweet_count >= quantile(retweet_count, 0.99, na.rm = TRUE)) %>%
group_by(user_display_name, user_screen_name, account_category, user_profile_description, account_category) %>%
count() %>%
arrange(-n) %>%
head(100) %>%
mutate(n = format(n, big.mark = ","))
datatable(top_pc_accounts, colnames = c("Display name","Screen name","Account category","Profile description","Tweets in top 1%"))
en_ira_tweets %>%
filter(retweet_userid %in% unique(userid)) %>%
tally()
## # A tibble: 1 x 1
## n
## <int>
## 1 137653
Just 137,653 of the 3.26 million tweets were retweets of other known Russian trolls.