Russian Trolls Swarmed The Charlottesville March — Then Twitter Cracked Down

Data and R code to reproduce the analysis and graphics in this Aug. 10, 2018 BuzzFeed News post on the reaction of Twitter trolls operated by Russia’s Internet Research Agency to the violence that erupted in Charlottesville, Virginia, in August 2017. Supporting files are in this GitHub repository.

Data

On Jul. 31, 2018, researchers Darren Linvill and Patrick Warren of Clemson University in South Carolina published with FiveThirtyEight an archive of nearly 3 million tweets linked to accounts identified by Twitter in two lists provided to Congress as being linked to the Internet Research Agency. In February, special counsel Robert Mueller indicted the agency and 13 of its employees for interfering in US politics with “a strategic goal to sow discord.”

BuzzFeed News obtained a new copy of the data directly from the researchers. This has been cleaned to address some small problems with the publicly shared data, and filtered for tweets from the two main account types most active in trying to influence US politics, called “Left Trolls” and “Right Trolls” by Linvill and Warren. We also processed the data for easier handling by time and date, and filtered for tweets sent from Jun. 19, 2015, a period for which the researchers are confident they have a complete record of tweets from the handles identified by Twitter.

Many of the Left Trolls posed as supporters of Black Lives Matter, tweeting about aspects of black culture as well as politics. They tended to support Bernie Sanders, disparage Hillary Clinton, and were most active before the 2016 presidential election. Right Trolls posed as supporters of Donald Trump, and were most active in the summer of 2017.

The data contains the following fields:

  • author Account handle, in lower case.
  • content Tweet content.
  • region As classified by Social Studio, the software used by Linvill and Warren to compile the tweets.
  • language Language in which the tweet was written.
  • tweet_date tweet_time Date and time, in UTC, that the tweet was posted.
  • year month hour minute Processed from the tweet date and time.
  • following Number of accounts being followed by the author, at the time tweet was sent.
  • followers Number of accounts folllowing the author, at the time tweet was sent.
  • post_url URL for the tweet.
  • post_type Null for original content, RETWEET or QUOTE TWEET.
  • retweet 0 for original content, 1 for RETWEET or QUOTE TWEET.
  • tweet_id Unique tweet code, from Twitter.
  • author_id Author ID code from Twitter. Due to a glitch in data processing by the Clemson researchers, for earlier tweets these codes were turned into numbers and rounded, which means they cannot be reliably be used to identify accounts. Still, they may be useful for analysis of later tweets.
  • account_category Left Troll or Right Troll.
  • new_june_2018 0 for accounts in the list provided to Congress by Twitter in November 2017, 1 for accounts newly identified in the extended list released in June 2018.

Setting up

Required packages and regular expressions for processing tweet content; loading data.

# load required packages
library(readr)
library(dplyr)
library(ggplot2)
library(tidytext)
library(tidyr)
library(stringr)
library(scales)
library(DT)

# regexes for parsing tweets
replace_reg <- "https?://[^\\s]+|&amp;|&lt;|&gt;|\\bRT\\b"
unnest_reg <- "([^A-Za-z_\\d#@']|'(?![A-Za-z_\\d#@]))"

# load data
tweets <- data_frame()

for (n in c(1:4)) {
   tmp <- read_csv(paste0("data/tweets",n,".csv"), col_types = cols(
     .default = col_character(),
     following = col_integer(),
     followers = col_integer(),
     tweet_date = col_date(),
     tweet_time = col_double(),
     year = col_integer(),
     month = col_integer(),
     hour = col_integer(),
     minute = col_integer()
   ))
   tweets <- bind_rows(tweets,tmp)
}
rm(tmp)

Tweets from Russian Left and Right Trolls in July and August 2017

Left Trolls were most active before the 2016 election. But the summer of 2017 is when the Right Trolls had their biggest surge.

# tweets per day, by category
tweets_category_day <- tweets %>%
  group_by(tweet_date,account_category) %>%
  count() %>%
  arrange(-n) %>%
  filter(grepl("left|right",account_category,ignore.case = TRUE))

# plot
ggplot(tweet_category_day, aes(x=tweet_date, y=n, color=account_category)) +
  scale_color_brewer(palette = "Set1", direction = -1, name = "") +
  geom_line() +
  geom_point() +
  xlab("") +
  ylab("Tweets") +
  scale_y_continuous(labels = comma) +
  scale_x_date(limits = c(as.Date("2017-07-01"),as.Date("2017-08-31"))) +
  geom_vline(aes(xintercept = as.Date("2017-08-12")), linetype="dotted", size=0.75) +
  geom_hline(aes(yintercept = 0), size = 0.2) +
  annotate("text", 
           x = as.Date("2017-08-05"), 
           y = 16000, 
           label = "Charlottesville",
           family = "ProximaNova-Semibold",
           size = 4.5) +
  theme_minimal(base_size = 16, base_family = "ProximaNova-Semibold") +
  theme(legend.position = "top",
        panel.grid.minor.x = element_blank())