Data and R code for the analysis supporting this August 7, 2017 BuzzFeed News post on identifying potential surveillance aircraft. Supporting files are in this GitHub repository.

Data

BuzzFeed News obtained more than four months of aircraft transponder detections from the plane tracking website Flightradar24, covering August 17 to December 31, 2015 UTC, containing all data displayed on the site within a bounding box encompassing the continental United States, Alaska, Hawaii, and Puerto Rico.

Flightradar24 receives data from its network of ground-based receivers, supplemented by a feed from ground radars provided by the Federal Aviation Administration (FAA) with a five-minute delay.

After parsing from the raw files supplied by Flightradar24, the data included the following fields, for each transponder detection:

We also calculated:

Feature engineering

Using the same data, we had previously reported on flights of spy planes operated by the FBI and the Department of Homeland Security (DHS), and reasoned that it should be possible to train a machine learning algorthim to identify other aircraft performing similar surveillance, based on characteristics of the aircraft and their flight patterns.

First we filtered the data to remove planes registered abroad, based on their adshex code, common commercial airliners, based on their type, and aircraft with fewer than 500 transponder detections.

Then we took a random sample of 500 aircraft and calculated the following for each one:

Finally, we calculated the following variables for each of the aircraft in the larger filtered dataset:

The resulting data for 19,799 aircraft are in the file planes_features.csv.

Machine learning, using random forest algorithm

For the machine learning, we selected the random forest algorithm, popular among data scientists for classification tasks. (See this tutorial for background on running the random forest in R.)

As training data, drawn from planes_features.csv, we used 97 fixed-wing FBI and DHS planes from our previous story, given a class of surveil, and a random sample of 500 other planes, given a class of other.

Data identifying these planes is in the file train.csv.

# load required packages
library(readr)
library(dplyr)
library(randomForest)

# load planes_features data
planes <- read_csv("data/planes_features.csv") 

# convert type to integers, as new variable type2, so it can be used by the random forest algorithm
planes <- planes %>%
  mutate(type2=as.integer(as.factor(type)))

# load training data and join to the planes_features data
train <- read_csv("data/train.csv") %>%
  inner_join(planes, by="adshex")

We then trained the random forest algorithm using this data.

# set seed for reproducibility of model fit
set.seed(415)

# train the random forest
fit <- randomForest(as.factor(class) ~ duration1 + duration2 + duration3 + duration4 + duration5 + boxes1 + boxes2 + boxes3 + boxes4 + boxes5 + speed1 + speed2 + speed3 + speed4 + speed5 + altitude1 + altitude2 + altitude3 + altitude4 + altitude5 + steer1 + steer2 + steer3 + steer4 + steer5 + steer6 + steer7 + steer8 + flights + squawk_1 + observations + type2,
                    data=train, 
                    importance=TRUE,
                    ntree=2000)

One useful feature of the random forest is that it is possible to see which of the input variables are most important to the trained model.

# look at which variables are important in model
varImpPlot(fit, pch = 19, main = "Importance of variables", color = "blue", cex = 0.7)