Data and R code for the analysis supporting this August 7, 2017 BuzzFeed News post on identifying potential surveillance aircraft. Supporting files are in this GitHub repository.
BuzzFeed News obtained more than four months of aircraft transponder detections from the plane tracking website Flightradar24, covering August 17 to December 31, 2015 UTC, containing all data displayed on the site within a bounding box encompassing the continental United States, Alaska, Hawaii, and Puerto Rico.
Flightradar24 receives data from its network of ground-based receivers, supplemented by a feed from ground radars provided by the Federal Aviation Administration (FAA) with a five-minute delay.
After parsing from the raw files supplied by Flightradar24, the data included the following fields, for each transponder detection:
adshex
Unique identifier for each aircraft, corresponding to its “Mode-S” code, in hexademical format.flight_id
Unique identifier for each “flight segment,” in hexadecimal format. A flight segment is a continuous series of transponder detections for one aircraft. There may be more than one segment per flight, if a plane disappears from Flightradar24’s coverage for a period — for example when flying over rural areas with sparse receiver coverage. While being tracked by Fightradar24, planes were typically detected several times per minute.latitude
, longitude
Geographic location in digital degrees.altitude
Altitude in feet.speed
Ground speed in knots.squawk
Four-digit code transmitted by the transponder.type
Aircraft manufacter and model, if identified.timestamp
Full UTC timestamp.track
Compass bearing in degrees, with 0 corresponding to north.We also calculated:
steer
Change in compass bearing from the previous transponder detection for that aircraft; negative values indicate a turn to the left, positive values a turn to the right.Using the same data, we had previously reported on flights of spy planes operated by the FBI and the Department of Homeland Security (DHS), and reasoned that it should be possible to train a machine learning algorthim to identify other aircraft performing similar surveillance, based on characteristics of the aircraft and their flight patterns.
First we filtered the data to remove planes registered abroad, based on their adshex
code, common commercial airliners, based on their type
, and aircraft with fewer than 500 transponder detections.
Then we took a random sample of 500 aircraft and calculated the following for each one:
duration
of each flight segment recorded by Flightradar24, in minutes.boxes
Area of a rectangular bounding box drawn around each flight segment, in square kilometers.Finally, we calculated the following variables for each of the aircraft in the larger filtered dataset:
duration1
,duration2
,duration3
,duration4
,duration5
Proportion of flight segment durations for each plane falling into each of five quantiles calculated from duration
for the sample of 500 planes. The proportions for each aircraft must add up to 1; if the durations of flight segments for a plane closely matched those for a typical plane from the sample, these numbers would all approximate to 0.2; a plane that mostly flew very long flights would have large decimal fraction for duration5
.boxes1
,boxes2
,boxes3
,boxes4
,boxes5
Proportion of bounding box areas for each plane falling into each of five quantiles calculated from boxes
for the sample of 500 planes.speed1
,speed2
,speed3
,speed4
,speed5
Proportion of speed
values recorded for the aircraft falling into each of five quantiles recorded for speed
for the sample of 500 planes.altitude1
,altitude2
,altitude3
,altitude4
,altitude5
Proportion of altitude
values recorded for the aircraft falling into each of five quantiles recorded for altitude
for the sample of 500 planes.steer1
,steer2
,steer3
,steer4
,steer5
,steer6
,steer7
,steer8
Proportion of steer
values for each aircraft falling into bins set manually, after observing the distribution for the sample of 500 planes, using the breaks: -180, -25, -10, -1, 0, 1, 22, 45, 180.flights
Total number of flight segments for each plane.squawk_1
Squawk code used most commonly by the aircraft.observations
Total number of transponder detections for each plane.type
Aircraft manufacter and model, if identified, else unknown
.The resulting data for 19,799 aircraft are in the file planes_features.csv
.
For the machine learning, we selected the random forest algorithm, popular among data scientists for classification tasks. (See this tutorial for background on running the random forest in R.)
As training data, drawn from planes_features.csv
, we used 97 fixed-wing FBI and DHS planes from our previous story, given a class
of surveil
, and a random sample of 500 other planes, given a class
of other
.
Data identifying these planes is in the file train.csv
.
# load required packages
library(readr)
library(dplyr)
library(randomForest)
# load planes_features data
planes <- read_csv("data/planes_features.csv")
# convert type to integers, as new variable type2, so it can be used by the random forest algorithm
planes <- planes %>%
mutate(type2=as.integer(as.factor(type)))
# load training data and join to the planes_features data
train <- read_csv("data/train.csv") %>%
inner_join(planes, by="adshex")
We then trained the random forest algorithm using this data.
# set seed for reproducibility of model fit
set.seed(415)
# train the random forest
fit <- randomForest(as.factor(class) ~ duration1 + duration2 + duration3 + duration4 + duration5 + boxes1 + boxes2 + boxes3 + boxes4 + boxes5 + speed1 + speed2 + speed3 + speed4 + speed5 + altitude1 + altitude2 + altitude3 + altitude4 + altitude5 + steer1 + steer2 + steer3 + steer4 + steer5 + steer6 + steer7 + steer8 + flights + squawk_1 + observations + type2,
data=train,
importance=TRUE,
ntree=2000)
One useful feature of the random forest is that it is possible to see which of the input variables are most important to the trained model.
# look at which variables are important in model
varImpPlot(fit, pch = 19, main = "Importance of variables", color = "blue", cex = 0.7)
MeanDecreaseAccuracy
measures the overall decrease in accurancy of the model if each variable is removed. MeanDecreaseGini
measures the extent to which each variable plays a role in partitioning the data into the defined classes.
So these two charts show that the steer1
and steer2
variables, quantifying the frequency of turning hard to the left, and squawk_1
, the most common squawk code broadcast by a plane’s transponder, were the most important to the model.
We then examined the model’s performance, from its estimated errors in classifying the training data.
# look at estimated model performance
print(fit)
##
## Call:
## randomForest(formula = as.factor(class) ~ duration1 + duration2 + duration3 + duration4 + duration5 + boxes1 + boxes2 + boxes3 + boxes4 + boxes5 + speed1 + speed2 + speed3 + speed4 + speed5 + altitude1 + altitude2 + altitude3 + altitude4 + altitude5 + steer1 + steer2 + steer3 + steer4 + steer5 + steer6 + steer7 + steer8 + flights + squawk_1 + observations + type2, data = train, importance = TRUE, ntree = 2000)
## Type of random forest: classification
## Number of trees: 2000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 3.69%
## Confusion matrix:
## other surveil class.error
## other 498 2 0.0040000
## surveil 20 77 0.2061856
This output shows that overall, the estimated classification error rate was 3.7%. For the target surveil
class, representing likely surveillance aircraft, the estimated error rate was 20.6%. We experimented with simplifying the model, removing some of the less important variables, but that did not appreciably improve its performance.
(A more thorough approach to testing and refining a model’s performance would involve splitting the training dataset, and keeping a portion to one side to evaluate the trained model. However, for our goal of providing an initial screen for potential surveillance aircraft, to inform subsequent reporting, the initial model was adequate.)
We then used the model to classify all of the planes in the data, after first removing the planes used in training data and a list of known federal law enforcement aircraft, which are identified in the file feds.csv
.
# load data on known federal planes
feds <- read_csv("data/feds.csv")
# create dataset to classify
classify <- anti_join(planes, train) %>%
anti_join(feds)
# run the random forest model on that dataset
prediction <- predict(fit, classify)
# create data frame with predictions from random forest, by plane
classified <- data.frame(adshex = classify$adshex, class = prediction)
# summarize to see how many planes classified as candidate surveillance aircraft
classified_summary <- classified %>%
group_by(class) %>%
summarize(count=n())
print(classified_summary)
## # A tibble: 2 x 2
## class count
## <fctr> <int>
## 1 other 19091
## 2 surveil 69
The output shows that the model classified 69 planes as likely surveillance aircraft. Given the likelihood of some misclassification, we ran the analysis again to calculate a probability that each plane matched the surveil
and other
classes. We sorted them in descending order of probability of membership of the surveil
class, and selected the top 100 for further investigation, including the variable squawk_1
in the data, as some specific codes contain useful information — for example those restricted for use by federal law enforcement.
# run the random forest model again to get probabilities of membership of each class, rather than binary classification
prediction_probs <- as.data.frame(predict(fit, classify, type = "prob"))
# create data frame with predictions from random forest, by plane
classified_probs <- bind_cols(as.data.frame(classify$adshex), prediction_probs) %>%
mutate(adshex = classify$adshex) %>%
select(2:4) %>%
arrange(desc(surveil)) %>%
inner_join(planes) %>%
select(1:3,squawk_1)
# select the top 100 candidates
candidates <- head(classified_probs, 100)
Before exporting, we joined to data from the FAA aircraft registration database, to include where available the planes’ registration numbers and the organizations they are registered to.
# load and process FAA registration data (here using download from Jul 25, 2017, to reflect current registration)
faa <- read_csv("data/faa-registration.csv") %>%
select(1,7,34)
names(faa) <- c("n_number","name","adshex")
faa <- faa %>%
mutate(reg = paste0("N",n_number)) %>%
select(2:4)
# join to FAA registration data to obtain information on each plane's registration numbers and registered owners/operators
candidates <- left_join(candidates,faa, by="adshex")
# export data
write_csv(candidates, "data/candidates.csv", na="")
The file candidates_annotated.csv
includes some notes on these 100 aircraft, following further research and reporting.