Data and R code to reproduce the analysis and charts underlying this May 26, 2021 BuzzFeed News article on the excess deaths caused by the February 2021 winter storm and power outages in Texas.

Data and setting up

The CDC maintains counts of weekly deaths, for all and certain selected causes, at the national and state level (also for New York City, Washington, DC, and Puerto Rico), for 2014 to 2019, and for 2020 and 2021. This analysis uses a snapshot of that data captured on May 23, 2021, saved in a single CSV file, covering deaths recorded from January 1, 2015, to March 27, 2021.

# load required packages
library(tidyverse)
library(scales)
library(splines)
library(lubridate)

# set system date to UTC
Sys.setenv(TZ="UTC")

# load weekly deaths data
mortality <- read.csv("data/mortality.csv", stringsAsFactors = F) %>%
  mutate(weekendingdate = as.Date(weekendingdate))

Calculating expected deaths

To estimate expected deaths for each jurisdiction and week, we trained models on the 2015-2019 weekly deaths data, accounting for long-term demographic changes using a linear component and using a smoothing spline to account for seasonal variation. This is the same approach used by the New York Times in its accounting for excess deaths in the US through the COVID-19 pandemic

# data frame to hold observed and expected deaths data
mortality_working <- tibble()

# fit models and predict expected deaths
for (j in unique(mortality$jurisdiction)) {
  tmp1 <- mortality %>%
    filter(jurisdiction == j & weekendingdate < "2020-01-01")
  model <- lm(allcause ~ weekendingdate + bs(mmwrweek, knots = quantile(mmwrweek, seq(0, 1, 0.1))), data = tmp1)
  tmp2 <- mortality %>%
    filter(jurisdiction == j & weekendingdate >= "2020-01-01")
  tmp <- bind_rows(tmp1,tmp2)
  tmp3 <- as_tibble(predict(model,tmp, interval = "prediction"))
  tmp <- bind_cols(tmp,tmp3)
  mortality_working <- bind_rows(mortality_working,tmp)
}

# clean up environment
rm(tmp1,tmp2,tmp3,tmp,model)

We then made plots showing actual recorded weekly deaths (points) and the predictions from the model (line) for each jurisdiction.

# plots of actual and predicted deaths by jurisdiction
for (j in unique(mortality_working$jurisdiction)) {
  tmp <- mortality_working %>%
    filter(jurisdiction == j)
  plot <- ggplot(tmp) + 
    geom_point(aes(y = allcause, x = weekendingdate), alpha = 0.6, size = 0.5) +
    geom_line(aes(y = fit, x = weekendingdate), color = "#ef3b2c") +
    scale_y_continuous(labels = comma) +
    xlab("") + ylab("Deaths") +
    ggtitle(j) +
    theme_minimal(base_family = "Basier Square SemiBold", base_size = 16) +
    theme(panel.background = element_rect(fill = "#f1f1f2", size = 0),
          plot.background = element_rect(fill = "#f1f1f2", size = 0))
  ggsave(paste0("predictions/",j,".png"), plot, width = 20, height = 12, units = "cm")
}
#clean up environment
rm(j,tmp,plot)

Here is the chart for Texas:

There are two large peaks in deaths since the start of 2020 in Texas, which correspond to the summer and fall/winter surges in COVID-19 in the state. Charts for other jurisdictions are in the predictions folder.

Calculating weekly death anomalies

We calculated overall weekly death anomalies and anomalies minus deaths for which COVID-19 was given as the underlying cause.

# calculate death anomalies; all causes and excluding COVID-19 deaths
mortality_working <- mortality_working %>%
  mutate(deaths_anomaly = allcause - fit,
         deaths_anomaly_upr = allcause - lwr,
         deaths_anomaly_lwr = allcause - upr,
         non_covid_deaths_anomaly = deaths_anomaly - covid_19_u071_underlying_cause_of_death,
         non_covid_deaths_anomaly_upr = deaths_anomaly_upr - covid_19_u071_underlying_cause_of_death,
         non_covid_deaths_anomaly_lwr = deaths_anomaly_lwr - covid_19_u071_underlying_cause_of_death)

We then plotted timelines of the total death anomalies and non-COVID death anomalies for each jurisdiction.

# plots of weekly death anomalies by jurisdiction
for (j in unique(mortality_working$jurisdiction)) {
  tmp <- mortality_working %>%
    filter(jurisdiction == j)
  plot <- ggplot(tmp, aes(x = weekendingdate)) + 
    geom_hline(yintercept = 0, size = 0.2) +
    geom_area(aes(y = deaths_anomaly, x = weekendingdate),fill = "#ef3b2c", alpha = 0.3) +
    geom_line(aes(y = non_covid_deaths_anomaly), color = "#ef3b2c", size = 0.3) +
    scale_y_continuous(labels = comma) +
    xlab("") + ylab("Deaths relative to expected") +
    ggtitle(j) +
    theme_minimal(base_family = "Basier Square SemiBold", base_size = 16) +
    theme(panel.background = element_rect(fill = "#f1f1f2", size = 0),
          plot.background = element_rect(fill = "#f1f1f2", size = 0))
  ggsave(paste0("anomalies/",j,".png"), plot, width = 20, height = 12, units = "cm")
}
# clean up environment
rm(tmp,plot)

The chart for Texas is below. The shaded area represents weekly death anomalies for all causes. The red line shows anomalies for deaths in which COVID-19 was not identified as the underlying cause. It seems that the under-reporting of deaths associated with COVID-19 was a problem in the early stages of the pandemic, but reduced significantly in more recent months.