Complex Survey Analysis in R

Dec 15, 2016 00:00 · 1114 words · 6 minutes read R complex survey

Introduction

Public health is a broad and sometimes overreaching field. Consequently, I am asked, “what is public health?” more so than “what school did you go to (in Hawaii)?” With that said, I would like to start off with a quote by C.E. Taylor:

“The focus of public health is the community. The ‘patient’ is a whole population unit. The shift in professional orientation which occurs as the unit of attention moves from the individual to the group must be clearly recognized and explicitly stated because it has led to many misunderstandings in the past.”

Once it is clearly understood that public health is not brain surgery and according to the CDC more about zombie invasions, the next question asked is “so what does public health do if it doesn’t cure diabetes?” Such a loaded question. In a nutshell, the four core functions of public health are as follows:

  • Assessment
  • Policy development
  • Assurance
  • Communication

In other words, public health professionals rely on studies and surveys, such as the Behavioral Risk Factors Surveillance System (BRFSS), to understand health trends. Consequently, campaigns and/or policies are developed to break trends when needed.

Method

Statistical Software

SAS-callable SUDAAN is the usual software of choice when analyzing complex surveys as it can account for the sampling design and weighting; however, we can also do this with the R software environment for statistical computing. Proprietary statistical programs will always be public health’s frenemies in the area of research and complex survey analysis, but let’s not write off R too soon as it is a powerful statistical programming language.

# importing data
library(RCurl)
library(foreign)
library(downloader)

# data wrangling
library(data.table)
library(stringr)

# complex survey analysis
library(survey)

# eda
library(knitr)
library(VIM)
library(ggplot2)
library(scales)

Data

In laymen’s terms, many public health organizations and professionals cite the BRFSS when interested in health risk behaviors, health access and chronic disease prevalence. Introductions and examples have already been created on BRFSS for R, and have been featured on R-bloggers. Anthony Damico maintains a GitHub repository on downloading and storing complex surveys. I recommend his solution for those who are new to R and/or complex survey analysis and because there is a significant speed increase when working on analyzing the BRFSS alongside MonetDB.

In this example, I analyzed overweight or obese risk in the 2015 BRFSS series. The question I wanted to answer was if Hawaii should be written off as the healthiest state in the nation.

The BRFSS data set contains over 300 variables. In this case, we are only interested in looking at respondents’ reported BMI category and preferred race. We also need to extract additional variables such as weight, residency and strata.

# load(file="input/brfss15.Rda")
data_1 <- setDT(brfss15)[, .(X_PSU, X_LLCPWT, X_STSTR, X_STATE, X_RFBMI5, X_PRACE1)]

kable(head(data_1, 10), format="markdown")
X_PSU X_LLCPWT X_STSTR X_STATE X_RFBMI5 X_PRACE1
2.015e+09 341.38485 11011 1 2 1
2.015e+09 108.06090 11011 1 2 1
2.015e+09 255.26480 11011 1 1 1
2.015e+09 341.38485 11011 1 2 1
2.015e+09 258.68222 11011 1 1 1
2.015e+09 256.51859 11011 1 2 1
2.015e+09 85.65975 11011 1 1 1
2.015e+09 545.78210 11011 1 2 1
2.015e+09 211.21030 11011 1 1 1
2.015e+09 215.47286 11011 1 2 1

Preprocessing

The variables require some data wrangling. Both BMI category and preferred race are numerically encoded and may be difficult to interpret without a data dictionary. Discretizing variable levels with descriptive labels may also help with interpretation.

# source("01 Preprocess/Preprocess Features.R")

# subset data
data_1 <- setDT(brfss15)[, .(X_PSU, X_LLCPWT, X_STSTR, X_STATE, X_RFBMI5, X_PRACE1)]

# discretize BMI category
data_1 <- data_1[X_RFBMI5==2, X_RFBMI5_2:="Obese/Overweight"]
data_1 <- data_1[X_RFBMI5==1, X_RFBMI5_2:="Not Obese/Overweight"]
data_1 <- data_1[X_RFBMI5==9, X_RFBMI5_2:="Not Sure"]

data_1 <- data_1[X_RFBMI5_2!="Not Sure",]

data_1 <- data_1[, X_RFBMI5_2:=as.factor(X_RFBMI5_2)]

# discretize Race 
data_1 <- data_1[X_PRACE1==1, X_PRACE1_2:="White"]
data_1 <- data_1[X_PRACE1==2, X_PRACE1_2:="Black"]
data_1 <- data_1[X_PRACE1==3, X_PRACE1_2:="AIAN"]
data_1 <- data_1[X_PRACE1==4, X_PRACE1_2:="Asian"]
data_1 <- data_1[X_PRACE1==5, X_PRACE1_2:="NHOPI"]
data_1 <- data_1[X_PRACE1==6, X_PRACE1_2:="Other"]
data_1 <- data_1[X_PRACE1==7, X_PRACE1_2:="No Preferred"]
data_1 <- data_1[X_PRACE1==8, X_PRACE1_2:="Multiracial"]
data_1 <- data_1[X_PRACE1==77, X_PRACE1_2:="Not Sure"]
data_1 <- data_1[X_PRACE1==99, X_PRACE1_2:="Refused"]

data_1 <- data_1[X_PRACE1_2!="Not Sure",]

data_1 <- data_1[, X_PRACE1_2:=as.factor(X_PRACE1_2)]

Results

The survey design can now be specified with both the survey and preprocessed variables. This step may take some time given the size of this data set, and so the survey object is subset on the state of Hawaii for the rest of the analysis.

# source("02 Modeling/Create Survey Object.R")
options(survey.lonely.psu="certainty")

dsurvey <- svydesign(id        = ~X_PSU,
                     strata    = ~X_STSTR, 
                     weights   = ~X_LLCPWT,
                     nest      = TRUE,
                     data      = data_1)

dsurvey_2 <- subset(dsurvey, X_STATE==15)

Statewide Estimates

# calculate estimates
hawaii_est <- data.frame(
  svytotal(~X_RFBMI5_2, dsurvey_2, na.rm=TRUE),
  svymean(~X_RFBMI5_2, dsurvey_2, na.rm=TRUE, level=0.95),
  confint(svymean(~X_RFBMI5_2, dsurvey_2, na.rm=TRUE, level=0.95))
)

# create group column
hawaii_est$group <- str_replace(row.names(hawaii_est), "X_RFBMI5_2", "")
colnames(hawaii_est) <- c("N", "N (SE)", "%", "% (SE)", "% Lower CI (95%)", "% Upper CI (95%)", "Group")

# plot results
ggplot(data=hawaii_est, aes(x=Group, y=`%`, group=Group, fill=Group, ymax=`% Upper CI (95%)`, ymin=`% Lower CI (95%)`)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(position=position_dodge(width=0.9), width=0.1) +
  scale_y_continuous(labels=percent) +
  geom_text(aes(y=0, label=comma(round(N, 0))), position=position_dodge(width=0.9), vjust=-1) +
  labs(title="Overweight or Obese, Statewide", 
       subtitle="State of Hawaii BRFSS 2015",
       x="Group",
       y="Point Estimate (%)") + 
  theme_minimal()

Stratified By Preferred Race

hawaii_race_est <- merge(
  svyby(~X_RFBMI5_2, ~X_PRACE1_2, svymean, design=dsurvey_2, na.rm=TRUE, vartype=c('ci')),
  svyby(~X_RFBMI5_2, ~X_PRACE1_2, svytotal, design=dsurvey_2, na.rm=TRUE),
  by="X_PRACE1_2")

colnames(hawaii_race_est) <- c("Group", 
                               "Not Obese/Overweight (%)",
                               "Obese/Overweight (%)", 
                               "Not Obese/Overweight (%) Lower CI (95%)",
                               "Obese/Overweight (%) Lower CI (95%)",
                               "Not Obese/Overweight (%) Upper CI (95%)",
                               "Obese/Overweight (%) Upper CI (95%)",
                               "Not Obese/Overweight (N)",
                               "Obese/Overweight (N)", 
                               "Not Obese/Overweight (SE)",
                               "Obese/Overweight (SE)")

ggplot(data=hawaii_race_est, aes(x=reorder(Group, `Obese/Overweight (%)`), y=`Obese/Overweight (%)`, group=Group, fill=Group,
                                 ymax=`Obese/Overweight (%) Upper CI (95%)`, ymin=`Obese/Overweight (%) Lower CI (95%)`)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(position=position_dodge(width=0.9), width=0.1) +
  scale_y_continuous(labels=percent) +
  geom_text(aes(y=0, label=comma(round(`Obese/Overweight (N)`, 0))), position=position_dodge(width=0.9), size=3, vjust=-1) +
  labs(title="Overweight or Obese, by Preferred Race", 
       subtitle="State of Hawaii BRFSS 2015",
       x="Group",
       y="Point Estimate (%)") + 
  geom_hline(yintercept=hawaii_est$`%`[2], col="black", lty=2, size=1) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0))

Although more than half of residents reported to be overweight or obese, Hawaii still has one of the lowest rates in the nation (0.57% vs. 64.6%). Yet when stratifying by preferred race, Hawaii actually has some of the highest prevalence rates for overweight and obesity risk factors in the nation. In 2015, Native Hawaiians and Other Pacific Islanders (NHOPI) overweight or obese risk was at 79.17% (95% confidence interval [CI]: 71.88-86.46%).

Conclusion

So why was this exercise important? The two charts tell a compelling story. If we were to accept only the first cut of the data, then policies and initiatives may overlook Hawaii. However, the second chart reveals that NHOPIs are at extreme risk for negative health outcomes.

Public health attempts to understand trends and/or driving factors that cause morbidity or mortality. These analyses then help to drive interventions and policies that may potentially make the greatest impact. Tools like R can assist with this process as it democratizes data and analysis so that different perspectives can be added to the overall discussion.