Survey raking with {anesrake}

{anesrake} R Package Tutorial

We are often faced with a representativeness issue when using samples in surveys. Our sample might be different in important ways from the true population. To adjust for these errors, we can use raking adjustment. Raking allows us to select variables in our sample that will be adjusted based on true population parameters.

The following tutorial shows how to use {anesrake}, an r package that computes the weights for you!

The goal is to identify variables (often demographic) to weight on. The statistical software will compare the variables in your sample to the population to compute the weights.

For more information on the package, please refer to the following document: https://electionstudies.org/wp-content/uploads/2018/04/nes012427.pdf

require(pacman)
p_load(tidyverse, anesrake, weights)
dat <- read_csv('/Users/bijeanghafouri/Dropbox/My Mac (Bijean’s MacBook Pro)/Documents/website/website/content/blog/Tutorials/anesrake/donations.csv')
dat <- as.data.frame(dat)

# Set target variables as factors (important!)
dat$income <- as.factor(dat$income)
dat$education <- as.factor(dat$education)

Data simulation

First, we need to find our population-level estimates. In some cases, you will have access to population-level data from which you can draw your point estimates. However, you are likely to not have direct access to these data. You will need to find these statistics from other sources.

In this example, I use data from the United States census to find the proportions of each category in my variables. I find population-level education levels here, and income levels here.

From census data, we can find population-level marginal proportions for the variables we will weight on. We will be weighting on two variables: income and education. However, you can (perhaps ideally) weight on more variables, including sex, ethnicity, etc. Also, note that these categories are somewhat arbitrary. In a real survey, you are most likely to categorize income and education in other ways.

income <- c('20k', '50k', '100k')
income_prop <- c(.35, .50, .15)
education <- c('highschool', 'college', 'graduate')
education_prop <- c(.376, .497, .127)
population <- data_frame(income, education, income_prop, education_prop)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.

We identify the list of variables we want to weight on by creating a list of ‘targets’. It is important to make sure that the variable names in the dataset and in the population are the same.

target <- with(population, list(
  income = weights::wpct(income, income_prop),
  education  = weights::wpct(education, education_prop)
))

Now that the population-level proportions are dealt with, we can take a look at our survey results. Using the weights::wpct function, let’s estimate the proportions of our target variables in the survey.

weights::wpct(dat$income)
##      100k       20k       50k 
## 0.2058824 0.4325260 0.3615917
weights::wpct(dat$education)
##    college   graduate highschool 
##  0.5432526  0.3460208  0.1107266

How do our survey proportions compare to our population proportions?

target$income
## 100k  20k  50k 
## 0.15 0.35 0.50
target$education
##    college   graduate highschool 
##      0.497      0.127      0.376

Seems like we have an over-representation of respondents with an income between 50k and 100k, and an over-representation of graduate and college students. Well, this is what raking is for – let’s fix this!

The {anesrake} R package uses the ANES weighting algorithm to provide weights to any sample. We computed all the necessary inputs above. All we need to do is plug-and-play. There are many function arguments we can specify that I do not include here. The weightvec argument allows us to supply a vector of weights if we are using a dataset that already offers weights. We could also use the filter argument to supply a binary vector indicating which observations to include/exclude for weighting. For example, we may want to exclude observations for which the respondent did not provide an answer to the outcome question.

raking <- anesrake(target,                        # target list identified above
                    dat,                          # survey dataset 
                    dat$caseid,                   # unique identifier for each respondent (1:nrow(dat))
                    cap = 5,                      # Maximum value for any given weight
                    choosemethod = "total",       # How are parameters compared for selection (other options include 'average' and 'max')
                    type = "pctlim",              # What targets should be used to weight 
                    pctlim = 0.05                 # Threshold for deviation
                    )
## [1] "Raking converged in 27 iterations"
raking_summary <- summary(raking)
raking_summary
## $convergence
## [1] "Complete convergence was achieved after 27 iterations"
## 
## $base.weights
## [1] "No Base Weights Were Used"
## 
## $raking.variables
## [1] "income"    "education"
## 
## $weight.summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2557  0.5272  0.6540  1.0000  1.3487  5.0001 
## 
## $selection.method
## [1] "variable selection conducted using _pctlim_ - discrepancies selected using _total_."
## 
## $general.design.effect
## [1] 1.966204
## 
## $income
##       Target Unweighted N Unweighted %  Wtd N Wtd % Change in % Resid. Disc.
## 100k    0.15          238    0.2058824  173.4  0.15 -0.05588235 2.775558e-17
## 20k     0.35          500    0.4325260  404.6  0.35 -0.08252595 0.000000e+00
## 50k     0.50          418    0.3615917  578.0  0.50  0.13840830 0.000000e+00
## Total   1.00         1156    1.0000000 1156.0  1.00  0.27681661 2.775558e-17
##       Orig. Disc.
## 100k  -0.05588235
## 20k   -0.08252595
## 50k    0.13840830
## Total  0.27681661
## 
## $education
##            Target Unweighted N Unweighted %    Wtd N Wtd % Change in %
## college     0.497          628    0.5432526  574.532 0.497  -0.0462526
## graduate    0.127          400    0.3460208  146.812 0.127  -0.2190208
## highschool  0.376          128    0.1107266  434.656 0.376   0.2652734
## Total       1.000         1156    1.0000000 1156.000 1.000   0.5305467
##             Resid. Disc. Orig. Disc.
## college    -5.551115e-17  -0.0462526
## graduate    0.000000e+00  -0.2190208
## highschool  0.000000e+00   0.2652734
## Total       5.551115e-17   0.5305467

We’ve not got our raking results. Let’s look at the specifics for each target variable.

raking_summary$income
##       Target Unweighted N Unweighted %  Wtd N Wtd % Change in % Resid. Disc.
## 100k    0.15          238    0.2058824  173.4  0.15 -0.05588235 2.775558e-17
## 20k     0.35          500    0.4325260  404.6  0.35 -0.08252595 0.000000e+00
## 50k     0.50          418    0.3615917  578.0  0.50  0.13840830 0.000000e+00
## Total   1.00         1156    1.0000000 1156.0  1.00  0.27681661 2.775558e-17
##       Orig. Disc.
## 100k  -0.05588235
## 20k   -0.08252595
## 50k    0.13840830
## Total  0.27681661
raking_summary$education
##            Target Unweighted N Unweighted %    Wtd N Wtd % Change in %
## college     0.497          628    0.5432526  574.532 0.497  -0.0462526
## graduate    0.127          400    0.3460208  146.812 0.127  -0.2190208
## highschool  0.376          128    0.1107266  434.656 0.376   0.2652734
## Total       1.000         1156    1.0000000 1156.000 1.000   0.5305467
##             Resid. Disc. Orig. Disc.
## college    -5.551115e-17  -0.0462526
## graduate    0.000000e+00  -0.2190208
## highschool  0.000000e+00   0.2652734
## Total       5.551115e-17   0.5305467

We can also look at the general effect weighting had on our sample. We see that weighting caused a 96.6% increase in the variance.

raking_summary$general.design.effect
## [1] 1.966204

With these weights, we are able to append the weights vector to our survey data.

# Create weights vector and attach to dataset 
dat$weights <- raking$weightvec

What is the effect of weighting on our outcome? (In this example, the outcome refers to donations. 1 = respondent would donate to a political party, 0 = respondent would not donate to a political party)

wpct(dat$yhat)
##         0         1 
## 0.8132635 0.1867365
wpct(dat$yhat, dat$weights)
##         0         1 
## 0.8623287 0.1376713

If we compare the above result with our original outcome, we see that our sample overestimates the proportion of individuals that would donate to a political party. This is likely caused by an oversample of college graduates and high-income earners.

There we go! We now know how to weight our survey samples with the {anesrake} R package.

Bijean Ghafouri
Bijean Ghafouri
Doctoral Student

USC