[R] Survey Design / Rake questions

Farley, Robert FarleyR at metro.net
Wed Aug 20 18:56:33 CEST 2008

Thank you for your help.  Yes, my problem is one of non-response.  We
try to hand a survey form to everyone that boards at each stop, but
we're getting only ~10% usable responses.  One reason is that the "full
survey" is long, and requires geo-locating 2 points - Trip Origin and

My hope is to perform a second survey to establish the temporal
distribution.  Unfortunately, it appears that this will need to be
nearly as extensive (expensive) as the original survey.  

If this (raking "time on board") can be show to work, we can then
generalize the process to other variables.  

Thank you.

PS  Why is 'ByEBOn' a list and not a DataFrame?

> OnLabels    <- c( "Warner Center", "De Soto", "Pierce College",
"Tampa", "Reseda", "Balboa", "Woodley", "Sepulveda", "Van Nuys",
"Woodman", "Valley College", "Laurel Canyon", "North Hollywood")
> EBOnNewTots <- c(            1000,       600,             1200,
500,     1000,      500,       200,         250,       1000,       300,
100,              50,            73.65 )
> EBNumStn <- c(673.65,     800, 1000, 1000,  800,  700,  600, 500, 400,
200,  50, 50 )
> ByEBOn <- data.frame(OnLabels,EBOnNewTots)
> ByEBNum <- data.frame(c(1:12),EBNumStn)
> RakedEBSurvey <- rake(EBDesign, list(~ByEBOn, ~ByEBNum),
list(EBOnNewTots, EBNumStn ) )
Error in model.frame.default(margin, data = design$variables) : 
  invalid type (list) for variable 'ByEBOn'

Robert Farley

-----Original Message-----
From: Stas Kolenikov [mailto:skolenik at gmail.com] 
Sent: Wednesday, August 20, 2008 07:13
To: Farley, Robert
Cc: r-help at r-project.org
Subject: Re: [R] Survey Design / Rake questions

On Mon, Aug 18, 2008 at 6:18 PM, Farley, Robert <FarleyR at metro.net>
> My motivation is to try to correct for a "time on board" bias we see
> our surveys.  Not surprisingly, riders who are only on board a short
> time don't attempt/finish our survey forms.  We're able to weight our
> survey to the "bus stop-on by bus run" level.

So is it the problem of catching the short rides in your sample, or
the problem of having those short rides complete the survey? If the
former, then all you have to do is to weight by inverse probability of
selection (Horvitz-Thompson estimator). This probability is probably
roughly proportional to time on bus, which in turn might be
proportional to the number of stops in their ride. You may not need
any raking for that, just do some algebra computing those
probabilities of selection.

If the latter is the problem, then it is the problem of non-response.
If you think that the only thing that matters in whether a person
chooses to respond or not is the length of the ride, then your data
are "missing at random" (MAR), one of several standard concepts in the
missing data statistics
(http://www.citeulike.org/user/ctacmo/article/553290). You can bypass
that -- in survey statistics, that will be done with weights, again.
Here, you would need to boost the weight by the inverse fraction of
those who did complete the survey.

In a more difficult situation, your response probability might depend
on other factors, say demographics of the passengers, time of the day,
etc. I would imagine you would still have MAR data, unless you have
some weird questions like "Do you carry firearms on the bus?" to which
the people who did have guns at the time of their ride would probably
decline to answer, making the data informatively missing/not missing
at random (NMAR).

Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.

More information about the R-help mailing list