[R] Survey Design / Rake questions

Thu Aug 21 22:54:54 CEST 2008

On Tue, 19 Aug 2008, Farley, Robert wrote:

> While I'm trying to catch up on the statistical basis of my task, could
> someone point me to how I should fix my R error?

The variables in the formula in rake() need to be the raw variables in the 
design object, not summary tables.

       -thomas

>
> Thanks
>
>
> ########################################################################
> ####
>> library(survey)
>> SurveyData <- read.spss("C:/Data/R/orange_delivery.sav",
> use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)
>>
> #=======================================================================
> ========
>> temp <- sub(' +$', '', SurveyData$direction_)
>> SurveyData$direction_ <- temp
>>
> #=======================================================================
> ========
>>
> SurveyData$NumStn=abs(as.numeric(SurveyData$lineon)-as.numeric(SurveyDat
> a$lineoff))
>> EBSurvey <- subset(SurveyData, direction_ == "EASTBOUND" )
>> XTTable <- xtabs(~direction_ , EBSurvey)
>> XTTable
> direction_
> EASTBOUND
>      345
>> WBSurvey <- subset(SurveyData, direction_ == "WESTBOUND" )
>> XTTable <- xtabs(~direction_ , WBSurvey)
>> XTTable
> direction_
> WESTBOUND
>      307
>> #
>> EBDesign <- svydesign(id=~sampn, weights=~expwgt, data=EBSurvey)
>> #   svytable(~lineon+lineoff, EBDesign)
>> OnLabels    <- c( "Warner Center", "De Soto", "Pierce College",
> "Tampa", "Reseda", "Balboa", "Woodley", "Sepulveda", "Van Nuys",
> "Woodman", "Valley College", "Laurel Canyon", "North Hollywood")
>> EBOnNewTots <- c(            1000,       600,             1200,
> 500,     1000,      500,       200,         250,       1000,       300,
> 100,              50,            73.65 )
>> EBNumStn <- c(673.65,     800, 1000, 1000,  800,  700,  600, 500, 400,
> 200,  50, 50 )
>> ByEBOn <- data.frame(OnLabels,EBOnNewTots)
>> ByEBNum <- data.frame(c(1:12),EBNumStn)
>> RakedEBSurvey <- rake(EBDesign, list(~ByEBOn, ~ByEBNum),
> list(EBOnNewTots, EBNumStn ) )
> Error in model.frame.default(margin, data = design$variables) :
>  invalid type (list) for variable 'ByEBOn'
>>
>
> ###################################################################
> sessionInfo()
> R version 2.7.1 (2008-06-23)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] graphics  grDevices utils     datasets  stats     methods   base
>
>
> other attached packages:
> [1] survey_3.8     fortunes_1.3-5 moonsun_0.1    prettyR_1.3-2
> foreign_0.8-28
>>
> ####################################################################
>
> Robert Farley
> Metro
> www.Metro.net
>
> -----Original Message-----
> From: Farley, Robert
> Sent: Monday, August 18, 2008 16:18
> To: 'r-help at r-project.org'
> Subject: RE: [R] Survey Design / Rake questions
>
> Thank you for the list of references.  Do you know of any "free"
> references available online?  I'll have to find my library card :-)
>
>
> My motivation is to try to correct for a "time on board" bias we see in
> our surveys.  Not surprisingly, riders who are only on board a short
> time don't attempt/finish our survey forms.  We're able to weight our
> survey to the "bus stop-on by bus run" level.  I want to keep that, and
> rake on new (imposed?) marginals, like an estimate of how many minutes
> they were on-board derived from their origin-destination.  In practice,
> we'll have thousands of observations on hundreds of runs.  As I see it,
> my work-plan involves:
>
> 	Running rake successfully on test data
> 	Preparing "bus stop-on by run" marginals automatically
> 		Plus any other "pre-existing" marginals to be kept.
> 	Appending "time on bus" estimates
> 	Determining the "time on bus" distribution (second survey?)
> 	Implementing the raking adjustment for a production (large)
> dataset
>
>
> As of yet, I cannot get the first step to work  :-(
>
>
>
> I hope there are no "fatal flaws" in this concept....
>
>
>
>
>
> Robert Farley
> Metro
> www.Metro.net
>
> -----Original Message-----
> From: Stas Kolenikov [mailto:skolenik at gmail.com]
> Sent: Monday, August 18, 2008 10:32
> To: Farley, Robert
> Cc: r-help at r-project.org
> Subject: Re: [R] Survey Design / Rake questions
>
> Your reading, in increasing order of difficulty/mathematical details,
> might be Lohr's "Sampling"
> (http://www.citeulike.org/user/ctacmo/article/1068825), Korn &
> Graubard's "Health Surveys"
> (http://www.citeulike.org/user/ctacmo/article/553280), and Sarndal et.
> al. Survey Math Bible
> (http://www.citeulike.org/user/ctacmo/article/716032). You certainly
> should try to get a hold of the primary concepts before collecting
> your data (or rather before designing your survey... so it might
> already be too late!). Post-stratification is not that huge topic, for
> some reason; a review of mathematical details is given by Valliant
> (1993) (http://www.citeulike.org/user/ctacmo/article/1036976). On
> raking, the paper on top of Google Scholar search by Deville, Sarndal
> and Sautory (1993)
> (http://www.citeulike.org/user/ctacmo/article/3134001) is certainly
> coming from the best people in the field.
>
> I am not aware of general treatment of transportation survey sampling,
> although I suspect such references do exist in transportation
> research. There might be particular twists as the same subject/bus
> usage episode might be sampled at different locations.
>
> As far as rake() procedure is concerned, you need to have your data
> set up as sampled observations with two classifications across which
> you will be raking, probably the directions "E"/"W" and the stations.
> Those are not different data.frames, as you are trying to set them up,
> but a single data.frame with several columns. In other words, your
> sampled data will have labels "E"/"W" in one of the columns, and
> station names in another column, and (the names of) those columns will
> be the imputs of rake().
>
> On 8/18/08, Farley, Robert <FarleyR at metro.net> wrote:
>> I'm trying to learn how to calibrate/postStratify/rake survey data in
>>  preparation for a large survey effort we're about to embark upon.  As
> a
>>  working example, I have results from a small survey of ~650
> respondents,
>>  ~90 response fields each.  I'm trying to learn how to (properly?)
> apply
>>  the aforementioned functions.
>>
>>  My data are from a bus on board survey.  The expansion in the dataset
> is
>>  derived from three elements:
>>
>>   Response rates by bus stop for a sampled run
>>
>>   Total runs/samples runs
>>
>>   Normalized to (separately derived) daily line boarding
>>
>>  In order to get to the point of raking the data, I need to learn more
>>  about the survey package and nomenclature.  For instance, given how
> I've
>>  described the survey/weighting, is my call to svydesign correct?  I'm
>>  not sure I understand just what a "survey design" is.  Where can I
> read
>>  up on this?  What's a good reference for such things as "PSUs",
> "cluster
>>  sampling", and so on.
>
> -- 
> Stas Kolenikov, also found at http://stas.kolenikov.name
> Small print: I use this email account for mailing lists only.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle