[R] Survey Design / Rake questions

Fri Aug 22 19:07:00 CEST 2008

I *think* I'm making progress, but I'm still failing at the same step.  My rake call fails with:
Error in postStratify.survey.design(design, strata[[i]], population.margins[[i]],  : 
  Stratifying variables don't match

To my naïve eyes, it seems that my factors are "in the wrong order".  If so, how do I "assert" an ordering in my survey dataframe, or copy an "image" from the survey dataframe to my marginals dataframes?  I'd prefer to "pull" the original marginals dataframe(s) from the survey dataframe so that I can automate that in production.

If that's not my problem, where might I look for enlightenment?  Neither "?why" nor ?whatamimissing return citations.  :-)

**************************************************************************
**** How I fail Now ******************************************************
SurveyData <- read.spss("C:/Data/R/orange_delivery.sav", use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)
> #===============================================================================
> temp <- sub(' +$', '', SurveyData$direction_) 
> SurveyData$direction_ <- temp
> #===============================================================================
> SurveyData$NumStn=abs(as.numeric(SurveyData$lineon)-as.numeric(SurveyData$lineoff))
> mean(SurveyData$NumStn)
[1] 6.785276
> ### Kludge
> SurveyData$NumStn <- pmax(1,SurveyData$NumStn)
> mean(SurveyData$NumStn)
[1] 6.789877
> SurveyData$NumStn <- as.factor(SurveyData$NumStn)
> ###
> EBSurvey <- subset(SurveyData, direction_ == "EASTBOUND" )
> XTTable <- xtabs(~direction_ , EBSurvey)
> XTTable
direction_
EASTBOUND 
      345 
> WBSurvey <- subset(SurveyData, direction_ == "WESTBOUND" )
> XTTable <- xtabs(~direction_ , WBSurvey)
> XTTable
direction_
WESTBOUND 
      307 
> #
> EBDesign <- svydesign(id=~sampn, weights=~expwgt, data=EBSurvey)
> #   svytable(~lineon+lineoff, EBDesign)
> StnName     <- c( "Warner Center", "De Soto", "Pierce College", "Tampa", "Reseda", "Balboa", "Woodley", "Sepulveda", "Van Nuys", "Woodman", "Valley College", "Laurel Canyon", "North Hollywood")
> EBOnNewTots <- c(            1000,       600,             1200,     500,     1000,      500,       200,         250,       1000,       300,              100,          123.65,                0 )
> StnTraveld  <- c(as.character(1:12))
> EBNumStn    <- c(673.65,     800, 1000, 1000,  800,  700,  600, 500, 400, 200,  50, 50 )
> ByEBOn  <- data.frame(StnName,   Freq=EBOnNewTots)
> ByEBNum <- data.frame(StnTraveld, Freq=EBNumStn)
> RakedEBSurvey <- rake(EBDesign, list(~lineon, ~NumStn), list(ByEBOn, ByEBNum) )
Error in postStratify.survey.design(design, strata[[i]], population.margins[[i]],  : 
  Stratifying variables don't match
> 
> str(EBSurvey$lineon)
 Factor w/ 13 levels "Warner Center",..: 3 1 1 1 2 13 1 5 1 5 ...
> EBSurvey$lineon[1:5]
[1] Pierce College Warner Center  Warner Center  Warner Center  De Soto       
Levels: Warner Center De Soto Pierce College Tampa Reseda Balboa Woodley Sepulveda Van Nuys Woodman Valley College Laurel Canyon North Hollywood
> str(ByEBOn$StnName)
 Factor w/ 13 levels "Balboa","De Soto",..: 11 2 5 8 6 1 12 7 10 13 ...
> ByEBOn$StnName[1:5]
[1] Warner Center  De Soto        Pierce College Tampa          Reseda        
Levels: Balboa De Soto Laurel Canyon North Hollywood Pierce College Reseda Sepulveda Tampa Valley College Van Nuys Warner Center Woodley Woodman
> 
> str(EBSurvey$NumStn)
 Factor w/ 12 levels "1","2","3","4",..: 10 12 4 12 8 1 8 8 12 4 ...
> EBSurvey$NumStn[1:5]
[1] 10 12 4  12 8 
Levels: 1 2 3 4 5 6 7 8 9 10 11 12
> str(ByEBNum$StnTraveld)
 Factor w/ 12 levels "1","10","11",..: 1 5 6 7 8 9 10 11 12 2 ...
> ByEBNum$StnTraveld[1:5]
[1] 1 2 3 4 5
Levels: 1 10 11 12 2 3 4 5 6 7 8 9
>

**************************************************************************
**************************************************************************

Robert Farley
Metro
www.Metro.net 

-----Original Message-----
From: Thomas Lumley [mailto:tlumley at u.washington.edu] 
Sent: Thursday, August 21, 2008 13:55
To: Farley, Robert
Cc: r-help at r-project.org
Subject: Re: [R] Survey Design / Rake questions

On Tue, 19 Aug 2008, Farley, Robert wrote:

> While I'm trying to catch up on the statistical basis of my task, could
> someone point me to how I should fix my R error?

The variables in the formula in rake() need to be the raw variables in the 
design object, not summary tables.

       -thomas

>
> Thanks
>
>
> ########################################################################
> ####
>> library(survey)
>> SurveyData <- read.spss("C:/Data/R/orange_delivery.sav",
> use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)
>>
> #=======================================================================
> ========
>> temp <- sub(' +$', '', SurveyData$direction_)
>> SurveyData$direction_ <- temp
>>
> #=======================================================================
> ========
>>
> SurveyData$NumStn=abs(as.numeric(SurveyData$lineon)-as.numeric(SurveyDat
> a$lineoff))
>> EBSurvey <- subset(SurveyData, direction_ == "EASTBOUND" )
>> XTTable <- xtabs(~direction_ , EBSurvey)
>> XTTable
> direction_
> EASTBOUND
>      345
>> WBSurvey <- subset(SurveyData, direction_ == "WESTBOUND" )
>> XTTable <- xtabs(~direction_ , WBSurvey)
>> XTTable
> direction_
> WESTBOUND
>      307
>> #
>> EBDesign <- svydesign(id=~sampn, weights=~expwgt, data=EBSurvey)
>> #   svytable(~lineon+lineoff, EBDesign)
>> OnLabels    <- c( "Warner Center", "De Soto", "Pierce College",
> "Tampa", "Reseda", "Balboa", "Woodley", "Sepulveda", "Van Nuys",
> "Woodman", "Valley College", "Laurel Canyon", "North Hollywood")
>> EBOnNewTots <- c(            1000,       600,             1200,
> 500,     1000,      500,       200,         250,       1000,       300,
> 100,              50,            73.65 )
>> EBNumStn <- c(673.65,     800, 1000, 1000,  800,  700,  600, 500, 400,
> 200,  50, 50 )
>> ByEBOn <- data.frame(OnLabels,EBOnNewTots)
>> ByEBNum <- data.frame(c(1:12),EBNumStn)
>> RakedEBSurvey <- rake(EBDesign, list(~ByEBOn, ~ByEBNum),
> list(EBOnNewTots, EBNumStn ) )
> Error in model.frame.default(margin, data = design$variables) :
>  invalid type (list) for variable 'ByEBOn'
>>
>
> ###################################################################
> sessionInfo()
> R version 2.7.1 (2008-06-23)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] graphics  grDevices utils     datasets  stats     methods   base
>
>
> other attached packages:
> [1] survey_3.8     fortunes_1.3-5 moonsun_0.1    prettyR_1.3-2
> foreign_0.8-28
>>
> ####################################################################
>
> Robert Farley
> Metro
> www.Metro.net
>
> -----Original Message-----
> From: Farley, Robert
> Sent: Monday, August 18, 2008 16:18
> To: 'r-help at r-project.org'
> Subject: RE: [R] Survey Design / Rake questions
>
> Thank you for the list of references.  Do you know of any "free"
> references available online?  I'll have to find my library card :-)
>
>
> My motivation is to try to correct for a "time on board" bias we see in
> our surveys.  Not surprisingly, riders who are only on board a short
> time don't attempt/finish our survey forms.  We're able to weight our
> survey to the "bus stop-on by bus run" level.  I want to keep that, and
> rake on new (imposed?) marginals, like an estimate of how many minutes
> they were on-board derived from their origin-destination.  In practice,
> we'll have thousands of observations on hundreds of runs.  As I see it,
> my work-plan involves:
>
> 	Running rake successfully on test data
> 	Preparing "bus stop-on by run" marginals automatically
> 		Plus any other "pre-existing" marginals to be kept.
> 	Appending "time on bus" estimates
> 	Determining the "time on bus" distribution (second survey?)
> 	Implementing the raking adjustment for a production (large)
> dataset
>
>
> As of yet, I cannot get the first step to work  :-(
>
>
>
> I hope there are no "fatal flaws" in this concept....
>
>
>
>
>
> Robert Farley
> Metro
> www.Metro.net
>
> -----Original Message-----
> From: Stas Kolenikov [mailto:skolenik at gmail.com]
> Sent: Monday, August 18, 2008 10:32
> To: Farley, Robert
> Cc: r-help at r-project.org
> Subject: Re: [R] Survey Design / Rake questions
>
> Your reading, in increasing order of difficulty/mathematical details,
> might be Lohr's "Sampling"
> (http://www.citeulike.org/user/ctacmo/article/1068825), Korn &
> Graubard's "Health Surveys"
> (http://www.citeulike.org/user/ctacmo/article/553280), and Sarndal et.
> al. Survey Math Bible
> (http://www.citeulike.org/user/ctacmo/article/716032). You certainly
> should try to get a hold of the primary concepts before collecting
> your data (or rather before designing your survey... so it might
> already be too late!). Post-stratification is not that huge topic, for
> some reason; a review of mathematical details is given by Valliant
> (1993) (http://www.citeulike.org/user/ctacmo/article/1036976). On
> raking, the paper on top of Google Scholar search by Deville, Sarndal
> and Sautory (1993)
> (http://www.citeulike.org/user/ctacmo/article/3134001) is certainly
> coming from the best people in the field.
>
> I am not aware of general treatment of transportation survey sampling,
> although I suspect such references do exist in transportation
> research. There might be particular twists as the same subject/bus
> usage episode might be sampled at different locations.
>
> As far as rake() procedure is concerned, you need to have your data
> set up as sampled observations with two classifications across which
> you will be raking, probably the directions "E"/"W" and the stations.
> Those are not different data.frames, as you are trying to set them up,
> but a single data.frame with several columns. In other words, your
> sampled data will have labels "E"/"W" in one of the columns, and
> station names in another column, and (the names of) those columns will
> be the imputs of rake().
>
> On 8/18/08, Farley, Robert <FarleyR at metro.net> wrote:
>> I'm trying to learn how to calibrate/postStratify/rake survey data in
>>  preparation for a large survey effort we're about to embark upon.  As
> a
>>  working example, I have results from a small survey of ~650
> respondents,
>>  ~90 response fields each.  I'm trying to learn how to (properly?)
> apply
>>  the aforementioned functions.
>>
>>  My data are from a bus on board survey.  The expansion in the dataset
> is
>>  derived from three elements:
>>
>>   Response rates by bus stop for a sampled run
>>
>>   Total runs/samples runs
>>
>>   Normalized to (separately derived) daily line boarding
>>
>>  In order to get to the point of raking the data, I need to learn more
>>  about the survey package and nomenclature.  For instance, given how
> I've
>>  described the survey/weighting, is my call to svydesign correct?  I'm
>>  not sure I understand just what a "survey design" is.  Where can I
> read
>>  up on this?  What's a good reference for such things as "PSUs",
> "cluster
>>  sampling", and so on.
>
> -- 
> Stas Kolenikov, also found at http://stas.kolenikov.name
> Small print: I use this email account for mailing lists only.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle