[R] Survey Design / Rake questions
Thomas Lumley
tlumley at u.washington.edu
Thu Aug 21 22:54:54 CEST 2008
On Tue, 19 Aug 2008, Farley, Robert wrote:
> While I'm trying to catch up on the statistical basis of my task, could
> someone point me to how I should fix my R error?
The variables in the formula in rake() need to be the raw variables in the
design object, not summary tables.
-thomas
>
> Thanks
>
>
> ########################################################################
> ####
>> library(survey)
>> SurveyData <- read.spss("C:/Data/R/orange_delivery.sav",
> use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)
>>
> #=======================================================================
> ========
>> temp <- sub(' +$', '', SurveyData$direction_)
>> SurveyData$direction_ <- temp
>>
> #=======================================================================
> ========
>>
> SurveyData$NumStn=abs(as.numeric(SurveyData$lineon)-as.numeric(SurveyDat
> a$lineoff))
>> EBSurvey <- subset(SurveyData, direction_ == "EASTBOUND" )
>> XTTable <- xtabs(~direction_ , EBSurvey)
>> XTTable
> direction_
> EASTBOUND
> 345
>> WBSurvey <- subset(SurveyData, direction_ == "WESTBOUND" )
>> XTTable <- xtabs(~direction_ , WBSurvey)
>> XTTable
> direction_
> WESTBOUND
> 307
>> #
>> EBDesign <- svydesign(id=~sampn, weights=~expwgt, data=EBSurvey)
>> # svytable(~lineon+lineoff, EBDesign)
>> OnLabels <- c( "Warner Center", "De Soto", "Pierce College",
> "Tampa", "Reseda", "Balboa", "Woodley", "Sepulveda", "Van Nuys",
> "Woodman", "Valley College", "Laurel Canyon", "North Hollywood")
>> EBOnNewTots <- c( 1000, 600, 1200,
> 500, 1000, 500, 200, 250, 1000, 300,
> 100, 50, 73.65 )
>> EBNumStn <- c(673.65, 800, 1000, 1000, 800, 700, 600, 500, 400,
> 200, 50, 50 )
>> ByEBOn <- data.frame(OnLabels,EBOnNewTots)
>> ByEBNum <- data.frame(c(1:12),EBNumStn)
>> RakedEBSurvey <- rake(EBDesign, list(~ByEBOn, ~ByEBNum),
> list(EBOnNewTots, EBNumStn ) )
> Error in model.frame.default(margin, data = design$variables) :
> invalid type (list) for variable 'ByEBOn'
>>
>
> ###################################################################
> sessionInfo()
> R version 2.7.1 (2008-06-23)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
> States.1252;LC_MONETARY=English_United
> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>
> attached base packages:
> [1] graphics grDevices utils datasets stats methods base
>
>
> other attached packages:
> [1] survey_3.8 fortunes_1.3-5 moonsun_0.1 prettyR_1.3-2
> foreign_0.8-28
>>
> ####################################################################
>
> Robert Farley
> Metro
> www.Metro.net
>
> -----Original Message-----
> From: Farley, Robert
> Sent: Monday, August 18, 2008 16:18
> To: 'r-help at r-project.org'
> Subject: RE: [R] Survey Design / Rake questions
>
> Thank you for the list of references. Do you know of any "free"
> references available online? I'll have to find my library card :-)
>
>
> My motivation is to try to correct for a "time on board" bias we see in
> our surveys. Not surprisingly, riders who are only on board a short
> time don't attempt/finish our survey forms. We're able to weight our
> survey to the "bus stop-on by bus run" level. I want to keep that, and
> rake on new (imposed?) marginals, like an estimate of how many minutes
> they were on-board derived from their origin-destination. In practice,
> we'll have thousands of observations on hundreds of runs. As I see it,
> my work-plan involves:
>
> Running rake successfully on test data
> Preparing "bus stop-on by run" marginals automatically
> Plus any other "pre-existing" marginals to be kept.
> Appending "time on bus" estimates
> Determining the "time on bus" distribution (second survey?)
> Implementing the raking adjustment for a production (large)
> dataset
>
>
> As of yet, I cannot get the first step to work :-(
>
>
>
> I hope there are no "fatal flaws" in this concept....
>
>
>
>
>
> Robert Farley
> Metro
> www.Metro.net
>
> -----Original Message-----
> From: Stas Kolenikov [mailto:skolenik at gmail.com]
> Sent: Monday, August 18, 2008 10:32
> To: Farley, Robert
> Cc: r-help at r-project.org
> Subject: Re: [R] Survey Design / Rake questions
>
> Your reading, in increasing order of difficulty/mathematical details,
> might be Lohr's "Sampling"
> (http://www.citeulike.org/user/ctacmo/article/1068825), Korn &
> Graubard's "Health Surveys"
> (http://www.citeulike.org/user/ctacmo/article/553280), and Sarndal et.
> al. Survey Math Bible
> (http://www.citeulike.org/user/ctacmo/article/716032). You certainly
> should try to get a hold of the primary concepts before collecting
> your data (or rather before designing your survey... so it might
> already be too late!). Post-stratification is not that huge topic, for
> some reason; a review of mathematical details is given by Valliant
> (1993) (http://www.citeulike.org/user/ctacmo/article/1036976). On
> raking, the paper on top of Google Scholar search by Deville, Sarndal
> and Sautory (1993)
> (http://www.citeulike.org/user/ctacmo/article/3134001) is certainly
> coming from the best people in the field.
>
> I am not aware of general treatment of transportation survey sampling,
> although I suspect such references do exist in transportation
> research. There might be particular twists as the same subject/bus
> usage episode might be sampled at different locations.
>
> As far as rake() procedure is concerned, you need to have your data
> set up as sampled observations with two classifications across which
> you will be raking, probably the directions "E"/"W" and the stations.
> Those are not different data.frames, as you are trying to set them up,
> but a single data.frame with several columns. In other words, your
> sampled data will have labels "E"/"W" in one of the columns, and
> station names in another column, and (the names of) those columns will
> be the imputs of rake().
>
> On 8/18/08, Farley, Robert <FarleyR at metro.net> wrote:
>> I'm trying to learn how to calibrate/postStratify/rake survey data in
>> preparation for a large survey effort we're about to embark upon. As
> a
>> working example, I have results from a small survey of ~650
> respondents,
>> ~90 response fields each. I'm trying to learn how to (properly?)
> apply
>> the aforementioned functions.
>>
>> My data are from a bus on board survey. The expansion in the dataset
> is
>> derived from three elements:
>>
>> Response rates by bus stop for a sampled run
>>
>> Total runs/samples runs
>>
>> Normalized to (separately derived) daily line boarding
>>
>> In order to get to the point of raking the data, I need to learn more
>> about the survey package and nomenclature. For instance, given how
> I've
>> described the survey/weighting, is my call to svydesign correct? I'm
>> not sure I understand just what a "survey design" is. Where can I
> read
>> up on this? What's a good reference for such things as "PSUs",
> "cluster
>> sampling", and so on.
>
> --
> Stas Kolenikov, also found at http://stas.kolenikov.name
> Small print: I use this email account for mailing lists only.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help
mailing list