[R] Stratified Random Sampling Proportional to Size

William Dunlap wdunlap at tibco.com
Mon Apr 29 17:49:18 CEST 2013


This problem in sampling::strata() comes from calling cbind on a zero-row data.frame
with a scalar number.

  > library(sampling)
  > strata(mtcars[,c("mpg","hp","gear")], strat="gear", size=c(5,5,0))
  Error in data.frame(..., check.names = FALSE) :
    arguments imply differing number of rows: 0, 1
  In addition: Warning message:
  In strata(mtcars[, c("mpg", "hp", "gear")], strat = "gear", size = c(5,  :
    the method is not specified; by default, the method is srswor
  > traceback()
  5: stop("arguments imply differing number of rows: ", paste(unique(nrows),
         collapse = ", "))
  4: data.frame(..., check.names = FALSE)
  3: cbind(deparse.level, ...)
  2: cbind(r, i)
  1: strata(mtcars[, c("mpg", "hp", "gear")], strat = "gear", size = c(5,
         5, 0))

Changing that cbind call from cbind(r, i) to cbind(r, rep(i, length.out=nrow(r)))
would fix it up.

cbind is not entirely consistent with what it does with a 0-row rectangular input
and a scalar.

With a matrix you get a 0-row result and a warning
  > m <- matrix(numeric(), nrow=0, ncol=3, dimnames=list(NULL,paste0("Col",1:3)))
  > str(cbind(m, 666))
   num[0 , 1:4] 
   - attr(*, "dimnames")=List of 2
    ..$ : NULL
    ..$ : chr [1:4] "Col1" "Col2" "Col3" ""
  Warning message:
  In cbind(m, 666) :
    number of rows of result is not a multiple of vector length (arg 2)

With a data.frame you get an error
  > str(cbind(data.frame(m), 666))
  Error in data.frame(..., check.names = FALSE) : 
    arguments imply differing number of rows: 0, 1

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of Thomas Lumley
> Sent: Sunday, April 28, 2013 1:31 PM
> To: Jeff Newmiller
> Cc: R help (r-help at r-project.org)
> Subject: Re: [R] Stratified Random Sampling Proportional to Size
> 
> It looks as though you can't sample zero observations from a stratum.  If
> you take the example on the help page and change one of the sample sizes to
> zero you get exactly the same error.
> 
> >From the fact that there isn't a more explicit error message, I would guess
> that the author just never considered the possibility that someone would
> have a population stratum and not sample from it.
> 
>     -thomas
> 
> 
> On Sun, Apr 28, 2013 at 7:14 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us>wrote:
> 
> > a) Please post plain text
> >
> > b) Please make reproducible examples (e.g. telling us how you accessed a
> > database that we have no access to is not helpful). See ?head, ?dput and [1]
> >
> > c) I don't know anything about the sampling package or the strata
> > function, but I would recommend eliminating the rows that have zeros from
> > the input data. E.g.:
> >
> > stratum_cp <- stratum_cp[ 0<stratum_cp$stratp, ]
> >
> > [1] http://stackoverflow.com/**questions/5963269/how-to-make-**
> > a-great-r-reproducible-example<http://stackoverflow.com/questions/5963269/how-
> to-make-a-great-r-reproducible-example>
> >
> > On Fri, 26 Apr 2013, Lopez, Dan wrote:
> >
> >  Hello R Experts,
> >>
> >> I kindly request your assistance on figuring out how to get a stratified
> >> random sampling proportional to 100.
> >>
> >> Below is my r code showing what I did and the error I'm getting with
> >> sampling::strata
> >>
> >> # FIRST I summarized count of records by the two variables I want to use
> >> as strata
> >>
> >> Library(RODBC)
> >> library(sqldf)
> >> library(sampling)
> >> #After establishing connection I query the data and sort it by strata
> >> APPT_TYP_CD_LL and EMPL_TYPE and store it in a dataframe
> >> CURRPOP<-sqlQuery(ch,"SELECT APPT_TYP_CD_LL,
> EMPL_TYPE,ASOFDATE,EMPLID,**
> >> NAME,DEPTID,JOBCODE,JOBTITLE,**SAL_ADMIN_PLAN,RET_TYP_CD_LL FROM
> >> PS_EMPLOYEES_LL WHERE EMPL_STATUS NOT IN('R','T') ORDER BY
> APPT_TYP_CD_LL,
> >> EMPL_TYPE")
> >> #ROWID is a dummy ID I added and repositioned after the strat columns for
> >> later use
> >> CURRPOP$ROWID<-seq(nrow(**CURRPOP))
> >> CURRPOP<-CURRPOP[,c(1:2,11,3:**10)]
> >>
> >> # My strata.  Stratp is how many I want to sampled from each strata. NOTE
> >> THERE ARE SOME 0's which just means I won't sample from that group.
> >> stratum_cp<-sqldf("SELECT APPT_TYP_CD_LL,EMPL_TYPE, count(*) HC FROM
> >> CURRPOP GROUP BY APPT_TYP_CD_LL,EMPL_TYPE")
> >> stratum_cp$stratp<-round(**stratum_cp$HC/nrow(CURRPOP)***100)
> >>
> >>  stratum_cp
> >>>
> >>   APPT_TYP_CD_LL EMPL_TYPE   HC stratp
> >> 1              FA         S    1      0
> >> 2              FC         S    5      0
> >> 3              FP         S  173      3
> >> 4              FR         H  170      3
> >> 5              FX         H   49      1
> >> 6              FX         S   57      1
> >> 7              IN         H 1589     25
> >> 8              IN         S 3987     63
> >> 9              IP         H    7      0
> >> 10             IP         S   53      1
> >> 11             SA         H    8      0
> >> 12             SE         S   43      1
> >> 13             SF         H   14      0
> >> 14             SF         S    1      0
> >> 15             SG         S   10      0
> >> 16             ST         H  107      2
> >> 17             ST         S    6      0
> >>
> >> #THEN I attempted to use sampling::strata using the instructions in that
> >> package and got an error
> >>
> >>
> >> #I use stratum_cp$stratp for my sizes.
> >>
> >>
> >>
> >>  s<-strata(CURRPOP,c("APPT_TYP_**CD_LL","EMPL_TYPE"),size=**
> >>> stratum_cp$stratp,method="**srswor")
> >>>
> >>
> >> Error in data.frame(..., check.names = FALSE) :
> >>
> >>  arguments imply differing number of rows: 0, 1
> >>
> >>  traceback()
> >>>
> >>
> >> 5: stop("arguments imply differing number of rows: ", paste(unique(nrows),
> >>
> >>       collapse = ", "))
> >>
> >> 4: data.frame(..., check.names = FALSE)
> >>
> >> 3: cbind(deparse.level, ...)
> >>
> >> 2: cbind(r, i)
> >>
> >> 1: strata(CURRPOP, c("APPT_TYP_CD_LL", "EMPL_TYPE"), size =
> >> stratum_cp$stratp,
> >>
> >>       method = "srswor")
> >>
> >>
> >>
> >> #In lieu of a reproducible sample here is some info regarding most of my
> >> data
> >> dim(CURRPOP)
> >> [1] 6280   11
> >> #Cols w/ personal info have been removed in this output
> >>
> >>  str(CURRPOP[,c(1:3,7:11)])
> >>>
> >>
> >> 'data.frame':  6280 obs. of  8 variables:
> >>
> >> $ APPT_TYP_CD_LL: Factor w/ 12 levels "FA","FC","FP",..: 1 2 2 2 2 2 3 3
> >> 3 3 ...
> >>
> >> $ EMPL_TYPE     : Factor w/ 2 levels "H","S": 2 2 2 2 2 2 2 2 2 2 ...
> >>
> >> $ ROWID         : int  1 2 3 4 5 6 7 8 9 10 ...
> >>
> >> $ DEPTID        : int  9825 9613 9613 9852 9772 9852 9853 9853 9853 9854
> >> ...
> >>
> >> $ JOBCODE       : Factor w/ 325 levels "055.2","055.3",..: 311 112 112
> >> 112 112 112 298 299 299 300 ...
> >>
> >> $ JOBTITLE      : Factor w/ 325 levels "Accounting Assistant",..: 227 192
> >> 192 192 192 192 190 191 191 153 ...
> >>
> >> $ SAL_ADMIN_PLAN: Factor w/ 40 levels "ADE","AME","ASE",..: 36 38 38 38
> >> 38 38 31 31 31 31 ...
> >>
> >> $ RET_TYP_CD_LL : Factor w/ 2 levels "TCP1","TCP2": 2 2 2 2 2 2 2 2 2 2
> >> ...
> >>
> >> Daniel Lopez
> >> Workforce Analyst
> >> HRIM - Workforce Analytics & Metrics
> >> Strategic Human Resources Management
> >> wf-analytics-metrics at lists.**llnl.gov<wf-analytics-metrics at lists.llnl.gov>
> >> <mailto:wf-analytics-**metrics at lists.llnl.gov<wf-analytics-metrics at lists.llnl.gov>
> >> >
> >> (925) 422-0814
> >>
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> ______________________________**________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/**listinfo/r-
> help<https://stat.ethz.ch/mailman/listinfo/r-help>
> >> PLEASE do read the posting guide http://www.R-project.org/**
> >> posting-guide.html <http://www.R-project.org/posting-guide.html>
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >>
> > ------------------------------**------------------------------**
> > ---------------
> > Jeff Newmiller                        The     .....       .....  Go Live...
> > DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
> > Go...
> >                                       Live:   OO#.. Dead: OO#..  Playing
> > Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> > /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> >
> > ______________________________**________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-
> help>
> > PLEASE do read the posting guide http://www.R-project.org/**
> > posting-guide.html <http://www.R-project.org/posting-guide.html>
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> 
> 
> --
> Thomas Lumley
> Professor of Biostatistics
> University of Auckland
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list