[R] Stratified Random Sampling Proportional to Size

Lopez, Dan lopez235 at llnl.gov
Mon Apr 29 19:06:31 CEST 2013


Hi Jeff,
a & b) points taken. Thanks for the reference too.
c) taking the zero's out did the trick.

Dan

-----Original Message-----
From: Jeff Newmiller [mailto:jdnewmil at dcn.davis.ca.us] 
Sent: Sunday, April 28, 2013 12:15 AM
To: Lopez, Dan
Cc: R help (r-help at r-project.org)
Subject: Re: [R] Stratified Random Sampling Proportional to Size

a) Please post plain text

b) Please make reproducible examples (e.g. telling us how you accessed a database that we have no access to is not helpful). See ?head, ?dput and [1]

c) I don't know anything about the sampling package or the strata function, but I would recommend eliminating the rows that have zeros from the input data. E.g.:

stratum_cp <- stratum_cp[ 0<stratum_cp$stratp, ]

[1]
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

On Fri, 26 Apr 2013, Lopez, Dan wrote:

> Hello R Experts,
>
> I kindly request your assistance on figuring out how to get a 
> stratified random sampling proportional to 100.
>
> Below is my r code showing what I did and the error I'm getting with 
> sampling::strata
>
> # FIRST I summarized count of records by the two variables I want to 
> use as strata
>
> Library(RODBC)
> library(sqldf)
> library(sampling)
> #After establishing connection I query the data and sort it by strata 
> APPT_TYP_CD_LL and EMPL_TYPE and store it in a dataframe 
> CURRPOP<-sqlQuery(ch,"SELECT APPT_TYP_CD_LL, 
> EMPL_TYPE,ASOFDATE,EMPLID,NAME,DEPTID,JOBCODE,JOBTITLE,SAL_ADMIN_PLAN,
> RET_TYP_CD_LL FROM PS_EMPLOYEES_LL WHERE EMPL_STATUS NOT IN('R','T') 
> ORDER BY APPT_TYP_CD_LL, EMPL_TYPE") #ROWID is a dummy ID I added and 
> repositioned after the strat columns for later use
> CURRPOP$ROWID<-seq(nrow(CURRPOP))
> CURRPOP<-CURRPOP[,c(1:2,11,3:10)]
>
> # My strata.  Stratp is how many I want to sampled from each strata. NOTE THERE ARE SOME 0's which just means I won't sample from that group.
> stratum_cp<-sqldf("SELECT APPT_TYP_CD_LL,EMPL_TYPE, count(*) HC FROM 
> CURRPOP GROUP BY APPT_TYP_CD_LL,EMPL_TYPE")
> stratum_cp$stratp<-round(stratum_cp$HC/nrow(CURRPOP)*100)
>
>> stratum_cp
>   APPT_TYP_CD_LL EMPL_TYPE   HC stratp
> 1              FA         S    1      0
> 2              FC         S    5      0
> 3              FP         S  173      3
> 4              FR         H  170      3
> 5              FX         H   49      1
> 6              FX         S   57      1
> 7              IN         H 1589     25
> 8              IN         S 3987     63
> 9              IP         H    7      0
> 10             IP         S   53      1
> 11             SA         H    8      0
> 12             SE         S   43      1
> 13             SF         H   14      0
> 14             SF         S    1      0
> 15             SG         S   10      0
> 16             ST         H  107      2
> 17             ST         S    6      0
>
> #THEN I attempted to use sampling::strata using the instructions in 
> that package and got an error
>
>
> #I use stratum_cp$stratp for my sizes.
>
>
>
>> s<-strata(CURRPOP,c("APPT_TYP_CD_LL","EMPL_TYPE"),size=stratum_cp$str
>> atp,method="srswor")
>
> Error in data.frame(..., check.names = FALSE) :
>
>  arguments imply differing number of rows: 0, 1
>
>> traceback()
>
> 5: stop("arguments imply differing number of rows: ", 
> paste(unique(nrows),
>
>       collapse = ", "))
>
> 4: data.frame(..., check.names = FALSE)
>
> 3: cbind(deparse.level, ...)
>
> 2: cbind(r, i)
>
> 1: strata(CURRPOP, c("APPT_TYP_CD_LL", "EMPL_TYPE"), size = 
> stratum_cp$stratp,
>
>       method = "srswor")
>
>
>
> #In lieu of a reproducible sample here is some info regarding most of 
> my data
> dim(CURRPOP)
> [1] 6280   11
> #Cols w/ personal info have been removed in this output
>
>> str(CURRPOP[,c(1:3,7:11)])
>
> 'data.frame':  6280 obs. of  8 variables:
>
> $ APPT_TYP_CD_LL: Factor w/ 12 levels "FA","FC","FP",..: 1 2 2 2 2 2 3 3 3 3 ...
>
> $ EMPL_TYPE     : Factor w/ 2 levels "H","S": 2 2 2 2 2 2 2 2 2 2 ...
>
> $ ROWID         : int  1 2 3 4 5 6 7 8 9 10 ...
>
> $ DEPTID        : int  9825 9613 9613 9852 9772 9852 9853 9853 9853 9854 ...
>
> $ JOBCODE       : Factor w/ 325 levels "055.2","055.3",..: 311 112 112 112 112 112 298 299 299 300 ...
>
> $ JOBTITLE      : Factor w/ 325 levels "Accounting Assistant",..: 227 192 192 192 192 192 190 191 191 153 ...
>
> $ SAL_ADMIN_PLAN: Factor w/ 40 levels "ADE","AME","ASE",..: 36 38 38 38 38 38 31 31 31 31 ...
>
> $ RET_TYP_CD_LL : Factor w/ 2 levels "TCP1","TCP2": 2 2 2 2 2 2 2 2 2 2 ...
>
> Daniel Lopez
> Workforce Analyst
> HRIM - Workforce Analytics & Metrics
> Strategic Human Resources Management
> wf-analytics-metrics at lists.llnl.gov<mailto:wf-analytics-metrics at lists.
> llnl.gov>
> (925) 422-0814
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k



More information about the R-help mailing list