[R] Random sample from a data frame where ID column values don't match the values in an ID column in a second data frame
David Winsemius
dwinsemius at comcast.net
Fri Mar 30 15:59:34 CEST 2012
On Mar 30, 2012, at 8:17 AM, inkhorn wrote:
> Okay, here's some sample code:
>
> ID = c(1,2,3,"A1",5,6,"A2",8,9,"A3")
> fakedata = rnorm(10, 5, .5)
> main.df = data.frame(ID,fakedata)
>
> results for my data frame:
>> main.df
> ID fakedata
> 1 1 5.024332
> 2 2 4.752943
> 3 3 5.408618
> 4 A1 5.362838
> 5 5 5.158660
> 6 6 4.658235
> 7 A2 5.389601
> 8 8 4.998249
> 9 9 5.248517
> 10 A3 4.159490
>
> sample1.df = main.df[sample(nrow(main.df), 4), ]
>> sample1.df
> ID fakedata
> 5 5 5.158660
> 9 9 5.248517
> 4 A1 5.362838
> 8 8 4.998249
>
> Here's what happens when I put a comma before the variable ID:
>
>> sample2.df = main.df[sample(nrow(main.df[! main.df[,"ID"] %in%
>> sample1.df[,"ID"]]), 5),]
> Error in `[.data.frame`(main.df, !main.df[, "ID"] %in% sample1.df[,
> "ID"]) :
> undefined columns selected
That was not the code I offered, which had no error:
> sample2.df <- main.df[ ! main.df[, "ID"] %in% sample1.df[, "ID"] , ]
> sample2.df
ID fakedata
2 2 5.225752
4 A1 4.788752
5 5 3.973376
6 6 5.565669
8 8 5.369974
9 9 5.954552
If you want to further sub-sample from that complement which I offered
(and that _was_ a random sample from the main dataset albeit not the
particular sample you wanted) , then it is available for further
sampling.
> sample2.df[ sample(nrow(sample2.df), 3), ]
ID fakedata
2 2 5.225752
8 8 5.369974
6 6 5.565669
>
> Here's what happens when I exclude the comma:
>
> sample2.df = main.df[sample(nrow(main.df[! main.df["ID"] %in%
> sample1.df["ID"]]), 5),]
You cannot do both steps in one line using that exact strategy. But
you can "chain" uses of "[". You could for instance have constructed
indexes (indices seems to be disappearing from the English languages):
idx <- sample(nrow(main.df), 4)
subset1 <- main.df[ idx, ]
subset2 <- main.df[-idx, ][sample(nrow(main.df)-nrow(subset1), 3), ]
> subset2
ID fakedata
6 6 5.565669
5 5 3.973376
2 2 5.225752
--
David.
>> sample2.df
> ID fakedata
> 8 8 4.998249
> 1 1 5.024332
> 3 3 5.408618
> 5 5 5.158660
> 10 A3 4.159490
>
> As you can see, one way I get nothing other than an error, the other
> way I
> get a sample that doesn't exclude rows that were already included in
> the 1st
> sample.
>
> Thanks,
> Matt Dubins
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Random-sample-from-a-data-frame-where-ID-column-values-don-t-match-the-values-in-an-ID-column-in-a-se-tp4516448p4518878.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list