[R] r-data partitioning considering two variables (character and numeric)
Bert Gunter
bgunter@4567 @end|ng |rom gm@||@com
Tue Aug 28 01:09:03 CEST 2018
Just partition the unique stand_ID's and select on them using %in% , say:
id <- unique(dataGenotype$stand_ID)
tst <- sample(id, floor(length(id)/2))
wh <- dataGenotype$stand_ID %in% tst ## logical vector
test<- dataGenotype[wh,]
train <- dataGenotype[!wh,]
There are a million variations on this theme I'm sure.
-- Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Mon, Aug 27, 2018 at 3:54 PM Ahmed Attia <ahmedatia80 using gmail.com> wrote:
> I would like to partition the following dataset (dataGenotype) based
> on two variables; Genotype and stand_ID, for example, for Genotype
> H13: stand_ID number 7 may go to training and stand_ID number 18 and
> 21 may go to testing.
>
> Genotype stand_ID Inventory_date stemC mheight
> H13 7 5/18/2006 1940.1075 11.33995
> H13 7 11/1/2008 10898.9597 23.20395
> H13 7 4/14/2009 12830.1284 23.77395
> H13 18 11/3/2005 2726.42 13.4432
> H13 18 6/30/2008 12226.1554 24.091967
> H13 18 4/14/2009 14141.68 25.0922
> H13 21 5/18/2006 4981.7158 15.7173
> H13 21 4/14/2009 20327.0667 27.9155
> H15 9 3/31/2006 3570.06 14.7898
> H15 9 11/1/2008 15138.8383 26.2088
> H15 9 4/14/2009 17035.4688 26.8778
> H15 20 1/18/2005 3016.881 14.1886
> H15 20 10/4/2006 8330.4688 20.19425
> H15 20 6/30/2008 13576.5 25.4774
> H15 32 2/1/2006 3426.2525 14.31815
> U21 3 1/9/2006 3660.416 15.09925
> U21 3 6/30/2008 13236.29 24.27634
> U21 3 4/14/2009 16124.192 25.79562
> U21 67 11/4/2005 2812.8425 13.60485
> U21 67 4/14/2009 13468.455 24.6203
>
> And the desired output is the following;
>
> A-training
>
> Genotype stand_ID Inventory_date stemC mheight
> H13 7 5/18/2006 1940.1075 11.33995
> H13 7 11/1/2008 10898.9597 23.20395
> H13 7 4/14/2009 12830.1284 23.77395
> H15 9 3/31/2006 3570.06 14.7898
> H15 9 11/1/2008 15138.8383 26.2088
> H15 9 4/14/2009 17035.4688 26.8778
> U21 67 11/4/2005 2812.8425 13.60485
> U21 67 4/14/2009 13468.455 24.6203
>
> B-testing
>
> Genotype stand_ID Inventory_date stemC mheight
> H13 18 11/3/2005 2726.42 13.4432
> H13 18 6/30/2008 12226.1554 24.091967
> H13 18 4/14/2009 14141.68 25.0922
> H13 21 5/18/2006 4981.7158 15.7173
> H13 21 4/14/2009 20327.0667 27.9155
> H15 20 1/18/2005 3016.881 14.1886
> H15 20 10/4/2006 8330.4688 20.19425
> H15 20 6/30/2008 13576.5 25.4774
> H15 32 2/1/2006 3426.2525 14.31815
> U21 3 1/9/2006 3660.416 15.09925
> U21 3 6/30/2008 13236.29 24.27634
> U21 3 4/14/2009 16124.192 25.79562
>
> I tried the following code;
>
> library(caret)
> dataPartitioning <-
> createDataPartition(dataGenotype$stand_ID,1,list=F,p=0.2)
> train = dataGenotype[dataPartitioning,]
> test = dataGenotype[-dataPartitioning,]
>
> Also tried
>
> createDataPartition(unique(dataGenotype$stand_ID),1,list=F,p=0.2)
>
> It did not produce the desired output, the data are partitioned within
> the stand_ID. For example, one row of stand_ID 7 goes to training and
> two rows of stand_ID 7 go to testing. How can I partition the data by
> Genotype and stand_ID together?.
>
>
>
> Ahmed Attia
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list