[R] r-data partitioning considering two variables (character and numeric)

Tue Aug 28 01:09:03 CEST 2018

Just partition the unique stand_ID's and select on them using %in% , say:

id <- unique(dataGenotype$stand_ID)
tst <- sample(id, floor(length(id)/2))
wh <- dataGenotype$stand_ID %in% tst ## logical vector
test<- dataGenotype[wh,]
train <- dataGenotype[!wh,]

There are a million variations on this theme I'm sure.

-- Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Mon, Aug 27, 2018 at 3:54 PM Ahmed Attia <ahmedatia80 using gmail.com> wrote:

> I would like to partition the following dataset (dataGenotype) based
> on two variables; Genotype and stand_ID, for example, for Genotype
> H13: stand_ID number 7 may go to training and stand_ID number 18 and
> 21 may go to testing.
>
> Genotype    stand_ID    Inventory_date  stemC   mheight
> H13             7        5/18/2006  1940.1075   11.33995
> H13             7        11/1/2008  10898.9597  23.20395
> H13             7        4/14/2009  12830.1284  23.77395
> H13            18        11/3/2005  2726.42 13.4432
> H13            18        6/30/2008  12226.1554  24.091967
> H13            18        4/14/2009  14141.68    25.0922
> H13            21        5/18/2006  4981.7158   15.7173
> H13            21        4/14/2009  20327.0667  27.9155
> H15            9         3/31/2006  3570.06 14.7898
> H15            9         11/1/2008  15138.8383  26.2088
> H15            9         4/14/2009  17035.4688  26.8778
> H15           20         1/18/2005  3016.881    14.1886
> H15           20        10/4/2006   8330.4688   20.19425
> H15           20        6/30/2008   13576.5 25.4774
> H15           32        2/1/2006    3426.2525   14.31815
> U21           3         1/9/2006    3660.416    15.09925
> U21           3         6/30/2008   13236.29    24.27634
> U21           3         4/14/2009   16124.192   25.79562
> U21           67        11/4/2005   2812.8425   13.60485
> U21           67        4/14/2009   13468.455   24.6203
>
> And the desired output is the following;
>
> A-training
>
> Genotype    stand_ID    Inventory_date  stemC   mheight
> H13            7         5/18/2006  1940.1075   11.33995
> H13            7         11/1/2008  10898.9597  23.20395
> H13            7         4/14/2009  12830.1284  23.77395
> H15            9         3/31/2006  3570.06 14.7898
> H15            9         11/1/2008  15138.8383  26.2088
> H15            9         4/14/2009  17035.4688  26.8778
> U21            67        11/4/2005  2812.8425   13.60485
> U21            67        4/14/2009  13468.455   24.6203
>
> B-testing
>
> Genotype    stand_ID    Inventory_date  stemC   mheight
> H13             18       11/3/2005  2726.42 13.4432
> H13             18       6/30/2008  12226.1554  24.091967
> H13             18       4/14/2009  14141.68    25.0922
> H13             21       5/18/2006  4981.7158   15.7173
> H13             21       4/14/2009  20327.0667  27.9155
> H15             20       1/18/2005  3016.881    14.1886
> H15             20       10/4/2006  8330.4688   20.19425
> H15             20       6/30/2008  13576.5 25.4774
> H15             32       2/1/2006   3426.2525   14.31815
> U21             3        1/9/2006   3660.416    15.09925
> U21             3        6/30/2008  13236.29    24.27634
> U21             3        4/14/2009  16124.192   25.79562
>
> I tried the following code;
>
> library(caret)
> dataPartitioning <-
> createDataPartition(dataGenotype$stand_ID,1,list=F,p=0.2)
> train = dataGenotype[dataPartitioning,]
> test = dataGenotype[-dataPartitioning,]
>
> Also tried
>
> createDataPartition(unique(dataGenotype$stand_ID),1,list=F,p=0.2)
>
> It did not produce the desired output, the data are partitioned within
> the stand_ID. For example, one row of stand_ID 7 goes to training and
> two rows of stand_ID 7 go to testing. How can I partition the data by
> Genotype and stand_ID together?.
>
>
>
> Ahmed Attia
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]