[R] r-data partitioning considering two variables (character and numeric)
Ahmed Attia
@hmed@t|@80 @end|ng |rom gm@||@com
Tue Aug 28 02:46:33 CEST 2018
Thanks Bert, worked nicely. Yes, genotypes with only one ID will be
eliminated before partitioning the data.
Best regards
Ahmed Attia
On Mon, Aug 27, 2018 at 8:09 PM, Bert Gunter <bgunter.4567 using gmail.com> wrote:
> Just partition the unique stand_ID's and select on them using %in% , say:
>
> id <- unique(dataGenotype$stand_ID)
> tst <- sample(id, floor(length(id)/2))
> wh <- dataGenotype$stand_ID %in% tst ## logical vector
> test<- dataGenotype[wh,]
> train <- dataGenotype[!wh,]
>
> There are a million variations on this theme I'm sure.
>
> -- Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Mon, Aug 27, 2018 at 3:54 PM Ahmed Attia <ahmedatia80 using gmail.com> wrote:
>>
>> I would like to partition the following dataset (dataGenotype) based
>> on two variables; Genotype and stand_ID, for example, for Genotype
>> H13: stand_ID number 7 may go to training and stand_ID number 18 and
>> 21 may go to testing.
>>
>> Genotype stand_ID Inventory_date stemC mheight
>> H13 7 5/18/2006 1940.1075 11.33995
>> H13 7 11/1/2008 10898.9597 23.20395
>> H13 7 4/14/2009 12830.1284 23.77395
>> H13 18 11/3/2005 2726.42 13.4432
>> H13 18 6/30/2008 12226.1554 24.091967
>> H13 18 4/14/2009 14141.68 25.0922
>> H13 21 5/18/2006 4981.7158 15.7173
>> H13 21 4/14/2009 20327.0667 27.9155
>> H15 9 3/31/2006 3570.06 14.7898
>> H15 9 11/1/2008 15138.8383 26.2088
>> H15 9 4/14/2009 17035.4688 26.8778
>> H15 20 1/18/2005 3016.881 14.1886
>> H15 20 10/4/2006 8330.4688 20.19425
>> H15 20 6/30/2008 13576.5 25.4774
>> H15 32 2/1/2006 3426.2525 14.31815
>> U21 3 1/9/2006 3660.416 15.09925
>> U21 3 6/30/2008 13236.29 24.27634
>> U21 3 4/14/2009 16124.192 25.79562
>> U21 67 11/4/2005 2812.8425 13.60485
>> U21 67 4/14/2009 13468.455 24.6203
>>
>> And the desired output is the following;
>>
>> A-training
>>
>> Genotype stand_ID Inventory_date stemC mheight
>> H13 7 5/18/2006 1940.1075 11.33995
>> H13 7 11/1/2008 10898.9597 23.20395
>> H13 7 4/14/2009 12830.1284 23.77395
>> H15 9 3/31/2006 3570.06 14.7898
>> H15 9 11/1/2008 15138.8383 26.2088
>> H15 9 4/14/2009 17035.4688 26.8778
>> U21 67 11/4/2005 2812.8425 13.60485
>> U21 67 4/14/2009 13468.455 24.6203
>>
>> B-testing
>>
>> Genotype stand_ID Inventory_date stemC mheight
>> H13 18 11/3/2005 2726.42 13.4432
>> H13 18 6/30/2008 12226.1554 24.091967
>> H13 18 4/14/2009 14141.68 25.0922
>> H13 21 5/18/2006 4981.7158 15.7173
>> H13 21 4/14/2009 20327.0667 27.9155
>> H15 20 1/18/2005 3016.881 14.1886
>> H15 20 10/4/2006 8330.4688 20.19425
>> H15 20 6/30/2008 13576.5 25.4774
>> H15 32 2/1/2006 3426.2525 14.31815
>> U21 3 1/9/2006 3660.416 15.09925
>> U21 3 6/30/2008 13236.29 24.27634
>> U21 3 4/14/2009 16124.192 25.79562
>>
>> I tried the following code;
>>
>> library(caret)
>> dataPartitioning <-
>> createDataPartition(dataGenotype$stand_ID,1,list=F,p=0.2)
>> train = dataGenotype[dataPartitioning,]
>> test = dataGenotype[-dataPartitioning,]
>>
>> Also tried
>>
>> createDataPartition(unique(dataGenotype$stand_ID),1,list=F,p=0.2)
>>
>> It did not produce the desired output, the data are partitioned within
>> the stand_ID. For example, one row of stand_ID 7 goes to training and
>> two rows of stand_ID 7 go to testing. How can I partition the data by
>> Genotype and stand_ID together?.
>>
>>
>>
>> Ahmed Attia
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list