[R] splitting dataset based on variable and re-combining

Tue Dec 11 01:02:52 CET 2012

Package plyr is designed for this sort of thing, but functions split() and
unsplit() will work as well. This example just uses a simple lm() model:

> data(iris)
> iris <- iris[(iris$Species=="setosa" | iris$Species=="versicolor"),]
> set.seed(42)
> irisindex <- sample(1:nrow(iris), nrow(iris))
> iris <- iris[irisindex,]
> iris$Species <- factor(iris$Species) # Eliminate empty level virginica
> iris2 <- split(iris, iris$Species)   # List with two data.frames
> results <- lapply(iris2, function(x) lm(Sepal.Length ~ Sepal.Width + 
+     Petal.Length + Petal.Width, x))
> fit <- lapply(results, predict)
> iris3 <- lapply(names(iris2), function(x) data.frame(iris2[[x]],
fitted=fit[[x]]))
> iris4 <- unsplit(iris3, iris$Species)
> head(iris4)
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species   fitted
92          6.1         3.0          4.6         1.4 versicolor 6.283549
93          5.8         2.6          4.0         1.2 versicolor 5.719649
29          5.2         3.4          1.4         0.2     setosa 4.961338
81          5.5         2.4          3.8         1.1 versicolor 5.528532
62          5.9         3.0          4.2         1.5 versicolor 5.852292
50          5.0         3.3          1.4         0.2     setosa 4.895855

----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Brian Feeny
> Sent: Monday, December 10, 2012 4:41 PM
> To: r-help at r-project.org
> Subject: [R] splitting dataset based on variable and re-combining
> 
> 
> I have a dataset and I wish to use two different models to predict.
> Both models are SVM.  The reason for two different models is based
> on the sex of the observation.  I wish to be able to make predictions
> and have the results be in the same order as my original dataset.  To
> illustrate I will use iris:
> 
> # Take Iris and create a dataframe of just two Species, setosa and
> versicolor, shuffle them
> data(iris)
> iris <- iris[(iris$Species=="setosa" | iris$Species=="versicolor"),]
> irisindex <- sample(1:nrow(iris), nrow(iris))
> iris <- iris[irisindex,]
> 
> # Make predictions on setosa using the mySetosaModel model, and on
> versicolor using the myVersicolorModel:
> 
> predict(mySetosaModel, iris[iris$Species=="setosa",])
> predict(myVersicolorModel, iris[iris$Species=="versicolor",])
> 
> The problem is this will give me a vector of just the setosa results,
> and then one of just the versicolor results.
> 
> I wish to take the results and have them be in the same order as the
> original dataset.  So if the original dataset had:
> 
> 
> Species
> setosa
> setosa
> versicolor
> setosa
> versicolor
> setosa
> 
> I wish for my results to have:
> <prediction for setosa>
> <prediction for setosa>
> <prediction for versicolor>
> <prediction for setosa>
> <prediction for versicolor>
> <prediction for setosa>
> 
> But instead, what I am ending up with is two result sets, and no way I
> can think of to combine them.  I am sure this comes up alot where you
> have a factor you wish to split your models on, say sex (male vs.
> female), and you need to present the results back so it matches to the
> order of the orignal dataset.
> 
> I have tried to think of ways to use an index, to try to keep things in
> order, but I can't figure it out.
> 
> Any help is greatly appreciated.
> 
> Brian
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.