[R] simple loop(?) analysing subsets

Joshua Wiley jwiley.psych at gmail.com
Mon Jul 19 07:23:01 CEST 2010


Not to hijack the thread, but for my edification, what are the
advantages/disadvantages of split() + lapply() compared to by()?

Josh

On Sun, Jul 18, 2010 at 9:50 PM, Dennis Murphy <djmuser at gmail.com> wrote:
> Hi:
>
> Time to jack up your level of R knowledge, courtesy of the apply family.
>
> The 'R way' to do what you want is to split the data by species into list
> components, run lm() on each component and save the resulting lm objects in
> a list. The next trick is to figure out how to extract what you want, which
> may require a bit more ingenuity in delving into aRcana :)
>
> -----
> Aside:
> To reinforce Joshua's point, variable names with spaces not explicitly
> enclosed in quotes is bad practice, especially when someone who wants to
> help tries to copy and paste your data into his/her R session:
>
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> :
>  line 1 did not have 4 elements
>
> R expected four columns of data, but you provided three. In the future, it's
> a good idea to include your data example with dput(), which outputs
>
> dput(d)
> structure(list(species = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
> 2L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L), o2con = c(0.5, 0.6, 0.4,
> 0.4, 0.5, 0.3, 0.4, 0.5, 0.7, 0.9, 0.3, 0.7, 0.4, 0.3, 0.3, 0.6,
> 0.9, 0.2), bm = c(5L, 2L, 4L, 2L, 3L, 7L, 8L, 3L, 4L, 2L, 6L,
> 2L, 1L, 7L, 2L, 1L, 7L, 5L)), .Names = c("species", "o2con",
> "bm"), class = "data.frame", row.names = c(NA, -18L))
>
> This is easily copied and pasted into anyone's R session....but I digress.
> ------
>
> Calling your data frame d, here's how to run the same regression model on
> all species:
>
> # Create a function to perform the modeling, taking a data frame df as input
> f <- function(df) lm(o2con ~ bm, data = df)
>
> # Use lapply() to apply the function to each 'split' of the data, by
> species:
> v <- lapply(split(d, d$species), f)
>
> # v is a list object, where each component of the list is an lm object,
> # which itself is a list. In other words, it's a list of lists. do.call() is
> a
> # very useful function that applies a function to components of a list.
> # rbind and cbind are commonly used to slurp together common elements
> # from each component of a list.
>
> # Pulling out the coefficients from each model:
>> do.call(rbind, lapply(v, coef))
>  (Intercept)          bm
> 1   0.5176471 -0.01176471
> 2   0.9253731 -0.07611940
> 3   0.5942308 -0.04230769
> 5   0.3351648  0.04395604
>
> # Extract the r-squared values from each model:
> g <- function(m) summary(m)$r.squared
>> do.call(rbind, lapply(v, g))
>        [,1]
> 1 0.03361345
> 2 0.66932578
> 3 0.43291592
> 5 0.14652015
>
> # But you have to be careful...e.g., since you have unequal sample sizes per
> species,
>> do.call(cbind, lapply(v, resid))
>            1           2            3          5
> 1  0.04117647 -0.09253731 -0.040384615 -0.1230769
> 2  0.10588235  0.08358209  0.190384615  0.2208791
> 3 -0.07058824 -0.19701493 -0.151923077  0.2571429
> 4 -0.09411765  0.07910448  0.001923077 -0.3549451
> 5  0.01764706  0.12686567 -0.040384615 -0.1230769
> Warning message:
> In function (..., deparse.level = 1)  :
>  number of rows of result is not a multiple of vector length (arg 3)
>
> Notice how the first residual is recycled in each of groups 3 and 5. That's
> a potential gotcha.
>
> This gives you a small glimpse into the power that R can deliver in data
> analysis.
>
> HTH,
> Dennis
>
> On Sun, Jul 18, 2010 at 2:29 PM, karmakiller <roisinmoriarty at gmail.com>wrote:
>
>>
>> Hi All,
>>
>> I have a large data set with many columns of data. One of these columns is
>> a
>> species identifier and the remainder are variables such as temperature or
>> mass. Currently I am carrying out a single regression on subsets of the
>> data
>> set, e.g. separated data sets with only the data from one species at a
>> time.
>> I have been searching for a thread that will help me to understand how best
>> to repeat this process for each different species identifier in that
>> variable column. I can’t seem to find one that is similar to what I am
>> trying to do. It might be the case that I am not looking for the right
>> thing
>> or that I do not fully understand the process.
>>
>> How do I run a simple loop that produces a regression for each species as
>> identified in the variable species id, which is one column in the large
>> data
>> set that I am using?
>>
>> Simple regression that I wish to repeat
>>
>> data<- read.table("…/STUDY.txt",header=T)
>> names(data)
>> model<- with(data,{lm(o2con~bm)})
>> summary(model)
>>
>>
>> sample data set
>>
>> species id      o2con       bm
>> 1               0.5         5
>> 1               0.6         2
>> 1               0.4         4
>> 1               0.4         2
>> 1               0.5         3
>> 2               0.3         7
>> 2               0.4         8
>> 2               0.5         3
>> 2               0.7         4
>> 2               0.9         2
>> 3               0.3         6
>> 3               0.7         2
>> 3               0.4         1
>> 3               0.3         7
>> 5               0.3         2
>> 5               0.6         1
>> 5               0.9         7
>> 5               0.2         5
>>
>> I would be very grateful for some help with this. I really like using R and
>> I can usually figure out what I want to do but I have been trying to figure
>> this out for a while now and I am getting nowhere.
>>
>> Thank you.
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/simple-loop-analysing-subsets-tp2293383p2293383.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>        [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/



More information about the R-help mailing list