[R] Looping through different groups of variables in models

Thu Sep 1 13:01:26 CEST 2016

Thanks so much everybody, especially to Dennis. I didn't really occur to me
that I could put the models into a list. I have used dplyr for simple data
transformations and will definitely look into it.

On Thu, Sep 1, 2016 at 8:10 AM, Dennis Murphy <djmuser at gmail.com> wrote:

> Hi:
>
> See inline.
>
> On Wed, Aug 31, 2016 at 2:58 PM, Kai Mx <govokai at gmail.com> wrote:
> > Hi all,
> >
> > I am having trouble wrapping my head around a probably simple issue:
> >
> > After using the reshape package, I have a melted dataframe with the
> columns
> > group (factor), time (int), condition (factor), value(int).
> >
> > These are experimental data. The data were obtained from different
> > treatment groups (group) under different conditions at different time
> > points.
> >
> > I would now like to perform ANOVA, boxplots and calculate means to
> compare
> > groups for all combinations of conditions and time points with something
> > like
> >
> > fit <- lm(value~group, data=[subset of data with combination of
> > condition/timepoint])
> > summary (fit)
> > p <- ggplot([subset of data with combination of condition/timepoint],
> > aes(x= group, y=value)) + geom_boxplot ()
> > print (p)
> > tapply ([subset of data with combination of condition/timepoint]$value,
> > subset of data with combination of condition/timepoint]$group, mean)
>
> There is a traditional approach to this class of problem and an
> evolving 'modern' approach. The traditional approach is to use
> lapply() to produce a list of model objects (one per subgroup); the
> more recent approach is to take advantage of the pipeline operations
> enabled by the magrittr package, e.g.,via the dplyr and tidyr
> packages.  Related packages are purrr and broom; the former applies
> functional programming ideas to map a function recursively across a
> list, while the latter focuses on converting information from model
> objects into data frames that are more compatible with dplyr and
> friends.A package released to CRAN within the past couple of days,
> modelr, adds a few bells and whistles (e.g., bootstrap regression,
> adding columns of predicted values or residuals to a data frame), but
> you don't need it for your immediate purposes.
>
> Below is a generic approach to solving the types of problems you
> described above, which is the best one can do in the absence of a
> reproducible example. Therefore, if this doesn't work out of the box,
> you'll have to fix your data. (I won't do it for you, sorry.)
>
> You could do something like what you have in mind in plyr as follows,
> where md is a surrogate for the name of your melted data frame:
>
> library(plyr)
> L <- dlply(md, .(condition, time), function(d) lm(value ~ group, data = d))
>
> This would produce a list of models, one per condition * time
> combination. You could then use do.call() or another plyr function to
> extract elements of interest from the list, which generally would
> require that you write one or more (anonymous) functions to extract
> the information of interest. A similar approach can be used to
> generate a list of ggplots. It's cleaner if you put your code into
> functions and have it return the output you want, but you have to be
> careful about the form of the input and output - for dlply(), you want
> a function that takes a data frame as input. If you just want the
> plots printed, you could write a function to do that for a single plot
> (again with a data frame as input) and then use the d_ply() function
> in plyr to print them en masse, but it would generally make more sense
> to write them to files, so you'd probably be better off writing a
> function that ends with a ggsave() call and call d_ply(). [Note: the _
> is used when your function creates a side effect, such as printing or
> saving a plot object - it returns nothing to the R console.]
>
> As for the numeric summaries,
>
> ddply(md, .(condition, time, group), function(d) mean(d$value, na.rm =
> TRUE))
>
> would work. The advantage of plyr (and its successor, dplyr) is that
> you can pass arbitrary functions as the third argument as long as the
> input is a data frame and the output is a data frame (or something
> that can be coerced to a data frame). This is more robust than
> tapply().
>
> Comment: plyr/reshape2 is a good starter package combination as it
> teaches you the value of the split-apply-combine approach to data
> analysis, but it can be (very) slow. The dplyr/tidyr package
> combination is faster, more computationally efficient version of
> plyr::ddply() and reshape2 and is recommended for use, although you
> have to learn a somewhat different approach to R programming in the
> process. If you're fairly new to R, that shouldn't matter much.
>
> There has been a lot of work in the last year or two to improve the
> flow of programming for tasks such as recursive plotting or model
> fitting. The dplyr and tidyr packages are meant to be replacements for
> plyr and reshape[2], respectively; both are written by Hadley Wickham.
> These packages extensively use operators created by Stephan Bache in
> the magrittr package, particularly %>%, the pipeline operator, and .,
> the current data object operator. %>% is read as "then": its purpose
> is to take a data frame object as input and return the result of an
> operation on the data (i.e., a function call) as a data frame.
> Repeated application results in a chain of operations that establishes
> a programming flow and has the nice side effect of producing more
> readable, better organized code.
>
> There are some beautiful advantages to this approach to programming,
> but there are also some challenges, particularly when attempting to
> write functions in dplyr. At this point, you don't need to worry about
> it. The analogy to plyr::dlply() is dplyr::do(), which applies a
> function to each subgroup of data defined by the grouping variables.
> Related packages for modeling include broom (highly recommended) and
> perhaps modelr, although the latter is probably not relevant for your
> particular problem. OTOH, broom is certainly relevant.
>
> The most important functions in dplyr are "verbs" that apply a
> specific action to the input data:
>
> * group_by()   defines grouping variables, applies the "split" aspect
> of split-apply-combine
> * filter()     selects rows either by index or by logical expression
> * select()  selects variables (columns) either by column number or name
> * arrange()    sorts rows by value with respect to one or more
> grouping variables
> * mutate()     defines new variables or modifies existing ones
> * summarise()   applies a one-number summary function to a variable
>
> These are referred to as "one-table verbs". Functions that are
> "two-table" verbs are typically used to produce joins of two tables.
>
> The broom package is designed to summarize the output from model
> objects into data frames. The two primary functions are tidy(), which
> returns a summary data frame of the model coefficients, and glance(),
> which returns a vector of summary statistics (e.g, r^2, RMSE)
> associated with the model. The advantage is that the functions are
> designed to work in a dplyr data pipeline and can be used to combine
> the results from a list of model fits.
>
> Here's a template to pull off the model fitting exercise in dplyr,
> tidyr and broom. (Note: dplyr has many of the same functions as plyr,
> so if you need to load both, load plyr first and then load dplyr.)
>
> library(dplyr)
> library(tidyr)
> library(broom)
>
> ## Return an object that is effectively a list of models
> mods <- md %>%
>               group_by(condition, time) %>%
>               do(mod = lm(value ~ group, data = .)    # mod is an
> object of class lm
>
> # Note:   . is a placeholder for the current data
>
> # Now access tidy summaries of the models using broom - see its
> documentation
> # for details
> mods %>% tidy(mod, conf.int = TRUE)
> mods %>% glance(mod)
>
>
> For the ggplots, something like the following would apply:
>
> md %>% group_by(condition, time) %>%    # splits the data into subgroups
>                ggplot(., aes(x = group, y = value)) +
>                    geom_boxplot() +
>                    ggsave(paste(paste(group, value, sep = "_"), "png",
> sep = "."))
>
>
> The equivalent of tapply() in dplyr is a summarise() function:
>
> md %>% group_by(condition, time, group) %>%
>                summarise(mean = mean(value, na.rm = TRUE))
>
> Several variations of summarise() are present in dplyr which give it a
> great deal of flexibility; e.g., summarise_each() applies the same
> function to a set of variables or allows multiple functions to be
> applied to one or more variables:
>
> md %>% group_by(condition, time, group) %>%
>               summarise_at(vars(value), funs(mean, sd, min, max),  na.rm =
> TRUE)
>
>
> If you [intend to] use R on a regular basis and commonly encounter the
> types of problems you've described here, I'd strongly suggest that you
> look carefully into dplyr and friends. If you have the time, I'd also
> suggest looking into the data.table package, which is usually a bit
> faster than dplyr. It does many of the same types of tasks; recently,
> Hadley recently released the dtplyr package which allows data.table
> objects to be passed through the pipeline and allows use of data.table
> code in the pipeline. The combination of the two is powerful.
>
> Comment: Bert Gunter's advice about modeling the data en masse is
> important, because it allows one to investigate potential interactions
> and relationships that you cannot do with data subsets. His cautions
> should not be ignored. OTOH, sometimes data is intentionally
> segregated and listwise modeling/summarization is appropriate. I don't
> know which of these situations applies to you, but you need to
> consider the implications of sub-analyses carefully.
>
> Sorry for the length of this missive, but I hope it is of some help to you.
>
> Dennis
> >
> > How can I loop through these combinations and output the data in an
> elegant
> > way?
> >
> > Thanks so much!
> >
> > Best,
> >
> > Kai
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]