[R] Using a by() function to process several regression (lm()) functions

Fri Nov 6 04:13:15 CET 2009

On Thu, Nov 5, 2009 at 11:12 AM, Marc Los Huertos <mloshuertos at csumb.edu> wrote:
> Hello,
>
> Thank you very much for looking at this. I have a "seasonal" user for R. I
> teach my undergrads and graduates students statistics using R and often find
> myself trying to solve problems to process student collected data in an
> efficient way.
>
> In this case, I have a data.frame with multiple observations. These are gas
> concentrations in a chamber and are used to measure into rates, usually (not
> not always), four or five observations per chamber.
>
> Data are from several sites and multiple dates. I'd like to run a regression
> to calculate the rate of emissions, for each date and site.
>
> The regression is using concentration ~ Sample_interval (which is in
> seconds).
>
> lm(Concentration ~ Sample_interval, data=df)
>
> The data.frame looks something like this:

  # The structure of your data got mangled in the email, so I'm
reprinting it here in a
  # more stable form incase anyone else wants to give this problem a try.
  experimentData <-  structure(list(Date = structure(c(-715808,
-715808, -715808,
-715808, -715808, -715808, -715808, -715808, -715533, -715533,
-715533, -715533, -715533, -715533, -715533, -715533), class = "Date"),
    Time = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
    1L, 1L, 1L, 1L, 1L, 1L), .Label = "5:00", class = "factor"),
    Site = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,
    1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Site 1", "Site 2"), class = "factor"),
    Sample_interval = c(0L, 566L, 1005L, 1244L, 0L, 466L, 1005L,
    1349L, 0L, 506L, 1065L, 1184L, 0L, 396L, 1022L, 1304L),
Concentration = c(334L,
    355L, 367L, 384L, 434L, 455L, 437L, 474L, 354L, 359L, 377L,
    394L, 424L, 495L, 497L, 574L)), .Names = c("Date", "Time",
"Site", "Sample_interval", "Concentration"), row.names = c(NA,
-16L), class = "data.frame")

> I tried this:
>
> bylist.lm <- by(df, list(df$Date, df$Site),
>      function(x) lm(Concentration~Sample_interval, data=x))
>
> Then, I get a list of stuff...which seems hard very exact much...
>
> I tried this...
>
> rate <- (sapply(bylist.lm, coef)[2,])
>
> and got, the rates, but I can't figure out how to line them up with the "by"
> variables.
>
> So... In a perfect world, I would like to get a data frame with the
> intercept, slope, r.square, p.value for each date, site, so the question
> involves extracting from a "by" class, and mode "list". Granted, I have been
> confused for a few years on how to think about classes and modes...so some
> insight there would also be quite appreciated...

Well, the by() function returns what is basically a glorified list of
objects.  In your first case, you got a bunch of "lm" linear model
objects since that is what the function you provided to by() returns.
If you want to return a table of data, you will need to provide a more
sophisticated function that fits the linear model and then extracts
the statistics of interest and returns a vector/list/data.frame
containing the results:

  modelResults <- by( experimentData, list( experimentData[[ 'Date'
]], experimentData[[ 'Site' ]] ), function( dataSlice ){

    linMod <- lm( Concentration ~ Sample_interval, data = dataSlice )

    # Slope and intercept may be recovered from the output of the
coef() function:
    intercept <- coef( linMod )[1]
    slope <- coef( linMod )[2]

    # The R-Squared value is returned by the summary() function:
    rsq <- summary( linMod )[[ 'r.squared' ]]

    # The summary function also provides statistics for the F-distribution,
    # extract them, reformat as a list, rename and feed to pf() using do.call()
    # in order to get the p-value:
    fStats <- as.list( summary( linMod )[[ 'fstatistic' ]] )
    names( fStats ) <- c( 'q', 'df1', 'df2' )
    fStats[[ 'lower.tail' ]] <- FALSE

    pVal <- do.call( pf, fStats )

    return(
      data.frame( slope, intercept, rsq, pVal )
    )

  })

So, now modelResults is a list returned from by(), each component of
which contains a row of the statistics of interest. To collapse it
into a data.frame, use rbind() inside do.call():

  modelResults <- do.call( rbind, modelResults )

The approach outlined above gets the job done-- however some
information is lost or mangled in the translation such as which Site
and Date were used to fit a given model.  A more elegant solution is
provided by the __ply() functions in Hadley Wickham's plyr package.
These do the same job as by(), but eliminate the need to post process
the data with do.call() and preserve information about the parameters
used to subset the data.  Here we use ddply() since we are inputting a
data.frame and expecting a data.frame as output:

  require( plyr )

  modelResults <- ddply( experimentData, c( 'Date', 'Site' ),
function( dataSlice ){

    linMod <- lm( Concentration ~ Sample_interval, data = dataSlice )

    intercept <- coef( linMod )[1]
    slope <- coef( linMod )[2]

    rsq <- summary( linMod )[[ 'r.squared' ]]

    fStats <- as.list( summary( linMod )[[ 'fstatistic' ]] )
    names( fStats ) <- c( 'q', 'df1', 'df2' )
    fStats[[ 'lower.tail' ]] <- FALSE

    pVal <- do.call( pf, fStats )

    return(
      data.frame( slope, intercept, rsq, pVal )
    )

  })

Hadley's ggplot2 package also provides a nice way to efficiently
subset your data and plot it along with the linear model fits:

  require( ggplot2 )

  resultPlot <- qplot( Sample_interval, Concentration, data = experimentData ) +
    # facet_grid() breaks your data into sub-groups based on the values of
    # Date and Site in experimentData.
    facet_grid( Date ~ Site ) +
    # geom_smooth can be used to fit a variety of models, in this case we want a
    # linear model instead of the default LOESS.
    geom_smooth( method = 'lm' ) +
    # theme_bw() changes the default color scheme.
    theme_bw()

  print( resultPlot )

I hope this helps!

-Charlie