[R] meaning of formula in aggregate function

Den d.kazakiewicz at gmail.com
Sun Jan 23 03:58:43 CET 2011

Dear Dennis
Thank you very much for your comprehensive reply and for time you've
spent dealing with my e-mail.
Your kindly explanation made things clearer for me. 
After your explanation it looks simple.
lapply with chosen options takes small part of cycle<n> with same id
(eg. df[df$id==3,"cycle2"] and makes from it just a bunch of
The only thing I still don't get is why how this code get rid out of
NAs, but this is rather minor technical issue. Main question for me was
in formula. You helped me indeed.
Thank you again
Have a nice day
>From bending but not broken Belarus
У Суб, 22/01/2011 у 17:55 -0800, Dennis Murphy піша:
> Hi:
> I wouldn't pretend to speak for Henrique, but I'll give it a shot.
> On Sat, Jan 22, 2011 at 4:44 AM, Den <d.kazakiewicz at gmail.com> wrote:
>         Dear R community
>         Recently, dear Henrique Dallazuanna literally saved me solving
>         one
>         problem on data transformation which follows:
>         (n_, _n, j_, k_ signify numbers)
>         SOURCE DATA:
>         id      cycle1  cycle2  cycle3  …       cycle_n
>         1       c       c       c               c
>         1       m       m       m               m
>         1       f       f       f               f
>         2       m       m       m               NA
>         2       f       f       f               NA
>         2       c       c       c               NA
>         3       a       a       NA              NA
>         3       c       c       c               NA
>         3       f       f       f               NA
>         3       NA      NA      m               NA
>         ...........................................
>         Q: How to transform source data to:
>         RESULT DATA:
>         id      cyc1    cyc2    cyc3    …       cyc_n
>         1       cfm     cfm     cfm             cfm
>         2       cfm     cfm     cfm
>         3       acf     acf     cfm
>         ...........................................
>         The Henrique's solution is:
>         aggregate(.~ id, lapply(df, as.character), FUN =
>         function(x)paste(sort(x), collapse = ''), na.action = na.pass)
> The first part, . ~ id, is the formula. It's using every available
> variable in the input data on the left hand side of the formula except
> for id, which is the grouping variable.
> The data object is lapply(df, as.character), which is a list object
> that translates every element to character. I'm guessing that each
> element of the list is a character string or list of character
> strings, but I'm not sure. It looks like the individual characters of
> each cycle comprise a list component within id. (??)  [My guess: the
> result of lapply() is a list of lists. The top-level list components
> correspond to the id's, while the second-level components are the
> cycle variables, whose elements are the characters in each cycle
> variable for each row with the same id.]
> The function to be applied to each id is described in FUN. As Peter
> mentioned, it's an 'anonymous' function, which means it is defined
> in-line. In this case, a generic input object x has its elements
> sorted in increasing order and then combines the elements into a
> single string (the purpose of collapse = ); NA values are skipped
> over. Thus, if my hypothesis about the structure of the list is
> correct, the three characters in each cycle/id combination are first
> sorted and then combined into a single string, which is then output as
> the result. By the way that Henrique used the formula, the aggregate()
> function will march through each cycle variable within id and execute
> the function, and then iterate the process over all id's. 
>         Could somebody EXPLAIN HOW IT WORKS?
>         I mean Henrique saved my investigation indeed.
>         However, considering the fact, that I am about to perform
>         investigation
>         of cancer chemotherapy in 500 patients, it would be nice to
>         know what
>         I am actually doing.
> Henrique's R knowledge is on a different level from most of us, so I
> understand your question :) 
>         1. All help says about LHS in formulas like '.~id' is that
>         it's
>         name is "dot notation". And not a single word more. Thus, I
>         have no
>         clue, what dot in that formula really means.
> . is shorthand for 'everything not otherwise specified in the model
> formula'. In this case, it represents the entire set of cycle
> variables.
>         2. help says:
>          Note that ‘paste()’ coerces ‘NA_character_’, the character
>         missing
>         value, to ‘"NA"'
>         And at the same time:
>          ‘na.pass’ returns the object unchanged.
>         I am happy, that I don't have NAs in mydata.  I just don't
>         understand
>         how it happened.
>         3. Can't see the real difference between 'FUN = function(x)
>         paste(x)'
>         and 'FUN = paste'. However, former works perfectly while
>         latter simply
>         do not.
>         All I can follow from code above is that R breaks data on
>         groups with
>         same id, then it tear each little 'cycle' piece in separate
>         characters,
>         then sorts them and put together these characters within same
>         id on each
>         'cycle'. I miss how R put together all this mess back into
>         nice data
>         frame of long format. NAs is also a question, as I said
>         before.
> By default, aggregate() will try to return a data frame. For each id,
> it will output the id and the result of the function applied to each
> cycle variable, so there should be one row for each id, and n + 1
> columns for the n cycle variables + id.
> Does that help?
> Cheers,
> Dennis 
>         Could you please put some light on it if you don't mind to
>         answer those
>         naive  questions.
>         ______________________________________________
>         R-help at r-project.org mailing list
>         https://stat.ethz.ch/mailman/listinfo/r-help
>         PLEASE do read the posting guide
>         http://www.R-project.org/posting-guide.html
>         and provide commented, minimal, self-contained, reproducible
>         code.

More information about the R-help mailing list