[R] meaning of formula in aggregate function

Sun Jan 23 16:33:18 CET 2011

Dear Peter
Thank you
Lo and behold
Now I've got it

In code
aggregate(.~ id, lapply(df, as.character), FUN =
function(x)paste(sort(x), collapse = ''), na.action = na.pass)

there are no contradictions with NAs.

na.action = na.pass is applied to aggregate where default is na.omit.
And afterwards those NAs are removed by sort command.

It is a lot easier for me to deal with data when I know what I am doing.
Thank you again for help. Sorry for annoying naive questions. 

With best regards
Denis Kazakiewicz
Belarus 

 Няд, 23/01/2011 у 05:38 -0800, P Ehlers піша:
> Den wrote:
> > Dear Dennis
> > Thank you very much for your comprehensive reply and for time you've
> > spent dealing with my e-mail.
> > Your kindly explanation made things clearer for me. 
> > After your explanation it looks simple.
> > lapply with chosen options takes small part of cycle<n> with same id
> > (eg. df[df$id==3,"cycle2"] and makes from it just a bunch of
> > characters. 
> > The only thing I still don't get is why how this code get rid out of
> > NAs, but this is rather minor technical issue. Main question for me was
> > in formula. You helped me indeed.
> 
> Okay, now I see what you're asking regarding the NAs.
> I should have realized it before. Anyway, the answer
> is in the function sort(). Have a look at its help
> page and note what sort does when 'na.last=NA', the
> default. You'll see where the NAs went.
> 
> Peter Ehlers
> 
> > Thank you again
> > Have a nice day
> > Denis
> >>From bending but not broken Belarus
> > У Суб, 22/01/2011 у 17:55 -0800, Dennis Murphy піша:
> >> Hi:
> >>
> >> I wouldn't pretend to speak for Henrique, but I'll give it a shot.
> >>
> >> On Sat, Jan 22, 2011 at 4:44 AM, Den <d.kazakiewicz at gmail.com> wrote:
> >>         Dear R community
> >>         Recently, dear Henrique Dallazuanna literally saved me solving
> >>         one
> >>         problem on data transformation which follows:
> >>         
> >>         (n_, _n, j_, k_ signify numbers)
> >>         
> >>         SOURCE DATA:
> >>         id      cycle1  cycle2  cycle3  …       cycle_n
> >>         1       c       c       c               c
> >>         1       m       m       m               m
> >>         1       f       f       f               f
> >>         2       m       m       m               NA
> >>         2       f       f       f               NA
> >>         2       c       c       c               NA
> >>         3       a       a       NA              NA
> >>         3       c       c       c               NA
> >>         3       f       f       f               NA
> >>         3       NA      NA      m               NA
> >>         ...........................................
> >>         
> >>         
> >>         Q: How to transform source data to:
> >>         RESULT DATA:
> >>         id      cyc1    cyc2    cyc3    …       cyc_n
> >>         1       cfm     cfm     cfm             cfm
> >>         2       cfm     cfm     cfm
> >>         3       acf     acf     cfm
> >>         ...........................................
> >>         
> >>         
> >>         
> >>         The Henrique's solution is:
> >>         
> >>         aggregate(.~ id, lapply(df, as.character), FUN =
> >>         function(x)paste(sort(x), collapse = ''), na.action = na.pass)
> >>
> >> The first part, . ~ id, is the formula. It's using every available
> >> variable in the input data on the left hand side of the formula except
> >> for id, which is the grouping variable.
> >>
> >> The data object is lapply(df, as.character), which is a list object
> >> that translates every element to character. I'm guessing that each
> >> element of the list is a character string or list of character
> >> strings, but I'm not sure. It looks like the individual characters of
> >> each cycle comprise a list component within id. (??)  [My guess: the
> >> result of lapply() is a list of lists. The top-level list components
> >> correspond to the id's, while the second-level components are the
> >> cycle variables, whose elements are the characters in each cycle
> >> variable for each row with the same id.]
> >>
> >> The function to be applied to each id is described in FUN. As Peter
> >> mentioned, it's an 'anonymous' function, which means it is defined
> >> in-line. In this case, a generic input object x has its elements
> >> sorted in increasing order and then combines the elements into a
> >> single string (the purpose of collapse = ); NA values are skipped
> >> over. Thus, if my hypothesis about the structure of the list is
> >> correct, the three characters in each cycle/id combination are first
> >> sorted and then combined into a single string, which is then output as
> >> the result. By the way that Henrique used the formula, the aggregate()
> >> function will march through each cycle variable within id and execute
> >> the function, and then iterate the process over all id's. 
> >>
> >>         
> >>         
> >>         Could somebody EXPLAIN HOW IT WORKS?
> >>         I mean Henrique saved my investigation indeed.
> >>         However, considering the fact, that I am about to perform
> >>         investigation
> >>         of cancer chemotherapy in 500 patients, it would be nice to
> >>         know what
> >>         I am actually doing.
> >>
> >> Henrique's R knowledge is on a different level from most of us, so I
> >> understand your question :) 
> >>
> >>         
> >>         1. All help says about LHS in formulas like '.~id' is that
> >>         it's
> >>         name is "dot notation". And not a single word more. Thus, I
> >>         have no
> >>         clue, what dot in that formula really means.
> >>
> >> . is shorthand for 'everything not otherwise specified in the model
> >> formula'. In this case, it represents the entire set of cycle
> >> variables.
> >>  
> >>
> >>         2. help says:
> >>          Note that ‘paste()’ coerces ‘NA_character_’, the character
> >>         missing
> >>         value, to ‘"NA"'
> >>         And at the same time:
> >>          ‘na.pass’ returns the object unchanged.
> >>         I am happy, that I don't have NAs in mydata.  I just don't
> >>         understand
> >>         how it happened.
> >>         3. Can't see the real difference between 'FUN = function(x)
> >>         paste(x)'
> >>         and 'FUN = paste'. However, former works perfectly while
> >>         latter simply
> >>         do not.
> >>         
> >>         
> >>         All I can follow from code above is that R breaks data on
> >>         groups with
> >>         same id, then it tear each little 'cycle' piece in separate
> >>         characters,
> >>         then sorts them and put together these characters within same
> >>         id on each
> >>         'cycle'. I miss how R put together all this mess back into
> >>         nice data
> >>         frame of long format. NAs is also a question, as I said
> >>         before.
> >>
> >> By default, aggregate() will try to return a data frame. For each id,
> >> it will output the id and the result of the function applied to each
> >> cycle variable, so there should be one row for each id, and n + 1
> >> columns for the n cycle variables + id.
> >>
> >> Does that help?
> >>
> >> Cheers,
> >> Dennis 
> >>
> >>         
> >>         Could you please put some light on it if you don't mind to
> >>         answer those
> >>         naive  questions.
> >>         
> >>         ______________________________________________
> >>         R-help at r-project.org mailing list
> >>         https://stat.ethz.ch/mailman/listinfo/r-help
> >>         PLEASE do read the posting guide
> >>         http://www.R-project.org/posting-guide.html
> >>         and provide commented, minimal, self-contained, reproducible
> >>         code.
> >>
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.

-- 
Den <d.kazakiewicz at gmail.com>