[R] meaning of formula in aggregate function

Sat Jan 22 16:36:53 CET 2011

Den wrote:
> Dear R community
> Recently, dear Henrique Dallazuanna literally saved me solving one
> problem on data transformation which follows:
> 
> (n_, _n, j_, k_ signify numbers)
> 
> SOURCE DATA:   
> id      cycle1  cycle2  cycle3  …       cycle_n
> 1       c       c       c               c
> 1       m       m       m               m
> 1       f       f       f               f
> 2       m       m       m               NA
> 2       f       f       f               NA
> 2       c       c       c               NA
> 3       a       a       NA              NA
> 3       c       c       c               NA
> 3       f       f       f               NA
> 3       NA      NA      m               NA
> ...........................................
> 
> 
> Q: How to transform source data to:
> RESULT DATA:
> id      cyc1    cyc2    cyc3    …       cyc_n
> 1       cfm     cfm     cfm             cfm
> 2       cfm     cfm     cfm             
> 3       acf     acf     cfm             
> ...........................................
> 
>  
> 
> The Henrique's solution is:
> 
> aggregate(.~ id, lapply(df, as.character), FUN =
> function(x)paste(sort(x), collapse = ''), na.action = na.pass)
> 
> 
> Could somebody EXPLAIN HOW IT WORKS?
> I mean Henrique saved my investigation indeed.
> However, considering the fact, that I am about to perform investigation
> of cancer chemotherapy in 500 patients, it would be nice to know what 
> I am actually doing.
> 
> 1. All help says about LHS in formulas like '.~id' is that it's
> name is "dot notation". And not a single word more. Thus, I have no
> clue, what dot in that formula really means.

Well, ?aggregate does (rather gently) point you to the
help page for _formula_ where you will find quite a few
word about the use of '.' in the Details section.

> 2. help says:
>  Note that ‘paste()’ coerces ‘NA_character_’, the character missing
> value, to ‘"NA"'
> And at the same time:
>  ‘na.pass’ returns the object unchanged.
> I am happy, that I don't have NAs in mydata.  I just don't understand
> how it happened.

I don't understand what you're asking.

> 3. Can't see the real difference between 'FUN = function(x) paste(x)'
> and 'FUN = paste'. However, former works perfectly while latter simply
> do not.

That's not quite true. You're using paste(sort(x)) and not
just x in Henrique's solution. And that's precisely
the point: when a function is not 'simple', you need to
define it. Henrique is defining it 'on the fly'; you
could also define it separately before the aggregate()
call and then use it like this:

myfun <- function(x) paste(sort(x), collapse='')
aggregate(...., FUN = myfun, ....)

Peter Ehlers

> 
> 
> All I can follow from code above is that R breaks data on groups with
> same id, then it tear each little 'cycle' piece in separate characters,
> then sorts them and put together these characters within same id on each
> 'cycle'. I miss how R put together all this mess back into nice data
> frame of long format. NAs is also a question, as I said before. 
> 
> Could you please put some light on it if you don't mind to answer those
> naive  questions.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.