[R] meaning of formula in aggregate function

Den d.kazakiewicz at gmail.com
Sat Jan 22 13:44:59 CET 2011


Dear R community
Recently, dear Henrique Dallazuanna literally saved me solving one
problem on data transformation which follows:

(n_, _n, j_, k_ signify numbers)

SOURCE DATA:   
id      cycle1  cycle2  cycle3  …       cycle_n
1       c       c       c               c
1       m       m       m               m
1       f       f       f               f
2       m       m       m               NA
2       f       f       f               NA
2       c       c       c               NA
3       a       a       NA              NA
3       c       c       c               NA
3       f       f       f               NA
3       NA      NA      m               NA
...........................................


Q: How to transform source data to:
RESULT DATA:
id      cyc1    cyc2    cyc3    …       cyc_n
1       cfm     cfm     cfm             cfm
2       cfm     cfm     cfm             
3       acf     acf     cfm             
...........................................

 

The Henrique's solution is:

aggregate(.~ id, lapply(df, as.character), FUN =
function(x)paste(sort(x), collapse = ''), na.action = na.pass)


Could somebody EXPLAIN HOW IT WORKS?
I mean Henrique saved my investigation indeed.
However, considering the fact, that I am about to perform investigation
of cancer chemotherapy in 500 patients, it would be nice to know what 
I am actually doing.

1. All help says about LHS in formulas like '.~id' is that it's
name is "dot notation". And not a single word more. Thus, I have no
clue, what dot in that formula really means.
2. help says:
 Note that ‘paste()’ coerces ‘NA_character_’, the character missing
value, to ‘"NA"'
And at the same time:
 ‘na.pass’ returns the object unchanged.
I am happy, that I don't have NAs in mydata.  I just don't understand
how it happened.
3. Can't see the real difference between 'FUN = function(x) paste(x)'
and 'FUN = paste'. However, former works perfectly while latter simply
do not.


All I can follow from code above is that R breaks data on groups with
same id, then it tear each little 'cycle' piece in separate characters,
then sorts them and put together these characters within same id on each
'cycle'. I miss how R put together all this mess back into nice data
frame of long format. NAs is also a question, as I said before. 

Could you please put some light on it if you don't mind to answer those
naive  questions.



More information about the R-help mailing list