[R] meaning of formula in aggregate function
P Ehlers
ehlers at ucalgary.ca
Sun Jan 23 14:38:54 CET 2011
Den wrote:
> Dear Dennis
> Thank you very much for your comprehensive reply and for time you've
> spent dealing with my e-mail.
> Your kindly explanation made things clearer for me.
> After your explanation it looks simple.
> lapply with chosen options takes small part of cycle<n> with same id
> (eg. df[df$id==3,"cycle2"] and makes from it just a bunch of
> characters.
> The only thing I still don't get is why how this code get rid out of
> NAs, but this is rather minor technical issue. Main question for me was
> in formula. You helped me indeed.
Okay, now I see what you're asking regarding the NAs.
I should have realized it before. Anyway, the answer
is in the function sort(). Have a look at its help
page and note what sort does when 'na.last=NA', the
default. You'll see where the NAs went.
Peter Ehlers
> Thank you again
> Have a nice day
> Denis
>>From bending but not broken Belarus
> У Суб, 22/01/2011 у 17:55 -0800, Dennis Murphy піша:
>> Hi:
>>
>> I wouldn't pretend to speak for Henrique, but I'll give it a shot.
>>
>> On Sat, Jan 22, 2011 at 4:44 AM, Den <d.kazakiewicz at gmail.com> wrote:
>> Dear R community
>> Recently, dear Henrique Dallazuanna literally saved me solving
>> one
>> problem on data transformation which follows:
>>
>> (n_, _n, j_, k_ signify numbers)
>>
>> SOURCE DATA:
>> id cycle1 cycle2 cycle3 … cycle_n
>> 1 c c c c
>> 1 m m m m
>> 1 f f f f
>> 2 m m m NA
>> 2 f f f NA
>> 2 c c c NA
>> 3 a a NA NA
>> 3 c c c NA
>> 3 f f f NA
>> 3 NA NA m NA
>> ...........................................
>>
>>
>> Q: How to transform source data to:
>> RESULT DATA:
>> id cyc1 cyc2 cyc3 … cyc_n
>> 1 cfm cfm cfm cfm
>> 2 cfm cfm cfm
>> 3 acf acf cfm
>> ...........................................
>>
>>
>>
>> The Henrique's solution is:
>>
>> aggregate(.~ id, lapply(df, as.character), FUN =
>> function(x)paste(sort(x), collapse = ''), na.action = na.pass)
>>
>> The first part, . ~ id, is the formula. It's using every available
>> variable in the input data on the left hand side of the formula except
>> for id, which is the grouping variable.
>>
>> The data object is lapply(df, as.character), which is a list object
>> that translates every element to character. I'm guessing that each
>> element of the list is a character string or list of character
>> strings, but I'm not sure. It looks like the individual characters of
>> each cycle comprise a list component within id. (??) [My guess: the
>> result of lapply() is a list of lists. The top-level list components
>> correspond to the id's, while the second-level components are the
>> cycle variables, whose elements are the characters in each cycle
>> variable for each row with the same id.]
>>
>> The function to be applied to each id is described in FUN. As Peter
>> mentioned, it's an 'anonymous' function, which means it is defined
>> in-line. In this case, a generic input object x has its elements
>> sorted in increasing order and then combines the elements into a
>> single string (the purpose of collapse = ); NA values are skipped
>> over. Thus, if my hypothesis about the structure of the list is
>> correct, the three characters in each cycle/id combination are first
>> sorted and then combined into a single string, which is then output as
>> the result. By the way that Henrique used the formula, the aggregate()
>> function will march through each cycle variable within id and execute
>> the function, and then iterate the process over all id's.
>>
>>
>>
>> Could somebody EXPLAIN HOW IT WORKS?
>> I mean Henrique saved my investigation indeed.
>> However, considering the fact, that I am about to perform
>> investigation
>> of cancer chemotherapy in 500 patients, it would be nice to
>> know what
>> I am actually doing.
>>
>> Henrique's R knowledge is on a different level from most of us, so I
>> understand your question :)
>>
>>
>> 1. All help says about LHS in formulas like '.~id' is that
>> it's
>> name is "dot notation". And not a single word more. Thus, I
>> have no
>> clue, what dot in that formula really means.
>>
>> . is shorthand for 'everything not otherwise specified in the model
>> formula'. In this case, it represents the entire set of cycle
>> variables.
>>
>>
>> 2. help says:
>> Note that ‘paste()’ coerces ‘NA_character_’, the character
>> missing
>> value, to ‘"NA"'
>> And at the same time:
>> ‘na.pass’ returns the object unchanged.
>> I am happy, that I don't have NAs in mydata. I just don't
>> understand
>> how it happened.
>> 3. Can't see the real difference between 'FUN = function(x)
>> paste(x)'
>> and 'FUN = paste'. However, former works perfectly while
>> latter simply
>> do not.
>>
>>
>> All I can follow from code above is that R breaks data on
>> groups with
>> same id, then it tear each little 'cycle' piece in separate
>> characters,
>> then sorts them and put together these characters within same
>> id on each
>> 'cycle'. I miss how R put together all this mess back into
>> nice data
>> frame of long format. NAs is also a question, as I said
>> before.
>>
>> By default, aggregate() will try to return a data frame. For each id,
>> it will output the id and the result of the function applied to each
>> cycle variable, so there should be one row for each id, and n + 1
>> columns for the n cycle variables + id.
>>
>> Does that help?
>>
>> Cheers,
>> Dennis
>>
>>
>> Could you please put some light on it if you don't mind to
>> answer those
>> naive questions.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible
>> code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list