[R] meaning of formula in aggregate function
Den
d.kazakiewicz at gmail.com
Sun Jan 23 16:33:18 CET 2011
Dear Peter
Thank you
Lo and behold
Now I've got it
In code
aggregate(.~ id, lapply(df, as.character), FUN =
function(x)paste(sort(x), collapse = ''), na.action = na.pass)
there are no contradictions with NAs.
na.action = na.pass is applied to aggregate where default is na.omit.
And afterwards those NAs are removed by sort command.
It is a lot easier for me to deal with data when I know what I am doing.
Thank you again for help. Sorry for annoying naive questions.
With best regards
Denis Kazakiewicz
Belarus
Няд, 23/01/2011 у 05:38 -0800, P Ehlers піша:
> Den wrote:
> > Dear Dennis
> > Thank you very much for your comprehensive reply and for time you've
> > spent dealing with my e-mail.
> > Your kindly explanation made things clearer for me.
> > After your explanation it looks simple.
> > lapply with chosen options takes small part of cycle<n> with same id
> > (eg. df[df$id==3,"cycle2"] and makes from it just a bunch of
> > characters.
> > The only thing I still don't get is why how this code get rid out of
> > NAs, but this is rather minor technical issue. Main question for me was
> > in formula. You helped me indeed.
>
> Okay, now I see what you're asking regarding the NAs.
> I should have realized it before. Anyway, the answer
> is in the function sort(). Have a look at its help
> page and note what sort does when 'na.last=NA', the
> default. You'll see where the NAs went.
>
> Peter Ehlers
>
> > Thank you again
> > Have a nice day
> > Denis
> >>From bending but not broken Belarus
> > У Суб, 22/01/2011 у 17:55 -0800, Dennis Murphy піша:
> >> Hi:
> >>
> >> I wouldn't pretend to speak for Henrique, but I'll give it a shot.
> >>
> >> On Sat, Jan 22, 2011 at 4:44 AM, Den <d.kazakiewicz at gmail.com> wrote:
> >> Dear R community
> >> Recently, dear Henrique Dallazuanna literally saved me solving
> >> one
> >> problem on data transformation which follows:
> >>
> >> (n_, _n, j_, k_ signify numbers)
> >>
> >> SOURCE DATA:
> >> id cycle1 cycle2 cycle3 … cycle_n
> >> 1 c c c c
> >> 1 m m m m
> >> 1 f f f f
> >> 2 m m m NA
> >> 2 f f f NA
> >> 2 c c c NA
> >> 3 a a NA NA
> >> 3 c c c NA
> >> 3 f f f NA
> >> 3 NA NA m NA
> >> ...........................................
> >>
> >>
> >> Q: How to transform source data to:
> >> RESULT DATA:
> >> id cyc1 cyc2 cyc3 … cyc_n
> >> 1 cfm cfm cfm cfm
> >> 2 cfm cfm cfm
> >> 3 acf acf cfm
> >> ...........................................
> >>
> >>
> >>
> >> The Henrique's solution is:
> >>
> >> aggregate(.~ id, lapply(df, as.character), FUN =
> >> function(x)paste(sort(x), collapse = ''), na.action = na.pass)
> >>
> >> The first part, . ~ id, is the formula. It's using every available
> >> variable in the input data on the left hand side of the formula except
> >> for id, which is the grouping variable.
> >>
> >> The data object is lapply(df, as.character), which is a list object
> >> that translates every element to character. I'm guessing that each
> >> element of the list is a character string or list of character
> >> strings, but I'm not sure. It looks like the individual characters of
> >> each cycle comprise a list component within id. (??) [My guess: the
> >> result of lapply() is a list of lists. The top-level list components
> >> correspond to the id's, while the second-level components are the
> >> cycle variables, whose elements are the characters in each cycle
> >> variable for each row with the same id.]
> >>
> >> The function to be applied to each id is described in FUN. As Peter
> >> mentioned, it's an 'anonymous' function, which means it is defined
> >> in-line. In this case, a generic input object x has its elements
> >> sorted in increasing order and then combines the elements into a
> >> single string (the purpose of collapse = ); NA values are skipped
> >> over. Thus, if my hypothesis about the structure of the list is
> >> correct, the three characters in each cycle/id combination are first
> >> sorted and then combined into a single string, which is then output as
> >> the result. By the way that Henrique used the formula, the aggregate()
> >> function will march through each cycle variable within id and execute
> >> the function, and then iterate the process over all id's.
> >>
> >>
> >>
> >> Could somebody EXPLAIN HOW IT WORKS?
> >> I mean Henrique saved my investigation indeed.
> >> However, considering the fact, that I am about to perform
> >> investigation
> >> of cancer chemotherapy in 500 patients, it would be nice to
> >> know what
> >> I am actually doing.
> >>
> >> Henrique's R knowledge is on a different level from most of us, so I
> >> understand your question :)
> >>
> >>
> >> 1. All help says about LHS in formulas like '.~id' is that
> >> it's
> >> name is "dot notation". And not a single word more. Thus, I
> >> have no
> >> clue, what dot in that formula really means.
> >>
> >> . is shorthand for 'everything not otherwise specified in the model
> >> formula'. In this case, it represents the entire set of cycle
> >> variables.
> >>
> >>
> >> 2. help says:
> >> Note that ‘paste()’ coerces ‘NA_character_’, the character
> >> missing
> >> value, to ‘"NA"'
> >> And at the same time:
> >> ‘na.pass’ returns the object unchanged.
> >> I am happy, that I don't have NAs in mydata. I just don't
> >> understand
> >> how it happened.
> >> 3. Can't see the real difference between 'FUN = function(x)
> >> paste(x)'
> >> and 'FUN = paste'. However, former works perfectly while
> >> latter simply
> >> do not.
> >>
> >>
> >> All I can follow from code above is that R breaks data on
> >> groups with
> >> same id, then it tear each little 'cycle' piece in separate
> >> characters,
> >> then sorts them and put together these characters within same
> >> id on each
> >> 'cycle'. I miss how R put together all this mess back into
> >> nice data
> >> frame of long format. NAs is also a question, as I said
> >> before.
> >>
> >> By default, aggregate() will try to return a data frame. For each id,
> >> it will output the id and the result of the function applied to each
> >> cycle variable, so there should be one row for each id, and n + 1
> >> columns for the n cycle variables + id.
> >>
> >> Does that help?
> >>
> >> Cheers,
> >> Dennis
> >>
> >>
> >> Could you please put some light on it if you don't mind to
> >> answer those
> >> naive questions.
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible
> >> code.
> >>
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
--
Den <d.kazakiewicz at gmail.com>
More information about the R-help
mailing list