[R] by function ??
Matthew Dowle
mdowle at mdowle.plus.com
Tue Jan 5 11:53:26 CET 2010
I wrote :
> (some may return vectors, others may return vectors)
Its been pointed out there was a typo, and wasn't very clear anyway. It
should read '(some may return vectors, others may return scalars)'. I've
been asked for further explanation so here goes ...
The point I was trying to make is that the following expression is very
natural to write. It takes a bit of getting used to though. A reminder of
the 2 column Dataset (containing a group of 4 rows and a group of 3 rows)
then the R expression and then the output :
LEAID ratio
6307 0.7200000
6307 0.7623810
6307 0.8600000
6307 0.9200000
8300 0.5678462
8300 0.7700000
8300 0.8300000
the syntax :
Dataset = data.table(Dataset)
Dataset[,DT(ratio,scaled=abs(ratio-median(ratio)),sum=sum(ratio)),by="LEAID"]
and the 4 column output :
LEAID ratio scaled sum
6307 0.7200000 0.0911905 3.262381
6307 0.7623810 0.0488095 3.262381
6307 0.8600000 0.0488095 3.262381
6307 0.9200000 0.1088095 3.262381
8300 0.5678462 0.2021538 2.167846
8300 0.7700000 0.0000000 2.167846
8300 0.8300000 0.0600000 2.167846
The 2nd argument (the call to DT()) contains 3 expressions, which are
executed for each subset of the Dataset grouped by LEAID. The row order is
maintained for each subset, and these expressions operate on ordered vectors
as usual in R. We can use column names as variable names directly (like an
implicit ?with). Note that Dataset doesn't have to be ordered by LEAID, but
it just happens to be in this example.
A comment on each of the 3 expressions (the 3 arguments passed to DT()
above) is perhaps useful :
ratio : just repeats the ratio vector as is. You don't have to include
this but I wanted to keep the input data in the output to demonstrate.
abs(ratio-median(ratio)) : median() returns a scalar, subtracted from
each element from ratio, and returns a vector. abs() takes a vector, and
returns a vector. Standard R and basic stuff. Any R expresssion can be used,
so its more powerful than SQL in thats sense because SQL is restricted to a
small set of functions (avg, min, max, etc), which has been said before and
been true about R for a long time. Its the overall syntax of the single
'query' that I'm trying to demonstrate.
sum(ratio) : returns a scalar aggregate on the vector input. Thats what I
meant by "others may return scalars". Notice the the value of sum(ratio) is
repeated in the final column of the output. The reason is because at least
one of the other expressions return vectors, and standard R silent
repetition rules are coming into play inside DT().
Then the 2 data.table's (one for each of the 2 groups) are combined and a
single data.table is returned. Very similar to SQL really and some other
ways to aggregate in R, but more compact, more natural, easier and more
convenient (and therefore quicker) to write, debug and maintain.
"Matthew Dowle" <mdowle at mdowle.plus.com> wrote in message
news:hgnjev$3hk$1 at ger.gmane.org...
> or if Dataset is a data.table :
>
>> Dataset = data.table(Dataset)
>> Dataset[,abs(ratio-median(ratio)),by="LEAID"]
> LEAID V1
> [1,] 6307 0.0911905
> [2,] 6307 0.0488095
> [3,] 6307 0.0488095
> [4,] 6307 0.1088095
> [5,] 8300 0.2021538
> [6,] 8300 0.0000000
> [7,] 8300 0.0600000
> rather than :
>> Dataset$abs <- with(Dataset, ave(ratio, LEAID,
>> FUN=function(x)abs(x-median(x))))
>
> This is less code and more natural (to me anyway) e.g. it doesn't require
> use of function() or ave(). data.table knows that if the j expression
> returns a vector it should silently repeat the groups to match the length
> of the j result (which it is doing here). If the j expression returns a
> scalar you would just get 2 rows in this example. Note that the 'by'
> expression must evaluation to integer, or a list of integer vectors, so
> in this case LEAID must either be integer already or coerced to integer
> using by="as.integer(LEAID)".
>
> To give the aggregate expression a name, just wrap with the DT function.
> This is also how to return multiple aggregate functions from each subset
> (some may return vectors, others may return vectors) by listing them
> inside DT() :
>
>> Dataset[,DT(ratio,scaled=abs(ratio-median(ratio)),sum=sum(ratio)),by="LEAID"]
> LEAID ratio scaled sum
> [1,] 6307 0.7200000 0.0911905 3.262381
> [2,] 6307 0.7623810 0.0488095 3.262381
> [3,] 6307 0.8600000 0.0488095 3.262381
> [4,] 6307 0.9200000 0.1088095 3.262381
> [5,] 8300 0.5678462 0.2021538 2.167846
> [6,] 8300 0.7700000 0.0000000 2.167846
> [7,] 8300 0.8300000 0.0600000 2.167846
>
>
> "William Dunlap" <wdunlap at tibco.com> wrote in message
> news:77EB52C6DD32BA4D87471DCD70C8D7000243CBA1 at NA-PA-VBE03.na.tibco.com...
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of L.A.
>> Sent: Saturday, December 12, 2009 12:39 PM
>> To: r-help at r-project.org
>> Subject: Re: [R] by function ??
>>
>>
>>
>> Thanks for all the help, They all worked, But I'm stuck again.
>> I've tried searching, but I not sure how to word my search as
>> nothing came
>> up.
>> Here is my new hurdle, my data has 7 abservations and my
>> results have 2
>> answers:
>>
>>
>> Here is my data
>>
>> LEAID ratio
>> 3 6307 0.7200000
>> 1 6307 0.7623810
>> 2 6307 0.8600000
>> 4 6307 0.9200000
>> 5 8300 0.5678462
>> 7 8300 0.7700000
>> 6 8300 0.8300000
>>
>>
>> > median<-summaryBy(ratio ~ LEAID, data = Dataset, FUN = median)
>>
>> > print(median)
>> LEAID ratio.median
>> 1 6307 0.8111905
>> 2 8300 0.7700000
>>
>> Now what I want is a way to compute
>> abs(ratio- median)by LEAID for each observation to produce
>> something like
>> this
>>
>> LEAID ratio abs
>> 3 6307 0.7200000 .0912
>> 1 6307 0.7623810 .0488
>> 2 6307 0.8600000 .0488
>> 4 6307 0.9200000 .1088
>> 5 8300 0.5678462 .2022
>> 7 8300 0.7700000 .0000
>> 6 8300 0.8300000 .0600
>
> Try ave(), as in
> > Dataset$abs <- with(Dataset, ave(ratio, LEAID,
> FUN=function(x)abs(x-median(x))))
> > Dataset
> LEAID ratio abs
> 3 6307 0.7200000 0.0911905
> 1 6307 0.7623810 0.0488095
> 2 6307 0.8600000 0.0488095
> 4 6307 0.9200000 0.1088095
> 5 8300 0.5678462 0.2021538
> 7 8300 0.7700000 0.0000000
> 6 8300 0.8300000 0.0600000
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>>
>> Thanks,
>> L.A.
>>
>>
>>
>>
>> Ista Zahn wrote:
>> >
>> > Hi,
>> > I think you want
>> >
>> > by(TestData[ , "RATIO"], LEAID, median)
>> >
>> > -Ista
>> >
>> > On Tue, Dec 8, 2009 at 8:36 PM, L.A. <romsa at millect.com> wrote:
>> >>
>> >> I'm just learning and this is probably very simple, but I'm stuck.
>> >> I'm trying to understand the by().
>> >> This works.
>> >> by(TestData, LEAID, summary)
>> >>
>> >> But, This doesn't.
>> >>
>> >> by(TestData, LEAID, median(RATIO))
>> >>
>> >>
>> >> ERROR: could not find function "FUN"
>> >>
>> >> HELP!
>> >> Thanks,
>> >> LA
>> >> --
>> >> View this message in context:
>> >> http://n4.nabble.com/by-function-tp955789p955789.html
>> >> Sent from the R help mailing list archive at Nabble.com.
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>> >
>> >
>> > --
>> > Ista Zahn
>> > Graduate student
>> > University of Rochester
>> > Department of Clinical and Social Psychology
>> > http://yourpsyche.org
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>> >
>>
>> --
>> View this message in context:
>> http://n4.nabble.com/by-function-tp955789p962666.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
More information about the R-help
mailing list