[R] behavior of "by"
Jeff Laake
Jeff.Laake at noaa.gov
Wed Oct 29 02:04:57 CET 2008
Any insight into the behavior of "by" in the following case would be
appreciated. There is a note in the help details for "by" about
documenting behavior since v2.7 but I don't entirely understand what it
is saying. I'm using R2.7.2 Windows. I'm interested if the following
behavior was a change or whether it has always worked this way. I
looked at RSiteSearch and read through version changes but found nothing.
Take a dataframe as follows:
> samples
Region.Label Area Sample.Label Effort Label
1 1 10000 1 100 11
2 1 10000 2 100 12
3 1 10000 3 100 13
4 1 10000 4 100 14
5 1 10000 5 100 15
6 1 10000 6 100 16
7 1 10000 7 100 17
8 1 10000 8 100 18
9 1 10000 9 100 19
10 1 10000 10 100 110
Use "by" to tally number of entries with particular values of
Region.Label (in this case there is only 1 value of Region.Label)
by(samples$Effort,samples$Region.Label,length)
INDICES: 1
[1] 1
I expected to get 10 instead of 1. I debugged into by.data.frame and I
can see that it used drop=FALSE, so length returned the number of
columns which is 1. But if I do any of the following, I get the 10 I
expect.
> by(rep(1,10),samples$Region.Label,length)
samples$Region.Label: 1
[1] 10
by(samples$Label,samples$Region.Label,length)
samples$Region.Label: 1
[1] 10
Also if I use "tapply" with samples$Effort instead of "by" I get the 10
I expect.
tapply(samples$Effort,samples$Region.Label,length)
1
10
I do not understand why I'm getting these differences but I can see that
I'm going to use tapply from now on.
More information about the R-help
mailing list