[R] inconsistent behavior of summary function
Brian Diggs
diggsb at ohsu.edu
Tue Oct 4 23:15:33 CEST 2011
I'm going to put on my fire suit and wade in (see inline)
On 10/4/2011 8:11 AM, Bert Gunter wrote:
> On Tue, Oct 4, 2011 at 7:42 AM, Jeanne M. Spicer<xn8spicer at gmail.com>wrote:
>
>> I'm not sure how returning an incorrect result is ever a 'positive' feature
>
> It is **not** "incorrect"; perhaps unexpected, but that is not the same.
>
"You are technically correct -- the best kind of correct" -- Futurama
The results (using the built-in data set rock)
> summary(rock["area"])
area
Min. : 1016
1st Qu.: 5305
Median : 7487
Mean : 7188
3rd Qu.: 8870
Max. :12212
> summary(rock[["area"]])
Min. 1st Qu. Median Mean 3rd Qu. Max.
1016 5305 7487 7188 8870 12210
differ for exactly the reason you say (dispatching to different methods
of summary), and the different values of max are both correct given the
documentation. However, let's walk through what it takes to show that.
In the help page for summary, an option digits is described, which has
the default value max(3, getOption("digits")-3). Executing this (or
getOption("digits") alone and doing the math) results in the default
value of digits being 4 (at least for me; and I do not believe that I
have changed the option).
So what is this option used for? In the documentation, it says:
"integer, used for number formatting with signif() (for summary.default)
or format() (for summary.data.frame)." Let's assume that we realize
that rock["area"] is a data frame, which would be handled by
summary.data.frame, and rock[["area"]] is a vector, and further
determine that summary.default is what will handle it (having not found
summary.vector or summary.integer).
Let's dive into the help page for signif and format, since they are
listed as relevant to the use of digits in the two different cases.
signif tells us that digits is "integer indicating the number of ...
significant digits (signif) to be used." Looking at "Details", the last
sentence says "Each element of the vector is rounded individually,
unlike printing." So in the case of a vector, each value is separately
rounded to 4 significant digits (max of 12212 is rounded to 12210)
format tells us that digits is "how many significant digits are to be
used for numeric and complex x. ... This is a suggestion: enough decimal
places will be used so that the smallest (in magnitude) number has this
many significant digits, and also to satisfy nsmall."
So the difference is that if it is a vector, each part (min, quartiles,
mean, and max) is rounded to 4 significant digits individually, while if
it is a column of a data frame, the set is collectively rounded so that
the smallest has 4 significant digits and the rest are carried out to
the same decimal place.
Some points:
1) Both of these functions are in base, so I would expect the same
behavior using the same (default) arguments. Yes, the key word is
"expect." Hopefully I have demonstrated that I understand why they
differ. I would not anticipate rounding, and when only one value has
only one digit rounded, it is not really obvious that it happened. (As
compared to say, summary(11111*rock$area), if I knew the data was not
all rounded to the nearest 10,000). So this is not just a matter of
realizing that different methods are being dispatched, but reading
through three different help pages (at least three, assuming I started
at the right place and realized which other two were the relevant ones)
to see that the end results are presented differently WHICH I WOULD NOT
REALIZE THAT I EVEN NEED TO DO.
2) rock$area is an integer vector, so even if I realize that rounding
would be done on floating point numbers, I would not expect (yes, again,
"expect") that integers would need to be rounded to some lesser number
of significant digits.
3) The documentation for summary is actually wrong about digits for the
case of summary.data.frame. Consider:
> summary(rock["area"], digits=17)
area
Min. : 1016.0000000000000
1st Qu.: 5305.2500000000000
Median : 7487.0000000000000
Mean : 7187.7291666700003
3rd Qu.: 8869.5000000000000
Max. :12212.0000000000000
In particular, note the mean. It is wrong (mathematically incorrect AND
not consistent with the documentation).
> dput(mean(rock["area"]))
structure(7187.72916666667, .Names = "area")
Why? Internally, summary.data.frame calls summary.default on
rock[["area"]] with a hard coded digits value of 12. Then takes this
value, and formats it with 17 digits of precision as requested. That's
why there are the four zeros in the middle (the last digit being
numerical imprecision due to binary representation of floating point
values).
4) summary.default does not necessarily honor the number of significant
digits either:
> for(i in 1:9) print(summary(rock[["area"]], digits=i))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1000 5000 7000 7000 9000 10000
Min. 1st Qu. Median Mean 3rd Qu. Max.
1000 5300 7500 7200 8900 12000
Min. 1st Qu. Median Mean 3rd Qu. Max.
1020 5310 7490 7190 8870 12200
Min. 1st Qu. Median Mean 3rd Qu. Max.
1016 5305 7487 7188 8870 12210
Min. 1st Qu. Median Mean 3rd Qu. Max.
1016.0 5305.2 7487.0 7187.7 8869.5 12212.0
Min. 1st Qu. Median Mean 3rd Qu. Max.
1016.00 5305.25 7487.00 7187.73 8869.50 12212.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1016.000 5305.250 7487.000 7187.729 8869.500 12212.000
Min. 1st Qu. Median Mean 3rd Qu. Max.
1016.000 5305.250 7487.000 7187.729 8869.500 12212.000
Min. 1st Qu. Median Mean 3rd Qu. Max.
1016.000 5305.250 7487.000 7187.729 8869.500 12212.000
Beyond 7, no additional significant digits are printed, despite the
value of digits. This is the behavior of signif
> signif(mean(rock[["area"]]), digits=9)
[1] 7187.729
but is not consistent with documentation (which says digits can be as
large as 22).
>> but at least the documentation could more clearly warn users that this
>> method behaves differently in these cases -- summary(rock[,1]) vs
>> summary(rock[,1:2]) -- and that the method can and *does* return incorrect
>> results without any warning messages.
>>
>
> What is (in)adequate in documentation is often in the mind of the beholder.
>
> Note:
>> class(rock[,1])
> [1] "integer"
>
>> class(rock[,1:2])
> [1] "data.frame"
>
> This means that different methods are dispatched, leading to the different
> results. Morever,
>> summary(rock[,1,drop=FALSE])
> area
> Min. : 1016
> 1st Qu.: 5305
> Median : 7487
> Mean : 7188
> 3rd Qu.: 8870
> Max. :12212
>
> ... and that is because
>> class(rock[,1,drop=FALSE])
> [1] "data.frame"
>
> So the relevant Help file is ?"[.data.frame"
That certainly explains the reasoning for the different dispatches, but
is only the start of understanding what is going on. The data frame
method does rather what you would expect (since format tends to be less
surprising from an output point of view). Consider another example:
> summary(11111*rock["area"])
area
Min. : 11288776
1st Qu.: 58946633
Median : 83188057
Mean : 79862859
3rd Qu.: 98549015
Max. :135687532
> summary(11111*rock[["area"]])
Min. 1st Qu. Median Mean 3rd Qu. Max.
11290000 58950000 83190000 79860000 98550000 135700000
Both of these have digits value of 4 (the default), but the data frame
one "ignores" it (or, more accurately, format takes it as a
recommendation but prints all values down to the 1's place despite only
4 significant digits being requested, probably due to nsmall being 0).
The default method dutifully rounds each value to the requested default
4 decimal places.
>> I would encourage anyone teaching introductory R to look at the 'epicalc'
>> package. The re-vamped function 'summ' in that package returns correct
>> results regardless - summ(rock), summ(rock$area). In addition, when you
>> only ask for one column you not only get the correct results, you also get a
>> bonus distribution plot.
>>
>> I'd would like all of our students to use R, but little things like this
>> are huge stumbling blocks for them.
>>
>
> I have no doubt that this is true. R is powerful, flexible and, as an
> inevitable result, complex. To master it, honest effort is required,
> probably a somewhat scarce commodity in introductory classes, especially for
> non-statisticians. For that reason, there are numerous learning resources
> available, to be found on CRAN. Have you looked at them? Moreover,there are
> several R GUI's that attempt to shield the beginner from the initial shock,
> to be found in the R-GUIs link under "Other Projects." Have you considered
> those?
>
> So I think something more than righteous indignation is called for here.
> Nevertheless, the bottom line is that you get what you pay for: R **IS**
> hard -- but for many serious data analysts of all stripes, worth the effort.
I saw it as more exasperation at inconsistencies rather than righteous
indignation. There is much power in R, and there are many subtle points
(to which the existence of the R Inferno attests). Certainly the more
complicated a task is undertaken, the more subtleties are to be
expected. But to have to track subtle rounding issues for a simple
summary of a set of numbers (depending on how exactly the summary is
requested) was where I thought the frustration was coming from.
> Cheers,
> Bert
>
>> -jeanne
>>
--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University
More information about the R-help
mailing list