[R] inconsistent behavior of summary function

Tue Oct 4 23:15:33 CEST 2011

I'm going to put on my fire suit and wade in (see inline)

On 10/4/2011 8:11 AM, Bert Gunter wrote:
> On Tue, Oct 4, 2011 at 7:42 AM, Jeanne M. Spicer<xn8spicer at gmail.com>wrote:
>
>> I'm not sure how returning an incorrect result is ever a 'positive' feature
>
> It is **not** "incorrect"; perhaps unexpected, but that is not the same.
>

"You are technically correct -- the best kind of correct" -- Futurama

The results (using the built-in data set rock)

 > summary(rock["area"])
       area
  Min.   : 1016
  1st Qu.: 5305
  Median : 7487
  Mean   : 7188
  3rd Qu.: 8870
  Max.   :12212
 > summary(rock[["area"]])
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1016    5305    7487    7188    8870   12210

differ for exactly the reason you say (dispatching to different methods 
of summary), and the different values of max are both correct given the 
documentation.  However, let's walk through what it takes to show that.

In the help page for summary, an option digits is described, which has 
the default value max(3, getOption("digits")-3).  Executing this (or 
getOption("digits") alone and doing the math) results in the default 
value of digits being 4 (at least for me; and I do not believe that I 
have changed the option).

So what is this option used for?  In the documentation, it says: 
"integer, used for number formatting with signif() (for summary.default) 
or format() (for summary.data.frame)."  Let's assume that we realize 
that rock["area"] is a data frame, which would be handled by 
summary.data.frame, and rock[["area"]] is a vector, and further 
determine that summary.default is what will handle it (having not found 
summary.vector or summary.integer).

Let's dive into the help page for signif and format, since they are 
listed as relevant to the use of digits in the two different cases.

signif tells us that digits is "integer indicating the number of ... 
significant digits (signif) to be used."  Looking at "Details", the last 
sentence says "Each element of the vector is rounded individually, 
unlike printing."  So in the case of a vector, each value is separately 
rounded to 4 significant digits (max of 12212 is rounded to 12210)

format tells us that digits is "how many significant digits are to be 
used for numeric and complex x. ... This is a suggestion: enough decimal 
places will be used so that the smallest (in magnitude) number has this 
many significant digits, and also to satisfy nsmall."

So the difference is that if it is a vector, each part (min, quartiles, 
mean, and max) is rounded to 4 significant digits individually, while if 
it is a column of a data frame, the set is collectively rounded so that 
the smallest has 4 significant digits and the rest are carried out to 
the same decimal place.

Some points:

1) Both of these functions are in base, so I would expect the same 
behavior using the same (default) arguments.  Yes, the key word is 
"expect."  Hopefully I have demonstrated that I understand why they 
differ.  I would not anticipate rounding, and when only one value has 
only one digit rounded, it is not really obvious that it happened.  (As 
compared to say, summary(11111*rock$area), if I knew the data was not 
all rounded to the nearest 10,000).  So this is not just a matter of 
realizing that different methods are being dispatched, but reading 
through three different help pages (at least three, assuming I started 
at the right place and realized which other two were the relevant ones) 
to see that the end results are presented differently WHICH I WOULD NOT 
REALIZE THAT I EVEN NEED TO DO.

2) rock$area is an integer vector, so even if I realize that rounding 
would be done on floating point numbers, I would not expect (yes, again, 
"expect") that integers would need to be rounded to some lesser number 
of significant digits.

3) The documentation for summary is actually wrong about digits for the 
case of summary.data.frame.  Consider:
 > summary(rock["area"], digits=17)
       area
  Min.   : 1016.0000000000000
  1st Qu.: 5305.2500000000000
  Median : 7487.0000000000000
  Mean   : 7187.7291666700003
  3rd Qu.: 8869.5000000000000
  Max.   :12212.0000000000000

In particular, note the mean.  It is wrong (mathematically incorrect AND 
not consistent with the documentation).
 > dput(mean(rock["area"]))
structure(7187.72916666667, .Names = "area")

Why?  Internally, summary.data.frame calls summary.default on 
rock[["area"]] with a hard coded digits value of 12.  Then takes this 
value, and formats it with 17 digits of precision as requested.  That's 
why there are the four zeros in the middle (the last digit being 
numerical imprecision due to binary representation of floating point 
values).

4) summary.default does not necessarily honor the number of significant 
digits either:

 > for(i in 1:9) print(summary(rock[["area"]], digits=i))
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1000    5000    7000    7000    9000   10000
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1000    5300    7500    7200    8900   12000
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1020    5310    7490    7190    8870   12200
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1016    5305    7487    7188    8870   12210
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  1016.0  5305.2  7487.0  7187.7  8869.5 12212.0
     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
  1016.00  5305.25  7487.00  7187.73  8869.50 12212.00
      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
  1016.000  5305.250  7487.000  7187.729  8869.500 12212.000
      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
  1016.000  5305.250  7487.000  7187.729  8869.500 12212.000
      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
  1016.000  5305.250  7487.000  7187.729  8869.500 12212.000

Beyond 7, no additional significant digits are printed, despite the 
value of digits.  This is the behavior of signif
 > signif(mean(rock[["area"]]), digits=9)
[1] 7187.729
but is not consistent with documentation (which says digits can be as 
large as 22).

>> but at least the documentation could more clearly warn users that this
>> method behaves differently in these cases -- summary(rock[,1]) vs
>> summary(rock[,1:2]) -- and that the method can and *does* return incorrect
>> results without any warning messages.
>>
>
> What is (in)adequate in documentation is often in the mind of the beholder.
>
> Note:
>> class(rock[,1])
> [1] "integer"
>
>> class(rock[,1:2])
> [1] "data.frame"
>
> This means that different methods are dispatched, leading to the different
> results. Morever,
>> summary(rock[,1,drop=FALSE])
>        area
>   Min.   : 1016
>   1st Qu.: 5305
>   Median : 7487
>   Mean   : 7188
>   3rd Qu.: 8870
>   Max.   :12212
>
> ... and that is because
>> class(rock[,1,drop=FALSE])
> [1] "data.frame"
>
> So the relevant Help file is ?"[.data.frame"

That certainly explains the reasoning for the different dispatches, but 
is only the start of understanding what is going on.  The data frame 
method does rather what you would expect (since format tends to be less 
surprising from an output point of view).  Consider another example:

 > summary(11111*rock["area"])
       area
  Min.   : 11288776
  1st Qu.: 58946633
  Median : 83188057
  Mean   : 79862859
  3rd Qu.: 98549015
  Max.   :135687532
 > summary(11111*rock[["area"]])
      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
  11290000  58950000  83190000  79860000  98550000 135700000

Both of these have digits value of 4 (the default), but the data frame 
one "ignores" it (or, more accurately, format takes it as a 
recommendation but prints all values down to the 1's place despite only 
4 significant digits being requested, probably due to nsmall being 0). 
The default method dutifully rounds each value to the requested default 
4 decimal places.

>> I would encourage anyone teaching introductory R to look at the 'epicalc'
>> package.  The re-vamped function 'summ' in that package returns correct
>> results regardless - summ(rock), summ(rock$area).  In addition, when you
>> only ask for one column you not only get the correct results, you also get a
>> bonus distribution plot.
>>
>> I'd would like all of our students to use R, but little things like this
>> are huge stumbling blocks for them.
>>
>
> I have no doubt that this is true. R is powerful, flexible and, as an
> inevitable result, complex. To master it, honest effort is required,
> probably a somewhat scarce commodity in introductory classes, especially for
> non-statisticians. For that reason, there are numerous learning resources
> available, to be found on CRAN. Have you looked at them? Moreover,there are
> several R GUI's that attempt to shield the beginner from the initial shock,
> to be found in the R-GUIs link under "Other Projects." Have you considered
> those?
>
> So I think something more than righteous indignation is called for here.
> Nevertheless, the bottom line is that you get what you pay for: R **IS**
> hard -- but for many serious data analysts of all stripes, worth the effort.

I saw it as more exasperation at inconsistencies rather than righteous 
indignation.  There is much power in R, and there are many subtle points 
(to which the existence of the R Inferno attests).  Certainly the more 
complicated a task is undertaken, the more subtleties are to be 
expected.  But to have to track subtle rounding issues for a simple 
summary of a set of numbers (depending on how exactly the summary is 
requested) was where I thought the frustration was coming from.

> Cheers,
> Bert
>
>> -jeanne
>>

-- 
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University