[R] strange behaviour of median

Peter Ehlers ehlers at ucalgary.ca
Thu Feb 4 12:52:45 CET 2010


Hi Petr, a couple of comments inserted below.

Petr PIKAL wrote:
> Hi
> 
> r-help-bounces at r-project.org napsal dne 04.02.2010 11:31:51:
> 
>> Petr PIKAL wrote:
>>> Hi
>>>
>>> so do you think I shall fire a bug announcement? I think I rather wait 
> to 
>>> see if there is some reaction from others. Maybe, there is some reason 
> 
>>> behind such behaviour. Those simple statistics tend to behave 
> differently 
>>> when operating on data.frames so median is not such a huge surprise.
>>>
>>> see
>>>
>>> sd(df1), var(df1), mean(df1), max(df1), min(df1), range(df1)
>>>
>>> Produced results are usually clearly documented, however for novice it 
> is 
>>> rather mysterious why using those functions on vector produce easily 
>>> understandable results but using them on data.frame (which is most 
> common 
>>> structure of data) is far from consistent and intuitive.
>>>
>>> But I agree with you that mean and median in best case shall give 
> similar 
>>> results regarding results structure.
>>>
>>> Regards
>>> Petr
>> Well, I don't think that it's a bug since the documentation
>> for median() does not indicate that median should work for
>> dataframes, whereas for mean() it clearly says that a method
>> exists. methods('mean') and methods('median') as well as
>> mean.default(df1) are informative.
> 
> It depends informative for whom. Here is a snippet from median help page
> 
> This is a generic function for which methods can be written. However, the 
> default method makes use of sort and mean, both of which are generic, and 
> so the default method will work for most classes 
>                               ^^^^^^^^^

I agree that this is at the very least misleading. The default method
certainly does not work for data.frames and I wouldn't consider those
to be an unusual class. Still, at this point, I think we're talking
more about a wishlist item than a bug.

I must admit, I've never run across this situation. Good of you
to spot it.

  -Peter Ehlers

> If you consider data.frame an unusual class I could accept your point but 
> if help page tells me that a function works for most classes I would not 
> expect that data.frame class shall be avoided. Especially if work around 
> is such simple (for experienced user). As I said, if I encountered this in 
> real world I can make it easily work with *apply. 
> 
> I tried to give my audience experience that matrix is different from 
> data.frame with respect of such simple statistic functions. But how do you 
> explain, that using mean on matrix produces one number but using it on 
> data.frame it produces mean separately for each column. I wanted to show 
> that it is similar for median but being such candid moron I luckily tried 
> it before I presented it. :-)
> 
>> It seems to me to be a simple fix so I wonder what I'm
>> missing. Paraphrasing mean.data.frame:
>>
>> median.data.frame <- function(x, ...) sapply(x, median, ...)
>>
>> I think that it would be desirable to have similar behaviour
>> for both functions or at least a warning if median.default
>> is incorrectly applied to a data.frame object.
> 
> Agreed. For the benefit of novices I would vote for changing behaviour for 
> data.frames to get mean-like behaviour.
> 
> Regards
> Petr
> 
>>   -Peter Ehlers
>>
>>> r-help-bounces at r-project.org napsal dne 04.02.2010 10:28:16:
>>>
>>>> Well, I get the same as Petr with  R version 2.10.0 (2009-10-26)
>>>> on Linux.
>>>>
>>>> To me, this suggests that median is broken! Any user would,
>>>> a priori, expect that median() should operate in exactly
>>>> the same way as mean(). To extend Petr's example:
>>>>
>>>>   mat <- matrix(1:32, 4,8)
>>>>   df1 <- data.frame(mat)
>>>>   mean(df1)
>>>>   #   X1   X2   X3   X4   X5   X6   X7   X8 
>>>>   #  2.5  6.5 10.5 14.5 18.5 22.5 26.5 30.5 
>>>>   median(df1)
>>>>   # [1] 14.5 18.5
>>>>
>>>> so (as in Petr's original example, but more clearly) median()
>>>> returns the medians of the two "central" columns X4 and X5 of df1.
>>>>
>>>> But that is with an even number of columns. Now look at what
>>>> happens with an odd number:
>>>>
>>>>   mat <- matrix(1:28, 4,7)
>>>>   df1 <- data.frame(mat)
>>>>   mean(df1)
>>>>   #   X1   X2   X3   X4   X5   X6   X7 
>>>>   #  2.5  6.5 10.5 14.5 18.5 22.5 26.5 
>>>>   median(df1)
>>>>   #   structure(c("13", "14", "15", "16"), class = "AsIs")
>>>>   # 1                                                   13
>>>>   # 2                                                   14
>>>>   # 3                                                   15
>>>>   # 4                                                   16
>>>>
>>>> Wow!!!!!!!!!!
>>>>
>>>> This does suggest a tie-in with Petr's observation about "As.Is",
>>>> and there is no doubt at all that the above result is rubbish.
>>>> It is certainly not what a user would expect, and in the context
>>>> of Petr's intention to present R lessons to a class, I could
>>>> foresee students turning their backs on R if they came up with
>>>> such a result in their early encounters!
>>>>
>>>> Ted.
>>>>
>>>> On 04-Feb-10 08:59:59, Mario Valle wrote:
>>>>> Linux 2.9.0 gives:
>>>>>
>>>>>> median(df1)
>>>>> [1] 34
>>>>>
>>>>> Ever stranger...
>>>>>               mario
>>>>>
>>>>> Petr PIKAL wrote:
>>>>>> During some experimentation in preparing R lessons I encountered 
> this 
>>>>>> behaviour which I can not explain fully
>>>>>>
>>>>>> mat <- matrix(1:16, 4,4)
>>>>>> df1 <- data.frame(mat)
>>>>>>
>>>>>>> mean(df1)
>>>>>>   X1   X2   X3   X4 
>>>>>>  2.5  6.5 10.5 14.5 
>>>>>>
>>>>>> Expected, documented
>>>>>>
>>>>>>> median(df1)
>>>>>> [1]  6.5 10.5
>>>>>>
>>>>>> Rather weird, AFAIK there shall not be an issue with data frame at
>>>>>> least I 
>>>>>> did not find any in help page. I tracked it down probably to an 
> As.Is 
>>>>>> operation with object and subsequent sorting in median.default.
>>>>>>
>>>>>> I know other (*apply) ways how to compute median for data frames so 
> I
>>>>>> just 
>>>>>> would like to hear an opinion about this behaviour from more
>>>>>> experienced 
>>>>>> people.
>>>>>>
>>>>>> Thank you
>>>>>> Best regards
>>>>>>
>>>>>> Petr
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>> -- 
>>>>> Ing. Mario Valle
>>>>> Data Analysis and Visualization Group            |
>>>>> http://www.cscs.ch/~mvalle
>>>>> Swiss National Supercomputing Centre (CSCS)      | Tel:  +41 (91)
>>>>> 610.82.60
>>>>> v. Cantonale Galleria 2, 6928 Manno, Switzerland | Fax:  +41 (91)
>>>>> 610.82.82
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>> --------------------------------------------------------------------
>>>> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
>>>> Fax-to-email: +44 (0)870 094 0861
>>>> Date: 04-Feb-10                                       Time: 09:28:13
>>>> ------------------------------ XFMail ------------------------------
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>> -- 
>> Peter Ehlers
>> University of Calgary
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
Peter Ehlers
University of Calgary



More information about the R-help mailing list