[R] strange behaviour of median

Petr PIKAL petr.pikal at precheza.cz
Thu Feb 4 12:00:55 CET 2010


Hi

r-help-bounces at r-project.org napsal dne 04.02.2010 11:31:51:

> Petr PIKAL wrote:
> > Hi
> > 
> > so do you think I shall fire a bug announcement? I think I rather wait 
to 
> > see if there is some reaction from others. Maybe, there is some reason 

> > behind such behaviour. Those simple statistics tend to behave 
differently 
> > when operating on data.frames so median is not such a huge surprise.
> > 
> > see
> > 
> > sd(df1), var(df1), mean(df1), max(df1), min(df1), range(df1)
> > 
> > Produced results are usually clearly documented, however for novice it 
is 
> > rather mysterious why using those functions on vector produce easily 
> > understandable results but using them on data.frame (which is most 
common 
> > structure of data) is far from consistent and intuitive.
> > 
> > But I agree with you that mean and median in best case shall give 
similar 
> > results regarding results structure.
> > 
> > Regards
> > Petr
> 
> Well, I don't think that it's a bug since the documentation
> for median() does not indicate that median should work for
> dataframes, whereas for mean() it clearly says that a method
> exists. methods('mean') and methods('median') as well as
> mean.default(df1) are informative.

It depends informative for whom. Here is a snippet from median help page

This is a generic function for which methods can be written. However, the 
default method makes use of sort and mean, both of which are generic, and 
so the default method will work for most classes 
                             ^^^^^^^^^
If you consider data.frame an unusual class I could accept your point but 
if help page tells me that a function works for most classes I would not 
expect that data.frame class shall be avoided. Especially if work around 
is such simple (for experienced user). As I said, if I encountered this in 
real world I can make it easily work with *apply. 

I tried to give my audience experience that matrix is different from 
data.frame with respect of such simple statistic functions. But how do you 
explain, that using mean on matrix produces one number but using it on 
data.frame it produces mean separately for each column. I wanted to show 
that it is similar for median but being such candid moron I luckily tried 
it before I presented it. :-)

> 
> It seems to me to be a simple fix so I wonder what I'm
> missing. Paraphrasing mean.data.frame:
> 
> median.data.frame <- function(x, ...) sapply(x, median, ...)
> 
> I think that it would be desirable to have similar behaviour
> for both functions or at least a warning if median.default
> is incorrectly applied to a data.frame object.

Agreed. For the benefit of novices I would vote for changing behaviour for 
data.frames to get mean-like behaviour.

Regards
Petr

> 
>   -Peter Ehlers
> 
> > 
> > r-help-bounces at r-project.org napsal dne 04.02.2010 10:28:16:
> > 
> >> Well, I get the same as Petr with  R version 2.10.0 (2009-10-26)
> >> on Linux.
> >>
> >> To me, this suggests that median is broken! Any user would,
> >> a priori, expect that median() should operate in exactly
> >> the same way as mean(). To extend Petr's example:
> >>
> >>   mat <- matrix(1:32, 4,8)
> >>   df1 <- data.frame(mat)
> >>   mean(df1)
> >>   #   X1   X2   X3   X4   X5   X6   X7   X8 
> >>   #  2.5  6.5 10.5 14.5 18.5 22.5 26.5 30.5 
> >>   median(df1)
> >>   # [1] 14.5 18.5
> >>
> >> so (as in Petr's original example, but more clearly) median()
> >> returns the medians of the two "central" columns X4 and X5 of df1.
> >>
> >> But that is with an even number of columns. Now look at what
> >> happens with an odd number:
> >>
> >>   mat <- matrix(1:28, 4,7)
> >>   df1 <- data.frame(mat)
> >>   mean(df1)
> >>   #   X1   X2   X3   X4   X5   X6   X7 
> >>   #  2.5  6.5 10.5 14.5 18.5 22.5 26.5 
> >>   median(df1)
> >>   #   structure(c("13", "14", "15", "16"), class = "AsIs")
> >>   # 1                                                   13
> >>   # 2                                                   14
> >>   # 3                                                   15
> >>   # 4                                                   16
> >>
> >> Wow!!!!!!!!!!
> >>
> >> This does suggest a tie-in with Petr's observation about "As.Is",
> >> and there is no doubt at all that the above result is rubbish.
> >> It is certainly not what a user would expect, and in the context
> >> of Petr's intention to present R lessons to a class, I could
> >> foresee students turning their backs on R if they came up with
> >> such a result in their early encounters!
> >>
> >> Ted.
> >>
> >> On 04-Feb-10 08:59:59, Mario Valle wrote:
> >>> Linux 2.9.0 gives:
> >>>
> >>>> median(df1)
> >>> [1] 34
> >>>
> >>> Ever stranger...
> >>>               mario
> >>>
> >>> Petr PIKAL wrote:
> >>>> During some experimentation in preparing R lessons I encountered 
this 
> > 
> >>>> behaviour which I can not explain fully
> >>>>
> >>>> mat <- matrix(1:16, 4,4)
> >>>> df1 <- data.frame(mat)
> >>>>
> >>>>> mean(df1)
> >>>>   X1   X2   X3   X4 
> >>>>  2.5  6.5 10.5 14.5 
> >>>>
> >>>> Expected, documented
> >>>>
> >>>>> median(df1)
> >>>> [1]  6.5 10.5
> >>>>
> >>>> Rather weird, AFAIK there shall not be an issue with data frame at
> >>>> least I 
> >>>> did not find any in help page. I tracked it down probably to an 
As.Is 
> > 
> >>>> operation with object and subsequent sorting in median.default.
> >>>>
> >>>> I know other (*apply) ways how to compute median for data frames so 
I
> >>>> just 
> >>>> would like to hear an opinion about this behaviour from more
> >>>> experienced 
> >>>> people.
> >>>>
> >>>> Thank you
> >>>> Best regards
> >>>>
> >>>> Petr
> >>>>
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> >>>> http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>> -- 
> >>> Ing. Mario Valle
> >>> Data Analysis and Visualization Group            |
> >>> http://www.cscs.ch/~mvalle
> >>> Swiss National Supercomputing Centre (CSCS)      | Tel:  +41 (91)
> >>> 610.82.60
> >>> v. Cantonale Galleria 2, 6928 Manno, Switzerland | Fax:  +41 (91)
> >>> 610.82.82
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >> --------------------------------------------------------------------
> >> E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
> >> Fax-to-email: +44 (0)870 094 0861
> >> Date: 04-Feb-10                                       Time: 09:28:13
> >> ------------------------------ XFMail ------------------------------
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > 
> > 
> 
> -- 
> Peter Ehlers
> University of Calgary
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list