[Rd] On the median

Bill.Venables at csiro.au Bill.Venables at csiro.au
Wed Sep 22 07:29:41 CEST 2010


I have recently become aware of some curious behaviour of median() which I think could be usefully corrected.  I am sure this must have come up before, but I'm raising it again.

The phenomenon is best shown by a simple example.

> d <- matrix(runif(4*4), 4, 4)
> d
          [,1]       [,2]       [,3]      [,4]
[1,] 0.1388592 0.08478220 0.02012404 0.7733054
[2,] 0.1718332 0.06370432 0.66167219 0.2521809
[3,] 0.3190116 0.08616569 0.23107320 0.6278422
[4,] 0.9185233 0.29218144 0.99193823 0.6306847
> apply(d, 1, median)
[1] 0.1118207 0.2120070 0.2750424 0.7746040

So far, so good. But what happens when you turn it into a data frame?

> d <- data.frame(d)
> apply(d, 1, median)
[1] 0.1118207 0.2120070 0.2750424 0.7746040

No problem there, yet.  But if you just look at one row:

> median(d[1, ])
[1] 0.0847822 0.1388592

without warning you get a vector of size two as the result, viz the two values which enclose the middle.  I thought this was simply because one row of a data frame is a list, but that can't be the whole story.  e.g.

> median(d[,1])
[1] 0.2454224
> median(as.list(d[,1]))
Error in sort.list(x, partial = half + 0L:1L) : 
  'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
> 

(Well yes, Brian, I did...)  

The function mean() has a nice property when you call it on a data frame, e.g.

> mean(d)
       X1        X2        X3        X4 
0.3870568 0.1317084 0.4762019 0.5710033 

and just to complicate the issue even further, 

> mean(d[1, ])
        X1         X2         X3         X4 
0.13885916 0.08478220 0.02012404 0.77330535 

On the other hand, median(), whose behaviour should be similar I would suggest, just fails when handed a data frame argument.

> median(d)
[1] NA NA
Warning messages:
1: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA
> 
_________________

I suggest that there should be some consistency here, and I suggest that median() be given a data.frame method that would allow it to respond much the same as mean() does.  The way it responds to data frame arguments now is quirky, at best.

Currently median() though generic, has only the default method.

> methods("mean")
[1] mean.data.frame mean.Date       mean.default    mean.difftime   mean.POSIXct   
[6] mean.POSIXlt   

> methods("median")
[1] median.default
> 

Perhaps quantile() should also have a data.frame method for the same reason.  To me it seems curious, too, that quantile has a POSIXt method (in the stats package) whereas median currently does not.  (mean.POSIX*t are in the base package.)

> methods("quantile")
[1] quantile.default quantile.POSIXt*

   Non-visible functions are asterisked
> 

How do people respond to this?

(I see there have been hints of this in the past, see http://tolstoy.newcastle.edu.au/R/e2/help/06/12/7692.html
but I could only find hints.)

Bill Venables
CSIRO/CMIS, Cleveland Labs.
 


More information about the R-devel mailing list