[Rd] quantile(), IQR() and median() for factors

Greg Snow Greg.Snow at imail.org
Fri Mar 6 22:14:30 CET 2009


Yes I have discussed right continuous, left continous, etc. definitions for the median in numeric data.  I was just curious what the discussion was in texts that cover quantiles/medians of ordered categorical data in detail.

I do not expect Low.5 as computer output for the median (but Low.Medium does make sense in a way).  Back in my theory classes when we actually needed a firm definition I remember using the left continuous mainly (Low for the example), but I don't remember why we chose that over the right continuous version, probably just the teachers/books preference (I do remember it made things simpler than using the average of the middle 2 when n was even).

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111


> -----Original Message-----
> From: Simone Giannerini [mailto:sgiannerini at gmail.com]
> Sent: Friday, March 06, 2009 2:08 PM
> To: Prof Brian Ripley
> Cc: Greg Snow; R-devel
> Subject: Re: [Rd] quantile(), IQR() and median() for factors
> 
> Dear Greg,
> 
> thank you for your comments,
> as Prof. Ripley pointed out, in the case of even sample size the
> median is not unique and is formed by the two central observations or
> a function of them, if that makes sense.
> 
> 
> 
> Dear Prof. Ripley,
> 
> thank you for your concern,
> 
> may I notice that (in case of non-negative data) one can get the
> median from mad() with center=0,constant=1
> 
> 
> > mad(1:10,center=0,constant=1)
> [1] 5.5
> > mad(1:10,center=0,constant=1,high=TRUE)
> [1] 6
> > mad(1:10,center=0,constant=1,low=TRUE)
> [1] 5
> 
> so that it seems that part of the code of mad() might be a starting
> point, at least for median().
> I confirm my availability to work on the matter if requested.
> 
> Kind regards,
> 
> Simone
> 
> 
> On Fri, Mar 6, 2009 at 6:36 PM, Prof Brian Ripley
> <ripley at stats.ox.ac.uk> wrote:
> > On Fri, 6 Mar 2009, Greg Snow wrote:
> >
> >> I like the idea of median and friends working on ordered factors.
> Just a
> >> couple of thoughts on possible implementations.
> >>
> >> Adding extra checks and functionality will slow down the function.
> For a
> >> single evaluation on a given dataset this slowdown will not be
> noticeable,
> >> but inside of a simulation, bootstrap, or other high iteration
> technique, it
> >> could matter.  I would suggest creating a core function that does
> just the
> >> calculations (median, quantile, iqr) assuming that the data passed
> in is
> >> correct without doing any checks or anything fancy.  Then the user
> callable
> >> function (median et. al.) would do the checks dispatch to other
> functions
> >> for anything fancy, etc. then call the core function with the clean
> data.
> >>  The common user would not really notice a difference, but someone
> >> programming a high iteration technique could clean the data
> themselves, then
> >> call the core function directly bypassing the checks/branches.
> >
> > Since median and quantile are already generic, adding a 'ordered'
> method
> > would be zero cost to other uses.  And the factor check at the head
> of
> > median.default could be replaced by median.factor if someone could
> show a
> > convincing performance difference.
> >
> >> Just out of curiosity (from someone who only learned from English
> >> (Americanized at that) and not Italian texts), what would the median
> of
> >> [Low, Low, Medium, High] be?
> >
> > I don't think it is 'the' median but 'a' median.  (Even English
> Wikipedia
> > says the median is not unique for even numbers of inputs.)
> >
> >>
> >> --
> >> Gregory (Greg) L. Snow Ph.D.
> >> Statistical Data Center
> >> Intermountain Healthcare
> >> greg.snow at imail.org
> >> 801.408.8111
> >>
> >>
> >>> -----Original Message-----
> >>> From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-
> >>> project.org] On Behalf Of Simone Giannerini
> >>> Sent: Thursday, March 05, 2009 4:49 PM
> >>> To: R-devel
> >>> Subject: [Rd] quantile(), IQR() and median() for factors
> >>>
> >>> Dear all,
> >>>
> >>> from the help page of quantile:
> >>>
> >>> "x     numeric vectors whose sample quantiles are wanted. Missing
> >>> values are ignored."
> >>>
> >>> from the help page of IQR:
> >>>
> >>> "x     a numeric vector."
> >>>
> >>> as a matter of facts it seems that both quantile() and IQR() do not
> >>> check for the presence of a numeric input.
> >>> See the following:
> >>>
> >>> set.seed(11)
> >>> x <- rbinom(n=11,size=2,prob=.5)
> >>> x <- factor(x,ordered=TRUE)
> >>> x
> >>>  [1] 1 0 1 0 0 2 0 1 2 0 0
> >>> Levels: 0 < 1 < 2
> >>>
> >>>> quantile(x)
> >>>
> >>>   0%  25%  50%  75% 100%
> >>>    0 <NA>    0 <NA>    2
> >>> Levels: 0 < 1 < 2
> >>> Warning messages:
> >>> 1: In Ops.ordered((1 - h), qs[i]) :
> >>>   '*' is not meaningful for ordered factors
> >>> 2: In Ops.ordered(h, x[hi[i]]) : '*' is not meaningful for ordered
> >>> factors
> >>>
> >>>> IQR(x)
> >>>
> >>> [1] 1
> >>>
> >>> whereas median has the check:
> >>>
> >>>> median(x)
> >>>
> >>> Error in median.default(x) : need numeric data
> >>>
> >>> I also take the opportunity to ask your comments on the following
> >>> related subject:
> >>>
> >>> In my opinion it would be convenient that median() and the like
> >>> (quantile(), IQR()) be implemented for ordered factors for which in
> >>> fact
> >>> they can be well defined. For instance, in this way functions like
> >>> apply(x,FUN=median,...) could be used without the need of further
> >>> processing for
> >>> data frames that contain both numeric variables and ordered
> factors.
> >>> If on the one hand, to my limited knowledge, in English
> introductory
> >>> statistics
> >>> textbooks the fact that the median is well defined for ordered
> >>> categorical variables is only mentioned marginally,
> >>> on the other hand, in the Italian Statistics literature this is
> often
> >>> discussed in detail and this could mislead students and
> practitioners
> >>> that might
> >>> expect median() to work for ordered factors.
> >>>
> >>> In this message
> >>>
> >>> https://stat.ethz.ch/pipermail/r-help/2003-November/042684.html
> >>>
> >>> Martin Maechler considers the possibility of doing such a job by
> >>> allowing for extra arguments "low" and "high" as it is done for
> mad().
> >>> I am willing to give a contribution if requested, and comments are
> >>> welcome.
> >>>
> >>> Thank you for the attention,
> >>>
> >>> kind regards,
> >>>
> >>> Simone
> >>>
> >>>> R.version
> >>>
> >>>                _
> >>> platform       i386-pc-mingw32
> >>> arch           i386
> >>> os             mingw32
> >>> system         i386, mingw32
> >>> status
> >>> major          2
> >>> minor          8.1
> >>> year           2008
> >>> month          12
> >>> day            22
> >>> svn rev        47281
> >>> language       R
> >>> version.string R version 2.8.1 (2008-12-22)
> >>>
> >>>
>  LC_COLLATE=Italian_Italy.1252;LC_CTYPE=Italian_Italy.1252;LC_MONETARY=
> >>> Italian_Italy.1252;LC_NUMERIC=C;LC_TIME=Italian_Italy.1252
> >>>
> >>> --
> >>> ______________________________________________________
> >>>
> >>> Simone Giannerini
> >>> Dipartimento di Scienze Statistiche "Paolo Fortunati"
> >>> Universita' di Bologna
> >>> Via delle belle arti 41 - 40126  Bologna,  ITALY
> >>> Tel: +39 051 2098262  Fax: +39 051 232153
> >>> http://www2.stat.unibo.it/giannerini/
> >>>
> >>> ______________________________________________
> >>> R-devel at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >> ______________________________________________
> >> R-devel at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >
> > --
> > Brian D. Ripley,                  ripley at stats.ox.ac.uk
> > Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> > University of Oxford,             Tel:  +44 1865 272861 (self)
> > 1 South Parks Road,                     +44 1865 272866 (PA)
> > Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> 
> 
> 
> --
> ______________________________________________________
> 
> Simone Giannerini
> Dipartimento di Scienze Statistiche "Paolo Fortunati"
> Universita' di Bologna
> Via delle belle arti 41 - 40126  Bologna,  ITALY
> Tel: +39 051 2098262  Fax: +39 051 232153
> http://www2.stat.unibo.it/giannerini/
> ______________________________________________________



More information about the R-devel mailing list