[R] Weird problem with median on a factor

Tony Plate tplate at blackmesacapital.com
Fri Oct 31 21:56:16 CET 2003


median() has this test in it:
     if (mode(x) != "numeric")
         stop("need numeric data")

Note the following:

 > is.numeric(factor(letters))
[1] FALSE
 > mode(factor(letters))
[1] "numeric"
 >

It seems as though median() is using the wrong test.

-- Tony Plate

At Friday 03:37 PM 10/31/2003 -0500, RBaskin at ahrq.gov wrote:
>Beating a dead horse...
>
>I am an R beginner trying to understand this factor business.  While the
>entire business of finding the median of factor may be silly from a
>practical point of view, this email chain has helped me understand
>something.
>
>I have looked at the median function and it tests to see if what is passed
>to it is numeric.  If I were building a function, if I tested for mode
>numeric, and if something told me it was numeric then like the median
>function I would naively assume that I could do arithmetic on it:
> > saywhut<-as.factor(c(NA,"1","1","1","1","2","10"))
> > mode(saywhut)
>[1] "numeric"
>
>It appears to me that the when the median function tests for numeric it
>doesn't have the desired result with an object of class factor (and maybe
>other classes?) as was shown by the example.
>
>I have a suspicion that something of class factor has at least two pieces,
>one of which is the levels which can possibly be character or something else
>and the other piece is the ordering of the levels which is of storage.mode
>integer.  Is it this ordering that determines the mode of the factor??
>
>But if the mode of factor is truly numeric, why doesn't the median function
>use the numeric piece for finding the median (like it did with odd n - not
>that anyone would ever really want the median of a factor:)??  I think that
>Simon Fear hit on the right idea because of the definition of median that is
>used for an even number of observations takes the sum of the ordered middle
>two observations.  It is the sum (called by the median function) that chokes
>on a factor.
>
> > sum(saywhut,na.rm=T)
>Error in Summary.factor(..., na.rm = na.rm) :
>         "sum" not meaningful for factors
>
>It appears that whoever built the sum function built in a test for factor
>(Simon Fear's first suggestion for median)
>
>
>On the other hand:
> > sd(saywhut,na.rm=T)
>[1] 3.614784
>(Simon Fear's second suggestion for median)
>
>Bytheway, mean treats factor in different way:
>mean(saywhut)
>[1] NA
>Warning message:
>argument is not numeric or logical: returning NA in: mean.default(saywhut).
>
>
>There is an R-FAQ that tells one how to convert a factor to 'numeric' but if
>I had tested for something being numeric to begin with I never would have
>guessed that I needed to convert it to numeric.  I think what this
>conversion is really doing is getting rid of the machinery associated with
>the class factor:
> > #from the R-FAQ
> > test<-as.numeric(as.character(saywhut))
> > mode(test)
>[1] "numeric"
> > median(test,na.rm=T)
>[1] 1
>
>and bytheway:
> > not.a.factor<-c(NA,"1","1","2","10")
> > mode(not.a.factor)
>[1] "character"
> > median(not.a.factor,na.rm=T)
>Error in median(not.a.factor, na.rm = T) :
>         need numeric data
>
>
><Simon Fear: It seems to me the best way to deal with this "bug" would
>be to make calling median with a factor argument be an immediate error.>
>Do you think that all base functions (sum, sd, mean, median,...) should deal
>with this in a consistent way (This might be much more work.)?  Another
>thing that would make things consistent would be to take the stop-work
>behavior out of sum:)
>
>I don't think there is any real problem in the current behavior of factor as
>long as the interaction between functions and classes produces this
>stop-work behavior - preferably with a warning - and not unexpected side
>effects. I am curious if there are other classes of mode numeric which
>median-mean-sum-sd-etc might choke on.
>
><tongue-in-cheek on>
>Of course, R would produce a median for factors by using the "correct"
>defintion of a median of samples i.e., one that agrees with the definition
>of median on a CDF, even though this concept gives most people apoplexy.
><off>
>Thanks
>Bob
>Usual disclaimers....
>
>
>-----Original Message-----
>From: Simon Fear [mailto:Simon.Fear at synequanon.com]
>Sent: Friday, October 31, 2003 6:18 AM
>To: Christoph Bier
>Cc: r-help at stat.math.ethz.ch
>Subject: RE: [R] Weird problem with median on a factor
>
>Final guess as to observed behaviour: in the first case after
>removal of NAs there were an odd number of observations
>(so that sum was not called within the code for median).
>In your second call I suspect that even though you got
>an integer answer, it was found as sum(2,2)/2.
>
>It seems to me the best way to deal with this "bug" would
>be to make calling median with a factor argument be an
>immediate error. Or just trust users never to attempt such
>a thing ...
>
>Simon Fear
>Senior Statistician
>Syne qua non Ltd
>Tel: +44 (0) 1379 644449
>Fax: +44 (0) 1379 644445
>email: Simon.Fear at synequanon.com
>web: http://www.synequanon.com
>
>Number of attachments included with this message: 0
>
>This message (and any associated files) is confidential and\...{{dropped}}
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>
>         [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://www.stat.math.ethz.ch/mailman/listinfo/r-help




More information about the R-help mailing list