[Rd] (PR#7976) split() dropping levels (was "boxplot by factor")
Martin Maechler
maechler at stat.math.ethz.ch
Fri Jul 1 18:36:54 CEST 2005
>>>>> "PD" == Peter Dalgaard <p.dalgaard at biostat.ku.dk>
>>>>> on 28 Jun 2005 14:57:42 +0200 writes:
PD> "Liaw, Andy" <andy_liaw at merck.com> writes:
>> The issue is not with boxplot, but with split. boxplot.formula()
>> calls boxplot(split(split(mf[[response]], mf[-response]), ...),
>> but look at what split() returns when there are empty levels in
>> the factor:
>>
>> > f <- factor(gl(3, 6), levels=1:5)
>> > y <- rnorm(f)
>> > split(y, f)
>> $"1"
>> [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520
>>
>> $"2"
>> [1] -1.1296642 -0.4808355 -0.2789933 0.1220718 0.1287742 -0.7573801
>>
>> $"3"
>> [1] 1.2320902 0.5090700 -1.5508074 2.1373780 1.1681297 -0.7151561
>>
>> The "culprit" is the following in split.default():
>>
>> f <- factor(f)
>>
>> which drops empty levels in f, if there are any. BTW, ?split doesn't
>> mention what it does in such situation. Perhaps it should?
>>
>> If this is to be "fixed", I suppose an additional argument, e.g.,
>> drop=TRUE, can be added, and the corresponding line mentioned
>> above changed to something like:
>>
>> if (drop || !is.factor(f)) f <- factor(f)
>>
>> Then this additional argument can be pass on from boxplot.formula() to
>> split().
PD> Alternatively, I suspect that the intention was as.factor() rather
PD> than factor().
at first I thought Peter was right; but the real source of
split.default contains a comment (!) and that line is
f <- factor(f) # drop extraneous levels
so it seems, this was done there very much on purpose.
OTOH, S(-plus) has implemented it quite a bit differently, and actually
does keep the empty levels in the example
f <- factor(rep(1:3, each=6), levels=1:5); y <- rnorm(f); split(y, f)
PD> It does require a bit of care to fix it that way,
PD> though. There could be problems with empty levels popping up in
PD> unexpected places.
Indeed!
Given the new facts, I think we want to go in Andy's direction
with a new argument, 'drop'
A Peter mentioned, the real question is about its default.
"drop = TRUE" would be fully compatible with previous versions of R.
"drop = FALSE" would be compatible with S and S-plus.
I'm going to implement it, and try to see if 'drop = FALSE'
gives changes for R and its standard packages; if 'yes', that
would be an indication that such a R-back-compatibility breaking
change was not a good idea. If 'no', I could commit it and see
if it has an effect on the CRAN packages....
Of course, since split() and split()<- are S3 generics, and
since there's also unsplit(), this entails a whole slew of
changes {adding a "drop = FALSE" argument everywhere!}
and I presume will break everyone's code who has written own
split.foobar methods....
great...
Martin
More information about the R-devel
mailing list