[Rd] (PR#7976) split() dropping levels (was "boxplot by factor")
Martin Maechler
maechler at stat.math.ethz.ch
Mon Jul 4 09:15:59 CEST 2005
[ Hmm, is everyone of those interested in changes inside R "sleeping" ,
uninterested, ...
]
>>>>> "MM" == Martin Maechler <maechler at stat.math.ethz.ch>
>>>>> on Fri, 1 Jul 2005 18:36:54 +0200 writes:
>>>>> "PD" == Peter Dalgaard <p.dalgaard at biostat.ku.dk>
>>>>> on 28 Jun 2005 14:57:42 +0200 writes:
PD> "Liaw, Andy" <andy_liaw at merck.com> writes:
>>> The issue is not with boxplot, but with split. boxplot.formula()
>>> calls boxplot(split(split(mf[[response]], mf[-response]), ...),
>>> but look at what split() returns when there are empty levels in
>>> the factor:
>>>
>>> > f <- factor(gl(3, 6), levels=1:5)
>>> > y <- rnorm(f)
>>> > split(y, f)
>>> $"1"
>>> [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520
>>>
>>> $"2"
>>> [1] -1.1296642 -0.4808355 -0.2789933 0.1220718 0.1287742 -0.7573801
>>>
>>> $"3"
>>> [1] 1.2320902 0.5090700 -1.5508074 2.1373780 1.1681297 -0.7151561
>>>
>>> The "culprit" is the following in split.default():
>>>
>>> f <- factor(f)
>>>
>>> which drops empty levels in f, if there are any. BTW, ?split doesn't
>>> mention what it does in such situation. Perhaps it should?
>>>
>>> If this is to be "fixed", I suppose an additional argument, e.g.,
>>> drop=TRUE, can be added, and the corresponding line mentioned
>>> above changed to something like:
>>>
>>> if (drop || !is.factor(f)) f <- factor(f)
>>>
>>> Then this additional argument can be pass on from boxplot.formula() to
>>> split().
PD> Alternatively, I suspect that the intention was as.factor() rather
PD> than factor().
MM> at first I thought Peter was right; but the real source of
MM> split.default contains a comment (!) and that line is
MM> f <- factor(f) # drop extraneous levels
MM> so it seems, this was done there very much on purpose.
MM> OTOH, S(-plus) has implemented it quite a bit differently, and actually
MM> does keep the empty levels in the example
MM> f <- factor(rep(1:3, each=6), levels=1:5); y <- rnorm(f); split(y, f)
PD> It does require a bit of care to fix it that way,
PD> though. There could be problems with empty levels popping up in
PD> unexpected places.
MM> Indeed!
MM> Given the new facts, I think we want to go in Andy's direction
MM> with a new argument, 'drop'
MM> A Peter mentioned, the real question is about its default.
MM> "drop = TRUE" would be fully compatible with previous versions of R.
MM> "drop = FALSE" would be compatible with S and S-plus.
MM> I'm going to implement it, and try to see if 'drop = FALSE'
MM> gives changes for R and its standard packages; if 'yes', that
MM> would be an indication that such a R-back-compatibility breaking
MM> change was not a good idea. If 'no', I could commit it and see
MM> if it has an effect on the CRAN packages....
MM> Of course, since split() and split()<- are S3 generics, and
MM> since there's also unsplit(), this entails a whole slew of
MM> changes {adding a "drop = FALSE" argument everywhere!}
MM> and I presume will break everyone's code who has written own
MM> split.foobar methods....
MM> great...
MM> Martin
The change doesn't seem to affect the "standard" packages at all
which is good. On CRAN, it seems there are two packages only that
have split() or split()<- methods, namely 'spatstat' and 'compositions'.
If we introduced the extra argument 'drop',
these and every other user code defining split methods would
have to be updated to be compatible with the changed (S3)
generic having an extra argument 'drop'.
With this in mind, after more thought, I think that Peter's
initial proposal ---just replacing 'factor()' by 'as.factor()'
inside split--- seems to be nicer than introducing 'drop' and
*change* the default behavior to 'drop = FALSE' for the
following reasons :
1) people who rely on the current behavior would have to change
their calls to split() anyway;
2) instead of calling
split(x, f, drop=TRUE)
they can as well go for
split(x, factor(f))
which has identical effect but does not introduce an extra
argument 'drop'.
3) advantage of slightly higher compatibility with S
---
I intend to change this in R-devel
{with appropriate notes in NEWS !} during this week, unless
someone finds good reasons for a different (or no) change.
Martin
More information about the R-devel
mailing list