[Rd] (PR#7976) split() dropping levels (was "boxplot by factor")
Martin Maechler
maechler at stat.math.ethz.ch
Wed Jul 13 18:08:11 CEST 2005
I have now committed the new split(x, f, drop = FALSE)
to R-devel --- entailing non-backward compatible behavior,
but consistency with factor indexing (and with S-plus) ---
split() and "split<-" and unsplit() functions and methods to
R-devel.
This does automatically fix the original posters "boxplot by
factor" bug.
>>>>> "MM" == Martin Maechler <maechler at stat.math.ethz.ch>
>>>>> on Mon, 4 Jul 2005 09:15:59 +0200 writes:
>>>>> "PD" == Peter Dalgaard <p.dalgaard at biostat.ku.dk>
>>>>> on 28 Jun 2005 14:57:42 +0200 writes:
PD> "Liaw, Andy" <andy_liaw at merck.com> writes:
>>>> The issue is not with boxplot, but with split. boxplot.formula()
>>>> calls boxplot(split(split(mf[[response]], mf[-response]), ...),
>>>> but look at what split() returns when there are empty levels in
>>>> the factor:
>>>>
>>>> > f <- factor(gl(3, 6), levels=1:5)
>>>> > y <- rnorm(f)
>>>> > split(y, f)
>>>> $"1"
>>>> [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520
>>>>
>>>> $"2"
>>>> [1] -1.1296642 -0.4808355 -0.2789933 0.1220718 0.1287742 -0.7573801
>>>>
>>>> $"3"
>>>> [1] 1.2320902 0.5090700 -1.5508074 2.1373780 1.1681297 -0.7151561
>>>>
>>>> The "culprit" is the following in split.default():
>>>>
>>>> f <- factor(f)
>>>>
>>>> which drops empty levels in f, if there are any. BTW, ?split doesn't
>>>> mention what it does in such situation. Perhaps it should?
>>>>
>>>> If this is to be "fixed", I suppose an additional argument, e.g.,
>>>> drop=TRUE, can be added, and the corresponding line mentioned
>>>> above changed to something like:
>>>>
>>>> if (drop || !is.factor(f)) f <- factor(f)
>>>>
>>>> Then this additional argument can be pass on from boxplot.formula() to
>>>> split().
PD> Alternatively, I suspect that the intention was as.factor() rather
PD> than factor().
MM> at first I thought Peter was right; but the real source of
MM> split.default contains a comment (!) and that line is
MM> f <- factor(f) # drop extraneous levels
MM> so it seems, this was done there very much on purpose.
MM> OTOH, S(-plus) has implemented it quite a bit differently, and actually
MM> does keep the empty levels in the example
MM> f <- factor(rep(1:3, each=6), levels=1:5); y <- rnorm(f); split(y, f)
PD> It does require a bit of care to fix it that way,
PD> though. There could be problems with empty levels popping up in
PD> unexpected places.
MM> Indeed!
MM> Given the new facts, I think we want to go in Andy's direction
MM> with a new argument, 'drop'
MM> A Peter mentioned, the real question is about its default.
MM> "drop = TRUE" would be fully compatible with previous versions of R.
MM> "drop = FALSE" would be compatible with S and S-plus.
MM> I'm going to implement it, and try to see if 'drop = FALSE'
MM> gives changes for R and its standard packages; if 'yes', that
MM> would be an indication that such a R-back-compatibility breaking
MM> change was not a good idea. If 'no', I could commit it and see
MM> if it has an effect on the CRAN packages....
MM> Of course, since split() and split()<- are S3 generics, and
MM> since there's also unsplit(), this entails a whole slew of
MM> changes {adding a "drop = FALSE" argument everywhere!}
MM> and I presume will break everyone's code who has written own
MM> split.foobar methods....
MM> great...
MM> Martin
MM> The change doesn't seem to affect the "standard" packages at all
MM> which is good. On CRAN, it seems there are two packages only that
MM> have split() or split()<- methods, namely 'spatstat' and 'compositions'.
MM> If we introduced the extra argument 'drop',
MM> these and every other user code defining split methods would
MM> have to be updated to be compatible with the changed (S3)
MM> generic having an extra argument 'drop'.
MM> With this in mind, after more thought, I think that Peter's
MM> initial proposal ---just replacing 'factor()' by 'as.factor()'
MM> inside split--- seems to be nicer than introducing 'drop' and
MM> *change* the default behavior to 'drop = FALSE' for the
MM> following reasons :
MM> 1) people who rely on the current behavior would have to change
MM> their calls to split() anyway;
MM> 2) instead of calling
MM> split(x, f, drop=TRUE)
MM> they can as well go for
MM> split(x, factor(f))
MM> which has identical effect but does not introduce an extra
MM> argument 'drop'.
MM> 3) advantage of slightly higher compatibility with S
MM> ---
MM> I intend to change this in R-devel
MM> {with appropriate notes in NEWS !} during this week, unless
MM> someone finds good reasons for a different (or no) change.
MM> Martin
MM> ______________________________________________
MM> R-devel at r-project.org mailing list
MM> https://stat.ethz.ch/mailman/listinfo/r-devel
MM> !DSPAM:42c8e272288132092019954!
More information about the R-devel
mailing list