[Rd] (PR#7976) split() dropping levels (was "boxplot by factor")

Wed Jul 13 18:08:11 CEST 2005

I have now committed the new     split(x, f, drop = FALSE)
to R-devel --- entailing non-backward compatible behavior,
but consistency with factor indexing (and with S-plus) ---
split() and "split<-" and unsplit() functions and methods to
R-devel.

This does automatically fix the original posters "boxplot by
factor" bug.

>>>>> "MM" == Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>     on Mon, 4 Jul 2005 09:15:59 +0200 writes:

>>>>> "PD" == Peter Dalgaard <p.dalgaard at biostat.ku.dk>
>>>>>     on 28 Jun 2005 14:57:42 +0200 writes:

    PD> "Liaw, Andy" <andy_liaw at merck.com> writes:
    >>>> The issue is not with boxplot, but with split.  boxplot.formula() 
    >>>> calls boxplot(split(split(mf[[response]], mf[-response]), ...), 
    >>>> but look at what split() returns when there are empty levels in
    >>>> the factor:
    >>>> 
    >>>> > f <- factor(gl(3, 6), levels=1:5)
    >>>> > y <- rnorm(f)
    >>>> > split(y, f)
    >>>> $"1"
    >>>> [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520
    >>>> 
    >>>> $"2"
    >>>> [1] -1.1296642 -0.4808355 -0.2789933  0.1220718  0.1287742 -0.7573801
    >>>> 
    >>>> $"3"
    >>>> [1]  1.2320902  0.5090700 -1.5508074  2.1373780  1.1681297 -0.7151561
    >>>> 
    >>>> The "culprit" is the following in split.default():
    >>>> 
    >>>> f <- factor(f)
    >>>> 
    >>>> which drops empty levels in f, if there are any.  BTW, ?split doesn't
    >>>> mention what it does in such situation.  Perhaps it should?
    >>>> 
    >>>> If this is to be "fixed", I suppose an additional argument, e.g.,
    >>>> drop=TRUE, can be added, and the corresponding line mentioned
    >>>> above changed to something like:
    >>>> 
    >>>> if (drop || !is.factor(f)) f <- factor(f)
    >>>> 
    >>>> Then this additional argument can be pass on from boxplot.formula() to 
    >>>> split().

    PD> Alternatively, I suspect that the intention was as.factor() rather
    PD> than factor(). 

    MM> at first I thought Peter was right; but the real source of
    MM> split.default contains a comment (!) and that line is

    MM> f <- factor(f) # drop extraneous levels

    MM> so it seems, this was done there very much on purpose.    
    MM> OTOH, S(-plus) has implemented it quite a bit differently, and actually
    MM> does keep the empty levels in the example

    MM> f <- factor(rep(1:3, each=6), levels=1:5); y <- rnorm(f); split(y, f)

    PD> It does require a bit of care to fix it that way,
    PD> though. There could be problems with empty levels popping up in
    PD> unexpected places. 

    MM> Indeed!
    MM> Given the new facts, I think we want to go in Andy's direction
    MM> with a new argument, 'drop'

    MM> A Peter mentioned, the real question is about its default.
    MM> "drop = TRUE"   would be fully compatible with previous versions of R.
    MM> "drop = FALSE"  would be compatible with S and S-plus.

    MM> I'm going to implement it, and try to see if 'drop = FALSE'
    MM> gives changes for R and its standard packages;  if 'yes', that
    MM> would be an indication that such a R-back-compatibility breaking
    MM> change was not a good idea.  If 'no', I could commit it and see
    MM> if it has an effect on the CRAN packages....

    MM> Of course, since split() and split()<- are S3 generics, and
    MM> since there's also unsplit(),  this entails a whole slew of
    MM> changes {adding a "drop = FALSE" argument everywhere!}
    MM> and I presume will break everyone's code who has written own
    MM> split.foobar methods....

    MM> great...

    MM> Martin

    MM> The change doesn't seem to affect the "standard" packages at all
    MM> which is good.  On CRAN, it seems there are two packages only that
    MM> have  split() or split()<-  methods,  namely 'spatstat' and 'compositions'.

    MM> If we introduced the extra argument 'drop', 
    MM> these and every other user code defining split methods would
    MM> have to be updated to be compatible with the changed (S3)
    MM> generic having an extra argument 'drop'.

    MM> With this in mind, after more thought, I think that Peter's
    MM> initial proposal ---just replacing 'factor()' by 'as.factor()'
    MM> inside split--- seems to be nicer than introducing 'drop' and
    MM> *change* the default behavior to  'drop = FALSE' for the
    MM> following reasons : 

    MM> 1) people who rely on the current behavior would have to change
    MM> their calls to split() anyway;

    MM> 2) instead of calling  
    MM> split(x, f, drop=TRUE)
    MM> they can as well go for
    MM> split(x, factor(f)) 
    MM> which has identical effect but does not introduce an extra
    MM> argument 'drop'.

    MM> 3) advantage of slightly higher compatibility with S

    MM> ---

    MM> I intend to change this in R-devel
    MM> {with appropriate notes in NEWS !} during this week, unless
    MM> someone finds good reasons for a different (or no) change.

    MM> Martin

    MM> ______________________________________________
    MM> R-devel at r-project.org mailing list
    MM> https://stat.ethz.ch/mailman/listinfo/r-devel

    MM> !DSPAM:42c8e272288132092019954!