[R] by inconsistently strips class - with fix

Thu Apr 17 12:54:11 CEST 2008

On Thu, 17 Apr 2008, Alex Brown wrote:

> Adding a simplify argument to by would suit me fine.
>
> In my (limited) experience in using R, the automatic simplification that R 
> does in various situations is one of it's most troublesome features.  It 
> means that I cannot expect a program to work even if I give it data of the 
> same types as I always have before; any time a dimension is reduced to 1 bad 
> things happen.
>
> Is there a master switch I can set so dropping never happens automatically?

Nop, and you would break a lot of code by such a switch.  Which is why we 
are very much against having global options.

> Can you please have an option that by reads so I can indicate that by should 
> never drop/simplify?

No, as it will break lots of other people's code.  You can have your own 
version, and then namespaces will protect other code from your changes.

>
> -Alex
>
> On 17 Apr 2008, at 07:03, Prof Brian Ripley wrote:
>
>> Unfortunately your proposed change changes the type of the output: 
>> simplification is intended in many applications of by().
>> 
>> Before:
>> 
>>> str(by(mytimes$date[1], mytimes$set[1], function(x)x))
>> by [, 1] 1.21e+09
>> - attr(*, "dimnames")=List of 1
>> ..$ mytimes$set[1]: chr "1"
>> - attr(*, "call")= language by.default(data = mytimes$date[1], INDICES = 
>> mytimes$set[1],      FUN = function(x) x)
>> 
>> After:
>> 
>>> str(by(mytimes$date[1], mytimes$set[1], function(x)x))
>> List of 1
>> $ 1: POSIXct[1:1], format: "2008-04-17 06:53:31"
>> - attr(*, "dim")= int 1
>> - attr(*, "dimnames")=List of 1
>> ..$ mytimes$set[1]: chr "1"
>> - attr(*, "call")= language by.default(data = mytimes$date[1], INDICES = 
>> mytimes$set[1],      FUN = function(x) x)
>> - attr(*, "class")= chr "by"
>> 
>> c() does not do the same thing as unlist() in general, and it is untrue 
>> that 'c does not strip class'.  What happens in your example is that there 
>> is a c() method for your class (and not many others).
>> 
>> What we could is to add a 'simplify' argument to by() so you can control 
>> the simplification.
>> 
>> 
>> On Tue, 15 Apr 2008, Alex Brown wrote:
>> 
>>> summary:
>>> 
>>> The function 'by' inconsistently strips class from the data to which
>>> it is applied.
>>> 
>>> quick reason:
>>> 
>>> tapply strips class when simplify is set to TRUE (the default) due to
>>> the class stripping behaviour of unlist.
>>> 
>>> quick answer:
>>> 
>>> This can be fixed by invoking tapply with simplify=FALSE, or changing
>>> tapply to use do.call(c instead of unlist
>>> 
>>> executable example:
>>> 
>>> mytimes=data.frame(date = 1:3 + Sys.time(), set = c(1,1,2))
>>> 
>>> by(mytimes$date, mytimes$set, function(x)x)
>>> 
>>> INDICES: 1
>>> [1] "2008-04-15 11:41:38 BST" "2008-04-15 11:41:39 BST"
>>> ----------------------------------------------------------------------------------------
>>> INDICES: 2
>>> [1] "2008-04-15 11:41:40 BST"
>>> 
>>> by(mytimes[1,]$date, mytimes[1,]$set, function(x)x)
>>> 
>>> INDICES: 1
>>> [1] 1208256099
>>> 
>>> why this is a problem:
>>> 
>>> This is a problem when you are feeding the output of this by into a
>>> function which expects the class to be maintained.  I see this problem
>>> when constructing
>>> 
>>> reason:
>>> 
>>> tapply strips class when simplify is set to TRUE (the default) due to
>>> the behaviour of unlist:
>>> 
>>> "Where possible the list elements are coerced to a common mode during
>>> the unlisting, and so the result often ends up as a character vector.
>>> Vectors will be coerced to the highest type of the components in the
>>> hierarchy NULL < raw < logical < integer < real < complex < character
>>> < list < expression: pairlists are treated as lists."
>>> 
>>> solution:
>>> 
>>> This problem can be fixed in the function by.data.frame by modifying
>>> the call to tapply in the function "by":
>>> 
>>> by.data.frame = function (data, INDICES, FUN, ...)
>>> {
>>> if (!is.list(INDICES)) {
>>>    IND <- vector("list", 1)
>>>    IND[[1]] <- INDICES
>>>    names(IND) <- deparse(substitute(INDICES))[1]
>>> }
>>> else IND <- INDICES
>>> FUNx <- function(x) FUN(data[x, ], ...)
>>> nd <- nrow(data)
>>> <<<<
>>> ans <- eval(substitute(tapply(1:nd, IND, FUNx)), data)
>>> ====
>>> ans <- eval(substitute(tapply(1:nd, IND, FUNx, simplify=FALSE)),
>>> data)
>>>>>>> 
>>> attr(ans, "call") <- match.call()
>>> class(ans) <- "by"
>>> ans
>>> }
>>> 
>>> alternative solution:
>>> 
>>> the call in tapply to unlist(ans, recursive=F) can be replaced by
>>> do.call(c,ans, recursive=F) to fix this issue, since c does not strip
>>> class.
>>> 
>>> However, I haven't taken the time to work out if this will work in all
>>> cases.
>>> 
>>> for example:
>>> 
>>> function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
>>> {
>>> FUN <- if (!is.null(FUN))
>>>    match.fun(FUN)
>>> if (!is.list(INDEX))
>>>    INDEX <- list(INDEX)
>>> nI <- length(INDEX)
>>> namelist <- vector("list", nI)
>>> names(namelist) <- names(INDEX)
>>> extent <- integer(nI)
>>> nx <- length(X)
>>> one <- 1L
>>> group <- rep.int(one, nx)
>>> ngroup <- one
>>> for (i in seq.int(INDEX)) {
>>>    index <- as.factor(INDEX[[i]])
>>>    if (length(index) != nx)
>>>        stop("arguments must have same length")
>>>    namelist[[i]] <- levels(index)
>>>    extent[i] <- nlevels(index)
>>>    group <- group + ngroup * (as.integer(index) - one)
>>>    ngroup <- ngroup * nlevels(index)
>>> }
>>> if (is.null(FUN))
>>>    return(group)
>>> ans <- lapply(split(X, group), FUN, ...)
>>> index <- as.integer(names(ans))
>>> if (simplify && all(unlist(lapply(ans, length)) == 1)) {
>>>    ansmat <- array(dim = extent, dimnames = namelist)
>>> <<<<
>>>    ans <- unlist(ans, recursive = FALSE)
>>> ====
>>> 	ans <- do.call(c, ans, recursive = FALSE)
>>>>>>> 
>>> }
>>> else {
>>>    ansmat <- array(vector("list", prod(extent)), dim = extent,
>>>        dimnames = namelist)
>>> }
>>> if (length(index)) {
>>>    names(ans) <- NULL
>>>    ansmat[index] <- ans
>>> }
>>> ansmat
>>> }
>>> 
>>> Alexander Brown
>>> Principal Engineer
>>> Transitive
>>> Maybrook House, 40 Blackfriars Street, Manchester M3 2EG
>>> Phone: +44 (0)161 836 2321    Fax: +44 (0)161 836 2399    Mobile: +44
>>> (0)7980 708 221
>>> www.transitive.com
>>> * The leader in cross-platform virtualization
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>> 
>> -- 
>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford,             Tel:  +44 1865 272861 (self)
>> 1 South Parks Road,                     +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595