[R] Very slow using S4 classes

Mon Sep 12 18:37:56 CEST 2011

Hi André...

On 09/12/2011 07:20 AM, André Rossi wrote:
> Dear Martin Morgan and Martin Maechler...
>
> Here is an example of the computational time when a slot of a S4 class
> is of another S4 class and when it is just one object. I'm sending you
> the data file.
>
> Thank you!
>
> Best regards,
>
> André Rossi
>
> ############################################################
>
> setClass("SupervisedExample",
>      representation(
>          attr.value = "ANY",
>          target.value = "ANY"
> ))
>
> setClass("StreamBuffer",
>      representation=representation(
>          examples = "list", #SupervisedExample
>          max.length = "integer"
>      ),
>      prototype=list(
>              max.length = as.integer(10000)
>      )
> )
> b <- new("StreamBuffer")
>
> load("~/Dropbox/dataList2.RData")

For a reproducible example, I guess you have something like

   data <- replicate(10000, new("SupervisedExample"))

> b at examples <- data #data is a list of SupervisedExample class.
>
>  > system.time({for (i in 1:100) b at examples[[1]]@attr.value[1] = 2 })

Yes, this is slow. [[<-,S4 is not as clever as [[<-,list and performs 
extra duplication, including those 10,000 S4 objects it contains.

As before, an improvement is to think in terms of vectors, maybe a 
'SupervisedExamples' class to act as a collection of examples

setClass("SupervisedExamples",
          representation=representation(
            attr.value = "list",
            target.value = "list"))

setClass("StreamBuffer",
          representation=representation(
            examples="SupervisedExamples"))

SupervisedExamples <-
     function(attr.value=vector("list", n),
              target.value=vector("list", n), n, ...)
{
     new("SupervisedExamples", attr.value=attr.value,
         target.value=target.value, ...)
}

StreamBuffer <-
     function(examples, ...)
{
     new("StreamBuffer", examples=examples, ...)
}

data <- SupervisedExamples(n=100000)

b <- StreamBuffer(data)

I then have

 > system.time({for (i in 1:100) data at attr.value[[1]] = 2 })
    user  system elapsed
   1.081   0.013   1.094
 > system.time({for (i in 1:100) b at examples@attr.value[[1]] <- 2})
    user  system elapsed
   4.283   0.000   4.295

(note the 10x increase in size); still slower, but this will be 
amortized when the updates are vectorized, e.g.,

 > idx = sample(length(b at examples@attr.value), 100)
 > system.time(b at examples@attr.value[idx] <- list(2))
    user  system elapsed
   0.013   0.000   0.014

A further change might be to recognize 'StreamBuffer' as an abstract 
class that SupervisedExamples extends

setClass("StreamBuffer",
          representation=representation(
            "VIRTUAL", max.len="integer"),
          prototype=prototype(max.len=100000L),
          validity=function(object) {
              if (object at max.len < length(object))
                  "too many elements"
              else TRUE
          })

setMethod(length, "StreamBuffer", function(x) {
     stop("'length' undefined on '", class(x), "'")
})

setClass("SupervisedExamples",
          representation=representation(
            attr.value = "list",
            target.value = "list"),
          contains="StreamBuffer")

setMethod(length, "SupervisedExamples", function(x) {
     length(x at attr.value)
})

SupervisedExamples <-
     function(attr.value=vector("list", n),
              target.value=vector("list", n), n, ...)
{
     new("SupervisedExamples", attr.value=attr.value,
         target.value=target.value, ...)
}

data <- SupervisedExamples(n=100000)

 > system.time({for (i in 1:100) data at attr.value[[1]] = 2 })
    user  system elapsed
   1.043   0.014   1.061

Martin Morgan

>     user  system elapsed
>   16.837   0.108  18.244
>
>  > system.time({for (i in 1:100) data[[1]]@attr.value[1] = 2 })
>     user  system elapsed
>    0.024   0.000   0.026
>
> ############################################################
>
>
> 2011/9/10 Martin Morgan <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
>
>     On 09/10/2011 08:08 AM, André Rossi wrote:
>
>         Hi everybody!
>
>         I'm creating an object of a S4 class that has two slots:
>         ListExamples, which
>         is a list, and idx, which is an integer (as the code below).
>
>         Then, I read a data.frame file with 10000 (ten thousands) of
>         lines and 10
>         columns, do some pre-processing and, basically, I store each
>         line as an
>         element of a list in the slot ListExamples of the S4 object.
>         However, many
>         operations after this take a considerable time.
>
>         Can anyone explain me why dois it happen? Is it possible to
>         speed up an
>         script that deals with a big number of data (it might be
>         data.frame or
>         list)?
>
>         Thank you,
>
>         André Rossi
>
>         setClass("Buffer",
>              representation=representation(
>                  Listexamples = "list",
>                  idx = "integer"
>              )
>         )
>
>
>     Hi André,
>
>     Can you provide a simpler and more reproducible example, for instance
>
>      > setClass("Buf", representation=representation(__lst="list"))
>     [1] "Buf"
>      > b=new("Buf", lst=replicate(10000, list(10), simplify=FALSE))
>      > system.time({ b at lst[[1]][[1]] = 2 })
>        user  system elapsed
>       0.005   0.000   0.005
>
>     Generally it sounds like you're modeling the rows as elements of
>     Listofelements, but you're better served by modeling the columns
>     (lst = replicate(10, integer(10000)), if all of your 10 columns were
>     integer-valued, for instance). Also, S4 is providing some measure of
>     type safety, and you're undermining that by having your class
>     contain a 'list'. I'd go after
>
>     setClass("Buffer",
>              representation=representation(
>                col1="integer",
>                col2="character",
>                col3="numeric"
>                ## etc.
>                ),
>              validity=function(object) {
>                  nms <- slotNames(object)
>                  len <- sapply(nms, function(nm) length(slot(object, nm)))
>                  if (1L != length(unique(len)))
>     "slots must all be of same length"
>                  else TRUE
>              })
>
>     Buffer <-
>         function(col1, col2, col3, ...)
>     {
>         new("Buffer", col1=col1, col2=col2, col3=col3, ...)
>     }
>
>     Let's see where the inefficiencies are before deciding that this is
>     an S4 issue.
>
>     Martin
>
>
>
>                 [[alternative HTML version deleted]]
>
>
>
>
>         ________________________________________________
>         R-help at r-project.org <mailto:R-help at r-project.org> mailing list
>         https://stat.ethz.ch/mailman/__listinfo/r-help
>         <https://stat.ethz.ch/mailman/listinfo/r-help>
>         PLEASE do read the posting guide
>         http://www.R-project.org/__posting-guide.html
>         <http://www.R-project.org/posting-guide.html>
>         and provide commented, minimal, self-contained, reproducible code.
>
>
>
>     --
>     Computational Biology
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
>     Location: M1-B861
>     Telephone: 206 667-2793 <tel:206%20667-2793>
>
>

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793