[R] Very slow using S4 classes
Martin Morgan
mtmorgan at fhcrc.org
Mon Sep 12 18:37:56 CEST 2011
Hi André...
On 09/12/2011 07:20 AM, André Rossi wrote:
> Dear Martin Morgan and Martin Maechler...
>
> Here is an example of the computational time when a slot of a S4 class
> is of another S4 class and when it is just one object. I'm sending you
> the data file.
>
> Thank you!
>
> Best regards,
>
> André Rossi
>
> ############################################################
>
> setClass("SupervisedExample",
> representation(
> attr.value = "ANY",
> target.value = "ANY"
> ))
>
> setClass("StreamBuffer",
> representation=representation(
> examples = "list", #SupervisedExample
> max.length = "integer"
> ),
> prototype=list(
> max.length = as.integer(10000)
> )
> )
> b <- new("StreamBuffer")
>
> load("~/Dropbox/dataList2.RData")
For a reproducible example, I guess you have something like
data <- replicate(10000, new("SupervisedExample"))
> b at examples <- data #data is a list of SupervisedExample class.
>
> > system.time({for (i in 1:100) b at examples[[1]]@attr.value[1] = 2 })
Yes, this is slow. [[<-,S4 is not as clever as [[<-,list and performs
extra duplication, including those 10,000 S4 objects it contains.
As before, an improvement is to think in terms of vectors, maybe a
'SupervisedExamples' class to act as a collection of examples
setClass("SupervisedExamples",
representation=representation(
attr.value = "list",
target.value = "list"))
setClass("StreamBuffer",
representation=representation(
examples="SupervisedExamples"))
SupervisedExamples <-
function(attr.value=vector("list", n),
target.value=vector("list", n), n, ...)
{
new("SupervisedExamples", attr.value=attr.value,
target.value=target.value, ...)
}
StreamBuffer <-
function(examples, ...)
{
new("StreamBuffer", examples=examples, ...)
}
data <- SupervisedExamples(n=100000)
b <- StreamBuffer(data)
I then have
> system.time({for (i in 1:100) data at attr.value[[1]] = 2 })
user system elapsed
1.081 0.013 1.094
> system.time({for (i in 1:100) b at examples@attr.value[[1]] <- 2})
user system elapsed
4.283 0.000 4.295
(note the 10x increase in size); still slower, but this will be
amortized when the updates are vectorized, e.g.,
> idx = sample(length(b at examples@attr.value), 100)
> system.time(b at examples@attr.value[idx] <- list(2))
user system elapsed
0.013 0.000 0.014
A further change might be to recognize 'StreamBuffer' as an abstract
class that SupervisedExamples extends
setClass("StreamBuffer",
representation=representation(
"VIRTUAL", max.len="integer"),
prototype=prototype(max.len=100000L),
validity=function(object) {
if (object at max.len < length(object))
"too many elements"
else TRUE
})
setMethod(length, "StreamBuffer", function(x) {
stop("'length' undefined on '", class(x), "'")
})
setClass("SupervisedExamples",
representation=representation(
attr.value = "list",
target.value = "list"),
contains="StreamBuffer")
setMethod(length, "SupervisedExamples", function(x) {
length(x at attr.value)
})
SupervisedExamples <-
function(attr.value=vector("list", n),
target.value=vector("list", n), n, ...)
{
new("SupervisedExamples", attr.value=attr.value,
target.value=target.value, ...)
}
data <- SupervisedExamples(n=100000)
> system.time({for (i in 1:100) data at attr.value[[1]] = 2 })
user system elapsed
1.043 0.014 1.061
Martin Morgan
> user system elapsed
> 16.837 0.108 18.244
>
> > system.time({for (i in 1:100) data[[1]]@attr.value[1] = 2 })
> user system elapsed
> 0.024 0.000 0.026
>
> ############################################################
>
>
> 2011/9/10 Martin Morgan <mtmorgan at fhcrc.org <mailto:mtmorgan at fhcrc.org>>
>
> On 09/10/2011 08:08 AM, André Rossi wrote:
>
> Hi everybody!
>
> I'm creating an object of a S4 class that has two slots:
> ListExamples, which
> is a list, and idx, which is an integer (as the code below).
>
> Then, I read a data.frame file with 10000 (ten thousands) of
> lines and 10
> columns, do some pre-processing and, basically, I store each
> line as an
> element of a list in the slot ListExamples of the S4 object.
> However, many
> operations after this take a considerable time.
>
> Can anyone explain me why dois it happen? Is it possible to
> speed up an
> script that deals with a big number of data (it might be
> data.frame or
> list)?
>
> Thank you,
>
> André Rossi
>
> setClass("Buffer",
> representation=representation(
> Listexamples = "list",
> idx = "integer"
> )
> )
>
>
> Hi André,
>
> Can you provide a simpler and more reproducible example, for instance
>
> > setClass("Buf", representation=representation(__lst="list"))
> [1] "Buf"
> > b=new("Buf", lst=replicate(10000, list(10), simplify=FALSE))
> > system.time({ b at lst[[1]][[1]] = 2 })
> user system elapsed
> 0.005 0.000 0.005
>
> Generally it sounds like you're modeling the rows as elements of
> Listofelements, but you're better served by modeling the columns
> (lst = replicate(10, integer(10000)), if all of your 10 columns were
> integer-valued, for instance). Also, S4 is providing some measure of
> type safety, and you're undermining that by having your class
> contain a 'list'. I'd go after
>
> setClass("Buffer",
> representation=representation(
> col1="integer",
> col2="character",
> col3="numeric"
> ## etc.
> ),
> validity=function(object) {
> nms <- slotNames(object)
> len <- sapply(nms, function(nm) length(slot(object, nm)))
> if (1L != length(unique(len)))
> "slots must all be of same length"
> else TRUE
> })
>
> Buffer <-
> function(col1, col2, col3, ...)
> {
> new("Buffer", col1=col1, col2=col2, col3=col3, ...)
> }
>
> Let's see where the inefficiencies are before deciding that this is
> an S4 issue.
>
> Martin
>
>
>
> [[alternative HTML version deleted]]
>
>
>
>
> ________________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org> mailing list
> https://stat.ethz.ch/mailman/__listinfo/r-help
> <https://stat.ethz.ch/mailman/listinfo/r-help>
> PLEASE do read the posting guide
> http://www.R-project.org/__posting-guide.html
> <http://www.R-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793 <tel:206%20667-2793>
>
>
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the R-help
mailing list