[R] SLOW split() function

ivo welch ivo.welch at gmail.com
Tue Oct 11 17:56:51 CEST 2011


thanks, josh.  in my posting example, I did not need anything except
coefficients.  (when this is the case, I usually do not even use
lm.fit, but I eliminate all missing obs first and then use solve
crossprod(y,cbind(1,x)) crossprod(cbind(1,x)).)  this is pretty fast.)

alas, I will need to figure how to get coef standard errors faster in
this case.  summary.lm() is really slow.

regards,

/iaw
----
Ivo Welch (ivo.welch at gmail.com)
http://www.ivo-welch.info/
J. Fred Weston Professor of Finance
Anderson School at UCLA, C519





On Mon, Oct 10, 2011 at 11:30 PM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
> As another followup, given that you are doing numerous regression
> models and (I presume) working with finance/stock data that is
> strictly numeric (no need for special contrast coding, etc.), you can
> substantially reduce the time spent estimating the coefficients.  A
> simple way is to use lm.fit directly instead of lm.  For lm.fit, you
> pass the y and x (design) matrices directly.  This skips a good deal
> of overhead.  Here is one naive way, I imagine more speedups could be
> gained by incorporating the intercept (1 vector) into d instead of
> cbind()ing it.  The catch it that lm.fit requires matrices, not data
> tables, so what you gain may be lost in having to do an extra
> conversion.  In any case, here are the times on my system for the two
> options (note I used N = 1000 * 100 because I am presently on a
> glorified netbook).
>
>> print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
> + x, data=d[.indx,])) })))
>   user  system elapsed
>  69.00    0.00   69.56
>
>> print(system.time(all.2c <- lapply(si, function(.indx) { coef(lm.fit(y = d[.indx, y], x = cbind(1, d[.indx, x]))) })))
>   user  system elapsed
>  37.83    0.03   38.36
>
> the column names for the coeficients will not be the same as from lm,
> but the estimates should be identical.  While this is not recommended
> in typical usage, in an application like regressions on rolling time
> windows, etc. where you know the data are not changing, I think it
> makes sense to bypass the clever determine your data and best methods
> to use, and go straight to passing the design matrix.  Since you do
> not need residuals, variances, etc. it may be possible to speed this
> up even more, perhaps bypassing dqrls altogether.
>
> Cheers,
>
> Josh
>
> On Mon, Oct 10, 2011 at 9:56 PM, ivo welch <ivo.welch at gmail.com> wrote:
>> thank you, everyone.  this was very helpful to my specific task and
>> understanding.  for the benefit of future googlers, I thought I would
>> post some experiments and results here.
>>
>> ultimately, I need to do a by() on an irregular matrix, and I now know
>> how to speed up by() on a single-core, and then again on a multi-core
>> machine.
>>
>> library(data.table)
>> N <- 1000*1000
>> d <- data.table(data.frame( key= as.integer(runif(N, min=1,
>> max=N/10)), x=rnorm(N), y=rnorm(N) ))  # irregular
>> setkey(d, "key"); gc() ## sort and force a garbage collection
>>
>>
>> cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
>>
>> cat("\nStandard by() Function:\n")
>> print(system.time( all.1 <- by( d, d$key, function(d) coef(lm(y ~ x, data=d)))))
>>
>>
>> cat("\n\nPreSplit Function [aka Jim H]\n\t(a) Splitting Operation:\n")
>> print(system.time(si <- split(seq(nrow(d)), d$key)))
>> cat("\n\t(b) Regressions:\n")
>> print(system.time(all.2 <- lapply(si, function(.indx) {
>> coef(lm(d$y[.indx] ~ d$x[.indx])) })))
>> print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
>> x, data=d[.indx,])) })))
>>
>> cat("\n\nNaive Split Data Frame\n\t(a) Splitting Operation:\n")
>> print(system.time(ds <- split(d, d$key)))
>> cat("\n\t(b) Regressions:\n")
>> print(system.time(all.3a <- lapply(ds, function(ds) { coef(lm(ds$y ~ ds$x)) })))
>> print(system.time(all.3b <- lapply(ds, function(ds) { coef(lm(y ~ x,
>> data=ds)) })))
>>
>> the first and the last ways (all.1 and all.3) are "naive" ways of
>> doing this, and take about 400-500 seconds on a Mac Air, core i5.
>> Jim's suggestion (all.2) cuts this roughly into half by speeding up
>> the split to take almost no time.
>>
>> and now,
>>
>> library(multicore)
>> print(system.time(all.4 <- mclapply(si, function(.indx) { coef(lm(y ~
>> x, data=d[.indx,])) })))
>>
>> on my dual-core (quad-thread) i5, all four pseudo cores become busy,
>> and the time roughly halves again from 230 seconds to 120 seconds.
>>
>>
>> maybe the by() function should use Jim's approach, and multicore
>> should provide mcby().  of course, knowing how to do this myself fast
>> now by hand, this is not so important for me.  but it may help some
>> other novices.
>>
>> thanks again everybody.
>>
>> regards,
>>
>> /iaw
>>
>> ----
>> Ivo Welch (ivo.welch at gmail.com)
>>
>>
>>
>>
>> On Mon, Oct 10, 2011 at 9:31 PM, William Dunlap <wdunlap at tibco.com> wrote:
>>> The following avoids the overhead of data.frame methods
>>> (and assumes the data.frame doesn't include matrices
>>> or other data.frames) and relies on split(vector,factor)
>>> quickly splitting a vector into a list of vectors.
>>> For a 10^6 row by 10 column data.frame split in 10^5
>>> groups this took 14.1 seconds while split took 658.7 s.
>>> Both returned the same thing.
>>>
>>> Perhaps something based on this idea would help your
>>> parallelized by().
>>>
>>> mysplit.data.frame <-
>>> function (x, f, drop = FALSE, ...)
>>> {
>>>    f <- as.factor(f)
>>>    tmp <- lapply(x, function(xi) split(xi, f, drop = drop, ...))
>>>    rn <- split(rownames(x), f, drop = drop, ...)
>>>    tmp <- unlist(unname(tmp), recursive = FALSE)
>>>    tmp <- split(tmp, factor(names(tmp), levels = unique(names(tmp))))
>>>    tmp <- lapply(setNames(seq_along(tmp), names(tmp)), function(i) {
>>>        t <- tmp[[i]]
>>>        names(t) <- names(x)
>>>        attr(t, "row.names") <- rn[[i]]
>>>        class(t) <- "data.frame"
>>>        t
>>>    })
>>>    tmp
>>> }
>>>
>>> Bill Dunlap
>>> Spotfire, TIBCO Software
>>> wdunlap tibco.com
>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Jim Holtman
>>>> Sent: Monday, October 10, 2011 7:29 PM
>>>> To: ivo welch
>>>> Cc: r-help
>>>> Subject: Re: [R] SLOW split() function
>>>>
>>>> instead of spliting the entire dataframe, split the indices and then use these to access your data:
>>>> try
>>>>
>>>> system.time(s <- split(seq(nrow(d)), d$key))
>>>>
>>>> this should be faster and less memory intensive.  you can then use the indices to access the subset:
>>>>
>>>> result <- lapply(s, function(.indx){
>>>>     doSomething <- sum(d$someCol[.indx])
>>>> })
>>>>
>>>> Sent from my iPad
>>>>
>>>> On Oct 10, 2011, at 21:01, ivo welch <ivo.welch at gmail.com> wrote:
>>>>
>>>> > dear R experts:  apologies for all my speed and memory questions.  I
>>>> > have a bet with my coauthors that I can make R reasonably efficient
>>>> > through R-appropriate programming techniques.  this is not just for
>>>> > kicks, but for work.  for benchmarking, my [3 year old] Mac Pro has
>>>> > 2.8GHz Xeons, 16GB of RAM, and R 2.13.1.
>>>> >
>>>> > right now, it seems that 'split()' is why I am losing my bet.  (split
>>>> > is an integral component of *apply() and by(), so I need split() to be
>>>> > fast.  its resulting list can then be fed, e.g., to mclapply().)  I
>>>> > made up an example to illustrate my ills:
>>>> >
>>>> >    library(data.table)
>>>> >    N <- 1000
>>>> >    T <- N*10
>>>> >    d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) ))
>>>> >    setkey(d, "key"); gc() ## force a garbage collection
>>>> >    cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
>>>> >    print(system.time( s<-split(d, d$key) ))
>>>> >
>>>> > My ordered input data table (or data frame; doesn't make a difference)
>>>> > is 114MB in size.  it takes about a second to create.  split() only
>>>> > needs to reshape it.  this simple operation takes almost 5 minutes on
>>>> > my computer.
>>>> >
>>>> > with a data set that is larger, this explodes further.
>>>> >
>>>> > am I doing something wrong?  is there an alternative to split()?
>>>> >
>>>> > sincerely,
>>>> >
>>>> > /iaw
>>>> >
>>>> > ----
>>>> > Ivo Welch (ivo.welch at gmail.com)
>>>> >
>>>> > ______________________________________________
>>>> > R-help at r-project.org mailing list
>>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>>> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> > and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Joshua Wiley
> Ph.D. Student, Health Psychology
> Programmer Analyst II, ATS Statistical Consulting Group
> University of California, Los Angeles
> https://joshuawiley.com/
>



More information about the R-help mailing list