[R] ideas about how to reduce RAM & improve speed in trying to use lapply(strsplit())

Ian Gow iandgow at gmail.com
Mon May 30 02:44:27 CEST 2011


Not a new approach, but some benchmark data (the perl=TRUE speeds up Jim's
suggestion):

> x <- c('18x.6','12x.9','302x.3')
> y <- rep(x,100000)
> system.time(temp <- unlist(lapply(strsplit(y,".",fixed=TRUE),function(x)
>x[1])))
   user  system elapsed
  1.203   0.018   1.222
> system.time(temp2 <- gsub("^(.*?)\\..*$","\\1",y, perl=TRUE))
   user  system elapsed
  0.176   0.001   0.176
> identical(temp2, temp)
[1] TRUE
> system.time(temp3 <- gsub("^(.*)\\..*", '\\1', y))
   user  system elapsed
  0.292   0.001   0.291
> identical(temp3, temp)
[1] TRUE
> system.time(temp3 <- gsub("^(.*)\\..*", '\\1', y, perl=TRUE))
   user  system elapsed
  0.160   0.001   0.161






On 5/29/11 7:40 PM, "jim holtman" <jholtman at gmail.com> wrote:

>Try this approach:
>
>> x <- c('18x.6','12x.9','302x.3')
>> gsub("^(.*)\\..*", '\\1', x)
>[1] "18x"  "12x"  "302x"
>
>
>On Sun, May 29, 2011 at 8:10 PM, Matthew Keller <mckellercran at gmail.com>
>wrote:
>> hi all,
>>
>> I'm full of questions today :). Thanks in advance for your help!
>>
>> Here's the problem:
>> x <- c('18x.6','12x.9','302x.3')
>>
>> I want to get a vector that is c('18x','12x','302x')
>>
>> This is easily done using this code:
>>
>> unlist(lapply(strsplit(x,".",fixed=TRUE),function(x) x[1]))
>>
>> So far so good. The problem is that x is a vector of length 132e6.
>> When I run the above code, it runs for > 30 minutes, and it takes > 23
>> Gb RAM (no kidding!).
>>
>> Does anyone have ideas about how to speed up the code above and (more
>> importantly) reduce the RAM footprint? I'd prefer not to change the
>> file on disk using, e.g., awk, but I will do that as a last resort.
>>
>> Best
>>
>> Matt
>>
>> --
>> Matthew C Keller
>> Asst. Professor of Psychology
>> University of Colorado at Boulder
>> www.matthewckeller.com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
>-- 
>Jim Holtman
>Data Munger Guru
>
>What is the problem that you are trying to solve?
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list