[R] Efficiency challenge: MANY subsets

Tue Jan 20 09:45:14 CET 2009

Many thanks for this example, which doesn't entirely cover my case since I 
have as many "indexes" entries as "sequences" entries. It was very 
educational none the less and I used it to come up with something a bit 
faster than what I had before. The main trick I used though was naming all 
entries in "sequences" and "indexes" likes so 
  name(indexes) <- seq(length(indexes)
and then do a lapply on "names(indexes)", which allows me to access both 
lists easily. What I end up with is this:

fragments <- lapply(
    names(indexes),
    function(x){
      lapply(
        indexes[[x]],
        function(.range){
          .range <- seq.int(
            .range[1], .range[2]
          )
          unlist(lapply(sequences[x], '[', .range),use.names=FALSE)
        }
      )
    }
  )

Although this is still quite slow, it's much faster than what I had before. 
Any further comments are highly welcome. I can send the real "sequences" and 
"indexes" as exported R objects ...

Thanks, Joh

jim holtman wrote:

> Try this one;  it is doing a list of 7000 in under 2 seconds:
> 
>>  sequences <- list(
> +
> +
> + 
> c("M","G","L","W","I","S","F","G","T","P","P","S","Y","T","Y","L","L","I"
> + ,"M", +
> +
> + 
> 
"N","H","K","L","L","L","I","N","N","N","N","L","T","E","V","H","T","Y","F",
> "N","I","N","I","N","I","D","K","M","Y","I","H","*")
> +  )
>>
>>
>>
>>  indexes <- list(
> +   list(
> +     c(1,22),c(22,46),c(46, 51),c(1,46),c(22,51),c(1,51)
> +   )
> +  )
>>
>> indexes <- rep(indexes,10)
>> sequences <- rep(sequences,7000)
>>
>> system.time({
> + fragments <- lapply(indexes, function(.seq){
> +     lapply(.seq, function(.range){
> +         .range <- seq(.range[1], .range[2])  # save since we use several
> times
> +         lapply(sequences, '[', .range)
> +     })
> + })
> + })
>    user  system elapsed
>    1.24    0.00    1.26
>>
>>
> 
> 
> On Fri, Jan 16, 2009 at 3:16 PM, Johannes Graumann
> <johannes_graumann at web.de> wrote:
>> Thanks. Very elegant, but doesn't solve the problem of the outer "for"
>> loop, since I now would rewrite the code like so:
>>
>> fragments <- list()
>> for(iN in seq(length(sequences))){
>>  cat(paste(iN,"\n"))
>>  fragments[[iN]] <-
>>    lapply(indexes[[1]], function(g)sequences[[1]][do.call(seq,
>>    as.list(g))])
>> }
>>
>> still very slow for length(sequences) ~ 7000.
>>
>> Joh
>>
>> On Friday 16 January 2009 14:23:47 Henrique Dallazuanna wrote:
>>> Try this:
>>>
>>> lapply(indexes[[1]], function(g)sequences[[1]][do.call(seq,
>>> as.list(g))])
>>>
>>> On Fri, Jan 16, 2009 at 11:06 AM, Johannes Graumann <
>>>
>>> johannes_graumann at web.de> wrote:
>>> > Hello,
>>> >
>>> > I have a list of character vectors like this:
>>> >
>>> > sequences <- list(
>>> >
>>> >
>>> > 
c("M","G","L","W","I","S","F","G","T","P","P","S","Y","T","Y","L","L","I"
>>> >,"M",
>>> >
>>> >
>>> > 
"N","H","K","L","L","L","I","N","N","N","N","L","T","E","V","H","T","Y","
>>> >F", "N","I","N","I","N","I","D","K","M","Y","I","H","*")
>>> > )
>>> >
>>> > and another list of subset ranges like this:
>>> >
>>> > indexes <- list(
>>> >  list(
>>> >    c(1,22),c(22,46),c(46, 51),c(1,46),c(22,51),c(1,51)
>>> >  )
>>> > )
>>> >
>>> > What I now want to do is to subset each entry in "sequences"
>>> > (sequences[[1]]) with all ranges in the corresponding low level list
>>> > in "indexes" (indexes[[1]]). Here is what I came up with.
>>> >
>>> > fragments <- list()
>>> > for(iN in seq(length(sequences))){
>>> >  cat(paste(iN,"\n"))
>>> >  tmpFragments <- sapply(
>>> >    indexes[[iN]],
>>> >    function(x){
>>> >      sequences[[iN]][seq.int(x[1],x[2])]
>>> >    }
>>> >  )
>>> >  fragments[[iN]] <- tmpFragments
>>> > }
>>> >
>>> > This works fine, but "sequences" contains thousands of entries and the
>>> > corresponding "indexes" are sometimes hundreds of ranges long, so this
>>> > whole
>>> > process is EXTREMELY inefficient.
>>> >
>>> > Does somebody out there take the challenge and show me a way on how to
>>> > speed
>>> > this up?
>>> >
>>> > Thanks for any hints,
>>> >
>>> > Joh
>>> >
>>> > ______________________________________________
>>> > R-help at r-project.org mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>> > http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html and provide commented,
>> minimal, self-contained, reproducible code.
>>
>>
> 
> 
>