[R] Efficiency challenge: MANY subsets
Johannes Graumann
johannes_graumann at web.de
Tue Jan 20 09:45:14 CET 2009
Many thanks for this example, which doesn't entirely cover my case since I
have as many "indexes" entries as "sequences" entries. It was very
educational none the less and I used it to come up with something a bit
faster than what I had before. The main trick I used though was naming all
entries in "sequences" and "indexes" likes so
name(indexes) <- seq(length(indexes)
and then do a lapply on "names(indexes)", which allows me to access both
lists easily. What I end up with is this:
fragments <- lapply(
names(indexes),
function(x){
lapply(
indexes[[x]],
function(.range){
.range <- seq.int(
.range[1], .range[2]
)
unlist(lapply(sequences[x], '[', .range),use.names=FALSE)
}
)
}
)
Although this is still quite slow, it's much faster than what I had before.
Any further comments are highly welcome. I can send the real "sequences" and
"indexes" as exported R objects ...
Thanks, Joh
jim holtman wrote:
> Try this one; it is doing a list of 7000 in under 2 seconds:
>
>> sequences <- list(
> +
> +
> +
> c("M","G","L","W","I","S","F","G","T","P","P","S","Y","T","Y","L","L","I"
> + ,"M", +
> +
> +
>
"N","H","K","L","L","L","I","N","N","N","N","L","T","E","V","H","T","Y","F",
> "N","I","N","I","N","I","D","K","M","Y","I","H","*")
> + )
>>
>>
>>
>> indexes <- list(
> + list(
> + c(1,22),c(22,46),c(46, 51),c(1,46),c(22,51),c(1,51)
> + )
> + )
>>
>> indexes <- rep(indexes,10)
>> sequences <- rep(sequences,7000)
>>
>> system.time({
> + fragments <- lapply(indexes, function(.seq){
> + lapply(.seq, function(.range){
> + .range <- seq(.range[1], .range[2]) # save since we use several
> times
> + lapply(sequences, '[', .range)
> + })
> + })
> + })
> user system elapsed
> 1.24 0.00 1.26
>>
>>
>
>
> On Fri, Jan 16, 2009 at 3:16 PM, Johannes Graumann
> <johannes_graumann at web.de> wrote:
>> Thanks. Very elegant, but doesn't solve the problem of the outer "for"
>> loop, since I now would rewrite the code like so:
>>
>> fragments <- list()
>> for(iN in seq(length(sequences))){
>> cat(paste(iN,"\n"))
>> fragments[[iN]] <-
>> lapply(indexes[[1]], function(g)sequences[[1]][do.call(seq,
>> as.list(g))])
>> }
>>
>> still very slow for length(sequences) ~ 7000.
>>
>> Joh
>>
>> On Friday 16 January 2009 14:23:47 Henrique Dallazuanna wrote:
>>> Try this:
>>>
>>> lapply(indexes[[1]], function(g)sequences[[1]][do.call(seq,
>>> as.list(g))])
>>>
>>> On Fri, Jan 16, 2009 at 11:06 AM, Johannes Graumann <
>>>
>>> johannes_graumann at web.de> wrote:
>>> > Hello,
>>> >
>>> > I have a list of character vectors like this:
>>> >
>>> > sequences <- list(
>>> >
>>> >
>>> >
c("M","G","L","W","I","S","F","G","T","P","P","S","Y","T","Y","L","L","I"
>>> >,"M",
>>> >
>>> >
>>> >
"N","H","K","L","L","L","I","N","N","N","N","L","T","E","V","H","T","Y","
>>> >F", "N","I","N","I","N","I","D","K","M","Y","I","H","*")
>>> > )
>>> >
>>> > and another list of subset ranges like this:
>>> >
>>> > indexes <- list(
>>> > list(
>>> > c(1,22),c(22,46),c(46, 51),c(1,46),c(22,51),c(1,51)
>>> > )
>>> > )
>>> >
>>> > What I now want to do is to subset each entry in "sequences"
>>> > (sequences[[1]]) with all ranges in the corresponding low level list
>>> > in "indexes" (indexes[[1]]). Here is what I came up with.
>>> >
>>> > fragments <- list()
>>> > for(iN in seq(length(sequences))){
>>> > cat(paste(iN,"\n"))
>>> > tmpFragments <- sapply(
>>> > indexes[[iN]],
>>> > function(x){
>>> > sequences[[iN]][seq.int(x[1],x[2])]
>>> > }
>>> > )
>>> > fragments[[iN]] <- tmpFragments
>>> > }
>>> >
>>> > This works fine, but "sequences" contains thousands of entries and the
>>> > corresponding "indexes" are sometimes hundreds of ranges long, so this
>>> > whole
>>> > process is EXTREMELY inefficient.
>>> >
>>> > Does somebody out there take the challenge and show me a way on how to
>>> > speed
>>> > this up?
>>> >
>>> > Thanks for any hints,
>>> >
>>> > Joh
>>> >
>>> > ______________________________________________
>>> > R-help at r-project.org mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>> > http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html and provide commented,
>> minimal, self-contained, reproducible code.
>>
>>
>
>
>
More information about the R-help
mailing list