[BioC] IRanges: Need help speeding up sliding window analysis

Patrick Aboyoun paboyoun at fhcrc.org
Fri Sep 24 17:51:27 CEST 2010


Michael,
The IRanges package contains a number of built-in running window  
functions (runsum, runmean, runwtsum, runq) that you might want to  
consider for this operation. Also, if I am understanding what you are  
trying to do correctly, something like

width <- 30
halfWidth <- width/2
halfWidthSums <- runsum(rle, halfWidth)
diff(halfWidthSums, halfWidth)/width

should fit the bill.


Cheers,
Patrick


Quoting Martin Morgan <mtmorgan at fhcrc.org>:

> On 09/24/2010 03:11 AM, Michael Dondrup wrote:
>> Hi,
>>
>> I need some help with speeding up a sliding window analysis on an   
>> Rle object of length > 1 million.
>> I am using functions 'successiveViews' with negative gap width and   
>> 'viewApply' vs. viewMeans.
>> My goal is to apply a discrete differential operator that computes   
>> the  difference between the 'left'
>> half of the window and the 'right', aka. a cheap discrete numeric   
>> first order differentiation.
>>
>> What I found is: viewMeans(x) << viewApply(x, mean) << viewApply(x,  
>>  diff.op) in terms of time, example below.
>> Is there a way to pimp this code to make it work on the genome   
>> scale? I appreciate your input, I am confident there
>> is a better way to do it.
>
> maybe
>
>   win <- 30
>   diff(cumsum(rle), win) / win
>
> for numeric (not integer) rle, though there might be rounding problems
> if cumsum gets large. A strategy might be to break the Rle into regions
> separated by islands of at least 'win' 0's (using runLength / runValue
> to identify candidate break points), which allows one to reset the
> cumsum. Some inspiration might come from
> http://www.mail-archive.com/r-help@r-project.org/msg75280.html.
>
> Also the end points might need fiddling (e.g., by padding rle with 'win'
> trailing zeros, which is I think in effect what successiveViews does.
>
> Martin
>
>>
>> Thank you very much
>> Michael
>>
>> Code example:
>>
>> diff.op <- function(x, lrprop=1/2) {
>>   len = length(x)
>>   i = ceiling(len*lrprop)
>>   (sum(x[i:len]) - sum(x[1:i])) / len
>> }
>>
>> sliding.window.apply <-  function(object, width, fun, ...) {
>>   x <- trim(successiveViews(subject=object, width=rep(width,    
>> ceiling(length(object))  ), gap=-width+1))
>>   return (Rle(viewApply(x, fun, ...)))
>> }
>>
>> sliding.window.mean <-  function(object, width) {
>>   x <- trim(successiveViews(subject=object, width=rep(width,    
>> ceiling(length(object))  ), gap=-width+1))
>>   return (viewMeans(x))
>> }
>>
>> rle <- Rle(1:10000)
>>
>>> system.time(sliding.window.mean(rle, 30))
>>    user  system elapsed
>>   0.036   0.004   0.098
>>> system.time(sliding.window.apply(rle,  30, mean))
>>    user  system elapsed
>>   4.380   0.065   6.010
>>> system.time(sliding.window.apply(rle,  30, diff.op))
>>    user  system elapsed
>>  38.857   0.204  39.127
>>
>>> sessionInfo()
>> R version 2.11.1 (2010-05-31)
>> x86_64-apple-darwin9.8.0
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] IRanges_1.6.6
>>
>> loaded via a namespace (and not attached):
>>  [1] annotate_1.26.1      AnnotationDbi_1.10.2 Biobase_2.8.0         
>>  DBI_0.2-5            DESeq_1.0.4
>>  [6] genefilter_1.30.0    geneplotter_1.26.0   grid_2.11.1           
>>  RColorBrewer_1.0-2   RSQLite_0.9-1
>> [11] splines_2.11.1       survival_2.35-8      xtable_1.5-6
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:   
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:   
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list