[Bioc-devel] appending 2 GappedAlignments using "c" takes long

Thu Jul 11 10:15:33 CEST 2013

Hej Hervé!

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------

On Jul 10, 2013, at 8:54 PM, Hervé Pagès wrote:

> Hi Nico,
> 
> On 07/09/2013 08:07 AM, Nicolas Delhomme wrote:
>> Hej Bioc Core!
>> 
>> There was some discussion last year about implementing a BamStreamer (à la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
>> 
>> ".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
>> 
>>  ## create a stream
>>  stopifnot(is(bamFile,"BamFile"))
>> 
>>  ## set the yieldSize if it is not set already
>>  if(is.na(yieldSize(bamFile))){
>>    yieldSize(bamFile) <- yieldSize
>>  }
>> 
>>  ## open it
>>  open(bamFile)
>> 
>>  ## verb
>>  if(verbose){
>>    message(paste("Streaming",basename(path(bamFile))))
>>  }
>> 
>>  ## create the output
>>  out <- GappedAlignments()
>> 
>>  ## process it
>>  while(length(chunk <- readBamGappedAlignments(bamFile))){
>>    if(verbose){
>>      message(paste("Processed",length(chunk),"reads"))
>>    }
>>    out <- c(out,chunk)
>>  }
> 
> Note that regardless the speed of c() on GappedAlignments objects,
> growing an object in a loop is fundamentally inefficient (see Circle 2
> of The R Inferno).
> Also keeping the chunks in memory kind of defeats the purpose of reading
> the file one chunk at a time.

Sure. What this function normally really does is a data reduction - basically getting a named vector back. I just came across the appending issue when preparing the code example above.

> 
>> 
>>  ## close
>>  close(bamFile)
>> 
>>  ## return
>>  return(out)
>> }
>> 
>> In the method above, the first iteration of combining the GappedAlignments:
>> 
>> out <- c(out,chunk) takes:
>> 
>> system.time(append(out,chunk))
>> 
>>   user  system elapsed
>> 123.704   0.060 124.011
> 
> 2 minutes! Whaoo, that's really slow. I can't reproduce this on my
> machine though:
> 

OK, sounds more like a system issue then.

>  library(Rsamtools)
>  library(RNAseqData.HNRNPC.bam.chr14)
>  bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L])
>  yieldSize(bamfile) <- 100000L
>  open(bamfile)
>  out <- GappedAlignments()
> 
> Then:
> 
>  > chunk <- readBamGappedAlignments(bamfile)
>  > system.time(out <- append(out, chunk))
>     user  system elapsed
>    0.284   0.000   0.286
> 
> I wonder what's going on on your system. Are you sure it was not running
> out of memory when you did this?

Yes, that's a fat node with 0.2TB RAM and I was the only one on it at the time.

> Try to check the load with uptime or
> top in another terminal (e.g. start top right before you call append()).
> If the system starts swapping, then your R process will become hundreds
> or thousands times slower!

and there was no memory intensive job running. Could still have been some NFS related issue. I will retry with a fresh session and monitor the I/O as well.

> 
>> 
>> whereas the second iteration (faked here) takes only (still long):
>> 
>> system.time(append(chunk,chunk))
>> 
>>   user  system elapsed
>>  2.708   0.044   2.758
> 
> 2nd, 3rd and 4th iterations for me:
> 
>  > chunk <- readBamGappedAlignments(bamfile)
>  > system.time(out <- append(out, chunk))
>     user  system elapsed
>    0.516   0.004   0.521
> 
>  > chunk <- readBamGappedAlignments(bamfile)
>  > system.time(out <- append(out, chunk))
>     user  system elapsed
>    0.656   0.008   0.663
> 
>  > chunk <- readBamGappedAlignments(bamfile)
>  > system.time(out <- append(out, chunk))
>     user  system elapsed
>    0.796   0.004   0.801
> 
> As expected, the time is growing (this is why the process
> of growing an object in a loop is considered to be quadratic
> in time).

Quadratic! Wow, I knew it was slower but still... Good to know.

> 
>> 
>> I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration.
> 
> The seqinfo of the 2 objects to combine need to be merged together
> and set back on each object before the 2 objects can actually
> be combined. This operation is cheap and I wouldn't expect this
> to slow down the first iteration significantly.

Yes, that was very surprising.

> 
>> Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below:
>> 
>> out <- GappedAlignments(seqlengths=seqlengths(chunk))
>> 
>> Error in GappedAlignments(seqlengths = seqlengths(chunk)) :
>>  'names(seqlengths)' incompatible with 'levels(seqnames)'
>> 
>> out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk))
>> 
>> Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) :
>>  'strand' must be specified when 'seqnames' is not empty
>> 
>> out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+")
>> 
>> Error in validObject(.Object) :
>>  invalid class “GappedAlignments” object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle"
>> invalid class “GappedAlignments” object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
> 
> The trick is to create an empty GappedAlignments objects
> with non-empty seqlevels so you can put seqlengths on the
> seqlevels.
> 
> Here are 2 ways to create an empty GappedAlignments objects with
> non-empty seqlevels:
> 
>  (1) Pass an empty factor with non-empty levels to the seqnames
>      arg:
> 
>        out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)))
> 
>  (2) The recommended way:
> 
>        out <- GappedAlignments()
>        seqinfo(out) <- seqinfo(chunk)
> 
> Note that with (2), 'out' gets all the seqinfo from 'chunk' (including
> its seqlengths), not only its seqlevels.
> 
> (1) could be adapted to also set the seqlengths:
> 
>  out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)),
>                          seqlengths=seqlengths(chunk))
> 
> but (2) is really the preferred way.

Thanks for the pointers! 

> 
>> 
>> I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem.
>> 
>> My sessionInfo:
>> 
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> 
>> locale:
>> [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>> [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>> [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>> [7] LC_PAPER=C                 LC_NAME=C
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>> 
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>> [8] base
>> 
>> other attached packages:
>> [1] BiocInstaller_1.11.3  Rsamtools_1.13.22     Biostrings_2.29.12
>> [4] GenomicRanges_1.13.26 XVector_0.1.0         IRanges_1.19.15
>> [7] BiocGenerics_0.7.2
>> 
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-5   stats4_3.0.1   zlibbioc_1.7.0
> 
> 
> Looks like you are using Bioc-devel. Did you get all the
> warnings about GappedAlignments, readBamGappedAlignments(),
> and GappedAlignments() being deprecated?

I though I did, but indeed I didn't get the warnings then. This is very strange.

> 
> I thought you were using the release so that's what I used:
> 
> > sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C                 LC_NAME=C
> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
> 
> other attached packages:
> [1] RNAseqData.HNRNPC.bam.chr14_0.1.3 Rsamtools_1.12.3
> [3] Biostrings_2.28.0                 GenomicRanges_1.12.4
> [5] IRanges_1.18.1                    BiocGenerics_0.6.0
> 
> loaded via a namespace (and not attached):
> [1] bitops_1.0-5   stats4_3.0.0   zlibbioc_1.6.0
> 
> 
> The timings I get with Bioc-devel are pretty much the same though.
> 
> Something doesn't seem to be quite right with your cluster.

I agree, I'll check that out.

> What happens
> if you try to rbind() 2 data.frames of 100000 rows each in a fresh
> session?
> 
>  > df <- data.frame(aa=1:100000, bb=100000:1, cc="cc", dd="dd")
>  > system.time(df2 <- rbind(df, df))
>     user  system elapsed
>    0.204   0.000   0.206
> 

Good point. I'll try that out and let you know.

Thanks for the very detailed answer!

Cheers,

Nico

> Thanks,
> H.
> 
>> 
>> Cheers,
>> 
>> Nico
>> 
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>> 
>> Genome Biology Computational Support
>> 
>> European Molecular Biology Laboratory
>> 
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> 
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> 
> 
> -- 
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages at fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319