[Bioc-devel] appending 2 GappedAlignments using "c" takes long
Nicolas Delhomme
delhomme at embl.de
Thu Jul 11 10:15:33 CEST 2013
Hej Hervé!
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------
On Jul 10, 2013, at 8:54 PM, Hervé Pagès wrote:
> Hi Nico,
>
> On 07/09/2013 08:07 AM, Nicolas Delhomme wrote:
>> Hej Bioc Core!
>>
>> There was some discussion last year about implementing a BamStreamer (à la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
>>
>> ".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
>>
>> ## create a stream
>> stopifnot(is(bamFile,"BamFile"))
>>
>> ## set the yieldSize if it is not set already
>> if(is.na(yieldSize(bamFile))){
>> yieldSize(bamFile) <- yieldSize
>> }
>>
>> ## open it
>> open(bamFile)
>>
>> ## verb
>> if(verbose){
>> message(paste("Streaming",basename(path(bamFile))))
>> }
>>
>> ## create the output
>> out <- GappedAlignments()
>>
>> ## process it
>> while(length(chunk <- readBamGappedAlignments(bamFile))){
>> if(verbose){
>> message(paste("Processed",length(chunk),"reads"))
>> }
>> out <- c(out,chunk)
>> }
>
> Note that regardless the speed of c() on GappedAlignments objects,
> growing an object in a loop is fundamentally inefficient (see Circle 2
> of The R Inferno).
> Also keeping the chunks in memory kind of defeats the purpose of reading
> the file one chunk at a time.
Sure. What this function normally really does is a data reduction - basically getting a named vector back. I just came across the appending issue when preparing the code example above.
>
>>
>> ## close
>> close(bamFile)
>>
>> ## return
>> return(out)
>> }
>>
>> In the method above, the first iteration of combining the GappedAlignments:
>>
>> out <- c(out,chunk) takes:
>>
>> system.time(append(out,chunk))
>>
>> user system elapsed
>> 123.704 0.060 124.011
>
> 2 minutes! Whaoo, that's really slow. I can't reproduce this on my
> machine though:
>
OK, sounds more like a system issue then.
> library(Rsamtools)
> library(RNAseqData.HNRNPC.bam.chr14)
> bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L])
> yieldSize(bamfile) <- 100000L
> open(bamfile)
> out <- GappedAlignments()
>
> Then:
>
> > chunk <- readBamGappedAlignments(bamfile)
> > system.time(out <- append(out, chunk))
> user system elapsed
> 0.284 0.000 0.286
>
> I wonder what's going on on your system. Are you sure it was not running
> out of memory when you did this?
Yes, that's a fat node with 0.2TB RAM and I was the only one on it at the time.
> Try to check the load with uptime or
> top in another terminal (e.g. start top right before you call append()).
> If the system starts swapping, then your R process will become hundreds
> or thousands times slower!
and there was no memory intensive job running. Could still have been some NFS related issue. I will retry with a fresh session and monitor the I/O as well.
>
>>
>> whereas the second iteration (faked here) takes only (still long):
>>
>> system.time(append(chunk,chunk))
>>
>> user system elapsed
>> 2.708 0.044 2.758
>
> 2nd, 3rd and 4th iterations for me:
>
> > chunk <- readBamGappedAlignments(bamfile)
> > system.time(out <- append(out, chunk))
> user system elapsed
> 0.516 0.004 0.521
>
> > chunk <- readBamGappedAlignments(bamfile)
> > system.time(out <- append(out, chunk))
> user system elapsed
> 0.656 0.008 0.663
>
> > chunk <- readBamGappedAlignments(bamfile)
> > system.time(out <- append(out, chunk))
> user system elapsed
> 0.796 0.004 0.801
>
> As expected, the time is growing (this is why the process
> of growing an object in a loop is considered to be quadratic
> in time).
Quadratic! Wow, I knew it was slower but still... Good to know.
>
>>
>> I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration.
>
> The seqinfo of the 2 objects to combine need to be merged together
> and set back on each object before the 2 objects can actually
> be combined. This operation is cheap and I wouldn't expect this
> to slow down the first iteration significantly.
Yes, that was very surprising.
>
>> Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below:
>>
>> out <- GappedAlignments(seqlengths=seqlengths(chunk))
>>
>> Error in GappedAlignments(seqlengths = seqlengths(chunk)) :
>> 'names(seqlengths)' incompatible with 'levels(seqnames)'
>>
>> out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk))
>>
>> Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) :
>> 'strand' must be specified when 'seqnames' is not empty
>>
>> out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+")
>>
>> Error in validObject(.Object) :
>> invalid class “GappedAlignments” object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle"
>> invalid class “GappedAlignments” object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
>
> The trick is to create an empty GappedAlignments objects
> with non-empty seqlevels so you can put seqlengths on the
> seqlevels.
>
> Here are 2 ways to create an empty GappedAlignments objects with
> non-empty seqlevels:
>
> (1) Pass an empty factor with non-empty levels to the seqnames
> arg:
>
> out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)))
>
> (2) The recommended way:
>
> out <- GappedAlignments()
> seqinfo(out) <- seqinfo(chunk)
>
> Note that with (2), 'out' gets all the seqinfo from 'chunk' (including
> its seqlengths), not only its seqlevels.
>
> (1) could be adapted to also set the seqlengths:
>
> out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)),
> seqlengths=seqlengths(chunk))
>
> but (2) is really the preferred way.
Thanks for the pointers!
>
>>
>> I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem.
>>
>> My sessionInfo:
>>
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets methods
>> [8] base
>>
>> other attached packages:
>> [1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12
>> [4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15
>> [7] BiocGenerics_0.7.2
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0
>
>
> Looks like you are using Bioc-devel. Did you get all the
> warnings about GappedAlignments, readBamGappedAlignments(),
> and GappedAlignments() being deprecated?
I though I did, but indeed I didn't get the warnings then. This is very strange.
>
> I thought you were using the release so that's what I used:
>
> > sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] RNAseqData.HNRNPC.bam.chr14_0.1.3 Rsamtools_1.12.3
> [3] Biostrings_2.28.0 GenomicRanges_1.12.4
> [5] IRanges_1.18.1 BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-5 stats4_3.0.0 zlibbioc_1.6.0
>
>
> The timings I get with Bioc-devel are pretty much the same though.
>
> Something doesn't seem to be quite right with your cluster.
I agree, I'll check that out.
> What happens
> if you try to rbind() 2 data.frames of 100000 rows each in a fresh
> session?
>
> > df <- data.frame(aa=1:100000, bb=100000:1, cc="cc", dd="dd")
> > system.time(df2 <- rbind(df, df))
> user system elapsed
> 0.204 0.000 0.206
>
Good point. I'll try that out and let you know.
Thanks for the very detailed answer!
Cheers,
Nico
> Thanks,
> H.
>
>>
>> Cheers,
>>
>> Nico
>>
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>>
>> Genome Biology Computational Support
>>
>> European Molecular Biology Laboratory
>>
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioc-devel
mailing list