[Bioc-devel] appending 2 GappedAlignments using "c" takes long
Nicolas Delhomme
delhomme at embl.de
Thu Jul 11 10:04:18 CEST 2013
Hi Martin,
On Jul 10, 2013, at 8:40 PM, Martin Morgan wrote:
>
> ----- Nicolas Delhomme <delhomme at embl.de> wrote:
>> Hej Bioc Core!
>>
>> There was some discussion last year about implementing a BamStreamer (à la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
>
> Hi Nico --
>
> I had viewed the 'BamStreamer' functionality to be implemented by
>
> bf <- open(BamFile("foo.bam", yieldSize=1234567))
> while (length(x <- readGAlignments(bf))) {}
> close(bf)
>
Good to see I recalled your Cambridge teaching correctly ;-)
> paradigm, without wanting to wrap it further (FastqStreamer isn't any more convenient) because the tough work will be done in {} and one would have to create some sort of complicated rule about function signatures for implementing this as a call-back. I usually implement '{}' as a reduction where all the work is done (as with summarizeOverlaps, where the bam data is reduced to a count vector and then summed across iterations through the loop) -- it doesn't seem like there's any memory management gains to be had by concatenating GappedAlignments (these are renamed GAlignments in devel) together;
I agree and that's what I meant by having created a streamlined version of my function so has to show the paradigm, not the actual data reduction steps.
> usually I'd parallelize over files
>
> bfl <- open(BamFileList(fls, yieldSize=1234567)))
> result <- mclapply(bfl, function(bf, anno, ...) {
> ## initialize, e.g., cnt <- integer(length(anno)))
> while(length(x <- readGAlignments(bf))) {}
> ## return, e.g., anno
> })
> close(bfl)
So do I.
>
> This doesn't address the efficiency of appending GAlignments.
This was really a side effect that appeared when I was preparing the code example. I'll go through Hervé's answer about this, they might indeed have been an issue with R version (devel or not), as I didn't get the GAlignments warnings although the package version numbers seemed to match what's in devel. I'll have a closer look.
Thanks,
Nico
>
> Martin
>
>>
>> ".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
>>
>> ## create a stream
>> stopifnot(is(bamFile,"BamFile"))
>>
>> ## set the yieldSize if it is not set already
>> if(is.na(yieldSize(bamFile))){
>> yieldSize(bamFile) <- yieldSize
>> }
>>
>> ## open it
>> open(bamFile)
>>
>> ## verb
>> if(verbose){
>> message(paste("Streaming",basename(path(bamFile))))
>> }
>>
>> ## create the output
>> out <- GappedAlignments()
>>
>> ## process it
>> while(length(chunk <- readBamGappedAlignments(bamFile))){
>> if(verbose){
>> message(paste("Processed",length(chunk),"reads"))
>> }
>> out <- c(out,chunk)
>> }
>>
>> ## close
>> close(bamFile)
>>
>> ## return
>> return(out)
>> }
>>
>> In the method above, the first iteration of combining the GappedAlignments:
>>
>> out <- c(out,chunk) takes:
>>
>> system.time(append(out,chunk))
>>
>> user system elapsed
>> 123.704 0.060 124.011
>>
>> whereas the second iteration (faked here) takes only (still long):
>>
>> system.time(append(chunk,chunk))
>>
>> user system elapsed
>> 2.708 0.044 2.758
>>
>> I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration. Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below:
>>
>> out <- GappedAlignments(seqlengths=seqlengths(chunk))
>>
>> Error in GappedAlignments(seqlengths = seqlengths(chunk)) :
>> 'names(seqlengths)' incompatible with 'levels(seqnames)'
>>
>> out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk))
>>
>> Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) :
>> 'strand' must be specified when 'seqnames' is not empty
>>
>> out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+")
>>
>> Error in validObject(.Object) :
>> invalid class “GappedAlignments” object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle"
>> invalid class “GappedAlignments” object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
>>
>> I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem.
>>
>> My sessionInfo:
>>
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats graphics grDevices utils datasets methods
>> [8] base
>>
>> other attached packages:
>> [1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12
>> [4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15
>> [7] BiocGenerics_0.7.2
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0
>>
>> Cheers,
>>
>> Nico
>>
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>>
>> Genome Biology Computational Support
>>
>> European Molecular Biology Laboratory
>>
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
More information about the Bioc-devel
mailing list