[Bioc-devel] appending 2 GappedAlignments using "c" takes long
Martin Morgan
mtmorgan at fhcrc.org
Wed Jul 10 20:40:00 CEST 2013
----- Nicolas Delhomme <delhomme at embl.de> wrote:
> Hej Bioc Core!
>
> There was some discussion last year about implementing a BamStreamer (à la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
Hi Nico --
I had viewed the 'BamStreamer' functionality to be implemented by
bf <- open(BamFile("foo.bam", yieldSize=1234567))
while (length(x <- readGAlignments(bf))) {}
close(bf)
paradigm, without wanting to wrap it further (FastqStreamer isn't any more convenient) because the tough work will be done in {} and one would have to create some sort of complicated rule about function signatures for implementing this as a call-back. I usually implement '{}' as a reduction where all the work is done (as with summarizeOverlaps, where the bam data is reduced to a count vector and then summed across iterations through the loop) -- it doesn't seem like there's any memory management gains to be had by concatenating GappedAlignments (these are renamed GAlignments in devel) together; usually I'd parallelize over files
bfl <- open(BamFileList(fls, yieldSize=1234567)))
result <- mclapply(bfl, function(bf, anno, ...) {
## initialize, e.g., cnt <- integer(length(anno)))
while(length(x <- readGAlignments(bf))) {}
## return, e.g., anno
})
close(bfl)
This doesn't address the efficiency of appending GAlignments.
Martin
>
> ".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
>
> ## create a stream
> stopifnot(is(bamFile,"BamFile"))
>
> ## set the yieldSize if it is not set already
> if(is.na(yieldSize(bamFile))){
> yieldSize(bamFile) <- yieldSize
> }
>
> ## open it
> open(bamFile)
>
> ## verb
> if(verbose){
> message(paste("Streaming",basename(path(bamFile))))
> }
>
> ## create the output
> out <- GappedAlignments()
>
> ## process it
> while(length(chunk <- readBamGappedAlignments(bamFile))){
> if(verbose){
> message(paste("Processed",length(chunk),"reads"))
> }
> out <- c(out,chunk)
> }
>
> ## close
> close(bamFile)
>
> ## return
> return(out)
> }
>
> In the method above, the first iteration of combining the GappedAlignments:
>
> out <- c(out,chunk) takes:
>
> system.time(append(out,chunk))
>
> user system elapsed
> 123.704 0.060 124.011
>
> whereas the second iteration (faked here) takes only (still long):
>
> system.time(append(chunk,chunk))
>
> user system elapsed
> 2.708 0.044 2.758
>
> I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration. Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below:
>
> out <- GappedAlignments(seqlengths=seqlengths(chunk))
>
> Error in GappedAlignments(seqlengths = seqlengths(chunk)) :
> 'names(seqlengths)' incompatible with 'levels(seqnames)'
>
> out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk))
>
> Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) :
> 'strand' must be specified when 'seqnames' is not empty
>
> out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+")
>
> Error in validObject(.Object) :
> invalid class “GappedAlignments” object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle"
> invalid class “GappedAlignments” object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
>
> I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem.
>
> My sessionInfo:
>
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12
> [4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15
> [7] BiocGenerics_0.7.2
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0
>
> Cheers,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list