[Bioc-devel] appending 2 GappedAlignments using "c" takes long
Nicolas Delhomme
delhomme at embl.de
Tue Jul 9 17:07:41 CEST 2013
Hej Bioc Core!
There was some discussion last year about implementing a BamStreamer (à la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
## create a stream
stopifnot(is(bamFile,"BamFile"))
## set the yieldSize if it is not set already
if(is.na(yieldSize(bamFile))){
yieldSize(bamFile) <- yieldSize
}
## open it
open(bamFile)
## verb
if(verbose){
message(paste("Streaming",basename(path(bamFile))))
}
## create the output
out <- GappedAlignments()
## process it
while(length(chunk <- readBamGappedAlignments(bamFile))){
if(verbose){
message(paste("Processed",length(chunk),"reads"))
}
out <- c(out,chunk)
}
## close
close(bamFile)
## return
return(out)
}
In the method above, the first iteration of combining the GappedAlignments:
out <- c(out,chunk) takes:
system.time(append(out,chunk))
user system elapsed
123.704 0.060 124.011
whereas the second iteration (faked here) takes only (still long):
system.time(append(chunk,chunk))
user system elapsed
2.708 0.044 2.758
I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration. Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below:
out <- GappedAlignments(seqlengths=seqlengths(chunk))
Error in GappedAlignments(seqlengths = seqlengths(chunk)) :
'names(seqlengths)' incompatible with 'levels(seqnames)'
out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk))
Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) :
'strand' must be specified when 'seqnames' is not empty
out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+")
Error in validObject(.Object) :
invalid class “GappedAlignments” object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle"
invalid class “GappedAlignments” object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem.
My sessionInfo:
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12
[4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15
[7] BiocGenerics_0.7.2
loaded via a namespace (and not attached):
[1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0
Cheers,
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
More information about the Bioc-devel
mailing list