[Bioc-devel] appending 2 GappedAlignments using "c" takes long

Nicolas Delhomme delhomme at embl.de
Thu Jul 11 10:04:18 CEST 2013


Hi Martin,

On Jul 10, 2013, at 8:40 PM, Martin Morgan wrote:

> 
> ----- Nicolas Delhomme <delhomme at embl.de> wrote:
>> Hej Bioc Core!
>> 
>> There was some discussion last year about implementing a BamStreamer (à la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
> 
> Hi Nico --
> 
> I had viewed the 'BamStreamer' functionality to be implemented by 
> 
>   bf <- open(BamFile("foo.bam", yieldSize=1234567))
>   while (length(x <- readGAlignments(bf))) {}
>   close(bf)
> 

Good to see I recalled your Cambridge teaching correctly ;-)

> paradigm, without wanting to wrap it further (FastqStreamer isn't any more convenient) because the tough work will be done in {} and one would have to create some sort of complicated rule about function signatures for implementing this as a call-back. I usually implement '{}' as a reduction where all the work is done (as with summarizeOverlaps, where the bam data is reduced to a count vector and then summed across iterations through the loop) -- it doesn't seem like there's any memory management gains to be had by concatenating GappedAlignments (these are renamed GAlignments in devel) together;

I agree and that's what I meant by having created a streamlined version of my function so has to show the paradigm, not the actual data reduction steps.


> usually I'd parallelize over files
> 
>  bfl <- open(BamFileList(fls, yieldSize=1234567)))
>  result <- mclapply(bfl, function(bf, anno, ...) {
>      ## initialize, e.g., cnt <- integer(length(anno)))
>      while(length(x <- readGAlignments(bf))) {}
>      ## return, e.g., anno
>  })
>  close(bfl)

So do I.

> 
> This doesn't address the efficiency of appending GAlignments.

This was really a side effect that appeared when I was preparing the code example. I'll go through Hervé's answer about this, they might indeed have been an issue with  R version (devel or not), as I didn't get the GAlignments warnings although the package version numbers seemed to match what's in devel. I'll have a closer look.

Thanks,

Nico

> 
> Martin
> 
>> 
>> ".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
>> 
>> ## create a stream
>> stopifnot(is(bamFile,"BamFile"))
>> 
>> ## set the yieldSize if it is not set already
>> if(is.na(yieldSize(bamFile))){
>>   yieldSize(bamFile) <- yieldSize
>> }
>> 
>> ## open it
>> open(bamFile)
>> 
>> ## verb
>> if(verbose){
>>   message(paste("Streaming",basename(path(bamFile))))
>> }
>> 
>> ## create the output
>> out <- GappedAlignments()
>> 
>> ## process it
>> while(length(chunk <- readBamGappedAlignments(bamFile))){
>>   if(verbose){
>>     message(paste("Processed",length(chunk),"reads"))
>>   }
>>   out <- c(out,chunk)
>> }
>> 
>> ## close
>> close(bamFile)
>> 
>> ## return
>> return(out)
>> }
>> 
>> In the method above, the first iteration of combining the GappedAlignments:
>> 
>> out <- c(out,chunk) takes:
>> 
>> system.time(append(out,chunk))
>> 
>>  user  system elapsed 
>> 123.704   0.060 124.011 
>> 
>> whereas the second iteration (faked here) takes only (still long):
>> 
>> system.time(append(chunk,chunk))
>> 
>>  user  system elapsed 
>> 2.708   0.044   2.758
>> 
>> I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration. Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below:
>> 
>> out <- GappedAlignments(seqlengths=seqlengths(chunk))
>> 
>> Error in GappedAlignments(seqlengths = seqlengths(chunk)) : 
>> 'names(seqlengths)' incompatible with 'levels(seqnames)'
>> 
>> out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk))
>> 
>> Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) : 
>> 'strand' must be specified when 'seqnames' is not empty
>> 
>> out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+")
>> 
>> Error in validObject(.Object) : 
>> invalid class “GappedAlignments” object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle"
>> invalid class “GappedAlignments” object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
>> 
>> I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem.
>> 
>> My sessionInfo:
>> 
>> R version 3.0.1 (2013-05-16)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> 
>> locale:
>> [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
>> [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
>> [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
>> [7] LC_PAPER=C                 LC_NAME=C                 
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C            
>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
>> 
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
>> [8] base     
>> 
>> other attached packages:
>> [1] BiocInstaller_1.11.3  Rsamtools_1.13.22     Biostrings_2.29.12   
>> [4] GenomicRanges_1.13.26 XVector_0.1.0         IRanges_1.19.15      
>> [7] BiocGenerics_0.7.2   
>> 
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-5   stats4_3.0.1   zlibbioc_1.7.0
>> 
>> Cheers,
>> 
>> Nico
>> 
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>> 
>> Genome Biology Computational Support
>> 
>> European Molecular Biology Laboratory
>> 
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> 
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 



More information about the Bioc-devel mailing list