[Bioc-devel] appending 2 GappedAlignments using "c" takes long

Martin Morgan mtmorgan at fhcrc.org
Wed Jul 10 20:40:00 CEST 2013


----- Nicolas Delhomme <delhomme at embl.de> wrote:
> Hej Bioc Core!
> 
> There was some discussion last year about implementing a BamStreamer (à la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.

Hi Nico --

I had viewed the 'BamStreamer' functionality to be implemented by 

   bf <- open(BamFile("foo.bam", yieldSize=1234567))
   while (length(x <- readGAlignments(bf))) {}
   close(bf)

paradigm, without wanting to wrap it further (FastqStreamer isn't any more convenient) because the tough work will be done in {} and one would have to create some sort of complicated rule about function signatures for implementing this as a call-back. I usually implement '{}' as a reduction where all the work is done (as with summarizeOverlaps, where the bam data is reduced to a count vector and then summed across iterations through the loop) -- it doesn't seem like there's any memory management gains to be had by concatenating GappedAlignments (these are renamed GAlignments in devel) together; usually I'd parallelize over files

  bfl <- open(BamFileList(fls, yieldSize=1234567)))
  result <- mclapply(bfl, function(bf, anno, ...) {
      ## initialize, e.g., cnt <- integer(length(anno)))
      while(length(x <- readGAlignments(bf))) {}
      ## return, e.g., anno
  })
  close(bfl)

This doesn't address the efficiency of appending GAlignments.

Martin

> 
> ".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
> 
>  ## create a stream
>  stopifnot(is(bamFile,"BamFile"))
> 
>  ## set the yieldSize if it is not set already
>  if(is.na(yieldSize(bamFile))){
>    yieldSize(bamFile) <- yieldSize
>  }
> 
>  ## open it
>  open(bamFile)
> 
>  ## verb
>  if(verbose){
>    message(paste("Streaming",basename(path(bamFile))))
>  }
> 
>  ## create the output
>  out <- GappedAlignments()
> 
>  ## process it
>  while(length(chunk <- readBamGappedAlignments(bamFile))){
>    if(verbose){
>      message(paste("Processed",length(chunk),"reads"))
>    }
>    out <- c(out,chunk)
>  }
> 
>  ## close
>  close(bamFile)
> 
>  ## return
>  return(out)
> }
> 
> In the method above, the first iteration of combining the GappedAlignments:
> 
> out <- c(out,chunk) takes:
> 
> system.time(append(out,chunk))
> 
>   user  system elapsed 
> 123.704   0.060 124.011 
> 
> whereas the second iteration (faked here) takes only (still long):
> 
> system.time(append(chunk,chunk))
> 
>   user  system elapsed 
>  2.708   0.044   2.758
> 
> I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration. Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below:
> 
> out <- GappedAlignments(seqlengths=seqlengths(chunk))
> 
> Error in GappedAlignments(seqlengths = seqlengths(chunk)) : 
>  'names(seqlengths)' incompatible with 'levels(seqnames)'
> 
> out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk))
> 
> Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) : 
>  'strand' must be specified when 'seqnames' is not empty
> 
> out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+")
> 
> Error in validObject(.Object) : 
>  invalid class “GappedAlignments” object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle"
> invalid class “GappedAlignments” object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
> 
> I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem.
> 
> My sessionInfo:
> 
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> locale:
> [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
> [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
> [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
> [7] LC_PAPER=C                 LC_NAME=C                 
> [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
> 
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
> [8] base     
> 
> other attached packages:
> [1] BiocInstaller_1.11.3  Rsamtools_1.13.22     Biostrings_2.29.12   
> [4] GenomicRanges_1.13.26 XVector_0.1.0         IRanges_1.19.15      
> [7] BiocGenerics_0.7.2   
> 
> loaded via a namespace (and not attached):
> [1] bitops_1.0-5   stats4_3.0.1   zlibbioc_1.7.0
> 
> Cheers,
> 
> Nico
> 
> ---------------------------------------------------------------
> Nicolas Delhomme
> 
> Genome Biology Computational Support
> 
> European Molecular Biology Laboratory
> 
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list