[Bioc-devel] appending 2 GappedAlignments using "c" takes long
Hervé Pagès
hpages at fhcrc.org
Fri Jul 12 21:05:20 CEST 2013
Hi Nico,
On 07/11/2013 02:11 AM, Nicolas Delhomme wrote:
> That particular machine has been up for 40 days although all the parameters are in the green right now.
>
> Doing an rbind is as quick as what you reported:
>> system.time(df2 <- rbind(df, df))
> user system elapsed
> 0.104 0.000 0.104
>
>
> And now I do get the GAlignments warning:
>
>> GappedAlignments()
> GAlignments with 0 alignments and 0 metadata columns:
> seqnames strand cigar qwidth start end width
> <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
> ngap
> <integer>
> ---
> seqlengths:
>
>
> Warning message:
> The GappedAlignments class, the GappedAlignments()
> constructor, and the readGappedAlignments() function, have been
> renamed: GAlignments, GAlignments(), and readGAlignments(),
> respectively. The old names are deprecated. Please use the new
> names instead.
>
> And the appending works as for you:
>
>> library(Rsamtools)
>> library(RNAseqData.HNRNPC.bam.chr14)
>> bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L])
>> yieldSize(bamfile) <- 100000L
>> open(bamfile)
>> out <- GappedAlignments()
> Warning message:
> The GappedAlignments class, the GappedAlignments()
> constructor, and the readGappedAlignments() function, have been
> renamed: GAlignments, GAlignments(), and readGAlignments(),
> respectively. The old names are deprecated. Please use the new
> names instead.
>> chunk <- readBamGappedAlignments(bamfile)
> Warning message:
> 'readBamGappedAlignments' is deprecated.
> Use 'readGAlignmentsFromBam' instead.
> See help("Deprecated")
>
>> system.time(out <- append(out, chunk))
> user system elapsed
> 0.092 0.000 0.091
>
>> chunk <- readBamGappedAlignments(bamfile)
> Warning message:
> 'readBamGappedAlignments' is deprecated.
> Use 'readGAlignmentsFromBam' instead.
> See help("Deprecated")
>
>> system.time(out <- append(out, chunk))
> user system elapsed
> 0.372 0.000 0.369
>
>> chunk <- readBamGappedAlignments(bamfile)
> Warning message:
> 'readBamGappedAlignments' is deprecated.
> Use 'readGAlignmentsFromBam' instead.
> See help("Deprecated")
>
>> system.time(out <- append(out, chunk))
> user system elapsed
> 0.896 0.012 0.909
>
>
> And the sessionInfo are as before:
>
>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] GenomicRanges_1.13.26 Biostrings_2.29.12 XVector_0.1.0
> [4] IRanges_1.19.15 BiocGenerics_0.7.2
>
> loaded via a namespace (and not attached):
> [1] stats4_3.0.1
>
> So I'm not sure what happened; so far, I can only imagine an NFS / RAID related issue.
>
> Doing it with my own data gives the same results as above.
>
> Sorry for bothering you with that and many thanks for the help.
No problem. Maybe method dispatch was having a particularly bad day
that day. Glad everything looks ok now :-)
Cheers,
H.
>
> Cheers,
>
> Nico
>
> ---------------------------------------------------------------
> Nicolas Delhomme
>
> Genome Biology Computational Support
>
> European Molecular Biology Laboratory
>
> Tel: +49 6221 387 8310
> Email: nicolas.delhomme at embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
>
>
>
>
>
> On Jul 11, 2013, at 10:15 AM, Nicolas Delhomme wrote:
>
>> Hej Hervé!
>>
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>>
>> Genome Biology Computational Support
>>
>> European Molecular Biology Laboratory
>>
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> ---------------------------------------------------------------
>>
>> On Jul 10, 2013, at 8:54 PM, Hervé Pagès wrote:
>>
>>> Hi Nico,
>>>
>>> On 07/09/2013 08:07 AM, Nicolas Delhomme wrote:
>>>> Hej Bioc Core!
>>>>
>>>> There was some discussion last year about implementing a BamStreamer (à la FastqStreamer), but I haven't seen anything like it in the current devel. I've implemented the following function that should do the job for me - I have many very large files, and I need to use a cluster with relatively few RAM per node and a restrictive time allocation , so I want to parallelize the reading of the BAM file to manage both. The example below is obviously not affecting the RAM issue but I streamlined it to point out my issue.
>>>>
>>>> ".stream" <- function(bamFile,yieldSize=100000,verbose=FALSE){
>>>>
>>>> ## create a stream
>>>> stopifnot(is(bamFile,"BamFile"))
>>>>
>>>> ## set the yieldSize if it is not set already
>>>> if(is.na(yieldSize(bamFile))){
>>>> yieldSize(bamFile) <- yieldSize
>>>> }
>>>>
>>>> ## open it
>>>> open(bamFile)
>>>>
>>>> ## verb
>>>> if(verbose){
>>>> message(paste("Streaming",basename(path(bamFile))))
>>>> }
>>>>
>>>> ## create the output
>>>> out <- GappedAlignments()
>>>>
>>>> ## process it
>>>> while(length(chunk <- readBamGappedAlignments(bamFile))){
>>>> if(verbose){
>>>> message(paste("Processed",length(chunk),"reads"))
>>>> }
>>>> out <- c(out,chunk)
>>>> }
>>>
>>> Note that regardless the speed of c() on GappedAlignments objects,
>>> growing an object in a loop is fundamentally inefficient (see Circle 2
>>> of The R Inferno).
>>> Also keeping the chunks in memory kind of defeats the purpose of reading
>>> the file one chunk at a time.
>>
>> Sure. What this function normally really does is a data reduction - basically getting a named vector back. I just came across the appending issue when preparing the code example above.
>>
>>>
>>>>
>>>> ## close
>>>> close(bamFile)
>>>>
>>>> ## return
>>>> return(out)
>>>> }
>>>>
>>>> In the method above, the first iteration of combining the GappedAlignments:
>>>>
>>>> out <- c(out,chunk) takes:
>>>>
>>>> system.time(append(out,chunk))
>>>>
>>>> user system elapsed
>>>> 123.704 0.060 124.011
>>>
>>> 2 minutes! Whaoo, that's really slow. I can't reproduce this on my
>>> machine though:
>>>
>>
>> OK, sounds more like a system issue then.
>>
>>> library(Rsamtools)
>>> library(RNAseqData.HNRNPC.bam.chr14)
>>> bamfile <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1L])
>>> yieldSize(bamfile) <- 100000L
>>> open(bamfile)
>>> out <- GappedAlignments()
>>>
>>> Then:
>>>
>>>> chunk <- readBamGappedAlignments(bamfile)
>>>> system.time(out <- append(out, chunk))
>>> user system elapsed
>>> 0.284 0.000 0.286
>>>
>>> I wonder what's going on on your system. Are you sure it was not running
>>> out of memory when you did this?
>>
>> Yes, that's a fat node with 0.2TB RAM and I was the only one on it at the time.
>>
>>> Try to check the load with uptime or
>>> top in another terminal (e.g. start top right before you call append()).
>>> If the system starts swapping, then your R process will become hundreds
>>> or thousands times slower!
>>
>> and there was no memory intensive job running. Could still have been some NFS related issue. I will retry with a fresh session and monitor the I/O as well.
>>
>>>
>>>>
>>>> whereas the second iteration (faked here) takes only (still long):
>>>>
>>>> system.time(append(chunk,chunk))
>>>>
>>>> user system elapsed
>>>> 2.708 0.044 2.758
>>>
>>> 2nd, 3rd and 4th iterations for me:
>>>
>>>> chunk <- readBamGappedAlignments(bamfile)
>>>> system.time(out <- append(out, chunk))
>>> user system elapsed
>>> 0.516 0.004 0.521
>>>
>>>> chunk <- readBamGappedAlignments(bamfile)
>>>> system.time(out <- append(out, chunk))
>>> user system elapsed
>>> 0.656 0.008 0.663
>>>
>>>> chunk <- readBamGappedAlignments(bamfile)
>>>> system.time(out <- append(out, chunk))
>>> user system elapsed
>>> 0.796 0.004 0.801
>>>
>>> As expected, the time is growing (this is why the process
>>> of growing an object in a loop is considered to be quadratic
>>> in time).
>>
>> Quadratic! Wow, I knew it was slower but still... Good to know.
>>
>>>
>>>>
>>>> I suppose this has to do with the way GenomicRanges:::unlist_list_of_GappedAlignments deals with combining the objects and all the related sanity checks. For the first iteration, the seqlengths are different so I suppose that is what explains the 60X lag compared to the second iteration.
>>>
>>> The seqinfo of the 2 objects to combine need to be merged together
>>> and set back on each object before the 2 objects can actually
>>> be combined. This operation is cheap and I wouldn't expect this
>>> to slow down the first iteration significantly.
>>
>> Yes, that was very surprising.
>>
>>>
>>>> Due to the implementation of GappedAlignments, I can't set the seqlengths programmatically in GappedAlignments() which I imagine would have reduced the first iteration lag; see the trials below:
>>>>
>>>> out <- GappedAlignments(seqlengths=seqlengths(chunk))
>>>>
>>>> Error in GappedAlignments(seqlengths = seqlengths(chunk)) :
>>>> 'names(seqlengths)' incompatible with 'levels(seqnames)'
>>>>
>>>> out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk))
>>>>
>>>> Error in GappedAlignments(seqlengths = seqlengths(chunk), seqnames = seqnames(chunk)) :
>>>> 'strand' must be specified when 'seqnames' is not empty
>>>>
>>>> out <- GappedAlignments(seqlengths=seqlengths(chunk),seqnames=seqnames(chunk),strand="+")
>>>>
>>>> Error in validObject(.Object) :
>>>> invalid class “GappedAlignments” object: 1: invalid object for slot "strand" in class "GappedAlignments": got class "character", should be or extend class "Rle"
>>>> invalid class “GappedAlignments” object: 2: number of rows in DataTable 'mcols(x)' must match length of 'x'
>>>
>>> The trick is to create an empty GappedAlignments objects
>>> with non-empty seqlevels so you can put seqlengths on the
>>> seqlevels.
>>>
>>> Here are 2 ways to create an empty GappedAlignments objects with
>>> non-empty seqlevels:
>>>
>>> (1) Pass an empty factor with non-empty levels to the seqnames
>>> arg:
>>>
>>> out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)))
>>>
>>> (2) The recommended way:
>>>
>>> out <- GappedAlignments()
>>> seqinfo(out) <- seqinfo(chunk)
>>>
>>> Note that with (2), 'out' gets all the seqinfo from 'chunk' (including
>>> its seqlengths), not only its seqlevels.
>>>
>>> (1) could be adapted to also set the seqlengths:
>>>
>>> out <- GappedAlignments(seqnames=factor(levels=seqlevels(chunk)),
>>> seqlengths=seqlengths(chunk))
>>>
>>> but (2) is really the preferred way.
>>
>> Thanks for the pointers!
>>
>>>
>>>>
>>>> I completely approve of such sanity checks; it seems that I'm just trying to do something that it was not designed for :-) All I'm really interested in is a way to stream my BAM file and I'm looking forward to any suggestion. I especially don't want to re-invent the wheel if you have already planned something. If you haven't I'd be glad to get some insight how I can walk around that problem.
>>>>
>>>> My sessionInfo:
>>>>
>>>> R version 3.0.1 (2013-05-16)
>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
>>>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
>>>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
>>>> [7] LC_PAPER=C LC_NAME=C
>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] parallel stats graphics grDevices utils datasets methods
>>>> [8] base
>>>>
>>>> other attached packages:
>>>> [1] BiocInstaller_1.11.3 Rsamtools_1.13.22 Biostrings_2.29.12
>>>> [4] GenomicRanges_1.13.26 XVector_0.1.0 IRanges_1.19.15
>>>> [7] BiocGenerics_0.7.2
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] bitops_1.0-5 stats4_3.0.1 zlibbioc_1.7.0
>>>
>>>
>>> Looks like you are using Bioc-devel. Did you get all the
>>> warnings about GappedAlignments, readBamGappedAlignments(),
>>> and GappedAlignments() being deprecated?
>>
>> I though I did, but indeed I didn't get the warnings then. This is very strange.
>>
>>>
>>> I thought you were using the release so that's what I used:
>>>
>>>> sessionInfo()
>>> R version 3.0.0 (2013-04-03)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=C LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] parallel stats graphics grDevices utils datasets methods
>>> [8] base
>>>
>>> other attached packages:
>>> [1] RNAseqData.HNRNPC.bam.chr14_0.1.3 Rsamtools_1.12.3
>>> [3] Biostrings_2.28.0 GenomicRanges_1.12.4
>>> [5] IRanges_1.18.1 BiocGenerics_0.6.0
>>>
>>> loaded via a namespace (and not attached):
>>> [1] bitops_1.0-5 stats4_3.0.0 zlibbioc_1.6.0
>>>
>>>
>>> The timings I get with Bioc-devel are pretty much the same though.
>>>
>>> Something doesn't seem to be quite right with your cluster.
>>
>> I agree, I'll check that out.
>>
>>> What happens
>>> if you try to rbind() 2 data.frames of 100000 rows each in a fresh
>>> session?
>>>
>>>> df <- data.frame(aa=1:100000, bb=100000:1, cc="cc", dd="dd")
>>>> system.time(df2 <- rbind(df, df))
>>> user system elapsed
>>> 0.204 0.000 0.206
>>>
>>
>> Good point. I'll try that out and let you know.
>>
>> Thanks for the very detailed answer!
>>
>> Cheers,
>>
>> Nico
>>
>>
>>> Thanks,
>>> H.
>>>
>>>>
>>>> Cheers,
>>>>
>>>> Nico
>>>>
>>>> ---------------------------------------------------------------
>>>> Nicolas Delhomme
>>>>
>>>> Genome Biology Computational Support
>>>>
>>>> European Molecular Biology Laboratory
>>>>
>>>> Tel: +49 6221 387 8310
>>>> Email: nicolas.delhomme at embl.de
>>>> Meyerhofstrasse 1 - Postfach 10.2209
>>>> 69102 Heidelberg, Germany
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list