[Bioc-sig-seq] understanding memory size of AlignedRead objects

Martin Morgan mtmorgan at fhcrc.org
Wed May 11 01:25:10 CEST 2011


On 05/10/2011 02:47 PM, Janet Young wrote:
> Hi, (probably hello to you, Martin)

Hi (or is that hello?) Janet --

> I'm looking at some Illumina seq data, and trying to be more rigorous
> than I have been in the past about memory usage and tidying up unused
> variables. I'm a little mystified by something - I wonder if you can
> help me understand?
>
> I'm starting with a big AlignedRead object (one full lane of seq
> data) and then I've been using [] on AlignedRead objects to take
> various subsets of the data (and then looking at quality scores, map
> positions, etc).   I'm also taking some very small subsets (e.g. just
> the first 100 reads) to test and optimize some functions I'm
> writing.
>
> My confusion comes because even though I'm cutting down the number of
> seq reads by a lot (e.g. from 18 million to just 100 reads), the new
> AlignedRead object still takes up a lot of memory.

XStringSet (including DNAStringSet and BStringSet, which are used to 
hold AlignedRead sequence and quality information) are actually 'views' 
(start and end coordinates) on an underlying XString; when you subset a 
DNAStringSet, you subset the view but not the underlying DNAString. This 
might sound bad, but actually you have just one instance of the 
DNAString, and two light-weight views (the full view, and the subset 
view). If you're done with the full DNAString, you can force it to be 
compacated with compact(), e.g., after your first example

 > object.size(s1)
50840 bytes
 > object.size(s1[1:10]) # mostly smaller 'view'
38984 bytes
 > object.size(s1[1]) # a little smaller 'view', same big DNAString
38864 bytes
 > object.size(compact(s1[1]))
3912 bytes

or

 > object.size(aln)
175968 bytes
 > object.size(aln[1])
91488 bytes
 > object.size(compact(aln[1]))
21584 bytes

(I think Herve is working on your weighted coverage question)

Martin

>
> Two examples are given below - in both cases the small object takes
> about half as much memory as the original, even though the number of
> reads is now very much smaller.
>
> Do you have any suggestions as to how I might reduce the memory
> footprint of the subsetted AlignedRead object?  Is this an expected
> behavior?
>
> thanks very much,
>
> Janet
>
>
> library(ShortRead)
>
> exptPath<- system.file("extdata", package = "ShortRead") sp<-
> SolexaPath(exptPath) aln<- readAligned(sp, "s_2_export.txt")
>
> aln  ## aln has 1000 reads aln_small<- aln[1:2]   ### aln 2 has 2
> reads
>
> object.size(aln) # 165156 bytes object.size(aln_small) # 82220 bytes
>
> as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
> #### [1] 0.4978324
>
> read2Dir<-
> "data/solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"
>
>
my_reads<- readAligned(read2Dir, pattern="s_1_export.txt", 
type="SolexaExport")
> my_reads_verysmall<- my_reads[1:100]
>
> length(my_reads) # [1] 17894091 length(my_reads_verysmall) # [1] 100
>
> object.size(my_reads) # 3190125528 bytes
> object.size(my_reads_verysmall) # 1753653496 bytes
>
> as.numeric(object.size(my_reads_verysmall)) /
> as.numeric(object.size(my_reads)) # [1] 0.549713
>
>
>
> sessionInfo()
>
> R version 2.13.0 (2011-04-13) Platform: i386-apple-darwin9.8.0/i386
> (32-bit)
>
> locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages: [1] stats     graphics  grDevices utils
> datasets  methods   base
>
> other attached packages: [1] ShortRead_1.10.0    Rsamtools_1.4.1
> lattice_0.19-26     Biostrings_2.20.0 [5] GenomicRanges_1.4.3
> IRanges_1.10.0
>
> loaded via a namespace (and not attached): [1] Biobase_2.12.1
> grid_2.13.0    hwriter_1.3
>
> _______________________________________________ Bioc-sig-sequencing
> mailing list Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioc-sig-sequencing mailing list