[Bioc-sig-seq] understanding memory size of AlignedRead objects

Tue May 10 23:47:01 CEST 2011

Hi, (probably hello to you, Martin)

I'm looking at some Illumina seq data, and trying to be more rigorous than I have been in the past about memory usage and tidying up unused variables. I'm a little mystified by something - I wonder if you can help me understand?  

I'm starting with a big AlignedRead object (one full lane of seq data) and then I've been using [] on AlignedRead objects to take various subsets of the data (and then looking at quality scores, map positions, etc).   I'm also taking some very small subsets (e.g. just the first 100 reads) to test and optimize some functions I'm writing.

My confusion comes because even though I'm cutting down the number of seq reads by a lot (e.g. from 18 million to just 100 reads), the new AlignedRead object still takes up a lot of memory.   

Two examples are given below - in both cases the small object takes about half as much memory as the original, even though the number of reads is now very much smaller.

Do you have any suggestions as to how I might reduce the memory footprint of the subsetted AlignedRead object?  Is this an expected behavior?

thanks very much,

Janet

library(ShortRead)

exptPath <- system.file("extdata", package = "ShortRead")
sp <- SolexaPath(exptPath)
aln <- readAligned(sp, "s_2_export.txt")

aln  ## aln has 1000 reads
aln_small <- aln[1:2]   ### aln 2 has 2 reads

object.size(aln)
# 165156 bytes
object.size(aln_small)
# 82220 bytes

as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
#### [1] 0.4978324

read2Dir <- "data/solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"
my_reads <- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport")    
my_reads_verysmall <- my_reads[1:100]

length(my_reads)
# [1] 17894091
length(my_reads_verysmall)
# [1] 100

object.size(my_reads)
# 3190125528 bytes
object.size(my_reads_verysmall)
# 1753653496 bytes

as.numeric(object.size(my_reads_verysmall)) / as.numeric(object.size(my_reads))
# [1] 0.549713

sessionInfo()

R version 2.13.0 (2011-04-13)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ShortRead_1.10.0    Rsamtools_1.4.1     lattice_0.19-26     Biostrings_2.20.0  
[5] GenomicRanges_1.4.3 IRanges_1.10.0     

loaded via a namespace (and not attached):
[1] Biobase_2.12.1 grid_2.13.0    hwriter_1.3