[Bioc-sig-seq] understanding memory size of AlignedRead objects

Wed May 11 02:16:42 CEST 2011

Hi again,

I've been looking into this question a bit more, now using compact(). Things seemed to go well with the example dataset provided with ShortRead (as far as I can tell). But when I use our own much larger dataset the compaction factor is not very good (my small object is the first 100 reads, the full set is ~18 million reads, but the small object is 40% the size in memory of the full object).    I think it looks like quality and sreads haven't really been compacted nearly as much as I would expect. The code is below.  

Martin - if you would like access to this dataset I can email you about that privately, but I'm guessing the same would happen with any big dataset?

Janet

library(ShortRead)

########## a big dataset (our data)

read2Dir <- "solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"

my_reads <- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport")    
my_reads_verysmall <- my_reads[1:100]
my_reads_verysmall_compact <- compact(my_reads[1:100])

length(my_reads)
# [1] 17894091
length(my_reads_verysmall)
# [1] 100
length(my_reads_verysmall_compact)
# [1] 100

object.size(my_reads)
# 3190125528 bytes
object.size(my_reads_verysmall)
# 1753653496 bytes
object.size(my_reads_verysmall_compact)
# 1253663384 bytes

as.numeric(object.size(my_reads_verysmall)) / as.numeric(object.size(my_reads))
# [1] 0.549713

as.numeric(object.size(my_reads_verysmall_compact)) / as.numeric(object.size(my_reads))
# [1] 0.3929825

##### where is most of that memory?
object.size(position(my_reads_verysmall_compact))
# 440 bytes
object.size(chromosome(my_reads_verysmall_compact))
# 3040 bytes
object.size(sread(my_reads_verysmall_compact))
# 626820976 bytes
object.size(quality(my_reads_verysmall_compact))
# 626821568 bytes

myDNA2 <-  DNAStringSet(as.character(sread(my_reads_verysmall_compact)))
object.size(myDNA2)
# 9944 bytes

###### and ShortRead's example dataset, 1000 reads

exptPath <- system.file("extdata", package = "ShortRead")
sp <- SolexaPath(exptPath)
aln <- readAligned(sp, "s_2_export.txt")

aln  ## aln has 1000 reads
aln_small <- aln[1:2]   ### aln 2 has 2 reads
aln_small_compact <- compact(aln[1:2])

object.size(aln)
# 175968 bytes
object.size(aln_small)
# 91488 bytes
object.size(aln_small_compact)
# 21744 bytes

as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
# [1] 0.5199127

as.numeric(object.size(aln_small_compact)) / as.numeric(object.size(aln))
# [1] 0.1235679

myDNA <- DNAStringSet( as.character(sread(aln_small_compact)) )
identical(myDNA,sread(aln_small_compact))
[1] TRUE

aln_medium_compact <- compact(aln[1:300])
myDNA_medium <- DNAStringSet( as.character(sread(aln_medium_compact)) )
identical(myDNA,sread(aln_medium_compact))
# [1] FALSE

aln_10_compact <- compact(aln[1:10])
myDNA_10 <- DNAStringSet( as.character(sread(aln_10_compact)) )
identical(myDNA,sread(aln_10_compact))
# [1] FALSE

as.numeric(object.size(aln_10_compact)) / as.numeric(object.size(aln))
# [1] 0.1317512

#########

sessionInfo()

R version 2.13.0 (2011-04-13)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ShortRead_1.10.0    Rsamtools_1.4.0     lattice_0.19-23    
[4] Biostrings_2.20.0   GenomicRanges_1.4.0 IRanges_1.10.0     

loaded via a namespace (and not attached):
[1] Biobase_2.12.0 grid_2.13.0    hwriter_1.3   

On May 10, 2011, at 4:25 PM, Martin Morgan wrote:

> On 05/10/2011 02:47 PM, Janet Young wrote:
>> Hi, (probably hello to you, Martin)
> 
> Hi (or is that hello?) Janet --
> 
>> I'm looking at some Illumina seq data, and trying to be more rigorous
>> than I have been in the past about memory usage and tidying up unused
>> variables. I'm a little mystified by something - I wonder if you can
>> help me understand?
>> 
>> I'm starting with a big AlignedRead object (one full lane of seq
>> data) and then I've been using [] on AlignedRead objects to take
>> various subsets of the data (and then looking at quality scores, map
>> positions, etc).   I'm also taking some very small subsets (e.g. just
>> the first 100 reads) to test and optimize some functions I'm
>> writing.
>> 
>> My confusion comes because even though I'm cutting down the number of
>> seq reads by a lot (e.g. from 18 million to just 100 reads), the new
>> AlignedRead object still takes up a lot of memory.
> 
> XStringSet (including DNAStringSet and BStringSet, which are used to hold AlignedRead sequence and quality information) are actually 'views' (start and end coordinates) on an underlying XString; when you subset a DNAStringSet, you subset the view but not the underlying DNAString. This might sound bad, but actually you have just one instance of the DNAString, and two light-weight views (the full view, and the subset view). If you're done with the full DNAString, you can force it to be compacated with compact(), e.g., after your first example
> 
> > object.size(s1)
> 50840 bytes
> > object.size(s1[1:10]) # mostly smaller 'view'
> 38984 bytes
> > object.size(s1[1]) # a little smaller 'view', same big DNAString
> 38864 bytes
> > object.size(compact(s1[1]))
> 3912 bytes
> 
> or
> 
> > object.size(aln)
> 175968 bytes
> > object.size(aln[1])
> 91488 bytes
> > object.size(compact(aln[1]))
> 21584 bytes
> 
> (I think Herve is working on your weighted coverage question)
> 
> Martin
> 
>> 
>> Two examples are given below - in both cases the small object takes
>> about half as much memory as the original, even though the number of
>> reads is now very much smaller.
>> 
>> Do you have any suggestions as to how I might reduce the memory
>> footprint of the subsetted AlignedRead object?  Is this an expected
>> behavior?
>> 
>> thanks very much,
>> 
>> Janet
>> 
>> 
>> library(ShortRead)
>> 
>> exptPath<- system.file("extdata", package = "ShortRead") sp<-
>> SolexaPath(exptPath) aln<- readAligned(sp, "s_2_export.txt")
>> 
>> aln  ## aln has 1000 reads aln_small<- aln[1:2]   ### aln 2 has 2
>> reads
>> 
>> object.size(aln) # 165156 bytes object.size(aln_small) # 82220 bytes
>> 
>> as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
>> #### [1] 0.4978324
>> 
>> read2Dir<-
>> "data/solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"
>> 
>> 
> my_reads<- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport")
>> my_reads_verysmall<- my_reads[1:100]
>> 
>> length(my_reads) # [1] 17894091 length(my_reads_verysmall) # [1] 100
>> 
>> object.size(my_reads) # 3190125528 bytes
>> object.size(my_reads_verysmall) # 1753653496 bytes
>> 
>> as.numeric(object.size(my_reads_verysmall)) /
>> as.numeric(object.size(my_reads)) # [1] 0.549713
>> 
>> 
>> 
>> sessionInfo()
>> 
>> R version 2.13.0 (2011-04-13) Platform: i386-apple-darwin9.8.0/i386
>> (32-bit)
>> 
>> locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>> 
>> attached base packages: [1] stats     graphics  grDevices utils
>> datasets  methods   base
>> 
>> other attached packages: [1] ShortRead_1.10.0    Rsamtools_1.4.1
>> lattice_0.19-26     Biostrings_2.20.0 [5] GenomicRanges_1.4.3
>> IRanges_1.10.0
>> 
>> loaded via a namespace (and not attached): [1] Biobase_2.12.1
>> grid_2.13.0    hwriter_1.3
>> 
>> _______________________________________________ Bioc-sig-sequencing
>> mailing list Bioc-sig-sequencing at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> 
> -- 
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> 
> Location: M1-B861
> Telephone: 206 667-2793