[Bioc-sig-seq] understanding memory size of AlignedRead objects

Hervé Pagès hpages at fhcrc.org
Thu May 12 00:11:24 CEST 2011


Hi Janet,

You might have found an issue with the "compact" method for AlignedRead
objects (not sure, I'll look more into this), however I have a long
story for you about why you might not even need to use compact().

First thing to understand is that seeing a big "object size" is not
necessarily a problem. This is because object.size() is not doing a
very good job on objects that contain external pointers (e.g.
DNAString, DNAStringSet) or environments.

Here is an example:

   library(BSgenome.Celegans.UCSC.ce2)
   chrI <- Celegans$chrI

Then:

   > chrI
     15080483-letter "DNAString" instance
   seq: 
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA...TTAGGCTTAGGCTTAGGCTTAGGTTTAGGCTTAGGC

   > object.size(chrI)
   15082800 bytes

   > gc()
             used (Mb) gc trigger (Mb) max used (Mb)
   Ncells 1129166 60.4    1590760 85.0  1368491 73.1
   Vcells 2386831 18.3    2968425 22.7  2397912 18.3

   > object.size(list(chrI, chrI))
   30165656 bytes

   > x <- rep.int(list(chrI), 100000)

   > object.size(x)
   1508280800040 bytes

Hey, how is this possible? I don't have that amount of memory on my
laptop!

Also:

   > gc()
             used (Mb) gc trigger (Mb) max used (Mb)
   Ncells 1129383 60.4    1710298 91.4  1590760 85.0
   Vcells 2486879 19.0    3196846 24.4  2948256 22.5

Clearly object.size() is not telling the truth.

What happens is that, strictly speaking, a DNAString object doesn't
"contain" the sequence data, but it contains a pointer to a memory
location where the sequence data is stored. This allows "sharing"
the sequence data by several objects. For example, other DNAString
instances can point to the same data as chrI:

   > subject <- chrI

   > subject at shared
   SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

   > chrI at shared
   SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

   > exon <- subseq(chrI, start=1001, width=200)

   > exon
     200-letter "DNAString" instance
   seq: 
TTTTTCGGGTTTTTTGAAATGAATATCGTAGCTACA...CAAAATTTTTGGAAATTAGTTTAAAAATCTCACATT

   > exon at shared
   SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

Note that even if 'exon' is only "looking" at a small portion of the
shared data, it's still pointing to the same memory location (i.e. same
starting address). The bookkeeping of what part of the whole shared data
needs to be looked at exactly is handled via the "offset" and "length"
slots:

   > str(chrI)
   Formal class 'DNAString' [package "Biostrings"] with 5 slots
     ..@ shared         :Formal class 'SharedRaw' [package "IRanges"] 
with 2 slots
     .. .. ..@ xp                    :<externalptr>
     .. .. ..@ .link_to_cached_object:<environment: 0x5da1de8>
     ..@ offset         : int 0
     ..@ length         : int 15080483
     ..@ elementMetadata: NULL
     ..@ metadata       : list()

   > str(exon)
   Formal class 'DNAString' [package "Biostrings"] with 5 slots
     ..@ shared         :Formal class 'SharedRaw' [package "IRanges"] 
with 2 slots
     .. .. ..@ xp                    :<externalptr>
     .. .. ..@ .link_to_cached_object:<environment: 0x5da1de8>
     ..@ offset         : int 1000
     ..@ length         : int 200
     ..@ elementMetadata: NULL
     ..@ metadata       : list()

The sequence data can also be shared by other types of objects. For example:

   > rna <- RNAString(exon)

   > rna
     200-letter "RNAString" instance
   seq: 
UUUUUCGGGUUUUUUGAAAUGAAUAUCGUAGCUACA...CAAAAUUUUUGGAAAUUAGUUUAAAAAUCUCACAUU

   > rna at shared
   SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

   > m <- matchPattern("UA", rna)

   > m
     Views on a 200-letter RNAString subject
   subject: 
UUUUUCGGGUUUUUUGAAAUGAAUAUCGUAGCUA...AAAUUUUUGGAAAUUAGUUUAAAAAUCUCACAUU
   views:
       start end width
   [1]    24  25     2 [UA]
   [2]    29  30     2 [UA]
   [3]    33  34     2 [UA]
   [4]   151 152     2 [UA]
   [5]   154 155     2 [UA]
   [6]   181 182     2 [UA]
   [7]   186 187     2 [UA]

   > subject(m)@shared
   SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

   > m[[4]]
     2-letter "RNAString" instance
   seq: UA

   > m[[4]]@shared
   SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

   > dnaset <- DNAStringSet(m)

   > dnaset
     A DNAStringSet instance of length 7
       width seq
   [1]     2 TA
   [2]     2 TA
   [3]     2 TA
   [4]     2 TA
   [5]     2 TA
   [6]     2 TA
   [7]     2 TA

   > dnaset at pool
   SharedRaw_Pool of length 1
   1: SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

As you can see, all these operations happen without copying the original
sequence data (chromosome I). Hence they are very fast, and, most of the
times (but not always, more on this below) very memory efficient too,
despite what's reported by object.size(). If the original sequence data
needs to be modified, then it will be copied before modification:

   > subseq(rna, start=7, width=3) <- RNAString("AA")

   > rna
     199-letter "RNAString" instance
   seq: 
UUUUUCAAUUUUUUGAAAUGAAUAUCGUAGCUACAG...CAAAAUUUUUGGAAAUUAGUUUAAAAAUCUCACAUU

   > rna at shared
   SharedRaw of length 199 (data starting at address 0x33a3b58)

Concatenation too will trigger a copy:

   > uaua <- c(m[[4]], m[[1]])

   > uaua
     4-letter "RNAString" instance
   seq: UAUA

   > uaua at shared
   SharedRaw of length 4 (data starting at address 0x53a1120)

For DNAStringSet objects (and XStringSet objects in general) the
mechanism used for sharing sequence data is a little bit more
complicated than for XString objects: instead of a single pointer,
they use a list of pointers (stored in the "pool" slot) so the
sequence data can be fragmented into several memory chunks (each
chunk can be shared with other objects). For example:

   > dnaset2 <- c(DNAStringSet(rna), DNAStringSet(uaua))

   > dnaset2
     A DNAStringSet instance of length 2
       width seq
   [1]   199 
TTTTTCAATTTTTTGAAATGAATATCGTAGCTAC...AATTTTTGGAAATTAGTTTAAAAATCTCACATT
   [2]     4 TATA

   > dnaset2 at pool
   SharedRaw_Pool of length 2
   1: SharedRaw of length 199 (data starting at address 0x33a3b58)
   2: SharedRaw of length 4 (data starting at address 0x53a1120)

To summarize, all the XString, XStringSet, XStringViews, MaskedXString
objects use SharedRaw objects internally to store the sequence data
(the SharedRaw class is an internal class defined in IRanges).
You can think of those SharedRaw objects as the basic storage units
for storing sequence data i.e. each of them is a chunk of memory
containing sequence data. It can be shared (i.e. more than 1 object
is "using it") or not (only 1 object is "using it"). If no object
is using it, then it's a candidate for garbbage collection.

For example, at this point, the SharedRaw object containing
chromosome I is used by the following objects: 'chrI', 'x',
'subject', 'exon', 'm' and 'dnaset'. So all of them would need
to be removed (with rm()) for the SharedRaw object to be garbbage
collected.

Note that IRanges/Biostrings don't contain any code for handling
this, it just relies on the standard behaviour of external pointers
in R (a SharedRaw object is basically an R external pointer pointing
to a raw vector) and how the garbbage collector handles them.

OK, now back to your original question. When you subset your
AlignedRead object, the "small" AlignedRead object is still
pointing to the original sequence data:

   > aln0
   class: AlignedRead
   length: 1000 reads; width: 35 cycles
   chromosome: NM NM ... chr5.fa 29:255:255
   position: NA NA ... 71805980 NA
   strand: NA NA ... + NA
   alignQuality: NumericQuality
   alignData varLabels: run lane ... filtering contig

   > sread(aln0)@pool
   SharedRaw_Pool of length 1
   1: SharedRaw of length 35000 (data starting at address 0x810f418)

   > sread(aln0[1:10])@pool
   SharedRaw_Pool of length 1
   1: SharedRaw of length 35000 (data starting at address 0x810f418)

Hence the strange result reported by object.size():

   > object.size(aln0)
   175968 bytes

   > object.size(aln0[1:10])
   92480 bytes

'aln0[1:10]' has a smaller size, but not as small as one would expect.

Does it matter? It depends.

If you want to save the small object to a file (serialization), then
yes (the whole SharedRaw object will be saved, which is not good).
In that case, you want to "compact" the object first. compact() was
specifically introduced for this use case: it creates a "full" copy
of the original object ("full" here means that even the sequence data
are copied), but only the sequence data that are effectively "being
looked at" are copied. The resulting object is semantically equivalent
to the original but its internal representation and size are different:

   > dnaset
     A DNAStringSet instance of length 7
       width seq
   [1]     2 TA
   [2]     2 TA
   [3]     2 TA
   [4]     2 TA
   [5]     2 TA
   [6]     2 TA
   [7]     2 TA

   > dnaset at pool
   SharedRaw_Pool of length 1
   1: SharedRaw of length 15080483 (data starting at address 0x7fd30293e038)

   > object.size(dnaset)
   15084424 bytes

   > compact(dnaset)
     A DNAStringSet instance of length 7
       width seq
   [1]     2 TA
   [2]     2 TA
   [3]     2 TA
   [4]     2 TA
   [5]     2 TA
   [6]     2 TA
   [7]     2 TA

   > compact(dnaset)@pool
   SharedRaw_Pool of length 1
   1: SharedRaw of length 14 (data starting at address 0x80ed740)

   > object.size(compact(dnaset))
   3952 bytes

Note that, in this particular situation, compact() could take
advantage of the fact that the elements in 'dnaset' are repeated
(from a sequence content point of view, not from a memory location
point of view) to achieve greater size reduction... but that's
another story.

Also if you work one chromosome at a time in a loop, and if the
objects you create within the loop are pointing to the chromosome
sequence, and if those references are going to persist after you
are done with the chromosome, then yes, you want to compact those
objects. By compacting them you remove the references to the
chromosome sequence and therefore allow it to be garbbage collected.
Otherwise you will end up with all the chromosome sequences loaded
into memory!

For example, you wouldn't want to do something like this:

   lapply(seqnames(Celegans),
          function(seqname)
              matchPattern("CAACTTAC", Celegans[[seqname]]))

because the result would be a list of XStringViews objects, one
per chromosome, each of them pointing to the chromosome sequence.
That will work for Celegans but for Hsapiens you will end up with
all the chromosomes in memory!

For your use case, the benefit of using compact() on the subsetted
AlignedRead objects depends on what you will do with it. If those
small objects are just temporary (e.g. you will do coverage() on
them, and you won't care about the subsetted AlignedRead objects
anymore), then I would recommend that you do not use compact(): it
will slow down things and you will end up using more memory than
you need (because compact() creates a copy). But if you want to keep
them around, but also want to be able to remove the big AlignedRead
object in order to save memory, then you might want to compact them.

One thing to remember is that object.size() should tell the truth
on a compacted object. And also summing the individual object.size()
for a collection of compacted objects tells the truth about the
memory effectively used by the collection. But summing the
individual object.size() in general does not tell the truth:
it might overestimate the amount of memory effectively used,
because it might count the memory used by a SharedRaw objects
several times (as many times as the SharedRaw object is used).

Hope the story was helpful and not too long ;-)

H.


On 11-05-10 05:16 PM, Janet Young wrote:
> Hi again,
>
> I've been looking into this question a bit more, now using compact(). Things seemed to go well with the example dataset provided with ShortRead (as far as I can tell). But when I use our own much larger dataset the compaction factor is not very good (my small object is the first 100 reads, the full set is ~18 million reads, but the small object is 40% the size in memory of the full object).    I think it looks like quality and sreads haven't really been compacted nearly as much as I would expect. The code is below.


>
> Martin - if you would like access to this dataset I can email you about that privately, but I'm guessing the same would happen with any big dataset?
>
> Janet
>
>
>
> library(ShortRead)
>
> ########## a big dataset (our data)
>
> read2Dir<- "solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"
>
> my_reads<- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport")
> my_reads_verysmall<- my_reads[1:100]
> my_reads_verysmall_compact<- compact(my_reads[1:100])
>
> length(my_reads)
> # [1] 17894091
> length(my_reads_verysmall)
> # [1] 100
> length(my_reads_verysmall_compact)
> # [1] 100
>
> object.size(my_reads)
> # 3190125528 bytes
> object.size(my_reads_verysmall)
> # 1753653496 bytes
> object.size(my_reads_verysmall_compact)
> # 1253663384 bytes
>
> as.numeric(object.size(my_reads_verysmall)) / as.numeric(object.size(my_reads))
> # [1] 0.549713
>
> as.numeric(object.size(my_reads_verysmall_compact)) / as.numeric(object.size(my_reads))
> # [1] 0.3929825
>
> ##### where is most of that memory?
> object.size(position(my_reads_verysmall_compact))
> # 440 bytes
> object.size(chromosome(my_reads_verysmall_compact))
> # 3040 bytes
> object.size(sread(my_reads_verysmall_compact))
> # 626820976 bytes
> object.size(quality(my_reads_verysmall_compact))
> # 626821568 bytes
>
> myDNA2<-  DNAStringSet(as.character(sread(my_reads_verysmall_compact)))
> object.size(myDNA2)
> # 9944 bytes
>
>
> ###### and ShortRead's example dataset, 1000 reads
>
> exptPath<- system.file("extdata", package = "ShortRead")
> sp<- SolexaPath(exptPath)
> aln<- readAligned(sp, "s_2_export.txt")
>
> aln  ## aln has 1000 reads
> aln_small<- aln[1:2]   ### aln 2 has 2 reads
> aln_small_compact<- compact(aln[1:2])
>
> object.size(aln)
> # 175968 bytes
> object.size(aln_small)
> # 91488 bytes
> object.size(aln_small_compact)
> # 21744 bytes
>
> as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
> # [1] 0.5199127
>
> as.numeric(object.size(aln_small_compact)) / as.numeric(object.size(aln))
> # [1] 0.1235679
>
> myDNA<- DNAStringSet( as.character(sread(aln_small_compact)) )
> identical(myDNA,sread(aln_small_compact))
> [1] TRUE
>
> aln_medium_compact<- compact(aln[1:300])
> myDNA_medium<- DNAStringSet( as.character(sread(aln_medium_compact)) )
> identical(myDNA,sread(aln_medium_compact))
> # [1] FALSE
>
> aln_10_compact<- compact(aln[1:10])
> myDNA_10<- DNAStringSet( as.character(sread(aln_10_compact)) )
> identical(myDNA,sread(aln_10_compact))
> # [1] FALSE
>
> as.numeric(object.size(aln_10_compact)) / as.numeric(object.size(aln))
> # [1] 0.1317512
>
> #########
>
>
> sessionInfo()
>
> R version 2.13.0 (2011-04-13)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] ShortRead_1.10.0    Rsamtools_1.4.0     lattice_0.19-23
> [4] Biostrings_2.20.0   GenomicRanges_1.4.0 IRanges_1.10.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.12.0 grid_2.13.0    hwriter_1.3
>
>
>
>
> On May 10, 2011, at 4:25 PM, Martin Morgan wrote:
>
>> On 05/10/2011 02:47 PM, Janet Young wrote:
>>> Hi, (probably hello to you, Martin)
>>
>> Hi (or is that hello?) Janet --
>>
>>> I'm looking at some Illumina seq data, and trying to be more rigorous
>>> than I have been in the past about memory usage and tidying up unused
>>> variables. I'm a little mystified by something - I wonder if you can
>>> help me understand?
>>>
>>> I'm starting with a big AlignedRead object (one full lane of seq
>>> data) and then I've been using [] on AlignedRead objects to take
>>> various subsets of the data (and then looking at quality scores, map
>>> positions, etc).   I'm also taking some very small subsets (e.g. just
>>> the first 100 reads) to test and optimize some functions I'm
>>> writing.
>>>
>>> My confusion comes because even though I'm cutting down the number of
>>> seq reads by a lot (e.g. from 18 million to just 100 reads), the new
>>> AlignedRead object still takes up a lot of memory.
>>
>> XStringSet (including DNAStringSet and BStringSet, which are used to hold AlignedRead sequence and quality information) are actually 'views' (start and end coordinates) on an underlying XString; when you subset a DNAStringSet, you subset the view but not the underlying DNAString. This might sound bad, but actually you have just one instance of the DNAString, and two light-weight views (the full view, and the subset view). If you're done with the full DNAString, you can force it to be compacated with compact(), e.g., after your first example
>>
>>> object.size(s1)
>> 50840 bytes
>>> object.size(s1[1:10]) # mostly smaller 'view'
>> 38984 bytes
>>> object.size(s1[1]) # a little smaller 'view', same big DNAString
>> 38864 bytes
>>> object.size(compact(s1[1]))
>> 3912 bytes
>>
>> or
>>
>>> object.size(aln)
>> 175968 bytes
>>> object.size(aln[1])
>> 91488 bytes
>>> object.size(compact(aln[1]))
>> 21584 bytes
>>
>> (I think Herve is working on your weighted coverage question)
>>
>> Martin
>>
>>>
>>> Two examples are given below - in both cases the small object takes
>>> about half as much memory as the original, even though the number of
>>> reads is now very much smaller.
>>>
>>> Do you have any suggestions as to how I might reduce the memory
>>> footprint of the subsetted AlignedRead object?  Is this an expected
>>> behavior?
>>>
>>> thanks very much,
>>>
>>> Janet
>>>
>>>
>>> library(ShortRead)
>>>
>>> exptPath<- system.file("extdata", package = "ShortRead") sp<-
>>> SolexaPath(exptPath) aln<- readAligned(sp, "s_2_export.txt")
>>>
>>> aln  ## aln has 1000 reads aln_small<- aln[1:2]   ### aln 2 has 2
>>> reads
>>>
>>> object.size(aln) # 165156 bytes object.size(aln_small) # 82220 bytes
>>>
>>> as.numeric(object.size(aln_small)) / as.numeric(object.size(aln))
>>> #### [1] 0.4978324
>>>
>>> read2Dir<-
>>> "data/solexa/110317_SN367_0148_A81NVUABXX/Data/Intensities/BaseCalls/GERALD_24-03-2011_solexa.2"
>>>
>>>
>> my_reads<- readAligned(read2Dir, pattern="s_1_export.txt", type="SolexaExport")
>>> my_reads_verysmall<- my_reads[1:100]
>>>
>>> length(my_reads) # [1] 17894091 length(my_reads_verysmall) # [1] 100
>>>
>>> object.size(my_reads) # 3190125528 bytes
>>> object.size(my_reads_verysmall) # 1753653496 bytes
>>>
>>> as.numeric(object.size(my_reads_verysmall)) /
>>> as.numeric(object.size(my_reads)) # [1] 0.549713
>>>
>>>
>>>
>>> sessionInfo()
>>>
>>> R version 2.13.0 (2011-04-13) Platform: i386-apple-darwin9.8.0/i386
>>> (32-bit)
>>>
>>> locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages: [1] stats     graphics  grDevices utils
>>> datasets  methods   base
>>>
>>> other attached packages: [1] ShortRead_1.10.0    Rsamtools_1.4.1
>>> lattice_0.19-26     Biostrings_2.20.0 [5] GenomicRanges_1.4.3
>>> IRanges_1.10.0
>>>
>>> loaded via a namespace (and not attached): [1] Biobase_2.12.1
>>> grid_2.13.0    hwriter_1.3
>>>
>>> _______________________________________________ Bioc-sig-sequencing
>>> mailing list Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>> --
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>> Location: M1-B861
>> Telephone: 206 667-2793
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-sig-sequencing mailing list