[BioC] BSgenome and R memory use

Thu Jul 5 01:34:58 CEST 2007

Hi Paul, Martin,

Martin Morgan wrote:
> Hi Paul --
> 
> See class?BSgenome
> 
> I think what happens is that
> 
>> use.chromo <- Mmusculus[[chr.search[j]]]
> 
> causes the data to be loaded, and a 'view' to be created.

To be more precise, use.chromo <- Mmusculus[[chr.search[j]]] creates a new reference
to the sequence data (you could call this a "view" too but this might be confusing since
the concept of "view" is already used in Biostrings but for something slightly different).

To illustrate this, here is what happens to the chr1 sequence data during a typical work
flow (#ref_to_chr1 is the number of references to the memory address of this sequence, i.e.
the number of existing objects in your current session that point to this sequence):

  > library(BSgenome.Mmusculus.UCSC.mm8)      # This doesn't load the chromosome data
                                              # into memory -> #ref_to_chr1 = 0

  > gc()["Vcells", "(Mb)"]                    #
  [1] 1.6                                     # We start with only 1.6 Mb of data in memory

  > Mmusculus$chr1                            # Loads chr1 seq into memory (hence takes
                                              # a long time) + creates a reference to it
                                              # -> #ref_to_chr1 = 1

  > gc()["Vcells", "(Mb)"]                    #
  [1] 189.6                                   # 190 Mb of data in memory!

  > Mmusculus$chr1                            # Doesn't do anything -> #ref_to_chr1 = 1

  > x <- Mmusculus$chr1                       # This is very fast because a BString object
                                              # doesn't contain the sequence data, only
                                              # a pointer to the sequence data, hence
                                              # chr1 seq is not duplicated in memory.
                                              # But we now have 2 BString objects pointing
                                              # to the same place in memory -> #ref_to_chr1 = 2

  > y <- substr(x, 10, 100)                   # -> #ref_to_chr1 = 3

You must remove all references to chr1 seq if you want the 190 Mb of memory used by this
seq to be freed (it can be hard to keep track of all the references to a given sequence).
IMPORTANT: The 1st reference to chr1 seq should be removed last. This is achieved with unload().
All other references are removed by just removing the referencing object.

  > rm(x)                                     # -> #ref_to_chr1 = 2
  > rm(y)                                     # -> #ref_to_chr1 = 1
  > unload(Mmusculus, "chr1")                 # -> #ref_to_chr1 = 0

  > gc()["Vcells", "(Mb)"]
  [1] 1.6

Hope this helps.

> 
>> rm(use.chromo)
> 
> removes the view, but does not unload the data. So you'll need to also
> 
>> unload(Mmusculus, chr.search[j])
> 
> I've found these packages very useful, thanks Herve!

I'm glad you like them. Thanks!

Cheers,
H.

> 
> Martin
> 
> "Paul Leo" <p.leo at uq.edu.au> writes:
> 
>> I have a bit of a problem with R running out of memory with BSgenome . I
>> have distilled it down to the bare bones. Basically I am just calling up
>> different mouse chromosomes and putting them into an object
>> (use.chromo). I then immediately remove it with the simplistic idea that
>> this will free up the space that this object required. I always use the
>> same object and I do nothing with it. 
>>
>> The memory is rapidly depleted. I would love to know what tricks are out
>> there for cleaning up after removed objects. And in general what the
>> origin of this behavior is....and ideas now to avoid it. 
>>
>> Until the loop below is stated I have enough memory to load any single
>> mouse chromosome.
>>
>> Thanks 
>> Paul
>>
>>  
>> ### set up the test
>> library(BSgenome.Mmusculus.UCSC.mm8)
>> chromos<-c(1:19,"X","Y")
>> chr.search<-paste("chr",chromos,sep="")
>> #> chr.search
>> # [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"
>> "chr9"  "chr10" "chr11" "chr12" "chr13"
>> #[14] "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chrX"  "chrY"
>>
>> ##### run the test
>> k<-0
>> for (i in 1:10){
>> for (j in 1:length(chr.search)){
>> use.chromo <- Mmusculus[[chr.search[j]]] 
>> rm(use.chromo)
>> k<-k+1 }   } # k is between 6 and 8 typically when this fails
>> Error: cannot allocate vector of size 138.4 Mb
>>
>> ## note same behavior for R2.5 and earlier version of BS genome
>> ## I am using the standard memory location for windows (1.5GB) I don't
>> think increasing this will help much
>>
>> If you replace 
>> use.chromo <- Mmusculus[[chr.search[j]]] 
>> in the above loop with 
>> p<- getSeq(Mmusculus, chr.search[j], 100,1000)
>> a similar failure occurs. 
>>
>>
>> sessionInfo()
>> R version 2.6.0 Under development (unstable) (2007-06-26 r42066) 
>> i386-pc-mingw32 
>>
>> locale:
>> LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_MON
>> ETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252
>>
>> attached base packages:
>> [1] tools     stats     graphics  grDevices utils     datasets  methods
>>
>> [8] base     
>>
>> other attached packages:
>> [1] BSgenome.Mmusculus.UCSC.mm8_1.3.0 BSgenome_1.5.0                   
>> [3] Biobase_1.15.17                   Biostrings_2.5.11                
>>
>>  
>>
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>