[BioC] Cleaning up after getSeq(BSgenome, GRanges)

Hervé Pagès hpages at fhcrc.org
Sat Jun 30 09:42:02 CEST 2012


Hi Steve,

The intention was really that the DNAStringSet object returned by
getSeq() would not hold any reference to the chromosomes that
getSeq() would load in the cache during the extraction so everything
would get automatically uncached at the first gc() opportunity after
getSeq() returns.
Unfortunately this was broken because of an issue with a low-level
helper in IRanges (the "xvcopy" method for XRawList objects to be
precise). The problem is fixed in IRanges 1.15.16 (I'll apply the
fix to release too):

 > library(BSgenome.Hsapiens.UCSC.hg19)

 > gc()
           used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1265019 67.6    1710298 91.4  1476915 78.9
Vcells  585626  4.5    1162592  8.9   901241  6.9

 > options(verbose=TRUE)  # so uncaching events will be reported

## Extracting the first 10 nucleotides from each chromosome:
 > first10 <- getSeq(Hsapiens, end=10)
uncaching chr1
uncaching chr10
uncaching chr11_gl000202_random
uncaching chr11
uncaching chr12
uncaching chr13
uncaching chr15
uncaching chr14
uncaching chr16
uncaching chr17_gl000203_random
uncaching chr17_gl000206_random
uncaching chr19
uncaching chr19_gl000208_random
uncaching chr18_gl000207_random
uncaching chr18
uncaching chr17_gl000205_random
uncaching chr17_gl000204_random
uncaching chr17_ctg5_hap1
uncaching chr1_gl000192_random
uncaching chr1_gl000191_random
uncaching chr19_gl000209_random
uncaching chr17
uncaching chr2
uncaching chr21_gl000210_random
uncaching chr21
uncaching chr20
uncaching chr22
uncaching chr3
uncaching chr4_gl000193_random
uncaching chr4_ctg9_hap1
uncaching chr4_gl000194_random
uncaching chr4
uncaching chr5
uncaching chr6_cox_hap2
uncaching chr6_dbb_hap3
uncaching chr6_apd_hap1
uncaching chr6_mcf_hap5
uncaching chr6_mann_hap4
uncaching chr6
uncaching chr7
uncaching chr7_gl000195_random
uncaching chr6_ssto_hap7
uncaching chr6_qbl_hap6
uncaching chr8_gl000197_random
uncaching chr8_gl000196_random
uncaching chr8
uncaching chr9_gl000199_random
uncaching chrM
uncaching chrUn_gl000213
uncaching chrUn_gl000214
uncaching chrUn_gl000212
uncaching chrUn_gl000211
uncaching chr9_gl000201_random
uncaching chr9_gl000200_random
uncaching chr9_gl000198_random
uncaching chrUn_gl000217
uncaching chrUn_gl000220
uncaching chrUn_gl000223
uncaching chrUn_gl000227
uncaching chrUn_gl000230
uncaching chrUn_gl000234
uncaching chrUn_gl000238
uncaching chrUn_gl000242
uncaching chrUn_gl000243
uncaching chrUn_gl000241
uncaching chrUn_gl000240
uncaching chrUn_gl000239
uncaching chrUn_gl000237
uncaching chrUn_gl000236
uncaching chrUn_gl000235
uncaching chrUn_gl000233
uncaching chrUn_gl000232
uncaching chrUn_gl000231
uncaching chrUn_gl000229
uncaching chrUn_gl000228
uncaching chrUn_gl000226
uncaching chrUn_gl000225
uncaching chrUn_gl000224
uncaching chrUn_gl000222
uncaching chrUn_gl000221
uncaching chrUn_gl000219
uncaching chrUn_gl000218
uncaching chrUn_gl000216
uncaching chrUn_gl000215
uncaching chrUn_gl000246
uncaching chrUn_gl000249
uncaching chrUn_gl000248
uncaching chrUn_gl000247
uncaching chrUn_gl000245
uncaching chrUn_gl000244
uncaching chrX
uncaching chr9

 > first10
   A DNAStringSet instance of length 93
      width seq
  [1]    10 NNNNNNNNNN
  [2]    10 NNNNNNNNNN
  [3]    10 NNNNNNNNNN
  [4]    10 NNNNNNNNNN
  [5]    10 NNNNNNNNNN
  [6]    10 NNNNNNNNNN
  [7]    10 NNNNNNNNNN
  [8]    10 NNNNNNNNNN
  [9]    10 NNNNNNNNNN
  ...   ... ...
[85]    10 GATCTGAAGA
[86]    10 GATCATGCCT
[87]    10 GATCTTCAGG
[88]    10 GATCTGCGCA
[89]    10 GATCAGATAG
[90]    10 GATCTTAAGC
[91]    10 GATCTAAGTT
[92]    10 GATCTGTCAT
[93]    10 GATCACCAAG

 > ls(Hsapiens at .seqs_cache)
[1] "chrY"

 > gc()
Garbage collection 177 = 120+21+36 (level 2) ...
69.6 Mbytes of cons cells used (66%)
61.8 Mbytes of vectors used (17%)
uncaching chrY
           used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 1301932 69.6    1967602 105.1  1967602 105.1
Vcells 8094983 61.8   48876866 373.0 58058596 443.0

 > ls(Hsapiens at .seqs_cache)
character(0)

 > gc()
Garbage collection 178 = 120+21+37 (level 2) ...
69.5 Mbytes of cons cells used (66%)
4.6 Mbytes of vectors used (2%)
           used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 1300073 69.5    1967602 105.1  1967602 105.1
Vcells  600775  4.6   39101492 298.4 58058596 443.0

Memory used is almost the same as before getSeq() was called.

Thanks for reporting the issue!

H.


On 06/27/2012 10:20 AM, Steve Lianoglou wrote:
> Howdy,
>
> Say I'd like to fetch muchos sequences from hg19 that are defined in a
> GRanges object that spans all hg19 chromosomes.
>
> I can make my life easy and do:
>
> R> library(BSgenome.Hsapiens.UCSC.hg19)
> R> seqs <- getSeq(Hsapiens, my.GRanges)
>
> But while my life has been made easy, life for my CPU has been made
> harder as I (think that I) have now all of the Hsapiens chromosomes
> loaded up into (I think) the Hsapiens at .seqs_cache.
>
> I reckon I can do something like:
>
> R> rm(list=ls(Hsapiens at .seqs_cache), envir=Hsapiens at .seqs_cache)
> R> gc()
>
> to try to remedy the situation myself, but I wonder if I'm missing
> something else?
>
> Perhaps having a clearCache,BSgenome method to do some cleanup might be handy?
>
> Thanks,
> -steve
>


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list