[BioC] singe_sequences.fa.gz file in Bsgenome.Hsapiens.NCBI.GRCh38 is too big

Hervé Pagès hpages at fhcrc.org
Wed Jun 18 20:34:21 CEST 2014


Hi Sean,

On 04/15/2014 11:30 PM, Sean Li [guest] wrote:
>
> singe_sequences.fa.gz file in Bsgenome.Hsapiens.NCBI.GRCh38 is too big to load. Why can you separate it into several files as  Bsgenome.Hsapiens.UCSC.hg19 do?

How are you trying to access the genome sequences in
BSgenome.Hsapiens.NCBI.GRCh38?

Note that the singe_sequences.fa.gz file is the package internal
business and you should avoid trying to access it directly. The
"normal" way to access the genome sequences is via [[ or getSeq().
Use [[ to load a given chromosome:

   genome <- Bsgenome.Hsapiens.NCBI.GRCh38
   genome[["1"]]

Use getSeq() to extract a set of regions (typically specified via
a GRanges object).

Trying to load the entire genome will require that R is able to allocate
more than 3Gb of RAM which I don't think is possible on your platform
(32-bit Windows). That's just the size of the Human genome once in
memory (i.e. in a DNAStringSet object) and whatever format is used to
store it on disk (a single file or 1 file per chromosome) won't change that.

Anyway, because of other issues with singe_sequences.fa.gz, today
BSgenome.Hsapiens.NCBI.GRCh38 will be updated with a new version that
uses one file per chromosome.

Cheers,
H.

>
>   -- output of sessionInfo():
>
> R version 3.1.0 (2014-04-10)
> Platform: i386-w64-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=Chinese_People's Republic of China.936
> [2] LC_CTYPE=Chinese_People's Republic of China.936
> [3] LC_MONETARY=Chinese_People's Republic of China.936
> [4] LC_NUMERIC=C
> [5] LC_TIME=Chinese_People's Republic of China.936
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list