[Bioc-devel] new 2bit BSgenome data packages

Hervé Pagès hpages at fhcrc.org
Wed May 14 03:43:06 CEST 2014


Hi,

Most BSgenome data packages have been regenerated to use UCSC 2bit
format to store the sequences on disk. The new packages are currently
being pushed to the BioC devel repo and should become available in the
next hour or so (they'll have version 1.4.0).

Some basic testing indicates that this new storage outperforms the old
storage format (1 .rda file per chromosome) and the more recent storage
format (1 big RAzip'ed compressed FASTA file for all chromosomes) in
every aspect: for random access with getSeq(), for working one
chromosome at a time (e.g. with [[, $, or bsapply), and also for the
size of the package tarball. Many thanks to Michael for supporting the
2bit format in rtracklayer.

For genomes that contain letters other than As, Cs, Gs, Ts, or Ns
(e.g. hg17, hg18, GRCh38, Ecoli, TAIR.04232008, and TAIR.TAIR9),
the 2bit format cannot be used out-of-the-box (not impossible, but
would require some workarounds). So for these genomes, I regenerated
the BSgenome data packages using the old storage format (1 .rda file
per chromosome). They are also currently being pushed to the BioC devel
repo (they'll have version 1.3.1000).

Note that, after being deprecated in BioC 2.14, the upstream sequences
(i.e. the sequences 1000/2000/5000 bases upstream of annotated
transcription starts) are not included in these new packages. Most
packages now contain a man page showing how to extract the upstream
sequences from the full genome sequences using a gene model.

Please let me know if you have questions or concerns about this.

Thanks,
H.

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list