[Bioc-devel] [devteam-bioc] suggestion about bioconductor package-- BSgenome.Hsapiens.NCBI.GRCh38

Thu Apr 17 00:15:27 CEST 2014

Hi Michael,

On 04/16/2014 02:33 PM, Michael Lawrence wrote:
> Another interesting aspect of 2bit is that it supports simple masking.
> I'm all for the 2bit direction. But then BSgenome would depend on
> rtracklayer?

Any reason I should not do that? ;-)

> Or have you reimplemented it?

Nope.

H.

>
>
> On Wed, Apr 16, 2014 at 12:59 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Hi Sean
>
>     [hope you don't mind if I cc Bioc-devel]
>
>     On 04/15/2014 11:47 PM, Maintainer wrote:
>
>         Hi The Bioconductor Dev Team,
>         A new package called BSgenome.Hsapiens.NCBI.GRCh38 has been
>         available
>         for one week. But the single-sequence.fa.gz file in this package
>         is too
>         big. The sequences of all the chromosomes are put together. It is
>         difficult to load.  Why do you remake this
>         package as BSgenome.Hsapiens.UCSC.hg19 do? In
>         BSgenome.Hsapiens.UCSC.hg19, sequence files for each chromosome are
>         separated.
>
>
>     Yeah I know... sigh!!  ;-)
>
>     All the BSgenome data packages use this new on disk storage format, not
>     just GRCh38. All the chromosomes are now put together in a single
>     FASTA file compressed in the RAZip format (extension is rz, not gz).
>     The file is indexed so it makes direct random access faster. So for
>     example, extracting only some small portions of the genome with
>     getSeq() is much faster than with the previous on disk storage format
>     where the chromosomes were serialized into individual files.
>     Some people have manifested interest in having getSeq() do fast
>     random access to the genome sequences for a while. See for example
>     this post from Michael in 2011:
>
>     https://stat.ethz.ch/__pipermail/bioc-devel/2011-May/__002601.html
>     <https://stat.ethz.ch/pipermail/bioc-devel/2011-May/002601.html>
>
>     I'm aware that the new on disk storage format slows down operations
>     where you actually need to compute on the entire genome, which is
>     typically done in a loop where one chromosome is loaded at a time.
>     It's hard to beat the speed (and simplicity) of just loading the
>     individual serialized DNAString objects in that case. This is
>     unfortunate and I was a little bit worried about this when I pushed
>     the new BSgenome packages to the public repos because I think this
>     is a common use case.
>
>     Note that the switch to this new format happened in devel in January
>     and was announced on the Bioc-devel mailing list:
>
>     https://stat.ethz.ch/__pipermail/bioc-devel/2014-__January/005150.html
>     <https://stat.ethz.ch/pipermail/bioc-devel/2014-January/005150.html>
>
>     I was hoping people would test this and maybe start a discussion about
>     whether this change was worth it or not.
>
>     The good news is that I think we can have the best of both worlds i.e.
>     fast direct random access and fast loading of the full sequences.
>     I did some testing with the 2bit format and it looks very promising:
>
>        - Random access is much faster than with current RAZip'ed FASTA
>          format,
>
>        - Loading a full chromosome into memory is also very fast because
>          there isn't the overhead of decompressing the data. It could
>          even be that it's faster than loading the chromosome stored in
>          a serialized DNAString object, or maybe not, but at least it
>          should be very close.
>
>        - The sequences take less space on disk and the resulting BSgenome
>          package will also be slightly smaller.
>
>     It's something that I was planning to work on early in this new devel
>     cycle. Seems like I should start even earlier and maybe backport to
>     BioC release?
>
>     Thanks,
>     H.
>
>
>         Best,
>         Sean
>
>         2014-04-16
>         ------------------------------__------------------------------__------------
>
>
>         ____________________________________________________________________________
>         devteam-bioc mailing list
>         To unsubscribe from this mailing list send a blank email to
>         devteam-bioc-leave at lists.__fhcrc.org
>         <mailto:devteam-bioc-leave at lists.fhcrc.org>
>         You can also unsubscribe or change your personal options at
>         https://lists.fhcrc.org/__mailman/listinfo/devteam-bioc
>         <https://lists.fhcrc.org/mailman/listinfo/devteam-bioc>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>     _________________________________________________
>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/__listinfo/bioc-devel
>     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319