[Bioc-devel] [devteam-bioc] suggestion about bioconductor package-- BSgenome.Hsapiens.NCBI.GRCh38
Hervé Pagès
hpages at fhcrc.org
Thu Apr 17 00:15:27 CEST 2014
Hi Michael,
On 04/16/2014 02:33 PM, Michael Lawrence wrote:
> Another interesting aspect of 2bit is that it supports simple masking.
> I'm all for the 2bit direction. But then BSgenome would depend on
> rtracklayer?
Any reason I should not do that? ;-)
> Or have you reimplemented it?
Nope.
H.
>
>
> On Wed, Apr 16, 2014 at 12:59 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> Hi Sean
>
> [hope you don't mind if I cc Bioc-devel]
>
> On 04/15/2014 11:47 PM, Maintainer wrote:
>
> Hi The Bioconductor Dev Team,
> A new package called BSgenome.Hsapiens.NCBI.GRCh38 has been
> available
> for one week. But the single-sequence.fa.gz file in this package
> is too
> big. The sequences of all the chromosomes are put together. It is
> difficult to load. Why do you remake this
> package as BSgenome.Hsapiens.UCSC.hg19 do? In
> BSgenome.Hsapiens.UCSC.hg19, sequence files for each chromosome are
> separated.
>
>
> Yeah I know... sigh!! ;-)
>
> All the BSgenome data packages use this new on disk storage format, not
> just GRCh38. All the chromosomes are now put together in a single
> FASTA file compressed in the RAZip format (extension is rz, not gz).
> The file is indexed so it makes direct random access faster. So for
> example, extracting only some small portions of the genome with
> getSeq() is much faster than with the previous on disk storage format
> where the chromosomes were serialized into individual files.
> Some people have manifested interest in having getSeq() do fast
> random access to the genome sequences for a while. See for example
> this post from Michael in 2011:
>
> https://stat.ethz.ch/__pipermail/bioc-devel/2011-May/__002601.html
> <https://stat.ethz.ch/pipermail/bioc-devel/2011-May/002601.html>
>
> I'm aware that the new on disk storage format slows down operations
> where you actually need to compute on the entire genome, which is
> typically done in a loop where one chromosome is loaded at a time.
> It's hard to beat the speed (and simplicity) of just loading the
> individual serialized DNAString objects in that case. This is
> unfortunate and I was a little bit worried about this when I pushed
> the new BSgenome packages to the public repos because I think this
> is a common use case.
>
> Note that the switch to this new format happened in devel in January
> and was announced on the Bioc-devel mailing list:
>
> https://stat.ethz.ch/__pipermail/bioc-devel/2014-__January/005150.html
> <https://stat.ethz.ch/pipermail/bioc-devel/2014-January/005150.html>
>
> I was hoping people would test this and maybe start a discussion about
> whether this change was worth it or not.
>
> The good news is that I think we can have the best of both worlds i.e.
> fast direct random access and fast loading of the full sequences.
> I did some testing with the 2bit format and it looks very promising:
>
> - Random access is much faster than with current RAZip'ed FASTA
> format,
>
> - Loading a full chromosome into memory is also very fast because
> there isn't the overhead of decompressing the data. It could
> even be that it's faster than loading the chromosome stored in
> a serialized DNAString object, or maybe not, but at least it
> should be very close.
>
> - The sequences take less space on disk and the resulting BSgenome
> package will also be slightly smaller.
>
> It's something that I was planning to work on early in this new devel
> cycle. Seems like I should start even earlier and maybe backport to
> BioC release?
>
> Thanks,
> H.
>
>
> Best,
> Sean
>
> 2014-04-16
> ------------------------------__------------------------------__------------
>
>
> ____________________________________________________________________________
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.__fhcrc.org
> <mailto:devteam-bioc-leave at lists.fhcrc.org>
> You can also unsubscribe or change your personal options at
> https://lists.fhcrc.org/__mailman/listinfo/devteam-bioc
> <https://lists.fhcrc.org/mailman/listinfo/devteam-bioc>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
> _________________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
> https://stat.ethz.ch/mailman/__listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list