[Bioc-devel] Fwd: Re: [devteam-bioc] suggestion about bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38

Sun Apr 20 19:41:06 CEST 2014

What's the sessionInfo() after getSeq? Herve patched Rsamtools during the last 
release cycle to address issues on Windows, perhaps the original report is using 
a non-patched version of Rsamtools?

Martin

On 04/20/2014 07:39 AM, Michael Lawrence wrote:
> Looks like a fairly serious bug with BSgenome:
>
> ---------- Forwarded message ----------
> From: liyangbjmu <liyangbjmu at bjmu.edu.cn>
> Date: Sun, Apr 20, 2014 at 6:31 AM
> Subject: Re: Re: [Bioc-devel] [devteam-bioc] suggestion about
> bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38
> To: Michael Lawrence <lawrence.michael at gene.com>
>
>
>   Hi Michael,
>
> There is another problem with BSgeome.Hsapiens.NCBI.GRCh38. It can not
> always get the same sequence with getSeq command.
>
>   >library(BSgenome.Hsapiens.NCBI.GRCh38)
>> genome<-BSgenome.Hsapiens.NCBI.GRCh38
>   > getSeq(genome,*'6',start=30000,width=50*)
>    50-letter "DNAString" instance
> seq: *GATGGCATCCAAGAAAGGGATGAGAATGTGAGATCCAGAAGGAAAAGCAG*
>> getSeq(genome,*'6',start=30000,width=50*)
>    50-letter "DNAString" instance
> seq: *TTGGAAGAAACAGGAAAACAGACCCTCAGAGACACAAAGGATGCTGAGAG*
>> getSeq(genome,*'6',start=30000,width=50*)
>    50-letter "DNAString" instance
> seq: *AGTGGCAGAGAGAAGAGTTGAAGGGGAGAAGTTGCTAGAACCTTGCTGCC*
>> getSeq(genome,'6',start=30000,width=50)
>    50-letter "DNAString" instance
> seq: CTCCTGCCCCCCTACCCTCACCTGGGTACCAGCCCAGGGGCCTCGGTCTG
>> getSeq(genome,'6',start=30000,width=50)
>    50-letter "DNAString" instance
> seq: CAAGGTCTGTAGTCCCTGCTGGATCTGCAGCAATGCCTGCATGGCTCGGG
>> getSeq(genome,'6',start=30000,width=50)
>    50-letter "DNAString" instance
> seq: GATTGGTAAGGATGGAGAGTGACTCTGGGTTCTGCATCTGGTGGGAAATA
>
>
>   > sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: i386-w64-mingw32/i386 (32-bit)
>   locale:
> [1] LC_COLLATE=Chinese_People's Republic of China.936
> [2] LC_CTYPE=Chinese_People's Republic of China.936
> [3] LC_MONETARY=Chinese_People's Republic of China.936
> [4] LC_NUMERIC=C
> [5] LC_TIME=Chinese_People's Republic of China.936
>   attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> Best,
>
> Sean
> 2014-04-20
> ------------------------------
>
> *Yang Li, PhD*
> Department of Occupational and Environmental Health Science School of
> Public Health, Peking University 38 Xueyuan Road, Beijing 100191, China
> ------------------------------
>   *·¢¼þÈË£º* Michael Lawrence
> *·¢ËÍÊ±¼ä£º* 2014-04-17  05:15:10
> *ÊÕ¼þÈË£º* Hervé_Pagès
> *³ËÍ£º* liyangbjmu; maintainer; bioc-devel at r-project.org
> *Ö÷Ìâ£º* Re: [Bioc-devel] [devteam-bioc] suggestion about
> bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38
>    Another interesting aspect of 2bit is that it supports simple masking.
> I'm all for the 2bit direction. But then BSgenome would depend on
> rtracklayer? Or have you reimplemented it?
>
>
> On Wed, Apr 16, 2014 at 12:59 PM, Herv¨¦ Pag¨¨s <hpages at fhcrc.org> wrote:
>
>> Hi Sean
>>
>> [hope you don't mind if I cc Bioc-devel]
>>
>> On 04/15/2014 11:47 PM, Maintainer wrote:
>>
>>> Hi The Bioconductor Dev Team,
>>> A new package called BSgenome.Hsapiens.NCBI.GRCh38 has been available
>>> for one week. But the single-sequence.fa.gz file in this package is too
>>> big. The sequences of all the chromosomes are put together. It is
>>> difficult to load.  Why do you remake this
>>> package as BSgenome.Hsapiens.UCSC.hg19 do? In
>>> BSgenome.Hsapiens.UCSC.hg19, sequence files for each chromosome are
>>> separated.
>>>
>>
>> Yeah I know... sigh!!  ;-)
>>
>> All the BSgenome data packages use this new on disk storage format, not
>> just GRCh38. All the chromosomes are now put together in a single
>> FASTA file compressed in the RAZip format (extension is rz, not gz).
>> The file is indexed so it makes direct random access faster. So for
>> example, extracting only some small portions of the genome with
>> getSeq() is much faster than with the previous on disk storage format
>> where the chromosomes were serialized into individual files.
>> Some people have manifested interest in having getSeq() do fast
>> random access to the genome sequences for a while. See for example
>> this post from Michael in 2011:
>>
>>    https://stat.ethz.ch/pipermail/bioc-devel/2011-May/002601.html
>>
>> I'm aware that the new on disk storage format slows down operations
>> where you actually need to compute on the entire genome, which is
>> typically done in a loop where one chromosome is loaded at a time.
>> It's hard to beat the speed (and simplicity) of just loading the
>> individual serialized DNAString objects in that case. This is
>> unfortunate and I was a little bit worried about this when I pushed
>> the new BSgenome packages to the public repos because I think this
>> is a common use case.
>>
>> Note that the switch to this new format happened in devel in January
>> and was announced on the Bioc-devel mailing list:
>>
>>    https://stat.ethz.ch/pipermail/bioc-devel/2014-January/005150.html
>>
>> I was hoping people would test this and maybe start a discussion about
>> whether this change was worth it or not.
>>
>> The good news is that I think we can have the best of both worlds i.e.
>> fast direct random access and fast loading of the full sequences.
>> I did some testing with the 2bit format and it looks very promising:
>>
>>    - Random access is much faster than with current RAZip'ed FASTA
>>      format,
>>
>>    - Loading a full chromosome into memory is also very fast because
>>      there isn't the overhead of decompressing the data. It could
>>      even be that it's faster than loading the chromosome stored in
>>      a serialized DNAString object, or maybe not, but at least it
>>      should be very close.
>>
>>    - The sequences take less space on disk and the resulting BSgenome
>>      package will also be slightly smaller.
>>
>> It's something that I was planning to work on early in this new devel
>> cycle. Seems like I should start even earlier and maybe backport to
>> BioC release?
>>
>> Thanks,
>> H.
>>
>>
>> Best,
>>> Sean
>>>
>>> 2014-04-16
>>> ------------------------------------------------------------------------
>>>
>>>
>>> ________________________________________________________________________
>>> devteam-bioc mailing list
>>> To unsubscribe from this mailing list send a blank email to
>>> devteam-bioc-leave at lists.fhcrc.org
>>> You can also unsubscribe or change your personal options at
>>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>>
>>>
>> --
>> Herv¨¦ Pag¨¨s
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319 <%28206%29%20667-1319>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> 	[[alternative HTML version deleted]]
>
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793