[Bioc-devel] Fwd: Re: [devteam-bioc] suggestion about bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38

Martin Morgan mtmorgan at fhcrc.org
Sun Apr 20 19:49:54 CEST 2014


On 04/20/2014 10:41 AM, Martin Morgan wrote:
> What's the sessionInfo() after getSeq? Herve patched Rsamtools during the last
> release cycle to address issues on Windows, perhaps the original report is using
> a non-patched version of Rsamtools?

I'm always forgetting to follow my own advice; I see consistent behavior with 
the following (as well as on my regular Linux machine across multiple versions of R:

 > getSeq(genome,'6',start=30000,width=50)
   50-letter "DNAString" instance
seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
 > sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BSgenome.Hsapiens.NCBI.GRCh38_1.3.999 BSgenome_1.32.0
[3] Biostrings_2.32.0                     XVector_0.4.0
[5] GenomicRanges_1.16.1                  GenomeInfoDb_1.0.2
[7] IRanges_1.22.3                        BiocGenerics_0.10.0
[9] BiocInstaller_1.14.1

loaded via a namespace (and not attached):
[1] bitops_1.0-6     Rsamtools_1.16.0 stats4_3.1.0     tools_3.1.0 
zlibbioc_1.10.0
 >

>
> Martin
>
> On 04/20/2014 07:39 AM, Michael Lawrence wrote:
>> Looks like a fairly serious bug with BSgenome:
>>
>> ---------- Forwarded message ----------
>> From: liyangbjmu <liyangbjmu at bjmu.edu.cn>
>> Date: Sun, Apr 20, 2014 at 6:31 AM
>> Subject: Re: Re: [Bioc-devel] [devteam-bioc] suggestion about
>> bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38
>> To: Michael Lawrence <lawrence.michael at gene.com>
>>
>>
>>   Hi Michael,
>>
>> There is another problem with BSgeome.Hsapiens.NCBI.GRCh38. It can not
>> always get the same sequence with getSeq command.
>>
>>   >library(BSgenome.Hsapiens.NCBI.GRCh38)
>>> genome<-BSgenome.Hsapiens.NCBI.GRCh38
>>   > getSeq(genome,*'6',start=30000,width=50*)
>>    50-letter "DNAString" instance
>> seq: *GATGGCATCCAAGAAAGGGATGAGAATGTGAGATCCAGAAGGAAAAGCAG*
>>> getSeq(genome,*'6',start=30000,width=50*)
>>    50-letter "DNAString" instance
>> seq: *TTGGAAGAAACAGGAAAACAGACCCTCAGAGACACAAAGGATGCTGAGAG*
>>> getSeq(genome,*'6',start=30000,width=50*)
>>    50-letter "DNAString" instance
>> seq: *AGTGGCAGAGAGAAGAGTTGAAGGGGAGAAGTTGCTAGAACCTTGCTGCC*
>>> getSeq(genome,'6',start=30000,width=50)
>>    50-letter "DNAString" instance
>> seq: CTCCTGCCCCCCTACCCTCACCTGGGTACCAGCCCAGGGGCCTCGGTCTG
>>> getSeq(genome,'6',start=30000,width=50)
>>    50-letter "DNAString" instance
>> seq: CAAGGTCTGTAGTCCCTGCTGGATCTGCAGCAATGCCTGCATGGCTCGGG
>>> getSeq(genome,'6',start=30000,width=50)
>>    50-letter "DNAString" instance
>> seq: GATTGGTAAGGATGGAGAGTGACTCTGGGTTCTGCATCTGGTGGGAAATA
>>
>>
>>   > sessionInfo()
>> R version 3.1.0 (2014-04-10)
>> Platform: i386-w64-mingw32/i386 (32-bit)
>>   locale:
>> [1] LC_COLLATE=Chinese_People's Republic of China.936
>> [2] LC_CTYPE=Chinese_People's Republic of China.936
>> [3] LC_MONETARY=Chinese_People's Republic of China.936
>> [4] LC_NUMERIC=C
>> [5] LC_TIME=Chinese_People's Republic of China.936
>>   attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> Best,
>>
>> Sean
>> 2014-04-20
>> ------------------------------
>>
>> *Yang Li, PhD*
>> Department of Occupational and Environmental Health Science School of
>> Public Health, Peking University 38 Xueyuan Road, Beijing 100191, China
>> ------------------------------
>>   *·¢¼þÈË£º* Michael Lawrence
>> *·¢ËÍʱ¼ä£º* 2014-04-17  05:15:10
>> *ÊÕ¼þÈË£º* Hervé_Pagès
>> *³­ËÍ£º* liyangbjmu; maintainer; bioc-devel at r-project.org
>> *Ö÷Ì⣺* Re: [Bioc-devel] [devteam-bioc] suggestion about
>> bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38
>>    Another interesting aspect of 2bit is that it supports simple masking.
>> I'm all for the 2bit direction. But then BSgenome would depend on
>> rtracklayer? Or have you reimplemented it?
>>
>>
>> On Wed, Apr 16, 2014 at 12:59 PM, Herv¨¦ Pag¨¨s <hpages at fhcrc.org> wrote:
>>
>>> Hi Sean
>>>
>>> [hope you don't mind if I cc Bioc-devel]
>>>
>>> On 04/15/2014 11:47 PM, Maintainer wrote:
>>>
>>>> Hi The Bioconductor Dev Team,
>>>> A new package called BSgenome.Hsapiens.NCBI.GRCh38 has been available
>>>> for one week. But the single-sequence.fa.gz file in this package is too
>>>> big. The sequences of all the chromosomes are put together. It is
>>>> difficult to load.  Why do you remake this
>>>> package as BSgenome.Hsapiens.UCSC.hg19 do? In
>>>> BSgenome.Hsapiens.UCSC.hg19, sequence files for each chromosome are
>>>> separated.
>>>>
>>>
>>> Yeah I know... sigh!!  ;-)
>>>
>>> All the BSgenome data packages use this new on disk storage format, not
>>> just GRCh38. All the chromosomes are now put together in a single
>>> FASTA file compressed in the RAZip format (extension is rz, not gz).
>>> The file is indexed so it makes direct random access faster. So for
>>> example, extracting only some small portions of the genome with
>>> getSeq() is much faster than with the previous on disk storage format
>>> where the chromosomes were serialized into individual files.
>>> Some people have manifested interest in having getSeq() do fast
>>> random access to the genome sequences for a while. See for example
>>> this post from Michael in 2011:
>>>
>>>    https://stat.ethz.ch/pipermail/bioc-devel/2011-May/002601.html
>>>
>>> I'm aware that the new on disk storage format slows down operations
>>> where you actually need to compute on the entire genome, which is
>>> typically done in a loop where one chromosome is loaded at a time.
>>> It's hard to beat the speed (and simplicity) of just loading the
>>> individual serialized DNAString objects in that case. This is
>>> unfortunate and I was a little bit worried about this when I pushed
>>> the new BSgenome packages to the public repos because I think this
>>> is a common use case.
>>>
>>> Note that the switch to this new format happened in devel in January
>>> and was announced on the Bioc-devel mailing list:
>>>
>>>    https://stat.ethz.ch/pipermail/bioc-devel/2014-January/005150.html
>>>
>>> I was hoping people would test this and maybe start a discussion about
>>> whether this change was worth it or not.
>>>
>>> The good news is that I think we can have the best of both worlds i.e.
>>> fast direct random access and fast loading of the full sequences.
>>> I did some testing with the 2bit format and it looks very promising:
>>>
>>>    - Random access is much faster than with current RAZip'ed FASTA
>>>      format,
>>>
>>>    - Loading a full chromosome into memory is also very fast because
>>>      there isn't the overhead of decompressing the data. It could
>>>      even be that it's faster than loading the chromosome stored in
>>>      a serialized DNAString object, or maybe not, but at least it
>>>      should be very close.
>>>
>>>    - The sequences take less space on disk and the resulting BSgenome
>>>      package will also be slightly smaller.
>>>
>>> It's something that I was planning to work on early in this new devel
>>> cycle. Seems like I should start even earlier and maybe backport to
>>> BioC release?
>>>
>>> Thanks,
>>> H.
>>>
>>>
>>> Best,
>>>> Sean
>>>>
>>>> 2014-04-16
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>> ________________________________________________________________________
>>>> devteam-bioc mailing list
>>>> To unsubscribe from this mailing list send a blank email to
>>>> devteam-bioc-leave at lists.fhcrc.org
>>>> You can also unsubscribe or change your personal options at
>>>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>>>
>>>>
>>> --
>>> Herv¨¦ Pag¨¨s
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319 <%28206%29%20667-1319>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioc-devel mailing list