[Bioc-devel] Fwd: Re: [devteam-bioc] suggestion about bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38

Hervé Pagès hpages at fhcrc.org
Sun Apr 20 21:28:04 CEST 2014


Hi Tim,

On 04/20/2014 12:04 PM, Tim Triche, Jr. wrote:
> Would withdrawing or regressing the Windows version of the package for a time thus be sufficient to not lose the existing progress for *nix users?

That means we would distribute binary packages that were not built
from the source packages that we also distribute. I'd rather not open
that can of worms.

The roll-back is not such a big deal anyway. Even though it will
probably be permanent for the current release, my plan is to have the
2bit-based BSgenome packages ready ASAP in devel.

Cheers,
H.

>
> --t
>
>> On Apr 20, 2014, at 10:49 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>>
>>> On 04/20/2014 10:41 AM, Martin Morgan wrote:
>>> What's the sessionInfo() after getSeq? Herve patched Rsamtools during the last
>>> release cycle to address issues on Windows, perhaps the original report is using
>>> a non-patched version of Rsamtools?
>>
>> I'm always forgetting to follow my own advice; I see consistent behavior with the following (as well as on my regular Linux machine across multiple versions of R:
>>
>>> getSeq(genome,'6',start=30000,width=50)
>>   50-letter "DNAString" instance
>> seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>>> sessionInfo()
>> R version 3.1.0 (2014-04-10)
>> Platform: i386-w64-mingw32/i386 (32-bit)
>>
>> locale:
>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252
>> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>> [5] LC_TIME=English_United States.1252
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] BSgenome.Hsapiens.NCBI.GRCh38_1.3.999 BSgenome_1.32.0
>> [3] Biostrings_2.32.0                     XVector_0.4.0
>> [5] GenomicRanges_1.16.1                  GenomeInfoDb_1.0.2
>> [7] IRanges_1.22.3                        BiocGenerics_0.10.0
>> [9] BiocInstaller_1.14.1
>>
>> loaded via a namespace (and not attached):
>> [1] bitops_1.0-6     Rsamtools_1.16.0 stats4_3.1.0     tools_3.1.0 zlibbioc_1.10.0
>>>
>>
>>>
>>> Martin
>>>
>>>> On 04/20/2014 07:39 AM, Michael Lawrence wrote:
>>>> Looks like a fairly serious bug with BSgenome:
>>>>
>>>> ---------- Forwarded message ----------
>>>> From: liyangbjmu <liyangbjmu at bjmu.edu.cn>
>>>> Date: Sun, Apr 20, 2014 at 6:31 AM
>>>> Subject: Re: Re: [Bioc-devel] [devteam-bioc] suggestion about
>>>> bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38
>>>> To: Michael Lawrence <lawrence.michael at gene.com>
>>>>
>>>>
>>>>   Hi Michael,
>>>>
>>>> There is another problem with BSgeome.Hsapiens.NCBI.GRCh38. It can not
>>>> always get the same sequence with getSeq command.
>>>>
>>>>   >library(BSgenome.Hsapiens.NCBI.GRCh38)
>>>>> genome<-BSgenome.Hsapiens.NCBI.GRCh38
>>>>   > getSeq(genome,*'6',start=30000,width=50*)
>>>>    50-letter "DNAString" instance
>>>> seq: *GATGGCATCCAAGAAAGGGATGAGAATGTGAGATCCAGAAGGAAAAGCAG*
>>>>> getSeq(genome,*'6',start=30000,width=50*)
>>>>    50-letter "DNAString" instance
>>>> seq: *TTGGAAGAAACAGGAAAACAGACCCTCAGAGACACAAAGGATGCTGAGAG*
>>>>> getSeq(genome,*'6',start=30000,width=50*)
>>>>    50-letter "DNAString" instance
>>>> seq: *AGTGGCAGAGAGAAGAGTTGAAGGGGAGAAGTTGCTAGAACCTTGCTGCC*
>>>>> getSeq(genome,'6',start=30000,width=50)
>>>>    50-letter "DNAString" instance
>>>> seq: CTCCTGCCCCCCTACCCTCACCTGGGTACCAGCCCAGGGGCCTCGGTCTG
>>>>> getSeq(genome,'6',start=30000,width=50)
>>>>    50-letter "DNAString" instance
>>>> seq: CAAGGTCTGTAGTCCCTGCTGGATCTGCAGCAATGCCTGCATGGCTCGGG
>>>>> getSeq(genome,'6',start=30000,width=50)
>>>>    50-letter "DNAString" instance
>>>> seq: GATTGGTAAGGATGGAGAGTGACTCTGGGTTCTGCATCTGGTGGGAAATA
>>>>
>>>>
>>>>   > sessionInfo()
>>>> R version 3.1.0 (2014-04-10)
>>>> Platform: i386-w64-mingw32/i386 (32-bit)
>>>>   locale:
>>>> [1] LC_COLLATE=Chinese_People's Republic of China.936
>>>> [2] LC_CTYPE=Chinese_People's Republic of China.936
>>>> [3] LC_MONETARY=Chinese_People's Republic of China.936
>>>> [4] LC_NUMERIC=C
>>>> [5] LC_TIME=Chinese_People's Republic of China.936
>>>>   attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> Best,
>>>>
>>>> Sean
>>>> 2014-04-20
>>>> ------------------------------
>>>>
>>>> *Yang Li, PhD*
>>>> Department of Occupational and Environmental Health Science School of
>>>> Public Health, Peking University 38 Xueyuan Road, Beijing 100191, China
>>>> ------------------------------
>>>>   *发件人:* Michael Lawrence
>>>> *发送时间:* 2014-04-17  05:15:10
>>>> *收件人:* Herv開Pag鑣
>>>> *抄送:* liyangbjmu; maintainer; bioc-devel at r-project.org
>>>> *主题:* Re: [Bioc-devel] [devteam-bioc] suggestion about
>>>> bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38
>>>>    Another interesting aspect of 2bit is that it supports simple masking.
>>>> I'm all for the 2bit direction. But then BSgenome would depend on
>>>> rtracklayer? Or have you reimplemented it?
>>>>
>>>>
>>>>> On Wed, Apr 16, 2014 at 12:59 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>>>
>>>>> Hi Sean
>>>>>
>>>>> [hope you don't mind if I cc Bioc-devel]
>>>>>
>>>>>> On 04/15/2014 11:47 PM, Maintainer wrote:
>>>>>>
>>>>>> Hi The Bioconductor Dev Team,
>>>>>> A new package called BSgenome.Hsapiens.NCBI.GRCh38 has been available
>>>>>> for one week. But the single-sequence.fa.gz file in this package is too
>>>>>> big. The sequences of all the chromosomes are put together. It is
>>>>>> difficult to load.  Why do you remake this
>>>>>> package as BSgenome.Hsapiens.UCSC.hg19 do? In
>>>>>> BSgenome.Hsapiens.UCSC.hg19, sequence files for each chromosome are
>>>>>> separated.
>>>>>>
>>>>>
>>>>> Yeah I know... sigh!!  ;-)
>>>>>
>>>>> All the BSgenome data packages use this new on disk storage format, not
>>>>> just GRCh38. All the chromosomes are now put together in a single
>>>>> FASTA file compressed in the RAZip format (extension is rz, not gz).
>>>>> The file is indexed so it makes direct random access faster. So for
>>>>> example, extracting only some small portions of the genome with
>>>>> getSeq() is much faster than with the previous on disk storage format
>>>>> where the chromosomes were serialized into individual files.
>>>>> Some people have manifested interest in having getSeq() do fast
>>>>> random access to the genome sequences for a while. See for example
>>>>> this post from Michael in 2011:
>>>>>
>>>>>    https://stat.ethz.ch/pipermail/bioc-devel/2011-May/002601.html
>>>>>
>>>>> I'm aware that the new on disk storage format slows down operations
>>>>> where you actually need to compute on the entire genome, which is
>>>>> typically done in a loop where one chromosome is loaded at a time.
>>>>> It's hard to beat the speed (and simplicity) of just loading the
>>>>> individual serialized DNAString objects in that case. This is
>>>>> unfortunate and I was a little bit worried about this when I pushed
>>>>> the new BSgenome packages to the public repos because I think this
>>>>> is a common use case.
>>>>>
>>>>> Note that the switch to this new format happened in devel in January
>>>>> and was announced on the Bioc-devel mailing list:
>>>>>
>>>>>    https://stat.ethz.ch/pipermail/bioc-devel/2014-January/005150.html
>>>>>
>>>>> I was hoping people would test this and maybe start a discussion about
>>>>> whether this change was worth it or not.
>>>>>
>>>>> The good news is that I think we can have the best of both worlds i.e.
>>>>> fast direct random access and fast loading of the full sequences.
>>>>> I did some testing with the 2bit format and it looks very promising:
>>>>>
>>>>>    - Random access is much faster than with current RAZip'ed FASTA
>>>>>      format,
>>>>>
>>>>>    - Loading a full chromosome into memory is also very fast because
>>>>>      there isn't the overhead of decompressing the data. It could
>>>>>      even be that it's faster than loading the chromosome stored in
>>>>>      a serialized DNAString object, or maybe not, but at least it
>>>>>      should be very close.
>>>>>
>>>>>    - The sequences take less space on disk and the resulting BSgenome
>>>>>      package will also be slightly smaller.
>>>>>
>>>>> It's something that I was planning to work on early in this new devel
>>>>> cycle. Seems like I should start even earlier and maybe backport to
>>>>> BioC release?
>>>>>
>>>>> Thanks,
>>>>> H.
>>>>>
>>>>>
>>>>> Best,
>>>>>> Sean
>>>>>>
>>>>>> 2014-04-16
>>>>>> ------------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>> ________________________________________________________________________
>>>>>> devteam-bioc mailing list
>>>>>> To unsubscribe from this mailing list send a blank email to
>>>>>> devteam-bioc-leave at lists.fhcrc.org
>>>>>> You can also unsubscribe or change your personal options at
>>>>>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>>>>>
>>>>>>
>>>>> --
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fhcrc.org
>>>>> Phone:  (206) 667-5791
>>>>> Fax:    (206) 667-1319 <%28206%29%20667-1319>
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>>     [[alternative HTML version deleted]]
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>>
>>
>>
>> --
>> Computational Biology / Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N.
>> PO Box 19024 Seattle, WA 98109
>>
>> Location: Arnold Building M1 B861
>> Phone: (206) 667-2793
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list