[Bioc-devel] Fwd: Re: [devteam-bioc] suggestion about bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38

Sun Apr 20 21:04:11 CEST 2014

Would withdrawing or regressing the Windows version of the package for a time thus be sufficient to not lose the existing progress for *nix users?

--t

> On Apr 20, 2014, at 10:49 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> 
>> On 04/20/2014 10:41 AM, Martin Morgan wrote:
>> What's the sessionInfo() after getSeq? Herve patched Rsamtools during the last
>> release cycle to address issues on Windows, perhaps the original report is using
>> a non-patched version of Rsamtools?
> 
> I'm always forgetting to follow my own advice; I see consistent behavior with the following (as well as on my regular Linux machine across multiple versions of R:
> 
> > getSeq(genome,'6',start=30000,width=50)
>  50-letter "DNAString" instance
> seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> > sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: i386-w64-mingw32/i386 (32-bit)
> 
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
> 
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] BSgenome.Hsapiens.NCBI.GRCh38_1.3.999 BSgenome_1.32.0
> [3] Biostrings_2.32.0                     XVector_0.4.0
> [5] GenomicRanges_1.16.1                  GenomeInfoDb_1.0.2
> [7] IRanges_1.22.3                        BiocGenerics_0.10.0
> [9] BiocInstaller_1.14.1
> 
> loaded via a namespace (and not attached):
> [1] bitops_1.0-6     Rsamtools_1.16.0 stats4_3.1.0     tools_3.1.0 zlibbioc_1.10.0
> >
> 
>> 
>> Martin
>> 
>>> On 04/20/2014 07:39 AM, Michael Lawrence wrote:
>>> Looks like a fairly serious bug with BSgenome:
>>> 
>>> ---------- Forwarded message ----------
>>> From: liyangbjmu <liyangbjmu at bjmu.edu.cn>
>>> Date: Sun, Apr 20, 2014 at 6:31 AM
>>> Subject: Re: Re: [Bioc-devel] [devteam-bioc] suggestion about
>>> bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38
>>> To: Michael Lawrence <lawrence.michael at gene.com>
>>> 
>>> 
>>>  Hi Michael,
>>> 
>>> There is another problem with BSgeome.Hsapiens.NCBI.GRCh38. It can not
>>> always get the same sequence with getSeq command.
>>> 
>>>  >library(BSgenome.Hsapiens.NCBI.GRCh38)
>>>> genome<-BSgenome.Hsapiens.NCBI.GRCh38
>>>  > getSeq(genome,*'6',start=30000,width=50*)
>>>   50-letter "DNAString" instance
>>> seq: *GATGGCATCCAAGAAAGGGATGAGAATGTGAGATCCAGAAGGAAAAGCAG*
>>>> getSeq(genome,*'6',start=30000,width=50*)
>>>   50-letter "DNAString" instance
>>> seq: *TTGGAAGAAACAGGAAAACAGACCCTCAGAGACACAAAGGATGCTGAGAG*
>>>> getSeq(genome,*'6',start=30000,width=50*)
>>>   50-letter "DNAString" instance
>>> seq: *AGTGGCAGAGAGAAGAGTTGAAGGGGAGAAGTTGCTAGAACCTTGCTGCC*
>>>> getSeq(genome,'6',start=30000,width=50)
>>>   50-letter "DNAString" instance
>>> seq: CTCCTGCCCCCCTACCCTCACCTGGGTACCAGCCCAGGGGCCTCGGTCTG
>>>> getSeq(genome,'6',start=30000,width=50)
>>>   50-letter "DNAString" instance
>>> seq: CAAGGTCTGTAGTCCCTGCTGGATCTGCAGCAATGCCTGCATGGCTCGGG
>>>> getSeq(genome,'6',start=30000,width=50)
>>>   50-letter "DNAString" instance
>>> seq: GATTGGTAAGGATGGAGAGTGACTCTGGGTTCTGCATCTGGTGGGAAATA
>>> 
>>> 
>>>  > sessionInfo()
>>> R version 3.1.0 (2014-04-10)
>>> Platform: i386-w64-mingw32/i386 (32-bit)
>>>  locale:
>>> [1] LC_COLLATE=Chinese_People's Republic of China.936
>>> [2] LC_CTYPE=Chinese_People's Republic of China.936
>>> [3] LC_MONETARY=Chinese_People's Republic of China.936
>>> [4] LC_NUMERIC=C
>>> [5] LC_TIME=Chinese_People's Republic of China.936
>>>  attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>> 
>>> Best,
>>> 
>>> Sean
>>> 2014-04-20
>>> ------------------------------
>>> 
>>> *Yang Li, PhD*
>>> Department of Occupational and Environmental Health Science School of
>>> Public Health, Peking University 38 Xueyuan Road, Beijing 100191, China
>>> ------------------------------
>>>  *·¢¼þÈË£º* Michael Lawrence
>>> *·¢ËÍÊ±¼ä£º* 2014-04-17  05:15:10
>>> *ÊÕ¼þÈË£º* Hervé_Pagès
>>> *³ËÍ£º* liyangbjmu; maintainer; bioc-devel at r-project.org
>>> *Ö÷Ìâ£º* Re: [Bioc-devel] [devteam-bioc] suggestion about
>>> bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38
>>>   Another interesting aspect of 2bit is that it supports simple masking.
>>> I'm all for the 2bit direction. But then BSgenome would depend on
>>> rtracklayer? Or have you reimplemented it?
>>> 
>>> 
>>>> On Wed, Apr 16, 2014 at 12:59 PM, Herv¨¦ Pag¨¨s <hpages at fhcrc.org> wrote:
>>>> 
>>>> Hi Sean
>>>> 
>>>> [hope you don't mind if I cc Bioc-devel]
>>>> 
>>>>> On 04/15/2014 11:47 PM, Maintainer wrote:
>>>>> 
>>>>> Hi The Bioconductor Dev Team,
>>>>> A new package called BSgenome.Hsapiens.NCBI.GRCh38 has been available
>>>>> for one week. But the single-sequence.fa.gz file in this package is too
>>>>> big. The sequences of all the chromosomes are put together. It is
>>>>> difficult to load.  Why do you remake this
>>>>> package as BSgenome.Hsapiens.UCSC.hg19 do? In
>>>>> BSgenome.Hsapiens.UCSC.hg19, sequence files for each chromosome are
>>>>> separated.
>>>>> 
>>>> 
>>>> Yeah I know... sigh!!  ;-)
>>>> 
>>>> All the BSgenome data packages use this new on disk storage format, not
>>>> just GRCh38. All the chromosomes are now put together in a single
>>>> FASTA file compressed in the RAZip format (extension is rz, not gz).
>>>> The file is indexed so it makes direct random access faster. So for
>>>> example, extracting only some small portions of the genome with
>>>> getSeq() is much faster than with the previous on disk storage format
>>>> where the chromosomes were serialized into individual files.
>>>> Some people have manifested interest in having getSeq() do fast
>>>> random access to the genome sequences for a while. See for example
>>>> this post from Michael in 2011:
>>>> 
>>>>   https://stat.ethz.ch/pipermail/bioc-devel/2011-May/002601.html
>>>> 
>>>> I'm aware that the new on disk storage format slows down operations
>>>> where you actually need to compute on the entire genome, which is
>>>> typically done in a loop where one chromosome is loaded at a time.
>>>> It's hard to beat the speed (and simplicity) of just loading the
>>>> individual serialized DNAString objects in that case. This is
>>>> unfortunate and I was a little bit worried about this when I pushed
>>>> the new BSgenome packages to the public repos because I think this
>>>> is a common use case.
>>>> 
>>>> Note that the switch to this new format happened in devel in January
>>>> and was announced on the Bioc-devel mailing list:
>>>> 
>>>>   https://stat.ethz.ch/pipermail/bioc-devel/2014-January/005150.html
>>>> 
>>>> I was hoping people would test this and maybe start a discussion about
>>>> whether this change was worth it or not.
>>>> 
>>>> The good news is that I think we can have the best of both worlds i.e.
>>>> fast direct random access and fast loading of the full sequences.
>>>> I did some testing with the 2bit format and it looks very promising:
>>>> 
>>>>   - Random access is much faster than with current RAZip'ed FASTA
>>>>     format,
>>>> 
>>>>   - Loading a full chromosome into memory is also very fast because
>>>>     there isn't the overhead of decompressing the data. It could
>>>>     even be that it's faster than loading the chromosome stored in
>>>>     a serialized DNAString object, or maybe not, but at least it
>>>>     should be very close.
>>>> 
>>>>   - The sequences take less space on disk and the resulting BSgenome
>>>>     package will also be slightly smaller.
>>>> 
>>>> It's something that I was planning to work on early in this new devel
>>>> cycle. Seems like I should start even earlier and maybe backport to
>>>> BioC release?
>>>> 
>>>> Thanks,
>>>> H.
>>>> 
>>>> 
>>>> Best,
>>>>> Sean
>>>>> 
>>>>> 2014-04-16
>>>>> ------------------------------------------------------------------------
>>>>> 
>>>>> 
>>>>> ________________________________________________________________________
>>>>> devteam-bioc mailing list
>>>>> To unsubscribe from this mailing list send a blank email to
>>>>> devteam-bioc-leave at lists.fhcrc.org
>>>>> You can also unsubscribe or change your personal options at
>>>>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>>>> 
>>>>> 
>>>> --
>>>> Herv¨¦ Pag¨¨s
>>>> 
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>> 
>>>> E-mail: hpages at fhcrc.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319 <%28206%29%20667-1319>
>>>> 
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>> 
>>> 
>>>    [[alternative HTML version deleted]]
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>> 
>> 
>> 
> 
> 
> -- 
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel