[Bioc-devel] Fwd: Re: [devteam-bioc] suggestion about bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38
Tim Triche, Jr.
tim.triche at gmail.com
Sun Apr 20 21:04:11 CEST 2014
Would withdrawing or regressing the Windows version of the package for a time thus be sufficient to not lose the existing progress for *nix users?
--t
> On Apr 20, 2014, at 10:49 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>
>> On 04/20/2014 10:41 AM, Martin Morgan wrote:
>> What's the sessionInfo() after getSeq? Herve patched Rsamtools during the last
>> release cycle to address issues on Windows, perhaps the original report is using
>> a non-patched version of Rsamtools?
>
> I'm always forgetting to follow my own advice; I see consistent behavior with the following (as well as on my regular Linux machine across multiple versions of R:
>
> > getSeq(genome,'6',start=30000,width=50)
> 50-letter "DNAString" instance
> seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
> > sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: i386-w64-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] BSgenome.Hsapiens.NCBI.GRCh38_1.3.999 BSgenome_1.32.0
> [3] Biostrings_2.32.0 XVector_0.4.0
> [5] GenomicRanges_1.16.1 GenomeInfoDb_1.0.2
> [7] IRanges_1.22.3 BiocGenerics_0.10.0
> [9] BiocInstaller_1.14.1
>
> loaded via a namespace (and not attached):
> [1] bitops_1.0-6 Rsamtools_1.16.0 stats4_3.1.0 tools_3.1.0 zlibbioc_1.10.0
> >
>
>>
>> Martin
>>
>>> On 04/20/2014 07:39 AM, Michael Lawrence wrote:
>>> Looks like a fairly serious bug with BSgenome:
>>>
>>> ---------- Forwarded message ----------
>>> From: liyangbjmu <liyangbjmu at bjmu.edu.cn>
>>> Date: Sun, Apr 20, 2014 at 6:31 AM
>>> Subject: Re: Re: [Bioc-devel] [devteam-bioc] suggestion about
>>> bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38
>>> To: Michael Lawrence <lawrence.michael at gene.com>
>>>
>>>
>>> Hi Michael,
>>>
>>> There is another problem with BSgeome.Hsapiens.NCBI.GRCh38. It can not
>>> always get the same sequence with getSeq command.
>>>
>>> >library(BSgenome.Hsapiens.NCBI.GRCh38)
>>>> genome<-BSgenome.Hsapiens.NCBI.GRCh38
>>> > getSeq(genome,*'6',start=30000,width=50*)
>>> 50-letter "DNAString" instance
>>> seq: *GATGGCATCCAAGAAAGGGATGAGAATGTGAGATCCAGAAGGAAAAGCAG*
>>>> getSeq(genome,*'6',start=30000,width=50*)
>>> 50-letter "DNAString" instance
>>> seq: *TTGGAAGAAACAGGAAAACAGACCCTCAGAGACACAAAGGATGCTGAGAG*
>>>> getSeq(genome,*'6',start=30000,width=50*)
>>> 50-letter "DNAString" instance
>>> seq: *AGTGGCAGAGAGAAGAGTTGAAGGGGAGAAGTTGCTAGAACCTTGCTGCC*
>>>> getSeq(genome,'6',start=30000,width=50)
>>> 50-letter "DNAString" instance
>>> seq: CTCCTGCCCCCCTACCCTCACCTGGGTACCAGCCCAGGGGCCTCGGTCTG
>>>> getSeq(genome,'6',start=30000,width=50)
>>> 50-letter "DNAString" instance
>>> seq: CAAGGTCTGTAGTCCCTGCTGGATCTGCAGCAATGCCTGCATGGCTCGGG
>>>> getSeq(genome,'6',start=30000,width=50)
>>> 50-letter "DNAString" instance
>>> seq: GATTGGTAAGGATGGAGAGTGACTCTGGGTTCTGCATCTGGTGGGAAATA
>>>
>>>
>>> > sessionInfo()
>>> R version 3.1.0 (2014-04-10)
>>> Platform: i386-w64-mingw32/i386 (32-bit)
>>> locale:
>>> [1] LC_COLLATE=Chinese_People's Republic of China.936
>>> [2] LC_CTYPE=Chinese_People's Republic of China.936
>>> [3] LC_MONETARY=Chinese_People's Republic of China.936
>>> [4] LC_NUMERIC=C
>>> [5] LC_TIME=Chinese_People's Republic of China.936
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> Best,
>>>
>>> Sean
>>> 2014-04-20
>>> ------------------------------
>>>
>>> *Yang Li, PhD*
>>> Department of Occupational and Environmental Health Science School of
>>> Public Health, Peking University 38 Xueyuan Road, Beijing 100191, China
>>> ------------------------------
>>> *·¢¼þÈË£º* Michael Lawrence
>>> *·¢ËÍʱ¼ä£º* 2014-04-17 05:15:10
>>> *ÊÕ¼þÈË£º* Hervé_Pagès
>>> *³ËÍ£º* liyangbjmu; maintainer; bioc-devel at r-project.org
>>> *Ö÷Ì⣺* Re: [Bioc-devel] [devteam-bioc] suggestion about
>>> bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38
>>> Another interesting aspect of 2bit is that it supports simple masking.
>>> I'm all for the 2bit direction. But then BSgenome would depend on
>>> rtracklayer? Or have you reimplemented it?
>>>
>>>
>>>> On Wed, Apr 16, 2014 at 12:59 PM, Herv¨¦ Pag¨¨s <hpages at fhcrc.org> wrote:
>>>>
>>>> Hi Sean
>>>>
>>>> [hope you don't mind if I cc Bioc-devel]
>>>>
>>>>> On 04/15/2014 11:47 PM, Maintainer wrote:
>>>>>
>>>>> Hi The Bioconductor Dev Team,
>>>>> A new package called BSgenome.Hsapiens.NCBI.GRCh38 has been available
>>>>> for one week. But the single-sequence.fa.gz file in this package is too
>>>>> big. The sequences of all the chromosomes are put together. It is
>>>>> difficult to load. Why do you remake this
>>>>> package as BSgenome.Hsapiens.UCSC.hg19 do? In
>>>>> BSgenome.Hsapiens.UCSC.hg19, sequence files for each chromosome are
>>>>> separated.
>>>>>
>>>>
>>>> Yeah I know... sigh!! ;-)
>>>>
>>>> All the BSgenome data packages use this new on disk storage format, not
>>>> just GRCh38. All the chromosomes are now put together in a single
>>>> FASTA file compressed in the RAZip format (extension is rz, not gz).
>>>> The file is indexed so it makes direct random access faster. So for
>>>> example, extracting only some small portions of the genome with
>>>> getSeq() is much faster than with the previous on disk storage format
>>>> where the chromosomes were serialized into individual files.
>>>> Some people have manifested interest in having getSeq() do fast
>>>> random access to the genome sequences for a while. See for example
>>>> this post from Michael in 2011:
>>>>
>>>> https://stat.ethz.ch/pipermail/bioc-devel/2011-May/002601.html
>>>>
>>>> I'm aware that the new on disk storage format slows down operations
>>>> where you actually need to compute on the entire genome, which is
>>>> typically done in a loop where one chromosome is loaded at a time.
>>>> It's hard to beat the speed (and simplicity) of just loading the
>>>> individual serialized DNAString objects in that case. This is
>>>> unfortunate and I was a little bit worried about this when I pushed
>>>> the new BSgenome packages to the public repos because I think this
>>>> is a common use case.
>>>>
>>>> Note that the switch to this new format happened in devel in January
>>>> and was announced on the Bioc-devel mailing list:
>>>>
>>>> https://stat.ethz.ch/pipermail/bioc-devel/2014-January/005150.html
>>>>
>>>> I was hoping people would test this and maybe start a discussion about
>>>> whether this change was worth it or not.
>>>>
>>>> The good news is that I think we can have the best of both worlds i.e.
>>>> fast direct random access and fast loading of the full sequences.
>>>> I did some testing with the 2bit format and it looks very promising:
>>>>
>>>> - Random access is much faster than with current RAZip'ed FASTA
>>>> format,
>>>>
>>>> - Loading a full chromosome into memory is also very fast because
>>>> there isn't the overhead of decompressing the data. It could
>>>> even be that it's faster than loading the chromosome stored in
>>>> a serialized DNAString object, or maybe not, but at least it
>>>> should be very close.
>>>>
>>>> - The sequences take less space on disk and the resulting BSgenome
>>>> package will also be slightly smaller.
>>>>
>>>> It's something that I was planning to work on early in this new devel
>>>> cycle. Seems like I should start even earlier and maybe backport to
>>>> BioC release?
>>>>
>>>> Thanks,
>>>> H.
>>>>
>>>>
>>>> Best,
>>>>> Sean
>>>>>
>>>>> 2014-04-16
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> ________________________________________________________________________
>>>>> devteam-bioc mailing list
>>>>> To unsubscribe from this mailing list send a blank email to
>>>>> devteam-bioc-leave at lists.fhcrc.org
>>>>> You can also unsubscribe or change your personal options at
>>>>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>>>>
>>>>>
>>>> --
>>>> Herv¨¦ Pag¨¨s
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org
>>>> Phone: (206) 667-5791
>>>> Fax: (206) 667-1319 <%28206%29%20667-1319>
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>
>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list