Would withdrawing or regressing the Windows version of the package for a time thus be sufficient to not lose the existing progress for *nix users?


> On Apr 20, 2014, at 10:49 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
>> On 04/20/2014 10:41 AM, Martin Morgan wrote:
>> What's the sessionInfo() after getSeq? Herve patched Rsamtools during the last
>> release cycle to address issues on Windows, perhaps the original report is using
>> a non-patched version of Rsamtools?
> I'm always forgetting to follow my own advice; I see consistent behavior with the following (as well as on my regular Linux machine across multiple versions of R:
> > getSeq(genome,'6',start=30000,width=50)
>  50-letter "DNAString" instance
> > sessionInfo()
> R version 3.1.0 (2014-04-10)
> Platform: i386-w64-mingw32/i386 (32-bit)
> locale:
> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base
> other attached packages:
> [1] BSgenome.Hsapiens.NCBI.GRCh38_1.3.999 BSgenome_1.32.0
> [3] Biostrings_2.32.0                     XVector_0.4.0
> [5] GenomicRanges_1.16.1                  GenomeInfoDb_1.0.2
> [7] IRanges_1.22.3                        BiocGenerics_0.10.0
> [9] BiocInstaller_1.14.1
> loaded via a namespace (and not attached):
> [1] bitops_1.0-6     Rsamtools_1.16.0 stats4_3.1.0     tools_3.1.0 zlibbioc_1.10.0
> >
>> Martin
>>> On 04/20/2014 07:39 AM, Michael Lawrence wrote:
>>> Looks like a fairly serious bug with BSgenome:
>>> ---------- Forwarded message ----------
>>> From: liyangbjmu <liyangbjmu at bjmu.edu.cn>
>>> Date: Sun, Apr 20, 2014 at 6:31 AM
>>> Subject: Re: Re: [Bioc-devel] [devteam-bioc] suggestion about
>>> bioconductorpackage-- BSgenome.Hsapiens.NCBI.GRCh38
>>> To: Michael Lawrence <lawrence.michael at gene.com>
>>>  Hi Michael,
>>> There is another problem with BSgeome.Hsapiens.NCBI.GRCh38. It can not
>>> always get the same sequence with getSeq command.
>>>  >library(BSgenome.Hsapiens.NCBI.GRCh38)
>>>> genome<-BSgenome.Hsapiens.NCBI.GRCh38
>>>  > getSeq(genome,*'6',start=30000,width=50*)
>>>   50-letter "DNAString" instance
>>>> getSeq(genome,*'6',start=30000,width=50*)
>>>   50-letter "DNAString" instance
>>>> getSeq(genome,*'6',start=30000,width=50*)
>>>   50-letter "DNAString" instance
>>>> getSeq(genome,'6',start=30000,width=50)
>>>   50-letter "DNAString" instance
>>>> getSeq(genome,'6',start=30000,width=50)
>>>   50-letter "DNAString" instance
>>>> getSeq(genome,'6',start=30000,width=50)
>>>   50-letter "DNAString" instance
>>>  > sessionInfo()
>>> R version 3.1.0 (2014-04-10)
>>> Platform: i386-w64-mingw32/i386 (32-bit)
>>>  locale:
>>> [1] LC_COLLATE=Chinese_People's Republic of China.936
>>> [2] LC_CTYPE=Chinese_People's Republic of China.936
>>> [3] LC_MONETARY=Chinese_People's Republic of China.936
>>> [4] LC_NUMERIC=C
>>> [5] LC_TIME=Chinese_People's Republic of China.936
>>>  attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>> Best,
>>> Sean
>>>   Another interesting aspect of 2bit is that it supports simple masking.
>>> I'm all for the 2bit direction. But then BSgenome would depend on
>>> rtracklayer? Or have you reimplemented it?
>>>> On Wed, Apr 16, 2014 at 12:59 PM, Herv¨¦ Pag¨¨s <hpages at fhcrc.org> wrote:
>>>> Hi Sean
>>>> [hope you don't mind if I cc Bioc-devel]
>>>>> On 04/15/2014 11:47 PM, Maintainer wrote:
>>>>> Hi The Bioconductor Dev Team,
>>>>> A new package called BSgenome.Hsapiens.NCBI.GRCh38 has been available
>>>>> for one week. But the single-sequence.fa.gz file in this package is too
>>>>> big. The sequences of all the chromosomes are put together. It is
>>>>> difficult to load.  Why do you remake this
>>>>> package as BSgenome.Hsapiens.UCSC.hg19 do? In
>>>>> BSgenome.Hsapiens.UCSC.hg19, sequence files for each chromosome are
>>>>> separated.
>>>> Yeah I know... sigh!!  ;-)
>>>> All the BSgenome data packages use this new on disk storage format, not
>>>> just GRCh38. All the chromosomes are now put together in a single
>>>> FASTA file compressed in the RAZip format (extension is rz, not gz).
>>>> The file is indexed so it makes direct random access faster. So for
>>>> example, extracting only some small portions of the genome with
>>>> getSeq() is much faster than with the previous on disk storage format
>>>> where the chromosomes were serialized into individual files.
>>>> Some people have manifested interest in having getSeq() do fast
>>>> random access to the genome sequences for a while. See for example
>>>> this post from Michael in 2011:
>>>>   https://stat.ethz.ch/pipermail/bioc-devel/2011-May/002601.html
>>>> I'm aware that the new on disk storage format slows down operations
>>>> where you actually need to compute on the entire genome, which is
>>>> typically done in a loop where one chromosome is loaded at a time.
>>>> It's hard to beat the speed (and simplicity) of just loading the
>>>> individual serialized DNAString objects in that case. This is
>>>> unfortunate and I was a little bit worried about this when I pushed
>>>> the new BSgenome packages to the public repos because I think this
>>>> is a common use case.
>>>> Note that the switch to this new format happened in devel in January
>>>> and was announced on the Bioc-devel mailing list:
>>>>   https://stat.ethz.ch/pipermail/bioc-devel/2014-January/005150.html
>>>> I was hoping people would test this and maybe start a discussion about
>>>> whether this change was worth it or not.
>>>> The good news is that I think we can have the best of both worlds i.e.
>>>> fast direct random access and fast loading of the full sequences.
>>>> I did some testing with the 2bit format and it looks very promising:
>>>>   - Random access is much faster than with current RAZip'ed FASTA
>>>>     format,
>>>>   - Loading a full chromosome into memory is also very fast because
>>>>     there isn't the overhead of decompressing the data. It could
>>>>     even be that it's faster than loading the chromosome stored in
>>>>     a serialized DNAString object, or maybe not, but at least it
>>>>     should be very close.
>>>>   - The sequences take less space on disk and the resulting BSgenome
>>>>     package will also be slightly smaller.
>>>> It's something that I was planning to work on early in this new devel
>>>> cycle. Seems like I should start even earlier and maybe backport to
>>>> BioC release?
>>>> Thanks,
>>>> H.
>>>> Best,
>>>>> Sean
