[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Fri Jun 5 03:29:45 CEST 2015

Maybe this could eventually support setting the seqinfo with:

genome(gr) <- "hg19"

Or is that being too clever?

On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
> Hi,
>
> FWIW I started to work on supporting quick generation of a standalone
> Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
>
> It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
> bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10,
> mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3,
> gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
> and sacCer2. I'll add more.
>
> See ?Seqinfo for some examples.
>
> Right now it fetches the information from internet every time you
> call it but maybe we should just store that information in the
> GenomeInfoDb package as Tim suggested?
>
> H.
>
>
> On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
>>
>> That would be perfect actually.  And it would radically reduce &
>> modularize maintenance.  Maybe that's the best way to go after all.  Quite
>> sensible.
>>
>> --t
>>
>>> On Jun 3, 2015, at 12:46 PM, Vincent Carey <stvjc at channing.harvard.edu>
>>> wrote:
>>>
>>> It really isn't hard to have multiple OrganismDb packages in place -- the
>>> process of making new ones is documented and was given as an exercise in
>>> the EdX course.  I don't know if we want to institutionalize it and
>>> distribute such -- I think we might, so that there would be Hs19, Hs38,
>>> mm9, etc. packages.  They have very little content, they just coordinate
>>> interactions with packages that you'll already have.
>>>
>>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. <tim.triche at gmail.com>
>>> wrote:
>>>
>>>> Right, I typically do that too, and if you're working on human data it
>>>> isn't a big deal.  What makes things a lot more of a drag is when you
>>>> work
>>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where
>>>> Mus.musculus
>>>> is essentially a "build ahead" of Homo.sapiens.
>>>>
>>>> R> seqinfo(Homo.sapiens)
>>>> Seqinfo object with 93 sequences (1 circular) from hg19 genome
>>>>
>>>> R> seqinfo(Mus.musculus)
>>>> Seqinfo object with 66 sequences (1 circular) from mm10 genome:
>>>>
>>>> It's not as explicit as directly assigning the seqinfo from a genome
>>>> that
>>>> corresponds to that of your annotations/results/whatever.  I know we
>>>> could
>>>> all use crossmap or liftOver or whatever, but that's not really the
>>>> same,
>>>> and it takes time, whereas assigning the proper seqinfo for
>>>> relationships
>>>> is very fast.
>>>>
>>>> That's all I was getting at...
>>>>
>>>>
>>>> Statistics is the grammar of science.
>>>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>>
>>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
>>>> <stvjc at channing.harvard.edu
>>>>>
>>>>> wrote:
>>>>
>>>>
>>>>> I typically get this info from Homo.sapiens.  The result is parasitic
>>>>> on
>>>>> the TxDb that is in there.  I don't know how easy it is to swap
>>>>> alternate
>>>>> TxDb in to get a different build.  I think it would make sense to
>>>>> regard
>>>>> the OrganismDb instances as foundational for this sort of structural
>>>>> data.
>>>>>
>>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
>>>>> kasperdanielhansen at gmail.com> wrote:
>>>>>
>>>>>> Let me rephrase this slightly.  From one POV the purpose of
>>>>>> GenomeInfoDb
>>>>>> is
>>>>>> clean up the seqinfo slot.  Currently it does most of the cleaning,
>>>>>> but
>>>>>> it
>>>>>> does not add seqlengths.
>>>>>>
>>>>>> It is clear that seqlengths depends on the version of the genome, but
>>>>>> I
>>>>>> will argue so does the seqnames.  Of course, for human, chr22 will not
>>>>>> change.  But what about the names of all the random contigs?  Or for
>>>>>> other
>>>>>> organisms, what about going from a draft genome with 10k contigs to a
>>>>>> more
>>>>>> completely genome assembled into fewer, larger chromosomes.
>>>>>>
>>>>>> I acknowledge that this information is present in the BSgenome
>>>>>> packages,
>>>>>> but it seems (to me) to be very appropriate to have them around for
>>>>>> cleaning up the seqinfo slot.  For some situations it is not great to
>>>>>> depend on 1 GB> download for something that is a few bytes.
>>>>>>
>>>>>> Best,
>>>>>> Kasper
>>>>>>
>>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. <tim.triche at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> It would be nice (for a number of reasons) to have chromosome lengths
>>>>>>> readily available in a foundational package like GenomeInfoDb, so
>>>>>>> that,
>>>>>>> say,
>>>>>>>
>>>>>>> data(seqinfo.hg19)
>>>>>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
>>>>>>>
>>>>>>> would work without issues.  Is there any particular reason this
>>>>>>
>>>>>> couldn't
>>>>>>>
>>>>>>> happen for the supported/available BSgenomes?  It would seem like a
>>>>>>
>>>>>> simple
>>>>>>>
>>>>>>> matter to do
>>>>>>>
>>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
>>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
>>>>>>> R> save(seqinfo.hg19,
>>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
>>>>>>>
>>>>>>> and be done with it until (say) the next release or next released
>>>>>>> BSgenome.  I considered looping through the following BSgenomes
>>>>>>
>>>>>> myself...
>>>>>>>
>>>>>>> and if it isn't strongly opposed by (everyone) I may still do exactly
>>>>>>> that.  Seems useful, no?
>>>>>>>
>>>>>>> e.g. for the following 42 builds,
>>>>>>>
>>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "", available.genomes())),
>>>>>>> value=TRUE)
>>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2"
>>>>>>
>>>>>> "BSgenome.Btaurus.UCSC.bosTau3"
>>>>>>>
>>>>>>>
>>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4"
>>>>>>
>>>>>> "BSgenome.Btaurus.UCSC.bosTau6"
>>>>>>>
>>>>>>>
>>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8"
>>>>>>> "BSgenome.Celegans.UCSC.ce10"
>>>>>>>
>>>>>>> [7] "BSgenome.Celegans.UCSC.ce2"         "BSgenome.Celegans.UCSC.ce6"
>>>>>>>
>>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
>>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3"
>>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
>>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
>>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6"
>>>>>>
>>>>>> "BSgenome.Drerio.UCSC.danRer5"
>>>>>>>
>>>>>>>
>>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6"
>>>>>>
>>>>>> "BSgenome.Drerio.UCSC.danRer7"
>>>>>>>
>>>>>>>
>>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805"
>>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1"
>>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3"
>>>>>>
>>>>>> "BSgenome.Ggallus.UCSC.galGal4"
>>>>>>>
>>>>>>>
>>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38"
>>>>>>> "BSgenome.Hsapiens.UCSC.hg17"
>>>>>>>
>>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18"
>>>>>>> "BSgenome.Hsapiens.UCSC.hg19"
>>>>>>>
>>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38"
>>>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
>>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1"
>>>>>>
>>>>>> "BSgenome.Mmulatta.UCSC.rheMac2"
>>>>>>>
>>>>>>>
>>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3"
>>>>>>
>>>>>> "BSgenome.Mmusculus.UCSC.mm10"
>>>>>>>
>>>>>>>
>>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8"
>>>>>>> "BSgenome.Mmusculus.UCSC.mm9"
>>>>>>>
>>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
>>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3"
>>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4"
>>>>>>
>>>>>> "BSgenome.Rnorvegicus.UCSC.rn5"
>>>>>>>
>>>>>>>
>>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
>>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1"
>>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
>>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3"
>>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3"
>>>>>>
>>>>>> "BSgenome.Tguttata.UCSC.taeGut1"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Am I insane for suggesting this?  It would make things a little
>>>>>>> easier
>>>>>>
>>>>>> for
>>>>>>>
>>>>>>> rtracklayer, most SummarizedExperiment and SE-derived objects, blah,
>>>>>>
>>>>>> blah,
>>>>>>>
>>>>>>> blah...
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> --t
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Statistics is the grammar of science.
>>>>>>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>>>>
>>>>>>
>>>>>>         [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>>     [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel