[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?
Tim Triche, Jr.
tim.triche at gmail.com
Wed Jun 3 21:54:01 CEST 2015
That would be perfect actually. And it would radically reduce & modularize maintenance. Maybe that's the best way to go after all. Quite sensible.
--t
> On Jun 3, 2015, at 12:46 PM, Vincent Carey <stvjc at channing.harvard.edu> wrote:
>
> It really isn't hard to have multiple OrganismDb packages in place -- the
> process of making new ones is documented and was given as an exercise in
> the EdX course. I don't know if we want to institutionalize it and
> distribute such -- I think we might, so that there would be Hs19, Hs38,
> mm9, etc. packages. They have very little content, they just coordinate
> interactions with packages that you'll already have.
>
> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. <tim.triche at gmail.com>
> wrote:
>
>> Right, I typically do that too, and if you're working on human data it
>> isn't a big deal. What makes things a lot more of a drag is when you work
>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where Mus.musculus
>> is essentially a "build ahead" of Homo.sapiens.
>>
>> R> seqinfo(Homo.sapiens)
>> Seqinfo object with 93 sequences (1 circular) from hg19 genome
>>
>> R> seqinfo(Mus.musculus)
>> Seqinfo object with 66 sequences (1 circular) from mm10 genome:
>>
>> It's not as explicit as directly assigning the seqinfo from a genome that
>> corresponds to that of your annotations/results/whatever. I know we could
>> all use crossmap or liftOver or whatever, but that's not really the same,
>> and it takes time, whereas assigning the proper seqinfo for relationships
>> is very fast.
>>
>> That's all I was getting at...
>>
>>
>> Statistics is the grammar of science.
>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>
>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey <stvjc at channing.harvard.edu
>>> wrote:
>>
>>> I typically get this info from Homo.sapiens. The result is parasitic
>>> on
>>> the TxDb that is in there. I don't know how easy it is to swap alternate
>>> TxDb in to get a different build. I think it would make sense to regard
>>> the OrganismDb instances as foundational for this sort of structural data.
>>>
>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
>>> kasperdanielhansen at gmail.com> wrote:
>>>
>>>> Let me rephrase this slightly. From one POV the purpose of GenomeInfoDb
>>>> is
>>>> clean up the seqinfo slot. Currently it does most of the cleaning, but
>>>> it
>>>> does not add seqlengths.
>>>>
>>>> It is clear that seqlengths depends on the version of the genome, but I
>>>> will argue so does the seqnames. Of course, for human, chr22 will not
>>>> change. But what about the names of all the random contigs? Or for
>>>> other
>>>> organisms, what about going from a draft genome with 10k contigs to a
>>>> more
>>>> completely genome assembled into fewer, larger chromosomes.
>>>>
>>>> I acknowledge that this information is present in the BSgenome packages,
>>>> but it seems (to me) to be very appropriate to have them around for
>>>> cleaning up the seqinfo slot. For some situations it is not great to
>>>> depend on 1 GB> download for something that is a few bytes.
>>>>
>>>> Best,
>>>> Kasper
>>>>
>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. <tim.triche at gmail.com>
>>>> wrote:
>>>>
>>>>> It would be nice (for a number of reasons) to have chromosome lengths
>>>>> readily available in a foundational package like GenomeInfoDb, so that,
>>>>> say,
>>>>>
>>>>> data(seqinfo.hg19)
>>>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
>>>>>
>>>>> would work without issues. Is there any particular reason this
>>>> couldn't
>>>>> happen for the supported/available BSgenomes? It would seem like a
>>>> simple
>>>>> matter to do
>>>>>
>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
>>>>> R> save(seqinfo.hg19,
>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
>>>>>
>>>>> and be done with it until (say) the next release or next released
>>>>> BSgenome. I considered looping through the following BSgenomes
>>>> myself...
>>>>> and if it isn't strongly opposed by (everyone) I may still do exactly
>>>>> that. Seems useful, no?
>>>>>
>>>>> e.g. for the following 42 builds,
>>>>>
>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "", available.genomes())),
>>>>> value=TRUE)
>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2"
>>>> "BSgenome.Btaurus.UCSC.bosTau3"
>>>>>
>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4"
>>>> "BSgenome.Btaurus.UCSC.bosTau6"
>>>>>
>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8" "BSgenome.Celegans.UCSC.ce10"
>>>>>
>>>>> [7] "BSgenome.Celegans.UCSC.ce2" "BSgenome.Celegans.UCSC.ce6"
>>>>>
>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3"
>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6"
>>>> "BSgenome.Drerio.UCSC.danRer5"
>>>>>
>>>>> [15] "BSgenome.Drerio.UCSC.danRer6"
>>>> "BSgenome.Drerio.UCSC.danRer7"
>>>>>
>>>>> [17] "BSgenome.Ecoli.NCBI.20080805"
>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1"
>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3"
>>>> "BSgenome.Ggallus.UCSC.galGal4"
>>>>>
>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38" "BSgenome.Hsapiens.UCSC.hg17"
>>>>>
>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18" "BSgenome.Hsapiens.UCSC.hg19"
>>>>>
>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38"
>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1"
>>>> "BSgenome.Mmulatta.UCSC.rheMac2"
>>>>>
>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3"
>>>> "BSgenome.Mmusculus.UCSC.mm10"
>>>>>
>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8" "BSgenome.Mmusculus.UCSC.mm9"
>>>>>
>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3"
>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4"
>>>> "BSgenome.Rnorvegicus.UCSC.rn5"
>>>>>
>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1"
>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3"
>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3"
>>>> "BSgenome.Tguttata.UCSC.taeGut1"
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Am I insane for suggesting this? It would make things a little easier
>>>> for
>>>>> rtracklayer, most SummarizedExperiment and SE-derived objects, blah,
>>>> blah,
>>>>> blah...
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> --t
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Statistics is the grammar of science.
>>>>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list