[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Tim Triche, Jr. tim.triche at gmail.com
Wed Jun 3 21:54:01 CEST 2015


That would be perfect actually.  And it would radically reduce & modularize maintenance.  Maybe that's the best way to go after all.  Quite sensible. 

--t

> On Jun 3, 2015, at 12:46 PM, Vincent Carey <stvjc at channing.harvard.edu> wrote:
> 
> It really isn't hard to have multiple OrganismDb packages in place -- the
> process of making new ones is documented and was given as an exercise in
> the EdX course.  I don't know if we want to institutionalize it and
> distribute such -- I think we might, so that there would be Hs19, Hs38,
> mm9, etc. packages.  They have very little content, they just coordinate
> interactions with packages that you'll already have.
> 
> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. <tim.triche at gmail.com>
> wrote:
> 
>> Right, I typically do that too, and if you're working on human data it
>> isn't a big deal.  What makes things a lot more of a drag is when you work
>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where Mus.musculus
>> is essentially a "build ahead" of Homo.sapiens.
>> 
>> R> seqinfo(Homo.sapiens)
>> Seqinfo object with 93 sequences (1 circular) from hg19 genome
>> 
>> R> seqinfo(Mus.musculus)
>> Seqinfo object with 66 sequences (1 circular) from mm10 genome:
>> 
>> It's not as explicit as directly assigning the seqinfo from a genome that
>> corresponds to that of your annotations/results/whatever.  I know we could
>> all use crossmap or liftOver or whatever, but that's not really the same,
>> and it takes time, whereas assigning the proper seqinfo for relationships
>> is very fast.
>> 
>> That's all I was getting at...
>> 
>> 
>> Statistics is the grammar of science.
>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>> 
>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey <stvjc at channing.harvard.edu
>>> wrote:
>> 
>>> I typically get this info from Homo.sapiens.  The result is parasitic
>>> on
>>> the TxDb that is in there.  I don't know how easy it is to swap alternate
>>> TxDb in to get a different build.  I think it would make sense to regard
>>> the OrganismDb instances as foundational for this sort of structural data.
>>> 
>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
>>> kasperdanielhansen at gmail.com> wrote:
>>> 
>>>> Let me rephrase this slightly.  From one POV the purpose of GenomeInfoDb
>>>> is
>>>> clean up the seqinfo slot.  Currently it does most of the cleaning, but
>>>> it
>>>> does not add seqlengths.
>>>> 
>>>> It is clear that seqlengths depends on the version of the genome, but I
>>>> will argue so does the seqnames.  Of course, for human, chr22 will not
>>>> change.  But what about the names of all the random contigs?  Or for
>>>> other
>>>> organisms, what about going from a draft genome with 10k contigs to a
>>>> more
>>>> completely genome assembled into fewer, larger chromosomes.
>>>> 
>>>> I acknowledge that this information is present in the BSgenome packages,
>>>> but it seems (to me) to be very appropriate to have them around for
>>>> cleaning up the seqinfo slot.  For some situations it is not great to
>>>> depend on 1 GB> download for something that is a few bytes.
>>>> 
>>>> Best,
>>>> Kasper
>>>> 
>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. <tim.triche at gmail.com>
>>>> wrote:
>>>> 
>>>>> It would be nice (for a number of reasons) to have chromosome lengths
>>>>> readily available in a foundational package like GenomeInfoDb, so that,
>>>>> say,
>>>>> 
>>>>> data(seqinfo.hg19)
>>>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
>>>>> 
>>>>> would work without issues.  Is there any particular reason this
>>>> couldn't
>>>>> happen for the supported/available BSgenomes?  It would seem like a
>>>> simple
>>>>> matter to do
>>>>> 
>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
>>>>> R> save(seqinfo.hg19,
>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
>>>>> 
>>>>> and be done with it until (say) the next release or next released
>>>>> BSgenome.  I considered looping through the following BSgenomes
>>>> myself...
>>>>> and if it isn't strongly opposed by (everyone) I may still do exactly
>>>>> that.  Seems useful, no?
>>>>> 
>>>>> e.g. for the following 42 builds,
>>>>> 
>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "", available.genomes())),
>>>>> value=TRUE)
>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2"
>>>> "BSgenome.Btaurus.UCSC.bosTau3"
>>>>> 
>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4"
>>>> "BSgenome.Btaurus.UCSC.bosTau6"
>>>>> 
>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8"      "BSgenome.Celegans.UCSC.ce10"
>>>>> 
>>>>> [7] "BSgenome.Celegans.UCSC.ce2"         "BSgenome.Celegans.UCSC.ce6"
>>>>> 
>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3"
>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6"
>>>> "BSgenome.Drerio.UCSC.danRer5"
>>>>> 
>>>>> [15] "BSgenome.Drerio.UCSC.danRer6"
>>>> "BSgenome.Drerio.UCSC.danRer7"
>>>>> 
>>>>> [17] "BSgenome.Ecoli.NCBI.20080805"
>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1"
>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3"
>>>> "BSgenome.Ggallus.UCSC.galGal4"
>>>>> 
>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38"      "BSgenome.Hsapiens.UCSC.hg17"
>>>>> 
>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18"        "BSgenome.Hsapiens.UCSC.hg19"
>>>>> 
>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38"
>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1"
>>>> "BSgenome.Mmulatta.UCSC.rheMac2"
>>>>> 
>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3"
>>>> "BSgenome.Mmusculus.UCSC.mm10"
>>>>> 
>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8"        "BSgenome.Mmusculus.UCSC.mm9"
>>>>> 
>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3"
>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4"
>>>> "BSgenome.Rnorvegicus.UCSC.rn5"
>>>>> 
>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1"
>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3"
>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3"
>>>> "BSgenome.Tguttata.UCSC.taeGut1"
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Am I insane for suggesting this?  It would make things a little easier
>>>> for
>>>>> rtracklayer, most SummarizedExperiment and SE-derived objects, blah,
>>>> blah,
>>>>> blah...
>>>>> 
>>>>> 
>>>>> Best,
>>>>> 
>>>>> --t
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Statistics is the grammar of science.
>>>>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>> 
>>>>        [[alternative HTML version deleted]]
>>>> 
>>>> _______________________________________________
>>>> Bioc-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 
>    [[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel



More information about the Bioc-devel mailing list