[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Tim Triche, Jr. tim.triche at gmail.com
Fri Jun 5 03:30:19 CEST 2015


that's kind of always been my goal...


Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>

On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence <lawrence.michael at gene.com>
wrote:

> Maybe this could eventually support setting the seqinfo with:
>
> genome(gr) <- "hg19"
>
> Or is that being too clever?
>
> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
> > Hi,
> >
> > FWIW I started to work on supporting quick generation of a standalone
> > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
> >
> > It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
> > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, mm10,
> > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, galGal3,
> > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
> > and sacCer2. I'll add more.
> >
> > See ?Seqinfo for some examples.
> >
> > Right now it fetches the information from internet every time you
> > call it but maybe we should just store that information in the
> > GenomeInfoDb package as Tim suggested?
> >
> > H.
> >
> >
> > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
> >>
> >> That would be perfect actually.  And it would radically reduce &
> >> modularize maintenance.  Maybe that's the best way to go after all.
> Quite
> >> sensible.
> >>
> >> --t
> >>
> >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey <stvjc at channing.harvard.edu
> >
> >>> wrote:
> >>>
> >>> It really isn't hard to have multiple OrganismDb packages in place --
> the
> >>> process of making new ones is documented and was given as an exercise
> in
> >>> the EdX course.  I don't know if we want to institutionalize it and
> >>> distribute such -- I think we might, so that there would be Hs19, Hs38,
> >>> mm9, etc. packages.  They have very little content, they just
> coordinate
> >>> interactions with packages that you'll already have.
> >>>
> >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. <tim.triche at gmail.com>
> >>> wrote:
> >>>
> >>>> Right, I typically do that too, and if you're working on human data it
> >>>> isn't a big deal.  What makes things a lot more of a drag is when you
> >>>> work
> >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where
> >>>> Mus.musculus
> >>>> is essentially a "build ahead" of Homo.sapiens.
> >>>>
> >>>> R> seqinfo(Homo.sapiens)
> >>>> Seqinfo object with 93 sequences (1 circular) from hg19 genome
> >>>>
> >>>> R> seqinfo(Mus.musculus)
> >>>> Seqinfo object with 66 sequences (1 circular) from mm10 genome:
> >>>>
> >>>> It's not as explicit as directly assigning the seqinfo from a genome
> >>>> that
> >>>> corresponds to that of your annotations/results/whatever.  I know we
> >>>> could
> >>>> all use crossmap or liftOver or whatever, but that's not really the
> >>>> same,
> >>>> and it takes time, whereas assigning the proper seqinfo for
> >>>> relationships
> >>>> is very fast.
> >>>>
> >>>> That's all I was getting at...
> >>>>
> >>>>
> >>>> Statistics is the grammar of science.
> >>>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
> >>>>
> >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
> >>>> <stvjc at channing.harvard.edu
> >>>>>
> >>>>> wrote:
> >>>>
> >>>>
> >>>>> I typically get this info from Homo.sapiens.  The result is parasitic
> >>>>> on
> >>>>> the TxDb that is in there.  I don't know how easy it is to swap
> >>>>> alternate
> >>>>> TxDb in to get a different build.  I think it would make sense to
> >>>>> regard
> >>>>> the OrganismDb instances as foundational for this sort of structural
> >>>>> data.
> >>>>>
> >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
> >>>>> kasperdanielhansen at gmail.com> wrote:
> >>>>>
> >>>>>> Let me rephrase this slightly.  From one POV the purpose of
> >>>>>> GenomeInfoDb
> >>>>>> is
> >>>>>> clean up the seqinfo slot.  Currently it does most of the cleaning,
> >>>>>> but
> >>>>>> it
> >>>>>> does not add seqlengths.
> >>>>>>
> >>>>>> It is clear that seqlengths depends on the version of the genome,
> but
> >>>>>> I
> >>>>>> will argue so does the seqnames.  Of course, for human, chr22 will
> not
> >>>>>> change.  But what about the names of all the random contigs?  Or for
> >>>>>> other
> >>>>>> organisms, what about going from a draft genome with 10k contigs to
> a
> >>>>>> more
> >>>>>> completely genome assembled into fewer, larger chromosomes.
> >>>>>>
> >>>>>> I acknowledge that this information is present in the BSgenome
> >>>>>> packages,
> >>>>>> but it seems (to me) to be very appropriate to have them around for
> >>>>>> cleaning up the seqinfo slot.  For some situations it is not great
> to
> >>>>>> depend on 1 GB> download for something that is a few bytes.
> >>>>>>
> >>>>>> Best,
> >>>>>> Kasper
> >>>>>>
> >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. <
> tim.triche at gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> It would be nice (for a number of reasons) to have chromosome
> lengths
> >>>>>>> readily available in a foundational package like GenomeInfoDb, so
> >>>>>>> that,
> >>>>>>> say,
> >>>>>>>
> >>>>>>> data(seqinfo.hg19)
> >>>>>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
> >>>>>>>
> >>>>>>> would work without issues.  Is there any particular reason this
> >>>>>>
> >>>>>> couldn't
> >>>>>>>
> >>>>>>> happen for the supported/available BSgenomes?  It would seem like a
> >>>>>>
> >>>>>> simple
> >>>>>>>
> >>>>>>> matter to do
> >>>>>>>
> >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
> >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
> >>>>>>> R> save(seqinfo.hg19,
> >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
> >>>>>>>
> >>>>>>> and be done with it until (say) the next release or next released
> >>>>>>> BSgenome.  I considered looping through the following BSgenomes
> >>>>>>
> >>>>>> myself...
> >>>>>>>
> >>>>>>> and if it isn't strongly opposed by (everyone) I may still do
> exactly
> >>>>>>> that.  Seems useful, no?
> >>>>>>>
> >>>>>>> e.g. for the following 42 builds,
> >>>>>>>
> >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "",
> available.genomes())),
> >>>>>>> value=TRUE)
> >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2"
> >>>>>>
> >>>>>> "BSgenome.Btaurus.UCSC.bosTau3"
> >>>>>>>
> >>>>>>>
> >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4"
> >>>>>>
> >>>>>> "BSgenome.Btaurus.UCSC.bosTau6"
> >>>>>>>
> >>>>>>>
> >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8"
> >>>>>>> "BSgenome.Celegans.UCSC.ce10"
> >>>>>>>
> >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2"
>  "BSgenome.Celegans.UCSC.ce6"
> >>>>>>>
> >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
> >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3"
> >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
> >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
> >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6"
> >>>>>>
> >>>>>> "BSgenome.Drerio.UCSC.danRer5"
> >>>>>>>
> >>>>>>>
> >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6"
> >>>>>>
> >>>>>> "BSgenome.Drerio.UCSC.danRer7"
> >>>>>>>
> >>>>>>>
> >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805"
> >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1"
> >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3"
> >>>>>>
> >>>>>> "BSgenome.Ggallus.UCSC.galGal4"
> >>>>>>>
> >>>>>>>
> >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38"
> >>>>>>> "BSgenome.Hsapiens.UCSC.hg17"
> >>>>>>>
> >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18"
> >>>>>>> "BSgenome.Hsapiens.UCSC.hg19"
> >>>>>>>
> >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38"
> >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
> >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1"
> >>>>>>
> >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2"
> >>>>>>>
> >>>>>>>
> >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3"
> >>>>>>
> >>>>>> "BSgenome.Mmusculus.UCSC.mm10"
> >>>>>>>
> >>>>>>>
> >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8"
> >>>>>>> "BSgenome.Mmusculus.UCSC.mm9"
> >>>>>>>
> >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
> >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3"
> >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4"
> >>>>>>
> >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5"
> >>>>>>>
> >>>>>>>
> >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1"
> >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3"
> >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3"
> >>>>>>
> >>>>>> "BSgenome.Tguttata.UCSC.taeGut1"
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Am I insane for suggesting this?  It would make things a little
> >>>>>>> easier
> >>>>>>
> >>>>>> for
> >>>>>>>
> >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived objects,
> blah,
> >>>>>>
> >>>>>> blah,
> >>>>>>>
> >>>>>>> blah...
> >>>>>>>
> >>>>>>>
> >>>>>>> Best,
> >>>>>>>
> >>>>>>> --t
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Statistics is the grammar of science.
> >>>>>>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
> >>>>>>
> >>>>>>
> >>>>>>         [[alternative HTML version deleted]]
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Bioc-devel at r-project.org mailing list
> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>>
> >>>
> >>>     [[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> Bioc-devel at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>
> >>
> >> _______________________________________________
> >> Bioc-devel at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>
> >
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fredhutch.org
> > Phone:  (206) 667-5791
> > Fax:    (206) 667-1319
> >
> >
> > _______________________________________________
> > Bioc-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list