[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?
Gabe Becker
becker.gabe at gene.com
Fri Jun 5 20:51:12 CEST 2015
Herve,
This is probably a naive question, but what usecases are there for creating
an object with the wrong seqinfo for its genome?
~G
On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence <lawrence.michael at gene.com
> wrote:
> On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
> > I also think that we're heading towards something like that.
> >
> > So genome(gr) <- "hg19" would:
> >
> > (a) Add any missing information to the seqinfo.
> > (b) Sort the seqlevels in "canonical" order.
> > (c) Change the seqlevels style to UCSC style if they are not.
> >
> > The 3 tasks are orthogonal. I guess most of the times people want
> > an easy way to perform them all at once.
> >
> > We could easily support (a) and (b). This assumes that the current
> > seqlevels are already valid hg19 seqlevels:
> >
> > si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
> > gr1 <- GRanges(seqinfo=si1)
> > hg19_si <- Seqinfo(genome="hg19")
> >
> > ## (a):
> > seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
> > seqinfo(gr1)
> > # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
> > # seqnames seqlengths isCircular genome
> > # chrX 155270560 FALSE hg19
> > # chrUn_gl000249 38502 FALSE hg19
> > # chr2 243199373 FALSE hg19
> > # chr6_cox_hap2 4795371 FALSE hg19
> >
> > ## (b):
> > seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
> > seqinfo(gr1)
> > # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
> > # seqnames seqlengths isCircular genome
> > # chr2 243199373 FALSE hg19
> > # chrX 155270560 FALSE hg19
> > # chr6_cox_hap2 4795371 FALSE hg19
> > # chrUn_gl000249 38502 FALSE hg19
> >
> > (c) is harder because seqlevelsStyle() doesn't know how to rename
> > scaffolds yet:
> >
> > si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",
> "HSCHR6_MHC_COX_CTG1"))
> > gr2 <- GRanges(seqinfo=si2)
> >
> > seqlevelsStyle(gr2)
> > # [1] "NCBI"
> >
> > seqlevelsStyle(gr2) <- "UCSC"
> > seqlevels(gr2)
> > # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
> > # [4] "HSCHR6_MHC_COX_CTG1"
> >
> > So we need to work on this.
> >
> > I'm not sure about using genome(gr) <- "hg19" for this. Right now
> > it sets the genome column of the seqinfo with the supplied string
> > and nothing else. Aren't there valid use cases for this?
>
> Not sure. People would almost always want the seqname style and order
> to be consistent with the given genome.
>
> > What about
> > using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
> > whole seqinfo component actually gets filled.
> >
>
> Yea, but "genome" is so intuitive compared to "seqinfo".
>
>
>
> > H.
> >
> > On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
> >>
> >> that's kind of always been my goal...
> >>
> >>
> >> Statistics is the grammar of science.
> >> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
> >>
> >> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
> >> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
> >>
> >> Maybe this could eventually support setting the seqinfo with:
> >>
> >> genome(gr) <- "hg19"
> >>
> >> Or is that being too clever?
> >>
> >> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès <hpages at fredhutch.org
> >> <mailto:hpages at fredhutch.org>> wrote:
> >> > Hi,
> >> >
> >> > FWIW I started to work on supporting quick generation of a
> >> standalone
> >> > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
> >> >
> >> > It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
> >> > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1,
> >> mm10,
> >> > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4,
> >> galGal3,
> >> > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2,
> sacCer3,
> >> > and sacCer2. I'll add more.
> >> >
> >> > See ?Seqinfo for some examples.
> >> >
> >> > Right now it fetches the information from internet every time you
> >> > call it but maybe we should just store that information in the
> >> > GenomeInfoDb package as Tim suggested?
> >> >
> >> > H.
> >> >
> >> >
> >> > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
> >> >>
> >> >> That would be perfect actually. And it would radically reduce &
> >> >> modularize maintenance. Maybe that's the best way to go after
> >> all. Quite
> >> >> sensible.
> >> >>
> >> >> --t
> >> >>
> >> >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey
> >> <stvjc at channing.harvard.edu <mailto:stvjc at channing.harvard.edu>>
> >> >>> wrote:
> >> >>>
> >> >>> It really isn't hard to have multiple OrganismDb packages in
> >> place -- the
> >> >>> process of making new ones is documented and was given as an
> >> exercise in
> >> >>> the EdX course. I don't know if we want to institutionalize it
> >> and
> >> >>> distribute such -- I think we might, so that there would be
> >> Hs19, Hs38,
> >> >>> mm9, etc. packages. They have very little content, they just
> >> coordinate
> >> >>> interactions with packages that you'll already have.
> >> >>>
> >> >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr.
> >> <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
> >>
> >> >>> wrote:
> >> >>>
> >> >>>> Right, I typically do that too, and if you're working on human
> >> data it
> >> >>>> isn't a big deal. What makes things a lot more of a drag is
> >> when you
> >> >>>> work
> >> >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where
> >> >>>> Mus.musculus
> >> >>>> is essentially a "build ahead" of Homo.sapiens.
> >> >>>>
> >> >>>> R> seqinfo(Homo.sapiens)
> >> >>>> Seqinfo object with 93 sequences (1 circular) from hg19 genome
> >> >>>>
> >> >>>> R> seqinfo(Mus.musculus)
> >> >>>> Seqinfo object with 66 sequences (1 circular) from mm10
> genome:
> >> >>>>
> >> >>>> It's not as explicit as directly assigning the seqinfo from a
> >> genome
> >> >>>> that
> >> >>>> corresponds to that of your annotations/results/whatever. I
> >> know we
> >> >>>> could
> >> >>>> all use crossmap or liftOver or whatever, but that's not
> >> really the
> >> >>>> same,
> >> >>>> and it takes time, whereas assigning the proper seqinfo for
> >> >>>> relationships
> >> >>>> is very fast.
> >> >>>>
> >> >>>> That's all I was getting at...
> >> >>>>
> >> >>>>
> >> >>>> Statistics is the grammar of science.
> >> >>>> Karl Pearson
> >> <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
> >> >>>>
> >> >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
> >> >>>> <stvjc at channing.harvard.edu <mailto:
> stvjc at channing.harvard.edu>
> >> >>>>>
> >> >>>>> wrote:
> >> >>>>
> >> >>>>
> >> >>>>> I typically get this info from Homo.sapiens. The result is
> >> parasitic
> >> >>>>> on
> >> >>>>> the TxDb that is in there. I don't know how easy it is to
> swap
> >> >>>>> alternate
> >> >>>>> TxDb in to get a different build. I think it would make
> sense
> >> to
> >> >>>>> regard
> >> >>>>> the OrganismDb instances as foundational for this sort of
> >> structural
> >> >>>>> data.
> >> >>>>>
> >> >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
> >> >>>>> kasperdanielhansen at gmail.com
> >> <mailto:kasperdanielhansen at gmail.com>> wrote:
> >> >>>>>
> >> >>>>>> Let me rephrase this slightly. From one POV the purpose of
> >> >>>>>> GenomeInfoDb
> >> >>>>>> is
> >> >>>>>> clean up the seqinfo slot. Currently it does most of the
> >> cleaning,
> >> >>>>>> but
> >> >>>>>> it
> >> >>>>>> does not add seqlengths.
> >> >>>>>>
> >> >>>>>> It is clear that seqlengths depends on the version of the
> >> genome, but
> >> >>>>>> I
> >> >>>>>> will argue so does the seqnames. Of course, for human,
> >> chr22 will not
> >> >>>>>> change. But what about the names of all the random
> >> contigs? Or for
> >> >>>>>> other
> >> >>>>>> organisms, what about going from a draft genome with 10k
> >> contigs to a
> >> >>>>>> more
> >> >>>>>> completely genome assembled into fewer, larger chromosomes.
> >> >>>>>>
> >> >>>>>> I acknowledge that this information is present in the
> BSgenome
> >> >>>>>> packages,
> >> >>>>>> but it seems (to me) to be very appropriate to have them
> >> around for
> >> >>>>>> cleaning up the seqinfo slot. For some situations it is not
> >> great to
> >> >>>>>> depend on 1 GB> download for something that is a few bytes.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> Kasper
> >> >>>>>>
> >> >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr.
> >> <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
> >>
> >> >>>>>> wrote:
> >> >>>>>>
> >> >>>>>>> It would be nice (for a number of reasons) to have
> >> chromosome lengths
> >> >>>>>>> readily available in a foundational package like
> >> GenomeInfoDb, so
> >> >>>>>>> that,
> >> >>>>>>> say,
> >> >>>>>>>
> >> >>>>>>> data(seqinfo.hg19)
> >> >>>>>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
> >> >>>>>>>
> >> >>>>>>> would work without issues. Is there any particular reason
> >> this
> >> >>>>>>
> >> >>>>>> couldn't
> >> >>>>>>>
> >> >>>>>>> happen for the supported/available BSgenomes? It would
> >> seem like a
> >> >>>>>>
> >> >>>>>> simple
> >> >>>>>>>
> >> >>>>>>> matter to do
> >> >>>>>>>
> >> >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
> >> >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
> >> >>>>>>> R> save(seqinfo.hg19,
> >> >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
> >> >>>>>>>
> >> >>>>>>> and be done with it until (say) the next release or next
> >> released
> >> >>>>>>> BSgenome. I considered looping through the following
> >> BSgenomes
> >> >>>>>>
> >> >>>>>> myself...
> >> >>>>>>>
> >> >>>>>>> and if it isn't strongly opposed by (everyone) I may still
> >> do exactly
> >> >>>>>>> that. Seems useful, no?
> >> >>>>>>>
> >> >>>>>>> e.g. for the following 42 builds,
> >> >>>>>>>
> >> >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "",
> >> available.genomes())),
> >> >>>>>>> value=TRUE)
> >> >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2"
> >> >>>>>>
> >> >>>>>> "BSgenome.Btaurus.UCSC.bosTau3"
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4"
> >> >>>>>>
> >> >>>>>> "BSgenome.Btaurus.UCSC.bosTau6"
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8"
> >> >>>>>>> "BSgenome.Celegans.UCSC.ce10"
> >> >>>>>>>
> >> >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2"
> >> "BSgenome.Celegans.UCSC.ce6"
> >> >>>>>>>
> >> >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
> >> >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3"
> >> >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
> >> >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
> >> >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6"
> >> >>>>>>
> >> >>>>>> "BSgenome.Drerio.UCSC.danRer5"
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6"
> >> >>>>>>
> >> >>>>>> "BSgenome.Drerio.UCSC.danRer7"
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805"
> >> >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1"
> >> >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3"
> >> >>>>>>
> >> >>>>>> "BSgenome.Ggallus.UCSC.galGal4"
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38"
> >> >>>>>>> "BSgenome.Hsapiens.UCSC.hg17"
> >> >>>>>>>
> >> >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18"
> >> >>>>>>> "BSgenome.Hsapiens.UCSC.hg19"
> >> >>>>>>>
> >> >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38"
> >> >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
> >> >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1"
> >> >>>>>>
> >> >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2"
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3"
> >> >>>>>>
> >> >>>>>> "BSgenome.Mmusculus.UCSC.mm10"
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8"
> >> >>>>>>> "BSgenome.Mmusculus.UCSC.mm9"
> >> >>>>>>>
> >> >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
> >> >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3"
> >> >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4"
> >> >>>>>>
> >> >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5"
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
> >> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1"
> >> >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
> >> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3"
> >> >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3"
> >> >>>>>>
> >> >>>>>> "BSgenome.Tguttata.UCSC.taeGut1"
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> Am I insane for suggesting this? It would make things a
> >> little
> >> >>>>>>> easier
> >> >>>>>>
> >> >>>>>> for
> >> >>>>>>>
> >> >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived
> >> objects, blah,
> >> >>>>>>
> >> >>>>>> blah,
> >> >>>>>>>
> >> >>>>>>> blah...
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> Best,
> >> >>>>>>>
> >> >>>>>>> --t
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> Statistics is the grammar of science.
> >> >>>>>>> Karl Pearson
> >> <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> [[alternative HTML version deleted]]
> >> >>>>>>
> >> >>>>>> _______________________________________________
> >> >>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> >> mailing list
> >> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >> >>>
> >> >>>
> >> >>> [[alternative HTML version deleted]]
> >> >>>
> >> >>> _______________________________________________
> >> >>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> >> mailing list
> >> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> >> mailing list
> >> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >> >>
> >> >
> >> > --
> >> > Hervé Pagès
> >> >
> >> > Program in Computational Biology
> >> > Division of Public Health Sciences
> >> > Fred Hutchinson Cancer Research Center
> >> > 1100 Fairview Ave. N, M1-B514
> >> > P.O. Box 19024
> >> > Seattle, WA 98109-1024
> >> >
> >> > E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org>
> >> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> >> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
> >> >
> >> >
> >> > _______________________________________________
> >> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> >> mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>
> >>
> >
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages at fredhutch.org
> > Phone: (206) 667-5791
> > Fax: (206) 667-1319
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
--
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list