[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?
Gabe Becker
becker.gabe at gene.com
Fri Jun 5 22:05:15 CEST 2015
That sounds like it calls for an (class-style) inheritence/genome-union
model to me. I should probably stop talking now before the people who
would have to implement that start throwing things at me, though.
~G
On Fri, Jun 5, 2015 at 12:54 PM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:
> In WGBS we frequently sequence a human with spikein from the lambda
> genome. In this case, most of the chromosomes of the Granges are from
> human, except one. This is a usecase where genome(GR) is not constant. I
> suggest, partly for compatibility, to keep genome, but perhaps do something
> like
> standardizeGenome()
> or something like this.
>
> I would indeed love, love, love a function which just cleans it up.
>
> Kasper
>
> On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker <becker.gabe at gene.com> wrote:
>
>> Herve,
>>
>> This is probably a naive question, but what usecases are there for
>> creating
>> an object with the wrong seqinfo for its genome?
>>
>> ~G
>>
>> On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence <
>> lawrence.michael at gene.com
>> > wrote:
>>
>> > On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès <hpages at fredhutch.org>
>> wrote:
>> > > I also think that we're heading towards something like that.
>> > >
>> > > So genome(gr) <- "hg19" would:
>> > >
>> > > (a) Add any missing information to the seqinfo.
>> > > (b) Sort the seqlevels in "canonical" order.
>> > > (c) Change the seqlevels style to UCSC style if they are not.
>> > >
>> > > The 3 tasks are orthogonal. I guess most of the times people want
>> > > an easy way to perform them all at once.
>> > >
>> > > We could easily support (a) and (b). This assumes that the current
>> > > seqlevels are already valid hg19 seqlevels:
>> > >
>> > > si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
>> > > gr1 <- GRanges(seqinfo=si1)
>> > > hg19_si <- Seqinfo(genome="hg19")
>> > >
>> > > ## (a):
>> > > seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
>> > > seqinfo(gr1)
>> > > # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>> > > # seqnames seqlengths isCircular genome
>> > > # chrX 155270560 FALSE hg19
>> > > # chrUn_gl000249 38502 FALSE hg19
>> > > # chr2 243199373 FALSE hg19
>> > > # chr6_cox_hap2 4795371 FALSE hg19
>> > >
>> > > ## (b):
>> > > seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
>> > > seqinfo(gr1)
>> > > # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>> > > # seqnames seqlengths isCircular genome
>> > > # chr2 243199373 FALSE hg19
>> > > # chrX 155270560 FALSE hg19
>> > > # chr6_cox_hap2 4795371 FALSE hg19
>> > > # chrUn_gl000249 38502 FALSE hg19
>> > >
>> > > (c) is harder because seqlevelsStyle() doesn't know how to rename
>> > > scaffolds yet:
>> > >
>> > > si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",
>> > "HSCHR6_MHC_COX_CTG1"))
>> > > gr2 <- GRanges(seqinfo=si2)
>> > >
>> > > seqlevelsStyle(gr2)
>> > > # [1] "NCBI"
>> > >
>> > > seqlevelsStyle(gr2) <- "UCSC"
>> > > seqlevels(gr2)
>> > > # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
>> > > # [4] "HSCHR6_MHC_COX_CTG1"
>> > >
>> > > So we need to work on this.
>> > >
>> > > I'm not sure about using genome(gr) <- "hg19" for this. Right now
>> > > it sets the genome column of the seqinfo with the supplied string
>> > > and nothing else. Aren't there valid use cases for this?
>> >
>> > Not sure. People would almost always want the seqname style and order
>> > to be consistent with the given genome.
>> >
>> > > What about
>> > > using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
>> > > whole seqinfo component actually gets filled.
>> > >
>> >
>> > Yea, but "genome" is so intuitive compared to "seqinfo".
>> >
>> >
>> >
>> > > H.
>> > >
>> > > On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
>> > >>
>> > >> that's kind of always been my goal...
>> > >>
>> > >>
>> > >> Statistics is the grammar of science.
>> > >> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>> > >>
>> > >> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
>> > >> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>>
>> wrote:
>> > >>
>> > >> Maybe this could eventually support setting the seqinfo with:
>> > >>
>> > >> genome(gr) <- "hg19"
>> > >>
>> > >> Or is that being too clever?
>> > >>
>> > >> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès <
>> hpages at fredhutch.org
>> > >> <mailto:hpages at fredhutch.org>> wrote:
>> > >> > Hi,
>> > >> >
>> > >> > FWIW I started to work on supporting quick generation of a
>> > >> standalone
>> > >> > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
>> > >> >
>> > >> > It already supports hg38, hg19, hg18, panTro4, panTro3,
>> panTro2,
>> > >> > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1,
>> > >> mm10,
>> > >> > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4,
>> > >> galGal3,
>> > >> > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2,
>> > sacCer3,
>> > >> > and sacCer2. I'll add more.
>> > >> >
>> > >> > See ?Seqinfo for some examples.
>> > >> >
>> > >> > Right now it fetches the information from internet every time
>> you
>> > >> > call it but maybe we should just store that information in the
>> > >> > GenomeInfoDb package as Tim suggested?
>> > >> >
>> > >> > H.
>> > >> >
>> > >> >
>> > >> > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
>> > >> >>
>> > >> >> That would be perfect actually. And it would radically
>> reduce &
>> > >> >> modularize maintenance. Maybe that's the best way to go
>> after
>> > >> all. Quite
>> > >> >> sensible.
>> > >> >>
>> > >> >> --t
>> > >> >>
>> > >> >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey
>> > >> <stvjc at channing.harvard.edu <mailto:stvjc at channing.harvard.edu>>
>> > >> >>> wrote:
>> > >> >>>
>> > >> >>> It really isn't hard to have multiple OrganismDb packages in
>> > >> place -- the
>> > >> >>> process of making new ones is documented and was given as an
>> > >> exercise in
>> > >> >>> the EdX course. I don't know if we want to
>> institutionalize it
>> > >> and
>> > >> >>> distribute such -- I think we might, so that there would be
>> > >> Hs19, Hs38,
>> > >> >>> mm9, etc. packages. They have very little content, they
>> just
>> > >> coordinate
>> > >> >>> interactions with packages that you'll already have.
>> > >> >>>
>> > >> >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr.
>> > >> <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>> > >>
>> > >> >>> wrote:
>> > >> >>>
>> > >> >>>> Right, I typically do that too, and if you're working on
>> human
>> > >> data it
>> > >> >>>> isn't a big deal. What makes things a lot more of a drag
>> is
>> > >> when you
>> > >> >>>> work
>> > >> >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38)
>> where
>> > >> >>>> Mus.musculus
>> > >> >>>> is essentially a "build ahead" of Homo.sapiens.
>> > >> >>>>
>> > >> >>>> R> seqinfo(Homo.sapiens)
>> > >> >>>> Seqinfo object with 93 sequences (1 circular) from hg19
>> genome
>> > >> >>>>
>> > >> >>>> R> seqinfo(Mus.musculus)
>> > >> >>>> Seqinfo object with 66 sequences (1 circular) from mm10
>> > genome:
>> > >> >>>>
>> > >> >>>> It's not as explicit as directly assigning the seqinfo
>> from a
>> > >> genome
>> > >> >>>> that
>> > >> >>>> corresponds to that of your annotations/results/whatever.
>> I
>> > >> know we
>> > >> >>>> could
>> > >> >>>> all use crossmap or liftOver or whatever, but that's not
>> > >> really the
>> > >> >>>> same,
>> > >> >>>> and it takes time, whereas assigning the proper seqinfo for
>> > >> >>>> relationships
>> > >> >>>> is very fast.
>> > >> >>>>
>> > >> >>>> That's all I was getting at...
>> > >> >>>>
>> > >> >>>>
>> > >> >>>> Statistics is the grammar of science.
>> > >> >>>> Karl Pearson
>> > >> <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>> > >> >>>>
>> > >> >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
>> > >> >>>> <stvjc at channing.harvard.edu <mailto:
>> > stvjc at channing.harvard.edu>
>> > >> >>>>>
>> > >> >>>>> wrote:
>> > >> >>>>
>> > >> >>>>
>> > >> >>>>> I typically get this info from Homo.sapiens. The result
>> is
>> > >> parasitic
>> > >> >>>>> on
>> > >> >>>>> the TxDb that is in there. I don't know how easy it is to
>> > swap
>> > >> >>>>> alternate
>> > >> >>>>> TxDb in to get a different build. I think it would make
>> > sense
>> > >> to
>> > >> >>>>> regard
>> > >> >>>>> the OrganismDb instances as foundational for this sort of
>> > >> structural
>> > >> >>>>> data.
>> > >> >>>>>
>> > >> >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
>> > >> >>>>> kasperdanielhansen at gmail.com
>> > >> <mailto:kasperdanielhansen at gmail.com>> wrote:
>> > >> >>>>>
>> > >> >>>>>> Let me rephrase this slightly. From one POV the purpose
>> of
>> > >> >>>>>> GenomeInfoDb
>> > >> >>>>>> is
>> > >> >>>>>> clean up the seqinfo slot. Currently it does most of the
>> > >> cleaning,
>> > >> >>>>>> but
>> > >> >>>>>> it
>> > >> >>>>>> does not add seqlengths.
>> > >> >>>>>>
>> > >> >>>>>> It is clear that seqlengths depends on the version of the
>> > >> genome, but
>> > >> >>>>>> I
>> > >> >>>>>> will argue so does the seqnames. Of course, for human,
>> > >> chr22 will not
>> > >> >>>>>> change. But what about the names of all the random
>> > >> contigs? Or for
>> > >> >>>>>> other
>> > >> >>>>>> organisms, what about going from a draft genome with 10k
>> > >> contigs to a
>> > >> >>>>>> more
>> > >> >>>>>> completely genome assembled into fewer, larger
>> chromosomes.
>> > >> >>>>>>
>> > >> >>>>>> I acknowledge that this information is present in the
>> > BSgenome
>> > >> >>>>>> packages,
>> > >> >>>>>> but it seems (to me) to be very appropriate to have them
>> > >> around for
>> > >> >>>>>> cleaning up the seqinfo slot. For some situations it is
>> not
>> > >> great to
>> > >> >>>>>> depend on 1 GB> download for something that is a few
>> bytes.
>> > >> >>>>>>
>> > >> >>>>>> Best,
>> > >> >>>>>> Kasper
>> > >> >>>>>>
>> > >> >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr.
>> > >> <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>> > >>
>> > >> >>>>>> wrote:
>> > >> >>>>>>
>> > >> >>>>>>> It would be nice (for a number of reasons) to have
>> > >> chromosome lengths
>> > >> >>>>>>> readily available in a foundational package like
>> > >> GenomeInfoDb, so
>> > >> >>>>>>> that,
>> > >> >>>>>>> say,
>> > >> >>>>>>>
>> > >> >>>>>>> data(seqinfo.hg19)
>> > >> >>>>>>> seqinfo(myResults) <- seqinfo.hg19[
>> seqlevels(myResults) ]
>> > >> >>>>>>>
>> > >> >>>>>>> would work without issues. Is there any particular
>> reason
>> > >> this
>> > >> >>>>>>
>> > >> >>>>>> couldn't
>> > >> >>>>>>>
>> > >> >>>>>>> happen for the supported/available BSgenomes? It would
>> > >> seem like a
>> > >> >>>>>>
>> > >> >>>>>> simple
>> > >> >>>>>>>
>> > >> >>>>>>> matter to do
>> > >> >>>>>>>
>> > >> >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
>> > >> >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
>> > >> >>>>>>> R> save(seqinfo.hg19,
>> > >> >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
>> > >> >>>>>>>
>> > >> >>>>>>> and be done with it until (say) the next release or next
>> > >> released
>> > >> >>>>>>> BSgenome. I considered looping through the following
>> > >> BSgenomes
>> > >> >>>>>>
>> > >> >>>>>> myself...
>> > >> >>>>>>>
>> > >> >>>>>>> and if it isn't strongly opposed by (everyone) I may
>> still
>> > >> do exactly
>> > >> >>>>>>> that. Seems useful, no?
>> > >> >>>>>>>
>> > >> >>>>>>> e.g. for the following 42 builds,
>> > >> >>>>>>>
>> > >> >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "",
>> > >> available.genomes())),
>> > >> >>>>>>> value=TRUE)
>> > >> >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2"
>> > >> >>>>>>
>> > >> >>>>>> "BSgenome.Btaurus.UCSC.bosTau3"
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4"
>> > >> >>>>>>
>> > >> >>>>>> "BSgenome.Btaurus.UCSC.bosTau6"
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8"
>> > >> >>>>>>> "BSgenome.Celegans.UCSC.ce10"
>> > >> >>>>>>>
>> > >> >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2"
>> > >> "BSgenome.Celegans.UCSC.ce6"
>> > >> >>>>>>>
>> > >> >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
>> > >> >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3"
>> > >> >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
>> > >> >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
>> > >> >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6"
>> > >> >>>>>>
>> > >> >>>>>> "BSgenome.Drerio.UCSC.danRer5"
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6"
>> > >> >>>>>>
>> > >> >>>>>> "BSgenome.Drerio.UCSC.danRer7"
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805"
>> > >> >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1"
>> > >> >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3"
>> > >> >>>>>>
>> > >> >>>>>> "BSgenome.Ggallus.UCSC.galGal4"
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38"
>> > >> >>>>>>> "BSgenome.Hsapiens.UCSC.hg17"
>> > >> >>>>>>>
>> > >> >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18"
>> > >> >>>>>>> "BSgenome.Hsapiens.UCSC.hg19"
>> > >> >>>>>>>
>> > >> >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38"
>> > >> >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
>> > >> >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1"
>> > >> >>>>>>
>> > >> >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2"
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3"
>> > >> >>>>>>
>> > >> >>>>>> "BSgenome.Mmusculus.UCSC.mm10"
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8"
>> > >> >>>>>>> "BSgenome.Mmusculus.UCSC.mm9"
>> > >> >>>>>>>
>> > >> >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
>> > >> >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3"
>> > >> >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4"
>> > >> >>>>>>
>> > >> >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5"
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
>> > >> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1"
>> > >> >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
>> > >> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3"
>> > >> >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3"
>> > >> >>>>>>
>> > >> >>>>>> "BSgenome.Tguttata.UCSC.taeGut1"
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>> Am I insane for suggesting this? It would make things a
>> > >> little
>> > >> >>>>>>> easier
>> > >> >>>>>>
>> > >> >>>>>> for
>> > >> >>>>>>>
>> > >> >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived
>> > >> objects, blah,
>> > >> >>>>>>
>> > >> >>>>>> blah,
>> > >> >>>>>>>
>> > >> >>>>>>> blah...
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>> Best,
>> > >> >>>>>>>
>> > >> >>>>>>> --t
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>>
>> > >> >>>>>>> Statistics is the grammar of science.
>> > >> >>>>>>> Karl Pearson
>> > >> <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>> > >> >>>>>>
>> > >> >>>>>>
>> > >> >>>>>> [[alternative HTML version deleted]]
>> > >> >>>>>>
>> > >> >>>>>> _______________________________________________
>> > >> >>>>>> Bioc-devel at r-project.org <mailto:
>> Bioc-devel at r-project.org>
>> > >> mailing list
>> > >> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> > >> >>>
>> > >> >>>
>> > >> >>> [[alternative HTML version deleted]]
>> > >> >>>
>> > >> >>> _______________________________________________
>> > >> >>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> > >> mailing list
>> > >> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> > >> >>
>> > >> >>
>> > >> >> _______________________________________________
>> > >> >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> > >> mailing list
>> > >> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> > >> >>
>> > >> >
>> > >> > --
>> > >> > Hervé Pagès
>> > >> >
>> > >> > Program in Computational Biology
>> > >> > Division of Public Health Sciences
>> > >> > Fred Hutchinson Cancer Research Center
>> > >> > 1100 Fairview Ave. N, M1-B514
>> > >> > P.O. Box 19024
>> > >> > Seattle, WA 98109-1024
>> > >> >
>> > >> > E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org>
>> > >> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>> > >> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>> > >> >
>> > >> >
>> > >> > _______________________________________________
>> > >> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> > >> mailing list
>> > >> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> > >>
>> > >>
>> > >
>> > > --
>> > > Hervé Pagès
>> > >
>> > > Program in Computational Biology
>> > > Division of Public Health Sciences
>> > > Fred Hutchinson Cancer Research Center
>> > > 1100 Fairview Ave. N, M1-B514
>> > > P.O. Box 19024
>> > > Seattle, WA 98109-1024
>> > >
>> > > E-mail: hpages at fredhutch.org
>> > > Phone: (206) 667-5791
>> > > Fax: (206) 667-1319
>> >
>> > _______________________________________________
>> > Bioc-devel at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >
>>
>>
>>
>> --
>> Gabriel Becker, Ph.D
>> Computational Biologist
>> Genentech Research
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>
>
--
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list