[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?
Michael Lawrence
lawrence.michael at gene.com
Fri Jun 5 20:43:40 CEST 2015
On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
> I also think that we're heading towards something like that.
>
> So genome(gr) <- "hg19" would:
>
> (a) Add any missing information to the seqinfo.
> (b) Sort the seqlevels in "canonical" order.
> (c) Change the seqlevels style to UCSC style if they are not.
>
> The 3 tasks are orthogonal. I guess most of the times people want
> an easy way to perform them all at once.
>
> We could easily support (a) and (b). This assumes that the current
> seqlevels are already valid hg19 seqlevels:
>
> si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
> gr1 <- GRanges(seqinfo=si1)
> hg19_si <- Seqinfo(genome="hg19")
>
> ## (a):
> seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
> seqinfo(gr1)
> # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
> # seqnames seqlengths isCircular genome
> # chrX 155270560 FALSE hg19
> # chrUn_gl000249 38502 FALSE hg19
> # chr2 243199373 FALSE hg19
> # chr6_cox_hap2 4795371 FALSE hg19
>
> ## (b):
> seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
> seqinfo(gr1)
> # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
> # seqnames seqlengths isCircular genome
> # chr2 243199373 FALSE hg19
> # chrX 155270560 FALSE hg19
> # chr6_cox_hap2 4795371 FALSE hg19
> # chrUn_gl000249 38502 FALSE hg19
>
> (c) is harder because seqlevelsStyle() doesn't know how to rename
> scaffolds yet:
>
> si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2", "HSCHR6_MHC_COX_CTG1"))
> gr2 <- GRanges(seqinfo=si2)
>
> seqlevelsStyle(gr2)
> # [1] "NCBI"
>
> seqlevelsStyle(gr2) <- "UCSC"
> seqlevels(gr2)
> # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2"
> # [4] "HSCHR6_MHC_COX_CTG1"
>
> So we need to work on this.
>
> I'm not sure about using genome(gr) <- "hg19" for this. Right now
> it sets the genome column of the seqinfo with the supplied string
> and nothing else. Aren't there valid use cases for this?
Not sure. People would almost always want the seqname style and order
to be consistent with the given genome.
> What about
> using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
> whole seqinfo component actually gets filled.
>
Yea, but "genome" is so intuitive compared to "seqinfo".
> H.
>
> On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
>>
>> that's kind of always been my goal...
>>
>>
>> Statistics is the grammar of science.
>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>
>> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
>> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>>
>> Maybe this could eventually support setting the seqinfo with:
>>
>> genome(gr) <- "hg19"
>>
>> Or is that being too clever?
>>
>> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès <hpages at fredhutch.org
>> <mailto:hpages at fredhutch.org>> wrote:
>> > Hi,
>> >
>> > FWIW I started to work on supporting quick generation of a
>> standalone
>> > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
>> >
>> > It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
>> > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1,
>> mm10,
>> > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4,
>> galGal3,
>> > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
>> > and sacCer2. I'll add more.
>> >
>> > See ?Seqinfo for some examples.
>> >
>> > Right now it fetches the information from internet every time you
>> > call it but maybe we should just store that information in the
>> > GenomeInfoDb package as Tim suggested?
>> >
>> > H.
>> >
>> >
>> > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
>> >>
>> >> That would be perfect actually. And it would radically reduce &
>> >> modularize maintenance. Maybe that's the best way to go after
>> all. Quite
>> >> sensible.
>> >>
>> >> --t
>> >>
>> >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey
>> <stvjc at channing.harvard.edu <mailto:stvjc at channing.harvard.edu>>
>> >>> wrote:
>> >>>
>> >>> It really isn't hard to have multiple OrganismDb packages in
>> place -- the
>> >>> process of making new ones is documented and was given as an
>> exercise in
>> >>> the EdX course. I don't know if we want to institutionalize it
>> and
>> >>> distribute such -- I think we might, so that there would be
>> Hs19, Hs38,
>> >>> mm9, etc. packages. They have very little content, they just
>> coordinate
>> >>> interactions with packages that you'll already have.
>> >>>
>> >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr.
>> <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>>
>> >>> wrote:
>> >>>
>> >>>> Right, I typically do that too, and if you're working on human
>> data it
>> >>>> isn't a big deal. What makes things a lot more of a drag is
>> when you
>> >>>> work
>> >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where
>> >>>> Mus.musculus
>> >>>> is essentially a "build ahead" of Homo.sapiens.
>> >>>>
>> >>>> R> seqinfo(Homo.sapiens)
>> >>>> Seqinfo object with 93 sequences (1 circular) from hg19 genome
>> >>>>
>> >>>> R> seqinfo(Mus.musculus)
>> >>>> Seqinfo object with 66 sequences (1 circular) from mm10 genome:
>> >>>>
>> >>>> It's not as explicit as directly assigning the seqinfo from a
>> genome
>> >>>> that
>> >>>> corresponds to that of your annotations/results/whatever. I
>> know we
>> >>>> could
>> >>>> all use crossmap or liftOver or whatever, but that's not
>> really the
>> >>>> same,
>> >>>> and it takes time, whereas assigning the proper seqinfo for
>> >>>> relationships
>> >>>> is very fast.
>> >>>>
>> >>>> That's all I was getting at...
>> >>>>
>> >>>>
>> >>>> Statistics is the grammar of science.
>> >>>> Karl Pearson
>> <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>> >>>>
>> >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
>> >>>> <stvjc at channing.harvard.edu <mailto:stvjc at channing.harvard.edu>
>> >>>>>
>> >>>>> wrote:
>> >>>>
>> >>>>
>> >>>>> I typically get this info from Homo.sapiens. The result is
>> parasitic
>> >>>>> on
>> >>>>> the TxDb that is in there. I don't know how easy it is to swap
>> >>>>> alternate
>> >>>>> TxDb in to get a different build. I think it would make sense
>> to
>> >>>>> regard
>> >>>>> the OrganismDb instances as foundational for this sort of
>> structural
>> >>>>> data.
>> >>>>>
>> >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
>> >>>>> kasperdanielhansen at gmail.com
>> <mailto:kasperdanielhansen at gmail.com>> wrote:
>> >>>>>
>> >>>>>> Let me rephrase this slightly. From one POV the purpose of
>> >>>>>> GenomeInfoDb
>> >>>>>> is
>> >>>>>> clean up the seqinfo slot. Currently it does most of the
>> cleaning,
>> >>>>>> but
>> >>>>>> it
>> >>>>>> does not add seqlengths.
>> >>>>>>
>> >>>>>> It is clear that seqlengths depends on the version of the
>> genome, but
>> >>>>>> I
>> >>>>>> will argue so does the seqnames. Of course, for human,
>> chr22 will not
>> >>>>>> change. But what about the names of all the random
>> contigs? Or for
>> >>>>>> other
>> >>>>>> organisms, what about going from a draft genome with 10k
>> contigs to a
>> >>>>>> more
>> >>>>>> completely genome assembled into fewer, larger chromosomes.
>> >>>>>>
>> >>>>>> I acknowledge that this information is present in the BSgenome
>> >>>>>> packages,
>> >>>>>> but it seems (to me) to be very appropriate to have them
>> around for
>> >>>>>> cleaning up the seqinfo slot. For some situations it is not
>> great to
>> >>>>>> depend on 1 GB> download for something that is a few bytes.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> Kasper
>> >>>>>>
>> >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr.
>> <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>> It would be nice (for a number of reasons) to have
>> chromosome lengths
>> >>>>>>> readily available in a foundational package like
>> GenomeInfoDb, so
>> >>>>>>> that,
>> >>>>>>> say,
>> >>>>>>>
>> >>>>>>> data(seqinfo.hg19)
>> >>>>>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
>> >>>>>>>
>> >>>>>>> would work without issues. Is there any particular reason
>> this
>> >>>>>>
>> >>>>>> couldn't
>> >>>>>>>
>> >>>>>>> happen for the supported/available BSgenomes? It would
>> seem like a
>> >>>>>>
>> >>>>>> simple
>> >>>>>>>
>> >>>>>>> matter to do
>> >>>>>>>
>> >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
>> >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
>> >>>>>>> R> save(seqinfo.hg19,
>> >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
>> >>>>>>>
>> >>>>>>> and be done with it until (say) the next release or next
>> released
>> >>>>>>> BSgenome. I considered looping through the following
>> BSgenomes
>> >>>>>>
>> >>>>>> myself...
>> >>>>>>>
>> >>>>>>> and if it isn't strongly opposed by (everyone) I may still
>> do exactly
>> >>>>>>> that. Seems useful, no?
>> >>>>>>>
>> >>>>>>> e.g. for the following 42 builds,
>> >>>>>>>
>> >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "",
>> available.genomes())),
>> >>>>>>> value=TRUE)
>> >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2"
>> >>>>>>
>> >>>>>> "BSgenome.Btaurus.UCSC.bosTau3"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4"
>> >>>>>>
>> >>>>>> "BSgenome.Btaurus.UCSC.bosTau6"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8"
>> >>>>>>> "BSgenome.Celegans.UCSC.ce10"
>> >>>>>>>
>> >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2"
>> "BSgenome.Celegans.UCSC.ce6"
>> >>>>>>>
>> >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
>> >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3"
>> >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
>> >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
>> >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6"
>> >>>>>>
>> >>>>>> "BSgenome.Drerio.UCSC.danRer5"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6"
>> >>>>>>
>> >>>>>> "BSgenome.Drerio.UCSC.danRer7"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805"
>> >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1"
>> >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3"
>> >>>>>>
>> >>>>>> "BSgenome.Ggallus.UCSC.galGal4"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38"
>> >>>>>>> "BSgenome.Hsapiens.UCSC.hg17"
>> >>>>>>>
>> >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18"
>> >>>>>>> "BSgenome.Hsapiens.UCSC.hg19"
>> >>>>>>>
>> >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38"
>> >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
>> >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1"
>> >>>>>>
>> >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3"
>> >>>>>>
>> >>>>>> "BSgenome.Mmusculus.UCSC.mm10"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8"
>> >>>>>>> "BSgenome.Mmusculus.UCSC.mm9"
>> >>>>>>>
>> >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
>> >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3"
>> >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4"
>> >>>>>>
>> >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
>> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1"
>> >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
>> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3"
>> >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3"
>> >>>>>>
>> >>>>>> "BSgenome.Tguttata.UCSC.taeGut1"
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Am I insane for suggesting this? It would make things a
>> little
>> >>>>>>> easier
>> >>>>>>
>> >>>>>> for
>> >>>>>>>
>> >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived
>> objects, blah,
>> >>>>>>
>> >>>>>> blah,
>> >>>>>>>
>> >>>>>>> blah...
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Best,
>> >>>>>>>
>> >>>>>>> --t
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Statistics is the grammar of science.
>> >>>>>>> Karl Pearson
>> <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>> >>>>>>
>> >>>>>>
>> >>>>>> [[alternative HTML version deleted]]
>> >>>>>>
>> >>>>>> _______________________________________________
>> >>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> mailing list
>> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>>
>> >>>
>> >>> [[alternative HTML version deleted]]
>> >>>
>> >>> _______________________________________________
>> >>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> mailing list
>> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>> >>
>> >> _______________________________________________
>> >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> >>
>> >
>> > --
>> > Hervé Pagès
>> >
>> > Program in Computational Biology
>> > Division of Public Health Sciences
>> > Fred Hutchinson Cancer Research Center
>> > 1100 Fairview Ave. N, M1-B514
>> > P.O. Box 19024
>> > Seattle, WA 98109-1024
>> >
>> > E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org>
>> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>> >
>> >
>> > _______________________________________________
>> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> mailing list
>> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioc-devel
mailing list