[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Fri Jun 5 20:50:06 CEST 2015

On 06/05/2015 11:43 AM, Michael Lawrence wrote:
> On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
>> I also think that we're heading towards something like that.
>>
>> So genome(gr) <- "hg19" would:
>>
>>    (a) Add any missing information to the seqinfo.
>>    (b) Sort the seqlevels in "canonical" order.
>>    (c) Change the seqlevels style to UCSC style if they are not.
>>
>> The 3 tasks are orthogonal. I guess most of the times people want
>> an easy way to perform them all at once.
>>
>> We could easily support (a) and (b). This assumes that the current
>> seqlevels are already valid hg19 seqlevels:
>>
>>    si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
>>    gr1 <- GRanges(seqinfo=si1)
>>    hg19_si <- Seqinfo(genome="hg19")
>>
>>    ## (a):
>>    seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
>>    seqinfo(gr1)
>>    # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>>    #   seqnames       seqlengths isCircular genome
>>    #   chrX            155270560      FALSE   hg19
>>    #   chrUn_gl000249      38502      FALSE   hg19
>>    #   chr2            243199373      FALSE   hg19
>>    #   chr6_cox_hap2     4795371      FALSE   hg19
>>
>>    ## (b):
>>    seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
>>    seqinfo(gr1)
>>    # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>>    #   seqnames       seqlengths isCircular genome
>>    #   chr2            243199373      FALSE   hg19
>>    #   chrX            155270560      FALSE   hg19
>>    #   chr6_cox_hap2     4795371      FALSE   hg19
>>    #   chrUn_gl000249      38502      FALSE   hg19
>>
>> (c) is harder because seqlevelsStyle() doesn't know how to rename
>> scaffolds yet:
>>
>>    si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2", "HSCHR6_MHC_COX_CTG1"))
>>    gr2 <- GRanges(seqinfo=si2)
>>
>>    seqlevelsStyle(gr2)
>>    # [1] "NCBI"
>>
>>    seqlevelsStyle(gr2) <- "UCSC"
>>    seqlevels(gr2)
>>    # [1] "chrX"                 "HSCHRUN_RANDOM_CTG42" "chr2"
>>    # [4] "HSCHR6_MHC_COX_CTG1"
>>
>> So we need to work on this.
>>
>> I'm not sure about using genome(gr) <- "hg19" for this. Right now
>> it sets the genome column of the seqinfo with the supplied string
>> and nothing else. Aren't there valid use cases for this?
>
> Not sure. People would almost always want the seqname style and order
> to be consistent with the given genome.

Agreed but my worry is that when they don't, then they would be left
with no way to just set the genome column.

H.

>
>> What about
>> using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
>> whole seqinfo component actually gets filled.
>>
>
> Yea, but "genome" is so intuitive compared to "seqinfo".
>
>
>
>> H.
>>
>> On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
>>>
>>> that's kind of always been my goal...
>>>
>>>
>>> Statistics is the grammar of science.
>>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>
>>> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
>>> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>>>
>>>      Maybe this could eventually support setting the seqinfo with:
>>>
>>>      genome(gr) <- "hg19"
>>>
>>>      Or is that being too clever?
>>>
>>>      On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès <hpages at fredhutch.org
>>>      <mailto:hpages at fredhutch.org>> wrote:
>>>       > Hi,
>>>       >
>>>       > FWIW I started to work on supporting quick generation of a
>>> standalone
>>>       > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
>>>       >
>>>       > It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
>>>       > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1,
>>> mm10,
>>>       > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4,
>>> galGal3,
>>>       > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
>>>       > and sacCer2. I'll add more.
>>>       >
>>>       > See ?Seqinfo for some examples.
>>>       >
>>>       > Right now it fetches the information from internet every time you
>>>       > call it but maybe we should just store that information in the
>>>       > GenomeInfoDb package as Tim suggested?
>>>       >
>>>       > H.
>>>       >
>>>       >
>>>       > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
>>>       >>
>>>       >> That would be perfect actually.  And it would radically reduce &
>>>       >> modularize maintenance.  Maybe that's the best way to go after
>>>      all.  Quite
>>>       >> sensible.
>>>       >>
>>>       >> --t
>>>       >>
>>>       >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey
>>>      <stvjc at channing.harvard.edu <mailto:stvjc at channing.harvard.edu>>
>>>       >>> wrote:
>>>       >>>
>>>       >>> It really isn't hard to have multiple OrganismDb packages in
>>>      place -- the
>>>       >>> process of making new ones is documented and was given as an
>>>      exercise in
>>>       >>> the EdX course.  I don't know if we want to institutionalize it
>>> and
>>>       >>> distribute such -- I think we might, so that there would be
>>>      Hs19, Hs38,
>>>       >>> mm9, etc. packages.  They have very little content, they just
>>>      coordinate
>>>       >>> interactions with packages that you'll already have.
>>>       >>>
>>>       >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr.
>>>      <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>>>
>>>       >>> wrote:
>>>       >>>
>>>       >>>> Right, I typically do that too, and if you're working on human
>>>      data it
>>>       >>>> isn't a big deal.  What makes things a lot more of a drag is
>>>      when you
>>>       >>>> work
>>>       >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where
>>>       >>>> Mus.musculus
>>>       >>>> is essentially a "build ahead" of Homo.sapiens.
>>>       >>>>
>>>       >>>> R> seqinfo(Homo.sapiens)
>>>       >>>> Seqinfo object with 93 sequences (1 circular) from hg19 genome
>>>       >>>>
>>>       >>>> R> seqinfo(Mus.musculus)
>>>       >>>> Seqinfo object with 66 sequences (1 circular) from mm10 genome:
>>>       >>>>
>>>       >>>> It's not as explicit as directly assigning the seqinfo from a
>>>      genome
>>>       >>>> that
>>>       >>>> corresponds to that of your annotations/results/whatever.  I
>>>      know we
>>>       >>>> could
>>>       >>>> all use crossmap or liftOver or whatever, but that's not
>>>      really the
>>>       >>>> same,
>>>       >>>> and it takes time, whereas assigning the proper seqinfo for
>>>       >>>> relationships
>>>       >>>> is very fast.
>>>       >>>>
>>>       >>>> That's all I was getting at...
>>>       >>>>
>>>       >>>>
>>>       >>>> Statistics is the grammar of science.
>>>       >>>> Karl Pearson
>>> <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>       >>>>
>>>       >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
>>>       >>>> <stvjc at channing.harvard.edu <mailto:stvjc at channing.harvard.edu>
>>>       >>>>>
>>>       >>>>> wrote:
>>>       >>>>
>>>       >>>>
>>>       >>>>> I typically get this info from Homo.sapiens.  The result is
>>>      parasitic
>>>       >>>>> on
>>>       >>>>> the TxDb that is in there.  I don't know how easy it is to swap
>>>       >>>>> alternate
>>>       >>>>> TxDb in to get a different build.  I think it would make sense
>>> to
>>>       >>>>> regard
>>>       >>>>> the OrganismDb instances as foundational for this sort of
>>>      structural
>>>       >>>>> data.
>>>       >>>>>
>>>       >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
>>>       >>>>> kasperdanielhansen at gmail.com
>>>      <mailto:kasperdanielhansen at gmail.com>> wrote:
>>>       >>>>>
>>>       >>>>>> Let me rephrase this slightly.  From one POV the purpose of
>>>       >>>>>> GenomeInfoDb
>>>       >>>>>> is
>>>       >>>>>> clean up the seqinfo slot.  Currently it does most of the
>>>      cleaning,
>>>       >>>>>> but
>>>       >>>>>> it
>>>       >>>>>> does not add seqlengths.
>>>       >>>>>>
>>>       >>>>>> It is clear that seqlengths depends on the version of the
>>>      genome, but
>>>       >>>>>> I
>>>       >>>>>> will argue so does the seqnames.  Of course, for human,
>>>      chr22 will not
>>>       >>>>>> change.  But what about the names of all the random
>>>      contigs?  Or for
>>>       >>>>>> other
>>>       >>>>>> organisms, what about going from a draft genome with 10k
>>>      contigs to a
>>>       >>>>>> more
>>>       >>>>>> completely genome assembled into fewer, larger chromosomes.
>>>       >>>>>>
>>>       >>>>>> I acknowledge that this information is present in the BSgenome
>>>       >>>>>> packages,
>>>       >>>>>> but it seems (to me) to be very appropriate to have them
>>>      around for
>>>       >>>>>> cleaning up the seqinfo slot.  For some situations it is not
>>>      great to
>>>       >>>>>> depend on 1 GB> download for something that is a few bytes.
>>>       >>>>>>
>>>       >>>>>> Best,
>>>       >>>>>> Kasper
>>>       >>>>>>
>>>       >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr.
>>>      <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>>>
>>>       >>>>>> wrote:
>>>       >>>>>>
>>>       >>>>>>> It would be nice (for a number of reasons) to have
>>>      chromosome lengths
>>>       >>>>>>> readily available in a foundational package like
>>>      GenomeInfoDb, so
>>>       >>>>>>> that,
>>>       >>>>>>> say,
>>>       >>>>>>>
>>>       >>>>>>> data(seqinfo.hg19)
>>>       >>>>>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
>>>       >>>>>>>
>>>       >>>>>>> would work without issues.  Is there any particular reason
>>> this
>>>       >>>>>>
>>>       >>>>>> couldn't
>>>       >>>>>>>
>>>       >>>>>>> happen for the supported/available BSgenomes?  It would
>>>      seem like a
>>>       >>>>>>
>>>       >>>>>> simple
>>>       >>>>>>>
>>>       >>>>>>> matter to do
>>>       >>>>>>>
>>>       >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
>>>       >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
>>>       >>>>>>> R> save(seqinfo.hg19,
>>>       >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
>>>       >>>>>>>
>>>       >>>>>>> and be done with it until (say) the next release or next
>>>      released
>>>       >>>>>>> BSgenome.  I considered looping through the following
>>> BSgenomes
>>>       >>>>>>
>>>       >>>>>> myself...
>>>       >>>>>>>
>>>       >>>>>>> and if it isn't strongly opposed by (everyone) I may still
>>>      do exactly
>>>       >>>>>>> that.  Seems useful, no?
>>>       >>>>>>>
>>>       >>>>>>> e.g. for the following 42 builds,
>>>       >>>>>>>
>>>       >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "",
>>>      available.genomes())),
>>>       >>>>>>> value=TRUE)
>>>       >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2"
>>>       >>>>>>
>>>       >>>>>> "BSgenome.Btaurus.UCSC.bosTau3"
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4"
>>>       >>>>>>
>>>       >>>>>> "BSgenome.Btaurus.UCSC.bosTau6"
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8"
>>>       >>>>>>> "BSgenome.Celegans.UCSC.ce10"
>>>       >>>>>>>
>>>       >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2"
>>>        "BSgenome.Celegans.UCSC.ce6"
>>>       >>>>>>>
>>>       >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
>>>       >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3"
>>>       >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
>>>       >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
>>>       >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6"
>>>       >>>>>>
>>>       >>>>>> "BSgenome.Drerio.UCSC.danRer5"
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6"
>>>       >>>>>>
>>>       >>>>>> "BSgenome.Drerio.UCSC.danRer7"
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805"
>>>       >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1"
>>>       >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3"
>>>       >>>>>>
>>>       >>>>>> "BSgenome.Ggallus.UCSC.galGal4"
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38"
>>>       >>>>>>> "BSgenome.Hsapiens.UCSC.hg17"
>>>       >>>>>>>
>>>       >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18"
>>>       >>>>>>> "BSgenome.Hsapiens.UCSC.hg19"
>>>       >>>>>>>
>>>       >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38"
>>>       >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
>>>       >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1"
>>>       >>>>>>
>>>       >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2"
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3"
>>>       >>>>>>
>>>       >>>>>> "BSgenome.Mmusculus.UCSC.mm10"
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8"
>>>       >>>>>>> "BSgenome.Mmusculus.UCSC.mm9"
>>>       >>>>>>>
>>>       >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
>>>       >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3"
>>>       >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4"
>>>       >>>>>>
>>>       >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5"
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
>>>       >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1"
>>>       >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
>>>       >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3"
>>>       >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3"
>>>       >>>>>>
>>>       >>>>>> "BSgenome.Tguttata.UCSC.taeGut1"
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>> Am I insane for suggesting this?  It would make things a
>>> little
>>>       >>>>>>> easier
>>>       >>>>>>
>>>       >>>>>> for
>>>       >>>>>>>
>>>       >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived
>>>      objects, blah,
>>>       >>>>>>
>>>       >>>>>> blah,
>>>       >>>>>>>
>>>       >>>>>>> blah...
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>> Best,
>>>       >>>>>>>
>>>       >>>>>>> --t
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>>
>>>       >>>>>>> Statistics is the grammar of science.
>>>       >>>>>>> Karl Pearson
>>>      <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>>       >>>>>>
>>>       >>>>>>
>>>       >>>>>>         [[alternative HTML version deleted]]
>>>       >>>>>>
>>>       >>>>>> _______________________________________________
>>>       >>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>      mailing list
>>>       >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>       >>>
>>>       >>>
>>>       >>>     [[alternative HTML version deleted]]
>>>       >>>
>>>       >>> _______________________________________________
>>>       >>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>      mailing list
>>>       >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>       >>
>>>       >>
>>>       >> _______________________________________________
>>>       >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>      mailing list
>>>       >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>       >>
>>>       >
>>>       > --
>>>       > Hervé Pagès
>>>       >
>>>       > Program in Computational Biology
>>>       > Division of Public Health Sciences
>>>       > Fred Hutchinson Cancer Research Center
>>>       > 1100 Fairview Ave. N, M1-B514
>>>       > P.O. Box 19024
>>>       > Seattle, WA 98109-1024
>>>       >
>>>       > E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org>
>>>       > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>>       > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>>       >
>>>       >
>>>       > _______________________________________________
>>>       > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>>      mailing list
>>>       > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fredhutch.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319