[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Fri Jun 5 20:43:40 CEST 2015

On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès <hpages at fredhutch.org> wrote:
> I also think that we're heading towards something like that.
>
> So genome(gr) <- "hg19" would:
>
>   (a) Add any missing information to the seqinfo.
>   (b) Sort the seqlevels in "canonical" order.
>   (c) Change the seqlevels style to UCSC style if they are not.
>
> The 3 tasks are orthogonal. I guess most of the times people want
> an easy way to perform them all at once.
>
> We could easily support (a) and (b). This assumes that the current
> seqlevels are already valid hg19 seqlevels:
>
>   si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
>   gr1 <- GRanges(seqinfo=si1)
>   hg19_si <- Seqinfo(genome="hg19")
>
>   ## (a):
>   seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
>   seqinfo(gr1)
>   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>   #   seqnames       seqlengths isCircular genome
>   #   chrX            155270560      FALSE   hg19
>   #   chrUn_gl000249      38502      FALSE   hg19
>   #   chr2            243199373      FALSE   hg19
>   #   chr6_cox_hap2     4795371      FALSE   hg19
>
>   ## (b):
>   seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
>   seqinfo(gr1)
>   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
>   #   seqnames       seqlengths isCircular genome
>   #   chr2            243199373      FALSE   hg19
>   #   chrX            155270560      FALSE   hg19
>   #   chr6_cox_hap2     4795371      FALSE   hg19
>   #   chrUn_gl000249      38502      FALSE   hg19
>
> (c) is harder because seqlevelsStyle() doesn't know how to rename
> scaffolds yet:
>
>   si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2", "HSCHR6_MHC_COX_CTG1"))
>   gr2 <- GRanges(seqinfo=si2)
>
>   seqlevelsStyle(gr2)
>   # [1] "NCBI"
>
>   seqlevelsStyle(gr2) <- "UCSC"
>   seqlevels(gr2)
>   # [1] "chrX"                 "HSCHRUN_RANDOM_CTG42" "chr2"
>   # [4] "HSCHR6_MHC_COX_CTG1"
>
> So we need to work on this.
>
> I'm not sure about using genome(gr) <- "hg19" for this. Right now
> it sets the genome column of the seqinfo with the supplied string
> and nothing else. Aren't there valid use cases for this?

Not sure. People would almost always want the seqname style and order
to be consistent with the given genome.

> What about
> using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
> whole seqinfo component actually gets filled.
>

Yea, but "genome" is so intuitive compared to "seqinfo".

> H.
>
> On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
>>
>> that's kind of always been my goal...
>>
>>
>> Statistics is the grammar of science.
>> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>
>> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
>> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
>>
>>     Maybe this could eventually support setting the seqinfo with:
>>
>>     genome(gr) <- "hg19"
>>
>>     Or is that being too clever?
>>
>>     On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès <hpages at fredhutch.org
>>     <mailto:hpages at fredhutch.org>> wrote:
>>      > Hi,
>>      >
>>      > FWIW I started to work on supporting quick generation of a
>> standalone
>>      > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
>>      >
>>      > It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2,
>>      > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1,
>> mm10,
>>      > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4,
>> galGal3,
>>      > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3,
>>      > and sacCer2. I'll add more.
>>      >
>>      > See ?Seqinfo for some examples.
>>      >
>>      > Right now it fetches the information from internet every time you
>>      > call it but maybe we should just store that information in the
>>      > GenomeInfoDb package as Tim suggested?
>>      >
>>      > H.
>>      >
>>      >
>>      > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
>>      >>
>>      >> That would be perfect actually.  And it would radically reduce &
>>      >> modularize maintenance.  Maybe that's the best way to go after
>>     all.  Quite
>>      >> sensible.
>>      >>
>>      >> --t
>>      >>
>>      >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey
>>     <stvjc at channing.harvard.edu <mailto:stvjc at channing.harvard.edu>>
>>      >>> wrote:
>>      >>>
>>      >>> It really isn't hard to have multiple OrganismDb packages in
>>     place -- the
>>      >>> process of making new ones is documented and was given as an
>>     exercise in
>>      >>> the EdX course.  I don't know if we want to institutionalize it
>> and
>>      >>> distribute such -- I think we might, so that there would be
>>     Hs19, Hs38,
>>      >>> mm9, etc. packages.  They have very little content, they just
>>     coordinate
>>      >>> interactions with packages that you'll already have.
>>      >>>
>>      >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr.
>>     <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>>
>>      >>> wrote:
>>      >>>
>>      >>>> Right, I typically do that too, and if you're working on human
>>     data it
>>      >>>> isn't a big deal.  What makes things a lot more of a drag is
>>     when you
>>      >>>> work
>>      >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where
>>      >>>> Mus.musculus
>>      >>>> is essentially a "build ahead" of Homo.sapiens.
>>      >>>>
>>      >>>> R> seqinfo(Homo.sapiens)
>>      >>>> Seqinfo object with 93 sequences (1 circular) from hg19 genome
>>      >>>>
>>      >>>> R> seqinfo(Mus.musculus)
>>      >>>> Seqinfo object with 66 sequences (1 circular) from mm10 genome:
>>      >>>>
>>      >>>> It's not as explicit as directly assigning the seqinfo from a
>>     genome
>>      >>>> that
>>      >>>> corresponds to that of your annotations/results/whatever.  I
>>     know we
>>      >>>> could
>>      >>>> all use crossmap or liftOver or whatever, but that's not
>>     really the
>>      >>>> same,
>>      >>>> and it takes time, whereas assigning the proper seqinfo for
>>      >>>> relationships
>>      >>>> is very fast.
>>      >>>>
>>      >>>> That's all I was getting at...
>>      >>>>
>>      >>>>
>>      >>>> Statistics is the grammar of science.
>>      >>>> Karl Pearson
>> <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>      >>>>
>>      >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
>>      >>>> <stvjc at channing.harvard.edu <mailto:stvjc at channing.harvard.edu>
>>      >>>>>
>>      >>>>> wrote:
>>      >>>>
>>      >>>>
>>      >>>>> I typically get this info from Homo.sapiens.  The result is
>>     parasitic
>>      >>>>> on
>>      >>>>> the TxDb that is in there.  I don't know how easy it is to swap
>>      >>>>> alternate
>>      >>>>> TxDb in to get a different build.  I think it would make sense
>> to
>>      >>>>> regard
>>      >>>>> the OrganismDb instances as foundational for this sort of
>>     structural
>>      >>>>> data.
>>      >>>>>
>>      >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
>>      >>>>> kasperdanielhansen at gmail.com
>>     <mailto:kasperdanielhansen at gmail.com>> wrote:
>>      >>>>>
>>      >>>>>> Let me rephrase this slightly.  From one POV the purpose of
>>      >>>>>> GenomeInfoDb
>>      >>>>>> is
>>      >>>>>> clean up the seqinfo slot.  Currently it does most of the
>>     cleaning,
>>      >>>>>> but
>>      >>>>>> it
>>      >>>>>> does not add seqlengths.
>>      >>>>>>
>>      >>>>>> It is clear that seqlengths depends on the version of the
>>     genome, but
>>      >>>>>> I
>>      >>>>>> will argue so does the seqnames.  Of course, for human,
>>     chr22 will not
>>      >>>>>> change.  But what about the names of all the random
>>     contigs?  Or for
>>      >>>>>> other
>>      >>>>>> organisms, what about going from a draft genome with 10k
>>     contigs to a
>>      >>>>>> more
>>      >>>>>> completely genome assembled into fewer, larger chromosomes.
>>      >>>>>>
>>      >>>>>> I acknowledge that this information is present in the BSgenome
>>      >>>>>> packages,
>>      >>>>>> but it seems (to me) to be very appropriate to have them
>>     around for
>>      >>>>>> cleaning up the seqinfo slot.  For some situations it is not
>>     great to
>>      >>>>>> depend on 1 GB> download for something that is a few bytes.
>>      >>>>>>
>>      >>>>>> Best,
>>      >>>>>> Kasper
>>      >>>>>>
>>      >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr.
>>     <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
>>
>>      >>>>>> wrote:
>>      >>>>>>
>>      >>>>>>> It would be nice (for a number of reasons) to have
>>     chromosome lengths
>>      >>>>>>> readily available in a foundational package like
>>     GenomeInfoDb, so
>>      >>>>>>> that,
>>      >>>>>>> say,
>>      >>>>>>>
>>      >>>>>>> data(seqinfo.hg19)
>>      >>>>>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
>>      >>>>>>>
>>      >>>>>>> would work without issues.  Is there any particular reason
>> this
>>      >>>>>>
>>      >>>>>> couldn't
>>      >>>>>>>
>>      >>>>>>> happen for the supported/available BSgenomes?  It would
>>     seem like a
>>      >>>>>>
>>      >>>>>> simple
>>      >>>>>>>
>>      >>>>>>> matter to do
>>      >>>>>>>
>>      >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
>>      >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
>>      >>>>>>> R> save(seqinfo.hg19,
>>      >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
>>      >>>>>>>
>>      >>>>>>> and be done with it until (say) the next release or next
>>     released
>>      >>>>>>> BSgenome.  I considered looping through the following
>> BSgenomes
>>      >>>>>>
>>      >>>>>> myself...
>>      >>>>>>>
>>      >>>>>>> and if it isn't strongly opposed by (everyone) I may still
>>     do exactly
>>      >>>>>>> that.  Seems useful, no?
>>      >>>>>>>
>>      >>>>>>> e.g. for the following 42 builds,
>>      >>>>>>>
>>      >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "",
>>     available.genomes())),
>>      >>>>>>> value=TRUE)
>>      >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2"
>>      >>>>>>
>>      >>>>>> "BSgenome.Btaurus.UCSC.bosTau3"
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4"
>>      >>>>>>
>>      >>>>>> "BSgenome.Btaurus.UCSC.bosTau6"
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8"
>>      >>>>>>> "BSgenome.Celegans.UCSC.ce10"
>>      >>>>>>>
>>      >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2"
>>       "BSgenome.Celegans.UCSC.ce6"
>>      >>>>>>>
>>      >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
>>      >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3"
>>      >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
>>      >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
>>      >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6"
>>      >>>>>>
>>      >>>>>> "BSgenome.Drerio.UCSC.danRer5"
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6"
>>      >>>>>>
>>      >>>>>> "BSgenome.Drerio.UCSC.danRer7"
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805"
>>      >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1"
>>      >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3"
>>      >>>>>>
>>      >>>>>> "BSgenome.Ggallus.UCSC.galGal4"
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38"
>>      >>>>>>> "BSgenome.Hsapiens.UCSC.hg17"
>>      >>>>>>>
>>      >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18"
>>      >>>>>>> "BSgenome.Hsapiens.UCSC.hg19"
>>      >>>>>>>
>>      >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38"
>>      >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
>>      >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1"
>>      >>>>>>
>>      >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2"
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3"
>>      >>>>>>
>>      >>>>>> "BSgenome.Mmusculus.UCSC.mm10"
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8"
>>      >>>>>>> "BSgenome.Mmusculus.UCSC.mm9"
>>      >>>>>>>
>>      >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
>>      >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3"
>>      >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4"
>>      >>>>>>
>>      >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5"
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
>>      >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1"
>>      >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
>>      >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3"
>>      >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3"
>>      >>>>>>
>>      >>>>>> "BSgenome.Tguttata.UCSC.taeGut1"
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>> Am I insane for suggesting this?  It would make things a
>> little
>>      >>>>>>> easier
>>      >>>>>>
>>      >>>>>> for
>>      >>>>>>>
>>      >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived
>>     objects, blah,
>>      >>>>>>
>>      >>>>>> blah,
>>      >>>>>>>
>>      >>>>>>> blah...
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>> Best,
>>      >>>>>>>
>>      >>>>>>> --t
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>>
>>      >>>>>>> Statistics is the grammar of science.
>>      >>>>>>> Karl Pearson
>>     <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>>      >>>>>>
>>      >>>>>>
>>      >>>>>>         [[alternative HTML version deleted]]
>>      >>>>>>
>>      >>>>>> _______________________________________________
>>      >>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>     mailing list
>>      >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>      >>>
>>      >>>
>>      >>>     [[alternative HTML version deleted]]
>>      >>>
>>      >>> _______________________________________________
>>      >>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>     mailing list
>>      >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>      >>
>>      >>
>>      >> _______________________________________________
>>      >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>     mailing list
>>      >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>      >>
>>      >
>>      > --
>>      > Hervé Pagès
>>      >
>>      > Program in Computational Biology
>>      > Division of Public Health Sciences
>>      > Fred Hutchinson Cancer Research Center
>>      > 1100 Fairview Ave. N, M1-B514
>>      > P.O. Box 19024
>>      > Seattle, WA 98109-1024
>>      >
>>      > E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org>
>>      > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>      > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>      >
>>      >
>>      > _______________________________________________
>>      > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>>     mailing list
>>      > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319