[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Fri Jun 5 21:54:23 CEST 2015

In WGBS we frequently sequence a human with spikein from the lambda
genome.  In this case, most of the chromosomes of the Granges are from
human, except one.  This is a usecase where genome(GR) is not constant.  I
suggest, partly for compatibility, to keep genome, but perhaps do something
like
  standardizeGenome()
or something like this.

I would indeed love, love, love a function which just cleans it up.

Kasper

On Fri, Jun 5, 2015 at 2:51 PM, Gabe Becker <becker.gabe at gene.com> wrote:

> Herve,
>
> This is probably a naive question, but what usecases are there for creating
> an object with the wrong seqinfo for its genome?
>
> ~G
>
> On Fri, Jun 5, 2015 at 11:43 AM, Michael Lawrence <
> lawrence.michael at gene.com
> > wrote:
>
> > On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès <hpages at fredhutch.org>
> wrote:
> > > I also think that we're heading towards something like that.
> > >
> > > So genome(gr) <- "hg19" would:
> > >
> > >   (a) Add any missing information to the seqinfo.
> > >   (b) Sort the seqlevels in "canonical" order.
> > >   (c) Change the seqlevels style to UCSC style if they are not.
> > >
> > > The 3 tasks are orthogonal. I guess most of the times people want
> > > an easy way to perform them all at once.
> > >
> > > We could easily support (a) and (b). This assumes that the current
> > > seqlevels are already valid hg19 seqlevels:
> > >
> > >   si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2"))
> > >   gr1 <- GRanges(seqinfo=si1)
> > >   hg19_si <- Seqinfo(genome="hg19")
> > >
> > >   ## (a):
> > >   seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)]
> > >   seqinfo(gr1)
> > >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
> > >   #   seqnames       seqlengths isCircular genome
> > >   #   chrX            155270560      FALSE   hg19
> > >   #   chrUn_gl000249      38502      FALSE   hg19
> > >   #   chr2            243199373      FALSE   hg19
> > >   #   chr6_cox_hap2     4795371      FALSE   hg19
> > >
> > >   ## (b):
> > >   seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1))
> > >   seqinfo(gr1)
> > >   # Seqinfo object with 4 sequences (1 circular) from hg19 genome:
> > >   #   seqnames       seqlengths isCircular genome
> > >   #   chr2            243199373      FALSE   hg19
> > >   #   chrX            155270560      FALSE   hg19
> > >   #   chr6_cox_hap2     4795371      FALSE   hg19
> > >   #   chrUn_gl000249      38502      FALSE   hg19
> > >
> > > (c) is harder because seqlevelsStyle() doesn't know how to rename
> > > scaffolds yet:
> > >
> > >   si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2",
> > "HSCHR6_MHC_COX_CTG1"))
> > >   gr2 <- GRanges(seqinfo=si2)
> > >
> > >   seqlevelsStyle(gr2)
> > >   # [1] "NCBI"
> > >
> > >   seqlevelsStyle(gr2) <- "UCSC"
> > >   seqlevels(gr2)
> > >   # [1] "chrX"                 "HSCHRUN_RANDOM_CTG42" "chr2"
> > >   # [4] "HSCHR6_MHC_COX_CTG1"
> > >
> > > So we need to work on this.
> > >
> > > I'm not sure about using genome(gr) <- "hg19" for this. Right now
> > > it sets the genome column of the seqinfo with the supplied string
> > > and nothing else. Aren't there valid use cases for this?
> >
> > Not sure. People would almost always want the seqname style and order
> > to be consistent with the given genome.
> >
> > > What about
> > > using seqinfo(gr) <- "hg19" instead? It kind of suggests that the
> > > whole seqinfo component actually gets filled.
> > >
> >
> > Yea, but "genome" is so intuitive compared to "seqinfo".
> >
> >
> >
> > > H.
> > >
> > > On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote:
> > >>
> > >> that's kind of always been my goal...
> > >>
> > >>
> > >> Statistics is the grammar of science.
> > >> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
> > >>
> > >> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence
> > >> <lawrence.michael at gene.com <mailto:lawrence.michael at gene.com>> wrote:
> > >>
> > >>     Maybe this could eventually support setting the seqinfo with:
> > >>
> > >>     genome(gr) <- "hg19"
> > >>
> > >>     Or is that being too clever?
> > >>
> > >>     On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès <hpages at fredhutch.org
> > >>     <mailto:hpages at fredhutch.org>> wrote:
> > >>      > Hi,
> > >>      >
> > >>      > FWIW I started to work on supporting quick generation of a
> > >> standalone
> > >>      > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb.
> > >>      >
> > >>      > It already supports hg38, hg19, hg18, panTro4, panTro3,
> panTro2,
> > >>      > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1,
> > >> mm10,
> > >>      > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4,
> > >> galGal3,
> > >>      > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2,
> > sacCer3,
> > >>      > and sacCer2. I'll add more.
> > >>      >
> > >>      > See ?Seqinfo for some examples.
> > >>      >
> > >>      > Right now it fetches the information from internet every time
> you
> > >>      > call it but maybe we should just store that information in the
> > >>      > GenomeInfoDb package as Tim suggested?
> > >>      >
> > >>      > H.
> > >>      >
> > >>      >
> > >>      > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote:
> > >>      >>
> > >>      >> That would be perfect actually.  And it would radically
> reduce &
> > >>      >> modularize maintenance.  Maybe that's the best way to go after
> > >>     all.  Quite
> > >>      >> sensible.
> > >>      >>
> > >>      >> --t
> > >>      >>
> > >>      >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey
> > >>     <stvjc at channing.harvard.edu <mailto:stvjc at channing.harvard.edu>>
> > >>      >>> wrote:
> > >>      >>>
> > >>      >>> It really isn't hard to have multiple OrganismDb packages in
> > >>     place -- the
> > >>      >>> process of making new ones is documented and was given as an
> > >>     exercise in
> > >>      >>> the EdX course.  I don't know if we want to institutionalize
> it
> > >> and
> > >>      >>> distribute such -- I think we might, so that there would be
> > >>     Hs19, Hs38,
> > >>      >>> mm9, etc. packages.  They have very little content, they just
> > >>     coordinate
> > >>      >>> interactions with packages that you'll already have.
> > >>      >>>
> > >>      >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr.
> > >>     <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
> > >>
> > >>      >>> wrote:
> > >>      >>>
> > >>      >>>> Right, I typically do that too, and if you're working on
> human
> > >>     data it
> > >>      >>>> isn't a big deal.  What makes things a lot more of a drag is
> > >>     when you
> > >>      >>>> work
> > >>      >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where
> > >>      >>>> Mus.musculus
> > >>      >>>> is essentially a "build ahead" of Homo.sapiens.
> > >>      >>>>
> > >>      >>>> R> seqinfo(Homo.sapiens)
> > >>      >>>> Seqinfo object with 93 sequences (1 circular) from hg19
> genome
> > >>      >>>>
> > >>      >>>> R> seqinfo(Mus.musculus)
> > >>      >>>> Seqinfo object with 66 sequences (1 circular) from mm10
> > genome:
> > >>      >>>>
> > >>      >>>> It's not as explicit as directly assigning the seqinfo from
> a
> > >>     genome
> > >>      >>>> that
> > >>      >>>> corresponds to that of your annotations/results/whatever.  I
> > >>     know we
> > >>      >>>> could
> > >>      >>>> all use crossmap or liftOver or whatever, but that's not
> > >>     really the
> > >>      >>>> same,
> > >>      >>>> and it takes time, whereas assigning the proper seqinfo for
> > >>      >>>> relationships
> > >>      >>>> is very fast.
> > >>      >>>>
> > >>      >>>> That's all I was getting at...
> > >>      >>>>
> > >>      >>>>
> > >>      >>>> Statistics is the grammar of science.
> > >>      >>>> Karl Pearson
> > >> <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
> > >>      >>>>
> > >>      >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey
> > >>      >>>> <stvjc at channing.harvard.edu <mailto:
> > stvjc at channing.harvard.edu>
> > >>      >>>>>
> > >>      >>>>> wrote:
> > >>      >>>>
> > >>      >>>>
> > >>      >>>>> I typically get this info from Homo.sapiens.  The result is
> > >>     parasitic
> > >>      >>>>> on
> > >>      >>>>> the TxDb that is in there.  I don't know how easy it is to
> > swap
> > >>      >>>>> alternate
> > >>      >>>>> TxDb in to get a different build.  I think it would make
> > sense
> > >> to
> > >>      >>>>> regard
> > >>      >>>>> the OrganismDb instances as foundational for this sort of
> > >>     structural
> > >>      >>>>> data.
> > >>      >>>>>
> > >>      >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
> > >>      >>>>> kasperdanielhansen at gmail.com
> > >>     <mailto:kasperdanielhansen at gmail.com>> wrote:
> > >>      >>>>>
> > >>      >>>>>> Let me rephrase this slightly.  From one POV the purpose
> of
> > >>      >>>>>> GenomeInfoDb
> > >>      >>>>>> is
> > >>      >>>>>> clean up the seqinfo slot.  Currently it does most of the
> > >>     cleaning,
> > >>      >>>>>> but
> > >>      >>>>>> it
> > >>      >>>>>> does not add seqlengths.
> > >>      >>>>>>
> > >>      >>>>>> It is clear that seqlengths depends on the version of the
> > >>     genome, but
> > >>      >>>>>> I
> > >>      >>>>>> will argue so does the seqnames.  Of course, for human,
> > >>     chr22 will not
> > >>      >>>>>> change.  But what about the names of all the random
> > >>     contigs?  Or for
> > >>      >>>>>> other
> > >>      >>>>>> organisms, what about going from a draft genome with 10k
> > >>     contigs to a
> > >>      >>>>>> more
> > >>      >>>>>> completely genome assembled into fewer, larger
> chromosomes.
> > >>      >>>>>>
> > >>      >>>>>> I acknowledge that this information is present in the
> > BSgenome
> > >>      >>>>>> packages,
> > >>      >>>>>> but it seems (to me) to be very appropriate to have them
> > >>     around for
> > >>      >>>>>> cleaning up the seqinfo slot.  For some situations it is
> not
> > >>     great to
> > >>      >>>>>> depend on 1 GB> download for something that is a few
> bytes.
> > >>      >>>>>>
> > >>      >>>>>> Best,
> > >>      >>>>>> Kasper
> > >>      >>>>>>
> > >>      >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr.
> > >>     <tim.triche at gmail.com <mailto:tim.triche at gmail.com>>
> > >>
> > >>      >>>>>> wrote:
> > >>      >>>>>>
> > >>      >>>>>>> It would be nice (for a number of reasons) to have
> > >>     chromosome lengths
> > >>      >>>>>>> readily available in a foundational package like
> > >>     GenomeInfoDb, so
> > >>      >>>>>>> that,
> > >>      >>>>>>> say,
> > >>      >>>>>>>
> > >>      >>>>>>> data(seqinfo.hg19)
> > >>      >>>>>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults)
> ]
> > >>      >>>>>>>
> > >>      >>>>>>> would work without issues.  Is there any particular
> reason
> > >> this
> > >>      >>>>>>
> > >>      >>>>>> couldn't
> > >>      >>>>>>>
> > >>      >>>>>>> happen for the supported/available BSgenomes?  It would
> > >>     seem like a
> > >>      >>>>>>
> > >>      >>>>>> simple
> > >>      >>>>>>>
> > >>      >>>>>>> matter to do
> > >>      >>>>>>>
> > >>      >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19)
> > >>      >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens)
> > >>      >>>>>>> R> save(seqinfo.hg19,
> > >>      >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
> > >>      >>>>>>>
> > >>      >>>>>>> and be done with it until (say) the next release or next
> > >>     released
> > >>      >>>>>>> BSgenome.  I considered looping through the following
> > >> BSgenomes
> > >>      >>>>>>
> > >>      >>>>>> myself...
> > >>      >>>>>>>
> > >>      >>>>>>> and if it isn't strongly opposed by (everyone) I may
> still
> > >>     do exactly
> > >>      >>>>>>> that.  Seems useful, no?
> > >>      >>>>>>>
> > >>      >>>>>>> e.g. for the following 42 builds,
> > >>      >>>>>>>
> > >>      >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "",
> > >>     available.genomes())),
> > >>      >>>>>>> value=TRUE)
> > >>      >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2"
> > >>      >>>>>>
> > >>      >>>>>> "BSgenome.Btaurus.UCSC.bosTau3"
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4"
> > >>      >>>>>>
> > >>      >>>>>> "BSgenome.Btaurus.UCSC.bosTau6"
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8"
> > >>      >>>>>>> "BSgenome.Celegans.UCSC.ce10"
> > >>      >>>>>>>
> > >>      >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2"
> > >>       "BSgenome.Celegans.UCSC.ce6"
> > >>      >>>>>>>
> > >>      >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
> > >>      >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3"
> > >>      >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
> > >>      >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3"
> > >>      >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6"
> > >>      >>>>>>
> > >>      >>>>>> "BSgenome.Drerio.UCSC.danRer5"
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6"
> > >>      >>>>>>
> > >>      >>>>>> "BSgenome.Drerio.UCSC.danRer7"
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805"
> > >>      >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1"
> > >>      >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3"
> > >>      >>>>>>
> > >>      >>>>>> "BSgenome.Ggallus.UCSC.galGal4"
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38"
> > >>      >>>>>>> "BSgenome.Hsapiens.UCSC.hg17"
> > >>      >>>>>>>
> > >>      >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18"
> > >>      >>>>>>> "BSgenome.Hsapiens.UCSC.hg19"
> > >>      >>>>>>>
> > >>      >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38"
> > >>      >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0"
> > >>      >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1"
> > >>      >>>>>>
> > >>      >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2"
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3"
> > >>      >>>>>>
> > >>      >>>>>> "BSgenome.Mmusculus.UCSC.mm10"
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8"
> > >>      >>>>>>> "BSgenome.Mmusculus.UCSC.mm9"
> > >>      >>>>>>>
> > >>      >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
> > >>      >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3"
> > >>      >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4"
> > >>      >>>>>>
> > >>      >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5"
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
> > >>      >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1"
> > >>      >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
> > >>      >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3"
> > >>      >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3"
> > >>      >>>>>>
> > >>      >>>>>> "BSgenome.Tguttata.UCSC.taeGut1"
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>> Am I insane for suggesting this?  It would make things a
> > >> little
> > >>      >>>>>>> easier
> > >>      >>>>>>
> > >>      >>>>>> for
> > >>      >>>>>>>
> > >>      >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived
> > >>     objects, blah,
> > >>      >>>>>>
> > >>      >>>>>> blah,
> > >>      >>>>>>>
> > >>      >>>>>>> blah...
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>> Best,
> > >>      >>>>>>>
> > >>      >>>>>>> --t
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>>
> > >>      >>>>>>> Statistics is the grammar of science.
> > >>      >>>>>>> Karl Pearson
> > >>     <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
> > >>      >>>>>>
> > >>      >>>>>>
> > >>      >>>>>>         [[alternative HTML version deleted]]
> > >>      >>>>>>
> > >>      >>>>>> _______________________________________________
> > >>      >>>>>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org
> >
> > >>     mailing list
> > >>      >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >>      >>>
> > >>      >>>
> > >>      >>>     [[alternative HTML version deleted]]
> > >>      >>>
> > >>      >>> _______________________________________________
> > >>      >>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> > >>     mailing list
> > >>      >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >>      >>
> > >>      >>
> > >>      >> _______________________________________________
> > >>      >> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> > >>     mailing list
> > >>      >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >>      >>
> > >>      >
> > >>      > --
> > >>      > Hervé Pagès
> > >>      >
> > >>      > Program in Computational Biology
> > >>      > Division of Public Health Sciences
> > >>      > Fred Hutchinson Cancer Research Center
> > >>      > 1100 Fairview Ave. N, M1-B514
> > >>      > P.O. Box 19024
> > >>      > Seattle, WA 98109-1024
> > >>      >
> > >>      > E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org>
> > >>      > Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> > >>      > Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
> > >>      >
> > >>      >
> > >>      > _______________________________________________
> > >>      > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> > >>     mailing list
> > >>      > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > >>
> > >>
> > >
> > > --
> > > Hervé Pagès
> > >
> > > Program in Computational Biology
> > > Division of Public Health Sciences
> > > Fred Hutchinson Cancer Research Center
> > > 1100 Fairview Ave. N, M1-B514
> > > P.O. Box 19024
> > > Seattle, WA 98109-1024
> > >
> > > E-mail: hpages at fredhutch.org
> > > Phone:  (206) 667-5791
> > > Fax:    (206) 667-1319
> >
> > _______________________________________________
> > Bioc-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>
>
>
> --
> Gabriel Becker, Ph.D
> Computational Biologist
> Genentech Research
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]