[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Vincent Carey stvjc at channing.harvard.edu
Wed Jun 3 21:17:30 CEST 2015


I typically get this info from Homo.sapiens.  The result is parasitic on
the TxDb that is in there.  I don't know how easy it is to swap alternate
TxDb in to get a different build.  I think it would make sense to regard
the OrganismDb instances as foundational for this sort of structural data.

On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:

> Let me rephrase this slightly.  From one POV the purpose of GenomeInfoDb is
> clean up the seqinfo slot.  Currently it does most of the cleaning, but it
> does not add seqlengths.
>
> It is clear that seqlengths depends on the version of the genome, but I
> will argue so does the seqnames.  Of course, for human, chr22 will not
> change.  But what about the names of all the random contigs?  Or for other
> organisms, what about going from a draft genome with 10k contigs to a more
> completely genome assembled into fewer, larger chromosomes.
>
> I acknowledge that this information is present in the BSgenome packages,
> but it seems (to me) to be very appropriate to have them around for
> cleaning up the seqinfo slot.  For some situations it is not great to
> depend on 1 GB> download for something that is a few bytes.
>
> Best,
> Kasper
>
> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. <tim.triche at gmail.com>
> wrote:
>
> > It would be nice (for a number of reasons) to have chromosome lengths
> > readily available in a foundational package like GenomeInfoDb, so that,
> > say,
> >
> > data(seqinfo.hg19)
> > seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
> >
> > would work without issues.  Is there any particular reason this couldn't
> > happen for the supported/available BSgenomes?  It would seem like a
> simple
> > matter to do
> >
> > R> library(BSgenome.Hsapiens.UCSC.hg19)
> > R> seqinfo.hg19 <- seqinfo(Hsapiens)
> > R> save(seqinfo.hg19,
> > file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
> >
> > and be done with it until (say) the next release or next released
> > BSgenome.  I considered looping through the following BSgenomes myself...
> > and if it isn't strongly opposed by (everyone) I may still do exactly
> > that.  Seems useful, no?
> >
> > e.g. for the following 42 builds,
> >
> > grep("(UCSC|NCBI)", unique(gsub(".masked", "", available.genomes())),
> > value=TRUE)
> >  [1] "BSgenome.Amellifera.UCSC.apiMel2"   "BSgenome.Btaurus.UCSC.bosTau3"
> >
> >  [3] "BSgenome.Btaurus.UCSC.bosTau4"      "BSgenome.Btaurus.UCSC.bosTau6"
> >
> >  [5] "BSgenome.Btaurus.UCSC.bosTau8"      "BSgenome.Celegans.UCSC.ce10"
> >
> >  [7] "BSgenome.Celegans.UCSC.ce2"         "BSgenome.Celegans.UCSC.ce6"
> >
> >  [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
> >  "BSgenome.Cfamiliaris.UCSC.canFam3"
> > [11] "BSgenome.Dmelanogaster.UCSC.dm2"
> >  "BSgenome.Dmelanogaster.UCSC.dm3"
> > [13] "BSgenome.Dmelanogaster.UCSC.dm6"    "BSgenome.Drerio.UCSC.danRer5"
> >
> > [15] "BSgenome.Drerio.UCSC.danRer6"       "BSgenome.Drerio.UCSC.danRer7"
> >
> > [17] "BSgenome.Ecoli.NCBI.20080805"
> > "BSgenome.Gaculeatus.UCSC.gasAcu1"
> > [19] "BSgenome.Ggallus.UCSC.galGal3"      "BSgenome.Ggallus.UCSC.galGal4"
> >
> > [21] "BSgenome.Hsapiens.NCBI.GRCh38"      "BSgenome.Hsapiens.UCSC.hg17"
> >
> > [23] "BSgenome.Hsapiens.UCSC.hg18"        "BSgenome.Hsapiens.UCSC.hg19"
> >
> > [25] "BSgenome.Hsapiens.UCSC.hg38"
> >  "BSgenome.Mfascicularis.NCBI.5.0"
> > [27] "BSgenome.Mfuro.UCSC.musFur1"
> "BSgenome.Mmulatta.UCSC.rheMac2"
> >
> > [29] "BSgenome.Mmulatta.UCSC.rheMac3"     "BSgenome.Mmusculus.UCSC.mm10"
> >
> > [31] "BSgenome.Mmusculus.UCSC.mm8"        "BSgenome.Mmusculus.UCSC.mm9"
> >
> > [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
> > "BSgenome.Ptroglodytes.UCSC.panTro3"
> > [35] "BSgenome.Rnorvegicus.UCSC.rn4"      "BSgenome.Rnorvegicus.UCSC.rn5"
> >
> > [37] "BSgenome.Rnorvegicus.UCSC.rn6"
> >  "BSgenome.Scerevisiae.UCSC.sacCer1"
> > [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
> >  "BSgenome.Scerevisiae.UCSC.sacCer3"
> > [41] "BSgenome.Sscrofa.UCSC.susScr3"
> "BSgenome.Tguttata.UCSC.taeGut1"
> >
> >
> >
> >
> > Am I insane for suggesting this?  It would make things a little easier
> for
> > rtracklayer, most SummarizedExperiment and SE-derived objects, blah,
> blah,
> > blah...
> >
> >
> > Best,
> >
> > --t
> >
> >
> >
> >
> > Statistics is the grammar of science.
> > Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
> >
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list