[Bioc-devel] chromosome lengths (seqinfo) for supported BSgenome builds into GenomeInfoDb?

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Wed Jun 3 21:12:06 CEST 2015


Let me rephrase this slightly.  From one POV the purpose of GenomeInfoDb is
clean up the seqinfo slot.  Currently it does most of the cleaning, but it
does not add seqlengths.

It is clear that seqlengths depends on the version of the genome, but I
will argue so does the seqnames.  Of course, for human, chr22 will not
change.  But what about the names of all the random contigs?  Or for other
organisms, what about going from a draft genome with 10k contigs to a more
completely genome assembled into fewer, larger chromosomes.

I acknowledge that this information is present in the BSgenome packages,
but it seems (to me) to be very appropriate to have them around for
cleaning up the seqinfo slot.  For some situations it is not great to
depend on 1 GB> download for something that is a few bytes.

Best,
Kasper

On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. <tim.triche at gmail.com>
wrote:

> It would be nice (for a number of reasons) to have chromosome lengths
> readily available in a foundational package like GenomeInfoDb, so that,
> say,
>
> data(seqinfo.hg19)
> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ]
>
> would work without issues.  Is there any particular reason this couldn't
> happen for the supported/available BSgenomes?  It would seem like a simple
> matter to do
>
> R> library(BSgenome.Hsapiens.UCSC.hg19)
> R> seqinfo.hg19 <- seqinfo(Hsapiens)
> R> save(seqinfo.hg19,
> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda")
>
> and be done with it until (say) the next release or next released
> BSgenome.  I considered looping through the following BSgenomes myself...
> and if it isn't strongly opposed by (everyone) I may still do exactly
> that.  Seems useful, no?
>
> e.g. for the following 42 builds,
>
> grep("(UCSC|NCBI)", unique(gsub(".masked", "", available.genomes())),
> value=TRUE)
>  [1] "BSgenome.Amellifera.UCSC.apiMel2"   "BSgenome.Btaurus.UCSC.bosTau3"
>
>  [3] "BSgenome.Btaurus.UCSC.bosTau4"      "BSgenome.Btaurus.UCSC.bosTau6"
>
>  [5] "BSgenome.Btaurus.UCSC.bosTau8"      "BSgenome.Celegans.UCSC.ce10"
>
>  [7] "BSgenome.Celegans.UCSC.ce2"         "BSgenome.Celegans.UCSC.ce6"
>
>  [9] "BSgenome.Cfamiliaris.UCSC.canFam2"
>  "BSgenome.Cfamiliaris.UCSC.canFam3"
> [11] "BSgenome.Dmelanogaster.UCSC.dm2"
>  "BSgenome.Dmelanogaster.UCSC.dm3"
> [13] "BSgenome.Dmelanogaster.UCSC.dm6"    "BSgenome.Drerio.UCSC.danRer5"
>
> [15] "BSgenome.Drerio.UCSC.danRer6"       "BSgenome.Drerio.UCSC.danRer7"
>
> [17] "BSgenome.Ecoli.NCBI.20080805"
> "BSgenome.Gaculeatus.UCSC.gasAcu1"
> [19] "BSgenome.Ggallus.UCSC.galGal3"      "BSgenome.Ggallus.UCSC.galGal4"
>
> [21] "BSgenome.Hsapiens.NCBI.GRCh38"      "BSgenome.Hsapiens.UCSC.hg17"
>
> [23] "BSgenome.Hsapiens.UCSC.hg18"        "BSgenome.Hsapiens.UCSC.hg19"
>
> [25] "BSgenome.Hsapiens.UCSC.hg38"
>  "BSgenome.Mfascicularis.NCBI.5.0"
> [27] "BSgenome.Mfuro.UCSC.musFur1"        "BSgenome.Mmulatta.UCSC.rheMac2"
>
> [29] "BSgenome.Mmulatta.UCSC.rheMac3"     "BSgenome.Mmusculus.UCSC.mm10"
>
> [31] "BSgenome.Mmusculus.UCSC.mm8"        "BSgenome.Mmusculus.UCSC.mm9"
>
> [33] "BSgenome.Ptroglodytes.UCSC.panTro2"
> "BSgenome.Ptroglodytes.UCSC.panTro3"
> [35] "BSgenome.Rnorvegicus.UCSC.rn4"      "BSgenome.Rnorvegicus.UCSC.rn5"
>
> [37] "BSgenome.Rnorvegicus.UCSC.rn6"
>  "BSgenome.Scerevisiae.UCSC.sacCer1"
> [39] "BSgenome.Scerevisiae.UCSC.sacCer2"
>  "BSgenome.Scerevisiae.UCSC.sacCer3"
> [41] "BSgenome.Sscrofa.UCSC.susScr3"      "BSgenome.Tguttata.UCSC.taeGut1"
>
>
>
>
> Am I insane for suggesting this?  It would make things a little easier for
> rtracklayer, most SummarizedExperiment and SE-derived objects, blah, blah,
> blah...
>
>
> Best,
>
> --t
>
>
>
>
> Statistics is the grammar of science.
> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list