[Bioc-devel] A more flexible GenomeInfoDb::mapSeqlevels(): used supported info but don't break with new organisms/toy examples
Leonardo Collado Torres
lcollado at jhu.edu
Wed Oct 22 05:47:50 CEST 2014
Hello,
In my `derfinder` package I have been relying on `GenomeInfoDb` to
correctly name the sequence names. Given
https://support.bioconductor.org/p/62136/ I now realize that I was
expecting too much from `GenomeInfoDb`.
For example, in some cases I was expecting it to guess that a single
'2' meant that the naming style was NCBI. This is true for Homo
sapiens but now I realize that it's not necessary true for Arabidopsis
thaliana.
> GenomeInfoDb:::.guessSpeciesStyle('2')
[1] "Arabidopsis thaliana" "NCBI"
## Could have been Homo sapiens (NCBI) or Arabidopbis thaliana TAIR10 or NCBI
> GenomeInfoDb:::.supportedSeqnameMappings()[['Homo_sapiens']][2, ]
NCBI UCSC dbSNP
2 2 chr2 ch2
> GenomeInfoDb:::.supportedSeqnameMappings()[['Arabidopsis_thaliana']][2, ]
NCBI TAIR10
2 2 2
I was also expecting mapSeqlevels() to return the same input if the
specified 'style' was the same as the one currently being used. For
example, if I was working with Homo sapiens NCBI style and attempted
to map to NCBI, I was expecting the same output as the input.
## More than 1 output
> mapSeqlevels('2', 'NCBI')
2
[1,] "2"
[2,] "LGII"
## Turns out there is another organism that would make this mapping valid
> GenomeInfoDb:::.supportedSeqnameMappings()[['Populus_trichocarpa']][2, ]
NCBI JGI2.F
2 LGII 2
I also noticed that `GenomeInfoDb` couldn't work with some fictional
examples. For example:
## Expected error, but this meant that `derfinder` couldn't use toy
examples from other
## packages.
> seqlevelsStyle('seq1')
Error in seqlevelsStyle("seq1") :
The style does not have a compatible entry for the species supported
by Seqname. Please see genomeStyles() for supported
species/style
## Using toy data from Rsamtools fails as expected
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools",
+ mustWork=TRUE)
> library(GenomicAlignments)
> names(scanBamHeader(BamFile(fl))$targets)
[1] "seq1" "seq2"
> seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets))
Error in seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets)) :
The style does not have a compatible entry for the species supported
by Seqname. Please see genomeStyles() for supported
species/style
So, I wanted a way to use `GenomeInfoDb` to make the correct naming
style maps for supported genomes, and simply return the original
sequence names if using an unsupported species/mapping or if the
original and target styles are the same.
That's what I basically did when I wrote 'extendedMapSeqlevels' (aye,
long name) which you can see at
https://gist.github.com/3e6f0e2e2bb67ee910da If you think that others
might be interested in such a function, then by all means please add
it to `GenomeInfoDb` and I'll import it. You'll notice how it tries to
guess the species and/or current naming style when that information is
not provided. When it's used with verbose = TRUE it will show a
message explaining why the mapping failed (if it did), it'll show the
link to the `GenomeInfoDb` vignette for adding an organism, and then
return the original sequence names.
I have a slightly modified version in derfinder
https://github.com/lcolladotor/derfinder/blob/master/R/extendedMapSeqlevels.R
or https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/derfinder/R/extendedMapSeqlevels.R
where I simply assume some other default values and guarantee
continuity with how I was doing things previously.
Naturally, R CMD check now shows a NOTE because I'm using the :::
syntax for un-exported functions from `GenomeInfoDb`.
Cheers,
Leo
Leonardo Collado Torres, PhD Candidate
Department of Biostatistics
Johns Hopkins University
Bloomberg School of Public Health
Website: http://www.biostat.jhsph.edu/~lcollado/
Blog: http://lcolladotor.github.io/
More information about the Bioc-devel
mailing list