[Bioc-devel] A more flexible GenomeInfoDb::mapSeqlevels(): used supported info but don't break with new organisms/toy examples

Leonardo Collado Torres lcollado at jhu.edu
Wed Oct 22 05:47:50 CEST 2014


Hello,

In my `derfinder` package I have been relying on `GenomeInfoDb` to
correctly name the sequence names. Given
https://support.bioconductor.org/p/62136/ I now realize that I was
expecting too much from `GenomeInfoDb`.

For example, in some cases I was expecting it to guess that a single
'2' meant that the naming style was NCBI. This is true for Homo
sapiens but now I realize that it's not necessary true for Arabidopsis
thaliana.

> GenomeInfoDb:::.guessSpeciesStyle('2')
[1] "Arabidopsis thaliana" "NCBI"

## Could have been Homo sapiens (NCBI) or Arabidopbis thaliana TAIR10 or NCBI
> GenomeInfoDb:::.supportedSeqnameMappings()[['Homo_sapiens']][2, ]
  NCBI UCSC dbSNP
2    2 chr2   ch2
> GenomeInfoDb:::.supportedSeqnameMappings()[['Arabidopsis_thaliana']][2, ]
  NCBI TAIR10
2    2      2

I was also expecting mapSeqlevels() to return the same input if the
specified 'style' was the same as the one currently being used. For
example, if I was working with Homo sapiens NCBI style and attempted
to map to NCBI, I was expecting the same output as the input.

## More than 1 output
> mapSeqlevels('2', 'NCBI')
     2
[1,] "2"
[2,] "LGII"
## Turns out there is another organism that would make this mapping valid
> GenomeInfoDb:::.supportedSeqnameMappings()[['Populus_trichocarpa']][2, ]
  NCBI JGI2.F
2 LGII      2


I also noticed that `GenomeInfoDb` couldn't work with some fictional
examples. For example:

## Expected error, but this meant that `derfinder` couldn't use toy
examples from other
## packages.
> seqlevelsStyle('seq1')
Error in seqlevelsStyle("seq1") :
  The style does not have a compatible entry for the species supported
by Seqname. Please see genomeStyles() for supported
  species/style

## Using toy data from Rsamtools fails as expected
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools",
+                   mustWork=TRUE)
> library(GenomicAlignments)
> names(scanBamHeader(BamFile(fl))$targets)
[1] "seq1" "seq2"
> seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets))
Error in seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets)) :
  The style does not have a compatible entry for the species supported
by Seqname. Please see genomeStyles() for supported
  species/style


So, I wanted a way to use `GenomeInfoDb` to make the correct naming
style maps for supported genomes, and simply return the original
sequence names if using an unsupported species/mapping or if the
original and target styles are the same.


That's what I basically did when I wrote 'extendedMapSeqlevels' (aye,
long name) which you can see at
https://gist.github.com/3e6f0e2e2bb67ee910da If you think that others
might be interested in such a function, then by all means please add
it to `GenomeInfoDb` and I'll import it. You'll notice how it tries to
guess the species and/or current naming style when that information is
not provided. When it's used with verbose = TRUE it will show a
message explaining why the mapping failed (if it did), it'll show the
link to the `GenomeInfoDb` vignette for adding an organism, and then
return the original sequence names.

I have a slightly modified version in derfinder
https://github.com/lcolladotor/derfinder/blob/master/R/extendedMapSeqlevels.R
 or https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/derfinder/R/extendedMapSeqlevels.R
where I simply assume some other default values and guarantee
continuity with how I was doing things previously.

Naturally, R CMD check now shows a NOTE because I'm using the :::
syntax for un-exported functions from `GenomeInfoDb`.

Cheers,
Leo


Leonardo Collado Torres, PhD Candidate
Department of Biostatistics
Johns Hopkins University
Bloomberg School of Public Health
Website: http://www.biostat.jhsph.edu/~lcollado/
Blog: http://lcolladotor.github.io/



More information about the Bioc-devel mailing list