[Bioc-devel] [devteam-bioc] A more flexible GenomeInfoDb::mapSeqlevels(): used supported info but don't break with new organisms/toy examples

Sonali Arora sarora at fhcrc.org
Wed Oct 22 19:53:37 CEST 2014


Hi Leo,

To summarize, These are the concerns raised in your previous email :

*a) mapSeqlevels() should return current style along with other mapped 
styles? **
look at extendedMapSeqlevels()

*I am looking into this. I have a few thoughts/ questions for you.

The main idea of mapSeqlevels() is to map the current seqlevelStyle to 
other seqlevelStyles.

"I was also expecting mapSeqlevels() to return the same input if the
specified 'style' was the same as the one currently being used."


Returning the existing seqlevelStyle along with the mapped Seqlevel 
style sounds a
little redundant. (given that the user is already supplying it while 
calling mapSeqlevels()
so he already knows it!)

  "For
example, if I was working with Homo sapiens NCBI style and attempted
to map to NCBI, I was expecting the same output as the input."


why would you want to map it back to NCBI ? Please explain your use case 
more
concretely.

I'd be happy to make changes to mapSeqlevels() and look at your function 
more deeply,
once I understand the use case better. Thanks for clarifying this.

*b) Do you think it would be helpful to you and other developers of I 
export**
**.guessSpeciesStyle() and .supportedSeqnameMappings() from GenomeInfoDb ?*

*c) .guessSpeciesStyle('2') should return all possible hits for "seqnames"*

This was a great find! I have fixed it in version 1.3.2 - Thanks for 
figuring this out !
Does the following style of output make sense to you ? If multiple hits 
are found
there are returned in "species" and "style" respectively.

 > .guessSpeciesStyle(c(paste0("chr",1:10)))
$species
[1] "Homo sapiens" "Mus musculus"

$style
[1] "UCSC" "UCSC"

 > .guessSpeciesStyle(c(paste0("chr",1:22)))
$species
[1] "Homo sapiens"

$style
[1] "UCSC"

 > .guessSpeciesStyle("chr2")
$species
[1] "Homo sapiens" "Mus musculus"

$style
[1] "UCSC" "UCSC"

 > .guessSpeciesStyle("2")
$species
  [1] "Arabidopsis thaliana"    "Arabidopsis thaliana" "Cyanidioschyzon 
merolae"
  [4] "Homo sapiens"            "Mus musculus"            "Oryza sativa"
  [7] "Oryza sativa"            "Populus trichocarpa"     "Zea mays"
[10] "Zea mays"

$style
  [1] "NCBI"   "TAIR10" "NCBI"   "NCBI"   "NCBI"   "NCBI" "MSU6"   
"JGI2"   "NCBI"   "AGPvF"

*SeqlevelsStyle also returns multiple styles now (if it cant guess the 
correct one!) *

 > seqnames <- c(paste0("chr",1:22))
 > seqlevelsStyle(seqnames)
[1] "UCSC"

 > seqnames <- "2"
 > seqlevelsStyle(seqnames)
warning! Multiple seqnameStyles found.
[1] "NCBI"   "TAIR10" "MSU6"   "JGI2"   "AGPvF"

Thanks again for your feedback,
Sonali.

*
*



On 10/21/2014 8:47 PM, Maintainer wrote:
> Hello,
>
> In my `derfinder` package I have been relying on `GenomeInfoDb` to
> correctly name the sequence names. Given
> https://support.bioconductor.org/p/62136/ I now realize that I was
> expecting too much from `GenomeInfoDb`.
>
> For example, in some cases I was expecting it to guess that a single
> '2' meant that the naming style was NCBI. This is true for Homo
> sapiens but now I realize that it's not necessary true for Arabidopsis
> thaliana.
>
>> GenomeInfoDb:::.guessSpeciesStyle('2')
> [1] "Arabidopsis thaliana" "NCBI"
>
> ## Could have been Homo sapiens (NCBI) or Arabidopbis thaliana TAIR10 or NCBI
>> GenomeInfoDb:::.supportedSeqnameMappings()[['Homo_sapiens']][2, ]
>    NCBI UCSC dbSNP
> 2    2 chr2   ch2
>> GenomeInfoDb:::.supportedSeqnameMappings()[['Arabidopsis_thaliana']][2, ]
>    NCBI TAIR10
> 2    2      2
>
> I was also expecting mapSeqlevels() to return the same input if the
> specified 'style' was the same as the one currently being used. For
> example, if I was working with Homo sapiens NCBI style and attempted
> to map to NCBI, I was expecting the same output as the input.
>
> ## More than 1 output
>> mapSeqlevels('2', 'NCBI')
>       2
> [1,] "2"
> [2,] "LGII"
> ## Turns out there is another organism that would make this mapping valid
>> GenomeInfoDb:::.supportedSeqnameMappings()[['Populus_trichocarpa']][2, ]
>    NCBI JGI2.F
> 2 LGII      2
>
>
> I also noticed that `GenomeInfoDb` couldn't work with some fictional
> examples. For example:
>
> ## Expected error, but this meant that `derfinder` couldn't use toy
> examples from other
> ## packages.
>> seqlevelsStyle('seq1')
> Error in seqlevelsStyle("seq1") :
>    The style does not have a compatible entry for the species supported
> by Seqname. Please see genomeStyles() for supported
>    species/style
>
> ## Using toy data from Rsamtools fails as expected
>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools",
> +                   mustWork=TRUE)
>> library(GenomicAlignments)
>> names(scanBamHeader(BamFile(fl))$targets)
> [1] "seq1" "seq2"
>> seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets))
> Error in seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets)) :
>    The style does not have a compatible entry for the species supported
> by Seqname. Please see genomeStyles() for supported
>    species/style
>
>
> So, I wanted a way to use `GenomeInfoDb` to make the correct naming
> style maps for supported genomes, and simply return the original
> sequence names if using an unsupported species/mapping or if the
> original and target styles are the same.
>
>
> That's what I basically did when I wrote 'extendedMapSeqlevels' (aye,
> long name) which you can see at
> https://gist.github.com/3e6f0e2e2bb67ee910da If you think that others
> might be interested in such a function, then by all means please add
> it to `GenomeInfoDb` and I'll import it. You'll notice how it tries to
> guess the species and/or current naming style when that information is
> not provided. When it's used with verbose = TRUE it will show a
> message explaining why the mapping failed (if it did), it'll show the
> link to the `GenomeInfoDb` vignette for adding an organism, and then
> return the original sequence names.
>
> I have a slightly modified version in derfinder
> https://github.com/lcolladotor/derfinder/blob/master/R/extendedMapSeqlevels.R
>   or https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/derfinder/R/extendedMapSeqlevels.R
> where I simply assume some other default values and guarantee
> continuity with how I was doing things previously.
>
> Naturally, R CMD check now shows a NOTE because I'm using the :::
> syntax for un-exported functions from `GenomeInfoDb`.
>
> Cheers,
> Leo
>
>
> Leonardo Collado Torres, PhD Candidate
> Department of Biostatistics
> Johns Hopkins University
> Bloomberg School of Public Health
> Website: http://www.biostat.jhsph.edu/~lcollado/
> Blog: http://lcolladotor.github.io/
>
> ________________________________________________________________________
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc

-- 
Thanks and Regards,
Sonali


	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list