[Bioc-devel] [devteam-bioc] A more flexible GenomeInfoDb::mapSeqlevels(): used supported info but don't break with new organisms/toy examples

Leonardo Collado Torres lcollado at jhu.edu
Fri Oct 24 00:35:31 CEST 2014


Hi,

I just found a tiny bug in .guessSpeciesStyle(). Basically, if the
style name has a dot, this function incorrectly returns the name of
the style.

See below:

## Lets guess the species and style for 'T'
> GenomeInfoDb:::.guessSpeciesStyle('T')
$species
[1] "Populus trichocarpa"

$style
[1] "JGI2"

## Same style with the higher lvl function
> seqlevelsStyle('T')
[1] "JGI2"

## Ok, style 'JGI2'

## In more dtail, this matches 'LGI' in NCBI style for this organism

> GenomeInfoDb:::.supportedSeqnameMappings()["Populus_trichocarpa"]
$Populus_trichocarpa
     NCBI JGI2.F
1     LGI      T
2    LGII      2
3   LGIII      3
4    LGIV      4
5     LGV      5
6    LGVI      6
7   LGVII      7
8  LGVIII      8
9    LGIX      9
10    LGX     10
11   LGXI     11
12  LGXII     12
13 LGXIII     13
14   LGIV     14
15   LGXV     15
16   LGVI     16
17  LGVII     17
18 LGVIII     18
19  LGXIX     19
20   Pltd   <NA>

## Guessing 'LGI' with the higher lvl function works
> seqlevelsStyle('LGI')
[1] "NCBI"

## Mapping from 'LGI' in 'NCBI' style to 'JGI2' style doesn't work
> mapSeqlevels('LGI', 'JGI2')
Error in mapSeqlevels("LGI", "JGI2") :
  supplied seqname style "JGI2" is not supported

## Using the correct 'style' name works.
## However a user might have expected it to work with 'JGI2' given
## that this was the output from seqlevelsStyle('T')
> mapSeqlevels('LGI', 'JGI2.F')
[1] "T"

## Although a user could find the correct name
> head(genomeStyles('Populus trichocarpa'), 1)
  circular auto   sex NCBI JGI2.F
1    FALSE TRUE FALSE  LGI      T

> packageVersion('GenomeInfoDb')
[1] ‘1.3.3’

Cheers,
Leo

On Wed, Oct 22, 2014 at 7:03 PM, Leonardo Collado Torres
<lcollado at jhu.edu> wrote:
> On Wed, Oct 22, 2014 at 6:59 PM, Leonardo Collado Torres
> <lcollado at jhu.edu> wrote:
>> Hi Sonali,
>>
>> The recent changes in GenomeInfoDb (1.3.3) look great!
>>
>>
>>
>> On Wed, Oct 22, 2014 at 1:53 PM, Sonali Arora <sarora at fhcrc.org> wrote:
>>>
>>> Hi Leo,
>>>
>>> To summarize, These are the concerns raised in your previous email :
>>>
>>> a) mapSeqlevels() should return current style along with other mapped styles?
>>> look at extendedMapSeqlevels()
>>>
>>> I am looking into this. I have a few thoughts/ questions for you.
>>>
>>> The main idea of mapSeqlevels() is to map the current seqlevelStyle to other seqlevelStyles.
>>>
>>
>> Totally!
>>
>>> "I was also expecting mapSeqlevels() to return the same input if the
>>> specified 'style' was the same as the one currently being used."
>>>
>>>
>>> Returning the existing seqlevelStyle along with the mapped Seqlevel style sounds a
>>> little redundant. (given that the user is already supplying it while calling mapSeqlevels()
>>> so he already knows it!)
>>>
>>>  "For
>>> example, if I was working with Homo sapiens NCBI style and attempted
>>> to map to NCBI, I was expecting the same output as the input."
>>>
>>>
>>> why would you want to map it back to NCBI ? Please explain your use case more
>>> concretely.
>>
>> I agree with you that my example doesn't make sense if a user is
>> manually calling mapSeqlevels().
>>
>> In `derfinder`, several functions split the work by sequence name
>> (chromosomes). Others take two GRanges objects and match them. Others
>> create GRanges objects. Others load data from BAM or BigWig files.
>>
>> In my use case, I wanted to make sure things would work even if the
>> user didn't make sure that (in the 2 GRanges case) that the objects
>> had the same sequence naming style. That is why the user doesn't
>> necessarily specify the `style` and I was using GenomeInfoDb to guess
>> it via `seqlevelsStyle()` as well as mapping the sequence names to the
>> default `style` to make sure the playing field was the same before
>> finding overlaps, etc via `mapSeqlevels()`.
>>
>> This means that sometimes I am working with a single sequence name.
>> Hence the `seqlevelsStyle('2')` example which is a different scenario
>> from working with multiple sequence names.
>>
>>> seqlevelsStyle('2')
>> warning! Multiple seqnameStyles found.
>> [1] "NCBI"   "TAIR10" "MSU6"   "JGI2"   "AGPvF"
>>> seqlevelsStyle(as.character(1:13))
>> [1] "NCBI"
>>
>> I'm aware that I could simply dodge using mapSeqlevels() when the
>> default `style` is the same as the `style` currently used in the input
>> as shown below:
>>
>>> foo <- function(seqnames, style = 'UCSC') {
>> +     if(seqlevelsStyle(seqnames) != style) {
>> +         res <- mapSeqlevels(seqnames, style)
>> +     } else {
>> +         res <- seqnames
>> +     }
>> +     return(res)
>> + }
>>>
>> ## Works with multiple seqnames
>>> foo(as.character(1:22))
>>       1       2       3       4       5       6       7       8
>>  "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"
>>       9      10      11      12      13      14      15      16
>>  "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16"
>>      17      18      19      20      21      22
>> "chr17" "chr18" "chr19" "chr20" "chr21" "chr22"
>>
>> ## You also have to guess the species in using a small number of sequence names
>> ## Or use only the first result: seqlevelsStyle(seqnames)[1]
>>> foo('2')
>> warning! Multiple seqnameStyles found.
>> [1] "chr2"
>> Warning message:
>> In if (seqlevelsStyle(seqnames) != style) { :
>>   the condition has length > 1 and only the first element will be used
>>>
>>
>> The second part (if you can call it like that) of my use case is that
>> I wanted to use the same code whether or not the sequence names /
>> organism supplied are not supported by GenomeInfoDb. That is, I wanted
>> to be flexible enough to use all the power from `GenomeInfoDb` when
>> there's enough information, and simply ignore the step of renaming
>> sequence styles if the information is missing.
>> `extendedMapSeqlevels()` can do this.
>>
>> Alternatively, I could also use some tryCatch() calls and if I get an
>> error (as illustrated below) then simply avoid using `mapSeqlevels()`.
>>
>> ## Could also be '39' instead of 'seq1' if for some reason someone is
>> working with a genome that has 39 chromosomes.
>> ## Currently the supported organism in GenomeInfoDb with the highest
>> number is Canis familiaris with 38 autosomal chrs.
>>> seqlevelsStyle('seq1')
>> Error in seqlevelsStyle("seq1") :
>>   The style does not have a compatible entry for the species
>>   supported by Seqname. Please see genomeStyles() for
>>   supported species/style
>>
>> ## Modified foo() which can work with unsupported genomes/styles
>>
>>> zoo <- function(seqnames, style = 'UCSC') {
>> +     currentStyle <- tryCatch(seqlevelsStyle(seqnames), error =
>> function(e) { 'error' })
>> +     if(currentStyle[1] == 'error') {
>> +         return(seqnames)
>> +     } else {
>> +         ## Or use currentStyle[1] if you don't want the warning in zoo('2')
>> +         if(currentStyle != style)     {
>> +             res <- mapSeqlevels(seqnames, style)
>> +         } else {
>> +             res<- seqnames
>> +         }
>> +     }
>> +     return(seqnames)
>> + }
>>> zoo('seq1')
>> [1] "seq1"
>>
>> `extendedMapSeqlevels()` does more than this by guessing the current
>> style only for valid species. If the species is not supplied, it
>> guesses it as well.
>>
>> `zoo()` above can be modified to take a `species` argument and/or
>> guess it by using `.guessSpeciesStyle()`.
>>
>>>
>>> I'd be happy to make changes to mapSeqlevels() and look at your function more deeply,
>>> once I understand the use case better. Thanks for clarifying this.
>>
>> No problem =) Thanks for digesting my email!
>>
>>>
>>> b) Do you think it would be helpful to you and other developers of I export
>>> .guessSpeciesStyle() and .supportedSeqnameMappings() from GenomeInfoDb ?
>>
>> Say if `mapSeqlevels()` had a `species` argument and
>> `guessSpeciesStyle()` was exported I could simplify
>> `extendedMapSeqlevels()` [maybe a different name for it would be
>> better] for my use case.
>>
>> I think that guessing the correct `style` without the `species`
>> information is going to get harder as the number of supported
>> organisms increases in `GenomeInfoDb`. Hence why I like
>> `guessSpeciesStyle()` more than `seqlevelsStyle()`.
>
> I'm guessing that the main use of `seqlevelsStyle(x)` is when `x` is a
> GRanges or GRangesList object that has the organism information. Then
> the `species` can be inferred without error and making it much more
> straight forward to guess the naming style used.
>
>>
>> Basically, I would use `guessSpeciesStyle()` to restrict the valid
>> `style` guesses to those corresponding to the `species` of interest,
>> or guess everything if nothing is supplied. Then with `mapSeqlevels()`
>> I could specify which mapping to use. Or alternatively use
>> `supportedSeqnameMappings()` if you think that `mapSeqlevels()`
>> shouldn't have a species argument.
>>
>> Anyhow, I don't think that it's critical that you export these
>> functions. I don't think that there is a problem with having the
>> single NOTE in R CMD check.
>>
>>>
>>> c) .guessSpeciesStyle('2') should return all possible hits for "seqnames"
>>>
>>> This was a great find! I have fixed it in version 1.3.2 - Thanks for figuring this out !
>>> Does the following style of output make sense to you ? If multiple hits are found
>>> there are returned in "species" and "style" respectively.
>>>
>>> > .guessSpeciesStyle(c(paste0("chr",1:10)))
>>> $species
>>> [1] "Homo sapiens" "Mus musculus"
>>>
>>> $style
>>> [1] "UCSC" "UCSC"
>>>
>>> > .guessSpeciesStyle(c(paste0("chr",1:22)))
>>> $species
>>> [1] "Homo sapiens"
>>>
>>> $style
>>> [1] "UCSC"
>>>
>>> > .guessSpeciesStyle("chr2")
>>> $species
>>> [1] "Homo sapiens" "Mus musculus"
>>>
>>> $style
>>> [1] "UCSC" "UCSC"
>>>
>>> > .guessSpeciesStyle("2")
>>> $species
>>>  [1] "Arabidopsis thaliana"    "Arabidopsis thaliana"    "Cyanidioschyzon merolae"
>>>  [4] "Homo sapiens"            "Mus musculus"            "Oryza sativa"
>>>  [7] "Oryza sativa"            "Populus trichocarpa"     "Zea mays"
>>> [10] "Zea mays"
>>>
>>> $style
>>>  [1] "NCBI"   "TAIR10" "NCBI"   "NCBI"   "NCBI"   "NCBI"   "MSU6"   "JGI2"   "NCBI"   "AGPvF"
>>>
>>
>> Awesome! I think that these changes are great! I'll shortly modify my
>> code to use this version.
>>
>>> SeqlevelsStyle also returns multiple styles now (if it cant guess the correct one!)
>>>
>>
>> =)
>>
>>> > seqnames <- c(paste0("chr",1:22))
>>> > seqlevelsStyle(seqnames)
>>> [1] "UCSC"
>>>
>>> > seqnames <- "2"
>>> > seqlevelsStyle(seqnames)
>>> warning! Multiple seqnameStyles found.
>>> [1] "NCBI"   "TAIR10" "MSU6"   "JGI2"   "AGPvF"
>>>
>>> Thanks again for your feedback,
>>> Sonali.
>>>
>>>
>>
>> Let me know if you need anything else!
>>
>> Cheers,
>> Leo
>>
>>>
>>>
>>>
>>>
>>> On 10/21/2014 8:47 PM, Maintainer wrote:
>>>
>>> Hello,
>>>
>>> In my `derfinder` package I have been relying on `GenomeInfoDb` to
>>> correctly name the sequence names. Given
>>> https://support.bioconductor.org/p/62136/ I now realize that I was
>>> expecting too much from `GenomeInfoDb`.
>>>
>>> For example, in some cases I was expecting it to guess that a single
>>> '2' meant that the naming style was NCBI. This is true for Homo
>>> sapiens but now I realize that it's not necessary true for Arabidopsis
>>> thaliana.
>>>
>>> GenomeInfoDb:::.guessSpeciesStyle('2')
>>>
>>> [1] "Arabidopsis thaliana" "NCBI"
>>>
>>> ## Could have been Homo sapiens (NCBI) or Arabidopbis thaliana TAIR10 or NCBI
>>>
>>> GenomeInfoDb:::.supportedSeqnameMappings()[['Homo_sapiens']][2, ]
>>>
>>>   NCBI UCSC dbSNP
>>> 2    2 chr2   ch2
>>>
>>> GenomeInfoDb:::.supportedSeqnameMappings()[['Arabidopsis_thaliana']][2, ]
>>>
>>>   NCBI TAIR10
>>> 2    2      2
>>>
>>> I was also expecting mapSeqlevels() to return the same input if the
>>> specified 'style' was the same as the one currently being used. For
>>> example, if I was working with Homo sapiens NCBI style and attempted
>>> to map to NCBI, I was expecting the same output as the input.
>>>
>>> ## More than 1 output
>>>
>>> mapSeqlevels('2', 'NCBI')
>>>
>>>      2
>>> [1,] "2"
>>> [2,] "LGII"
>>> ## Turns out there is another organism that would make this mapping valid
>>>
>>> GenomeInfoDb:::.supportedSeqnameMappings()[['Populus_trichocarpa']][2, ]
>>>
>>>   NCBI JGI2.F
>>> 2 LGII      2
>>>
>>>
>>> I also noticed that `GenomeInfoDb` couldn't work with some fictional
>>> examples. For example:
>>>
>>> ## Expected error, but this meant that `derfinder` couldn't use toy
>>> examples from other
>>> ## packages.
>>>
>>> seqlevelsStyle('seq1')
>>>
>>> Error in seqlevelsStyle("seq1") :
>>>   The style does not have a compatible entry for the species supported
>>> by Seqname. Please see genomeStyles() for supported
>>>   species/style
>>>
>>> ## Using toy data from Rsamtools fails as expected
>>>
>>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools",
>>>
>>> +                   mustWork=TRUE)
>>>
>>> library(GenomicAlignments)
>>> names(scanBamHeader(BamFile(fl))$targets)
>>>
>>> [1] "seq1" "seq2"
>>>
>>> seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets))
>>>
>>> Error in seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets)) :
>>>   The style does not have a compatible entry for the species supported
>>> by Seqname. Please see genomeStyles() for supported
>>>   species/style
>>>
>>>
>>> So, I wanted a way to use `GenomeInfoDb` to make the correct naming
>>> style maps for supported genomes, and simply return the original
>>> sequence names if using an unsupported species/mapping or if the
>>> original and target styles are the same.
>>>
>>>
>>> That's what I basically did when I wrote 'extendedMapSeqlevels' (aye,
>>> long name) which you can see at
>>> https://gist.github.com/3e6f0e2e2bb67ee910da If you think that others
>>> might be interested in such a function, then by all means please add
>>> it to `GenomeInfoDb` and I'll import it. You'll notice how it tries to
>>> guess the species and/or current naming style when that information is
>>> not provided. When it's used with verbose = TRUE it will show a
>>> message explaining why the mapping failed (if it did), it'll show the
>>> link to the `GenomeInfoDb` vignette for adding an organism, and then
>>> return the original sequence names.
>>>
>>> I have a slightly modified version in derfinder
>>> https://github.com/lcolladotor/derfinder/blob/master/R/extendedMapSeqlevels.R
>>>  or https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/derfinder/R/extendedMapSeqlevels.R
>>> where I simply assume some other default values and guarantee
>>> continuity with how I was doing things previously.
>>>
>>> Naturally, R CMD check now shows a NOTE because I'm using the :::
>>> syntax for un-exported functions from `GenomeInfoDb`.
>>>
>>> Cheers,
>>> Leo
>>>
>>>
>>> Leonardo Collado Torres, PhD Candidate
>>> Department of Biostatistics
>>> Johns Hopkins University
>>> Bloomberg School of Public Health
>>> Website: http://www.biostat.jhsph.edu/~lcollado/
>>> Blog: http://lcolladotor.github.io/
>>>
>>> ________________________________________________________________________
>>> devteam-bioc mailing list
>>> To unsubscribe from this mailing list send a blank email to
>>> devteam-bioc-leave at lists.fhcrc.org
>>> You can also unsubscribe or change your personal options at
>>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Sonali
>>>



More information about the Bioc-devel mailing list