[Bioc-devel] [devteam-bioc] A more flexible GenomeInfoDb::mapSeqlevels(): used supported info but don't break with new organisms/toy examples

Leonardo Collado Torres lcollado at jhu.edu
Thu Oct 23 01:03:46 CEST 2014


On Wed, Oct 22, 2014 at 6:59 PM, Leonardo Collado Torres
<lcollado at jhu.edu> wrote:
> Hi Sonali,
>
> The recent changes in GenomeInfoDb (1.3.3) look great!
>
>
>
> On Wed, Oct 22, 2014 at 1:53 PM, Sonali Arora <sarora at fhcrc.org> wrote:
>>
>> Hi Leo,
>>
>> To summarize, These are the concerns raised in your previous email :
>>
>> a) mapSeqlevels() should return current style along with other mapped styles?
>> look at extendedMapSeqlevels()
>>
>> I am looking into this. I have a few thoughts/ questions for you.
>>
>> The main idea of mapSeqlevels() is to map the current seqlevelStyle to other seqlevelStyles.
>>
>
> Totally!
>
>> "I was also expecting mapSeqlevels() to return the same input if the
>> specified 'style' was the same as the one currently being used."
>>
>>
>> Returning the existing seqlevelStyle along with the mapped Seqlevel style sounds a
>> little redundant. (given that the user is already supplying it while calling mapSeqlevels()
>> so he already knows it!)
>>
>>  "For
>> example, if I was working with Homo sapiens NCBI style and attempted
>> to map to NCBI, I was expecting the same output as the input."
>>
>>
>> why would you want to map it back to NCBI ? Please explain your use case more
>> concretely.
>
> I agree with you that my example doesn't make sense if a user is
> manually calling mapSeqlevels().
>
> In `derfinder`, several functions split the work by sequence name
> (chromosomes). Others take two GRanges objects and match them. Others
> create GRanges objects. Others load data from BAM or BigWig files.
>
> In my use case, I wanted to make sure things would work even if the
> user didn't make sure that (in the 2 GRanges case) that the objects
> had the same sequence naming style. That is why the user doesn't
> necessarily specify the `style` and I was using GenomeInfoDb to guess
> it via `seqlevelsStyle()` as well as mapping the sequence names to the
> default `style` to make sure the playing field was the same before
> finding overlaps, etc via `mapSeqlevels()`.
>
> This means that sometimes I am working with a single sequence name.
> Hence the `seqlevelsStyle('2')` example which is a different scenario
> from working with multiple sequence names.
>
>> seqlevelsStyle('2')
> warning! Multiple seqnameStyles found.
> [1] "NCBI"   "TAIR10" "MSU6"   "JGI2"   "AGPvF"
>> seqlevelsStyle(as.character(1:13))
> [1] "NCBI"
>
> I'm aware that I could simply dodge using mapSeqlevels() when the
> default `style` is the same as the `style` currently used in the input
> as shown below:
>
>> foo <- function(seqnames, style = 'UCSC') {
> +     if(seqlevelsStyle(seqnames) != style) {
> +         res <- mapSeqlevels(seqnames, style)
> +     } else {
> +         res <- seqnames
> +     }
> +     return(res)
> + }
>>
> ## Works with multiple seqnames
>> foo(as.character(1:22))
>       1       2       3       4       5       6       7       8
>  "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"
>       9      10      11      12      13      14      15      16
>  "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16"
>      17      18      19      20      21      22
> "chr17" "chr18" "chr19" "chr20" "chr21" "chr22"
>
> ## You also have to guess the species in using a small number of sequence names
> ## Or use only the first result: seqlevelsStyle(seqnames)[1]
>> foo('2')
> warning! Multiple seqnameStyles found.
> [1] "chr2"
> Warning message:
> In if (seqlevelsStyle(seqnames) != style) { :
>   the condition has length > 1 and only the first element will be used
>>
>
> The second part (if you can call it like that) of my use case is that
> I wanted to use the same code whether or not the sequence names /
> organism supplied are not supported by GenomeInfoDb. That is, I wanted
> to be flexible enough to use all the power from `GenomeInfoDb` when
> there's enough information, and simply ignore the step of renaming
> sequence styles if the information is missing.
> `extendedMapSeqlevels()` can do this.
>
> Alternatively, I could also use some tryCatch() calls and if I get an
> error (as illustrated below) then simply avoid using `mapSeqlevels()`.
>
> ## Could also be '39' instead of 'seq1' if for some reason someone is
> working with a genome that has 39 chromosomes.
> ## Currently the supported organism in GenomeInfoDb with the highest
> number is Canis familiaris with 38 autosomal chrs.
>> seqlevelsStyle('seq1')
> Error in seqlevelsStyle("seq1") :
>   The style does not have a compatible entry for the species
>   supported by Seqname. Please see genomeStyles() for
>   supported species/style
>
> ## Modified foo() which can work with unsupported genomes/styles
>
>> zoo <- function(seqnames, style = 'UCSC') {
> +     currentStyle <- tryCatch(seqlevelsStyle(seqnames), error =
> function(e) { 'error' })
> +     if(currentStyle[1] == 'error') {
> +         return(seqnames)
> +     } else {
> +         ## Or use currentStyle[1] if you don't want the warning in zoo('2')
> +         if(currentStyle != style)     {
> +             res <- mapSeqlevels(seqnames, style)
> +         } else {
> +             res<- seqnames
> +         }
> +     }
> +     return(seqnames)
> + }
>> zoo('seq1')
> [1] "seq1"
>
> `extendedMapSeqlevels()` does more than this by guessing the current
> style only for valid species. If the species is not supplied, it
> guesses it as well.
>
> `zoo()` above can be modified to take a `species` argument and/or
> guess it by using `.guessSpeciesStyle()`.
>
>>
>> I'd be happy to make changes to mapSeqlevels() and look at your function more deeply,
>> once I understand the use case better. Thanks for clarifying this.
>
> No problem =) Thanks for digesting my email!
>
>>
>> b) Do you think it would be helpful to you and other developers of I export
>> .guessSpeciesStyle() and .supportedSeqnameMappings() from GenomeInfoDb ?
>
> Say if `mapSeqlevels()` had a `species` argument and
> `guessSpeciesStyle()` was exported I could simplify
> `extendedMapSeqlevels()` [maybe a different name for it would be
> better] for my use case.
>
> I think that guessing the correct `style` without the `species`
> information is going to get harder as the number of supported
> organisms increases in `GenomeInfoDb`. Hence why I like
> `guessSpeciesStyle()` more than `seqlevelsStyle()`.

I'm guessing that the main use of `seqlevelsStyle(x)` is when `x` is a
GRanges or GRangesList object that has the organism information. Then
the `species` can be inferred without error and making it much more
straight forward to guess the naming style used.

>
> Basically, I would use `guessSpeciesStyle()` to restrict the valid
> `style` guesses to those corresponding to the `species` of interest,
> or guess everything if nothing is supplied. Then with `mapSeqlevels()`
> I could specify which mapping to use. Or alternatively use
> `supportedSeqnameMappings()` if you think that `mapSeqlevels()`
> shouldn't have a species argument.
>
> Anyhow, I don't think that it's critical that you export these
> functions. I don't think that there is a problem with having the
> single NOTE in R CMD check.
>
>>
>> c) .guessSpeciesStyle('2') should return all possible hits for "seqnames"
>>
>> This was a great find! I have fixed it in version 1.3.2 - Thanks for figuring this out !
>> Does the following style of output make sense to you ? If multiple hits are found
>> there are returned in "species" and "style" respectively.
>>
>> > .guessSpeciesStyle(c(paste0("chr",1:10)))
>> $species
>> [1] "Homo sapiens" "Mus musculus"
>>
>> $style
>> [1] "UCSC" "UCSC"
>>
>> > .guessSpeciesStyle(c(paste0("chr",1:22)))
>> $species
>> [1] "Homo sapiens"
>>
>> $style
>> [1] "UCSC"
>>
>> > .guessSpeciesStyle("chr2")
>> $species
>> [1] "Homo sapiens" "Mus musculus"
>>
>> $style
>> [1] "UCSC" "UCSC"
>>
>> > .guessSpeciesStyle("2")
>> $species
>>  [1] "Arabidopsis thaliana"    "Arabidopsis thaliana"    "Cyanidioschyzon merolae"
>>  [4] "Homo sapiens"            "Mus musculus"            "Oryza sativa"
>>  [7] "Oryza sativa"            "Populus trichocarpa"     "Zea mays"
>> [10] "Zea mays"
>>
>> $style
>>  [1] "NCBI"   "TAIR10" "NCBI"   "NCBI"   "NCBI"   "NCBI"   "MSU6"   "JGI2"   "NCBI"   "AGPvF"
>>
>
> Awesome! I think that these changes are great! I'll shortly modify my
> code to use this version.
>
>> SeqlevelsStyle also returns multiple styles now (if it cant guess the correct one!)
>>
>
> =)
>
>> > seqnames <- c(paste0("chr",1:22))
>> > seqlevelsStyle(seqnames)
>> [1] "UCSC"
>>
>> > seqnames <- "2"
>> > seqlevelsStyle(seqnames)
>> warning! Multiple seqnameStyles found.
>> [1] "NCBI"   "TAIR10" "MSU6"   "JGI2"   "AGPvF"
>>
>> Thanks again for your feedback,
>> Sonali.
>>
>>
>
> Let me know if you need anything else!
>
> Cheers,
> Leo
>
>>
>>
>>
>>
>> On 10/21/2014 8:47 PM, Maintainer wrote:
>>
>> Hello,
>>
>> In my `derfinder` package I have been relying on `GenomeInfoDb` to
>> correctly name the sequence names. Given
>> https://support.bioconductor.org/p/62136/ I now realize that I was
>> expecting too much from `GenomeInfoDb`.
>>
>> For example, in some cases I was expecting it to guess that a single
>> '2' meant that the naming style was NCBI. This is true for Homo
>> sapiens but now I realize that it's not necessary true for Arabidopsis
>> thaliana.
>>
>> GenomeInfoDb:::.guessSpeciesStyle('2')
>>
>> [1] "Arabidopsis thaliana" "NCBI"
>>
>> ## Could have been Homo sapiens (NCBI) or Arabidopbis thaliana TAIR10 or NCBI
>>
>> GenomeInfoDb:::.supportedSeqnameMappings()[['Homo_sapiens']][2, ]
>>
>>   NCBI UCSC dbSNP
>> 2    2 chr2   ch2
>>
>> GenomeInfoDb:::.supportedSeqnameMappings()[['Arabidopsis_thaliana']][2, ]
>>
>>   NCBI TAIR10
>> 2    2      2
>>
>> I was also expecting mapSeqlevels() to return the same input if the
>> specified 'style' was the same as the one currently being used. For
>> example, if I was working with Homo sapiens NCBI style and attempted
>> to map to NCBI, I was expecting the same output as the input.
>>
>> ## More than 1 output
>>
>> mapSeqlevels('2', 'NCBI')
>>
>>      2
>> [1,] "2"
>> [2,] "LGII"
>> ## Turns out there is another organism that would make this mapping valid
>>
>> GenomeInfoDb:::.supportedSeqnameMappings()[['Populus_trichocarpa']][2, ]
>>
>>   NCBI JGI2.F
>> 2 LGII      2
>>
>>
>> I also noticed that `GenomeInfoDb` couldn't work with some fictional
>> examples. For example:
>>
>> ## Expected error, but this meant that `derfinder` couldn't use toy
>> examples from other
>> ## packages.
>>
>> seqlevelsStyle('seq1')
>>
>> Error in seqlevelsStyle("seq1") :
>>   The style does not have a compatible entry for the species supported
>> by Seqname. Please see genomeStyles() for supported
>>   species/style
>>
>> ## Using toy data from Rsamtools fails as expected
>>
>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools",
>>
>> +                   mustWork=TRUE)
>>
>> library(GenomicAlignments)
>> names(scanBamHeader(BamFile(fl))$targets)
>>
>> [1] "seq1" "seq2"
>>
>> seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets))
>>
>> Error in seqlevelsStyle(names(scanBamHeader(BamFile(fl))$targets)) :
>>   The style does not have a compatible entry for the species supported
>> by Seqname. Please see genomeStyles() for supported
>>   species/style
>>
>>
>> So, I wanted a way to use `GenomeInfoDb` to make the correct naming
>> style maps for supported genomes, and simply return the original
>> sequence names if using an unsupported species/mapping or if the
>> original and target styles are the same.
>>
>>
>> That's what I basically did when I wrote 'extendedMapSeqlevels' (aye,
>> long name) which you can see at
>> https://gist.github.com/3e6f0e2e2bb67ee910da If you think that others
>> might be interested in such a function, then by all means please add
>> it to `GenomeInfoDb` and I'll import it. You'll notice how it tries to
>> guess the species and/or current naming style when that information is
>> not provided. When it's used with verbose = TRUE it will show a
>> message explaining why the mapping failed (if it did), it'll show the
>> link to the `GenomeInfoDb` vignette for adding an organism, and then
>> return the original sequence names.
>>
>> I have a slightly modified version in derfinder
>> https://github.com/lcolladotor/derfinder/blob/master/R/extendedMapSeqlevels.R
>>  or https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/derfinder/R/extendedMapSeqlevels.R
>> where I simply assume some other default values and guarantee
>> continuity with how I was doing things previously.
>>
>> Naturally, R CMD check now shows a NOTE because I'm using the :::
>> syntax for un-exported functions from `GenomeInfoDb`.
>>
>> Cheers,
>> Leo
>>
>>
>> Leonardo Collado Torres, PhD Candidate
>> Department of Biostatistics
>> Johns Hopkins University
>> Bloomberg School of Public Health
>> Website: http://www.biostat.jhsph.edu/~lcollado/
>> Blog: http://lcolladotor.github.io/
>>
>> ________________________________________________________________________
>> devteam-bioc mailing list
>> To unsubscribe from this mailing list send a blank email to
>> devteam-bioc-leave at lists.fhcrc.org
>> You can also unsubscribe or change your personal options at
>> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>
>>
>> --
>> Thanks and Regards,
>> Sonali
>>



More information about the Bioc-devel mailing list