[Bioc-devel] proposal for additional seqlevelsStyle

Fri Jul 3 15:34:56 CEST 2020

Thanks Hervé,

this looks very useful!!!

robert.

On 03/07/2020 07:49, Hervé Pagès wrote:
> Hi Vince, Robert, Kasper,
>
> I've done some work on this. Starting with GenomeInfoDb_1.25.7 the 
> seqlevelsStyle() setter has 2 major improvements:
>
> 1. It knows how to rename contigs and scaffolds, not just the 
> chromosomes:
>
>   library(TxDb.Mmusculus.UCSC.mm10.knownGene)
>
>   seqinfo(txdb)
>   # Seqinfo object with 66 sequences (1 circular) from mm10 genome:
>   # seqnames       seqlengths isCircular genome
>   # chr1            195471971       <NA>   mm10
>   # chr2            182113224       <NA>   mm10
>   # chr3            160039680       <NA>   mm10
>   # chr4            156508116       <NA>   mm10
>   # chr5            151834684       <NA>   mm10
>   # ...                   ...        ...    ...
>   # chrUn_GL456392      23629       <NA>   mm10
>   # chrUn_GL456393      55711       <NA>   mm10
>   # chrUn_GL456394      24323       <NA>   mm10
>   # chrUn_GL456396      21240       <NA>   mm10
>   # chrUn_JH584304     114452       <NA>   mm10
>
>   seqlevelsStyle(txdb) <- "NCBI"
>
>   seqinfo(txdb)
>   # Seqinfo object with 66 sequences (1 circular) from GRCm38 genome:
>   # seqnames      seqlengths isCircular genome
>   # 1              195471971       <NA> GRCm38
>   # 2              182113224       <NA> GRCm38
>   # 3              160039680       <NA> GRCm38
>   # 4              156508116       <NA> GRCm38
>   # 5              151834684       <NA> GRCm38
>   # ...                  ...        ...    ...
>   # MSCHRUN_CTG10      23629       <NA> GRCm38
>   # MSCHRUN_CTG11      55711       <NA> GRCm38
>   # MSCHRUN_CTG12      24323       <NA> GRCm38
>   # MSCHRUN_CTG15      21240       <NA> GRCm38
>   # MSCHRUN_CTG23     114452       <NA> GRCm38
>
> 2. It supports new style RefSeq for renaming to/from RefSeq accessions:
>
>   seqlevelsStyle(txdb) <- "RefSeq"
>
>   seqinfo(txdb)
>   # Seqinfo object with 66 sequences (1 circular) from GRCm38 genome:
>   # seqnames    seqlengths isCircular genome
>   # NC_000067.6  195471971       <NA> GRCm38
>   # NC_000068.7  182113224       <NA> GRCm38
>   # NC_000069.6  160039680       <NA> GRCm38
>   # NC_000070.6  156508116       <NA> GRCm38
>   # NC_000071.6  151834684       <NA> GRCm38
>   # ...                ...        ...    ...
>   # NT_166476.1      23629       <NA> GRCm38
>   # NT_166477.1      55711       <NA> GRCm38
>   # NT_166478.1      24323       <NA> GRCm38
>   # NT_166480.1      21240       <NA> GRCm38
>   # NT_187064.1     114452       <NA> GRCm38
>
> These new features only work on objects for which the genome is set to 
> an NCBI assembly (e.g. WBcel235) or UCSC genome (e.g. ce11). This is 
> the case with TxDb, BSgenome, and SNPlocs objects.
>
> The workhorses behind them are new low-level utilities 
> getChromInfoFromNCBI() and getChromInfoFromUCSC(). These support 141 
> NCBI assemblies and 74 UCSC genomes at the moment, respectively. It's 
> easy to add new organisms. The gotcha is that they require internet 
> access and so does the seqlevelsStyle() setter. This could be 
> mitigated by caching the data via BiocFileCache.
>
> Next thing on the list is to support the GenBank style (Vince's 
> original request) to rename to/from GenBank accessions.
>
> Cheers,
> H.
>
>
> On 12/13/19 10:51, Kasper Daniel Hansen wrote:
>> If the chromosome name depends on the assembly, that makes GenomeInfoDb
>> even more useful and necessary.  Provided it is supported of course.
>>
>> On Fri, Dec 13, 2019 at 11:45 AM Vincent Carey 
>> <stvjc using channing.harvard.edu>
>> wrote:
>>
>>> I tried an inline png but I think it was rejected by bioc-devel.  
>>> Here's
>>> another try.
>>>
>>> On Fri, Dec 13, 2019 at 11:40 AM Vincent Carey 
>>> <stvjc using channing.harvard.edu
>>>>
>>> wrote:
>>>
>>>> Thanks -- It is good to know more about the complications of adding
>>>> seqlevelsStyle elements.
>>>> I am not sure how pervasive this will be in SNP annotation in the 
>>>> future.
>>>> The "new API" for dbSNP
>>>> references SPDI annotation conventions.
>>>>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__api.ncbi.nlm.nih.gov_variation_v0_&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=q-r9L4kC5cjCxRXc33m9n1hKs7WLWiYhNB2t6B_b7nU&s=1jolcMxvVlDvdU0u8LZeWnWHJFt-R6ZweevICNvw2oo&e= 
>>>>
>>>>
>>>> at least one dbsnp build 152 resource uses this nomenclature.  The one
>>>>
>>>> referenced below is the "go-to" resource for current rsid-coordinate
>>>>
>>>> correspondence, as far as I know.
>>>>
>>>>
>>>>> library(VariantAnnotation)
>>>>
>>>> *0/0 packages newly attached/loaded, see sessionInfo() for details.*
>>>>
>>>>> mypar = GRanges("NC_000001.11", IRanges(100000,120000)) # note 
>>>>> seqnames
>>>>
>>>>
>>>>> nn = readVcf("
>>>>
>>> https://urldefense.proofpoint.com/v2/url?u=ftp-3A__ftp.ncbi.nih.gov_snp_redesign_latest-5Frelease_VCF_GCF-5F000001405.38.gz&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=q-r9L4kC5cjCxRXc33m9n1hKs7WLWiYhNB2t6B_b7nU&s=SXqRWrv1IK1WpNOf3cjMC82-fvChoJ6za48WzlRGeAU&e= 
>>>
>>>> ",
>>>>
>>>> +   genome="GRCh38", param=mypar)
>>>>
>>>>
>>>>> head(rowRanges(nn), 3)
>>>>
>>>> GRanges object with 3 ranges and 5 metadata columns:
>>>>
>>>>                     seqnames    ranges strand | paramRangeID
>>> REF
>>>>
>>>>                        <Rle> <IRanges> <Rle> |     <factor>
>>> <DNAStringSet>
>>>>
>>>>    rs1331956057 NC_000001.11    100000      * | <NA>
>>> C
>>>>
>>>>    rs1252351580 NC_000001.11    100036      * | <NA>
>>> T
>>>>
>>>>    rs1238523913 NC_000001.11    100051      * | <NA>
>>> T
>>>>
>>>>                                ALT      QUAL      FILTER
>>>>
>>>>                 <DNAStringSetList> <numeric> <character>
>>>>
>>>>    rs1331956057                  T      <NA> .
>>>>
>>>>    rs1252351580                  G      <NA> .
>>>>
>>>>    rs1238523913                  C      <NA> .
>>>>
>>>>    -------
>>>>
>>>>    seqinfo: 1 sequence from GRCh38 genome; no seqlengths
>>>>
>>>>
>>>> On Fri, Dec 13, 2019 at 11:01 AM Robert Castelo 
>>>> <robert.castelo using upf.edu>
>>>> wrote:
>>>>
>>>>> hi Hervé,
>>>>>
>>>>> i didn't know about this new sequence style until Vince posted his
>>>>> message and we briefly talked about it at the European BioC 
>>>>> meeting this
>>>>> week in Brussels. however, i didn't know that the style was 
>>>>> specific to
>>>>> a particular assembly. i have no use case of this at the mome moment,
>>>>> i.e., i have not encountered myself any annotation or BAM file with
>>>>> chromosome names written that way, so i don't know how pressing this
>>>>> issue is, maybe Vince can tell us how spread such chromosome naming
>>>>> style may become in the near future.
>>>>>
>>>>> naively, i'd think that it would be matter of adding a
>>>>> reference-specific column, i.e., 'GRCh38.p13', 'GRCh37.p13', etc., 
>>>>> but i
>>>>> can imagine that maybe the "reference style" concept might not be the
>>>>> appropriate placeholder to map all different chromosome names of all
>>>>> different individual human genomes uploaded to NCBI. maybe we should
>>>>> wait until we have a specific use case .. Vince?
>>>>>
>>>>> robert.
>>>>>
>>>>> On 12/11/19 10:06 PM, Pages, Herve wrote:
>>>>>> Hi Vince, Robert,
>>>>>>
>>>>>> Looks like Vince wants the RefSeq accession e.g. NC_000017.11 for
>>> chrom
>>>>>> 17 in the GRCh38.
>>>>>>
>>>>>> @Robert: Is this what you're also interested in?
>>>>>>
>>>>>> The problem is that the RefSeq accessions are specific to a 
>>>>>> particular
>>>>>> assembly (e.g. NC_000017.11 for chrom 17 in GRCh38 but NC_000017.10
>>> for
>>>>>> the same chrom in GRCh37).
>>>>>>
>>>>>> Currently seqlevelsStyle() doesn't know how to distinguish between
>>>>>> different assemblies of the same organism. Not saying it couldn't 
>>>>>> but
>>> it
>>>>>> would require some thinking and some significant refactoring. It
>>>>>> wouldn't be just a matter of adding a column to
>>>>>> genomeStyles()$Homo_sapiens.
>>>>>>
>>>>>> H.
>>>>>>
>>>>>>
>>>>>> On 12/10/19 14:19, Robert Castelo wrote:
>>>>>>> I second this, and would suggest to name the style as 'GRC' for
>>> "Genome
>>>>>>> Reference Consortium".
>>>>>>>
>>>>>>> thanks Vince for bringing this up, being able to easily switch
>>> between
>>>>>>> genome styles is great.
>>>>>>>
>>>>>>> if 'paste0()' in R is one of the most influential contributions to
>>>>>>> statistical computing
>>>>>>>
>>>>>>>
>>>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__simplystatistics.org_2013_01_31_paste0-2Dis-2Dstatistical-2Dcomputings-2Dmost-2Dinfluential-2Dcontribution-2Dof-2Dthe-2D21st-2Dcentury&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=b0_SIu8orJ7ZcCS3TIodFvGTPibt9R8vFL5Y40YSx3Q&e= 
>>>
>>>>>>>
>>>>>>> i think that 'seqlevelsStyle()' from the GenomeInfoDb package is 
>>>>>>> one
>>> of
>>>>>>> the most influential contributions to human genetics, if you think
>>>>> about
>>>>>>> the time invested by researchers in parsing and changing between
>>>>>>> different styles of chromosome names :)
>>>>>>>
>>>>>>> robert.
>>>>>>>
>>>>>>> On 06/12/2019 15:03, Vincent Carey wrote:
>>>>>>>> I raised this issue previously with little response.
>>>>>>>>
>>>>>>>> I'd propose that we add a column or two to
>>> genomeStyles()$Homo_sapiens
>>>>>>>>
>>>>>>>>> head(genomeStyles()$Homo_sapiens, 2)
>>>>>>>>      circular auto   sex NCBI UCSC dbSNP Ensembl
>>>>>>>>
>>>>>>>> 1    FALSE TRUE FALSE    1 chr1   ch1       1
>>>>>>>>
>>>>>>>> 2    FALSE TRUE FALSE    2 chr2   ch2       2
>>>>>>>>
>>>>>>>>
>>>>>>>> that includes the values for "NCBI reference sequence names"
>>>>>>>>
>>>>>>>> See
>>>>>>>>
>>>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_nuccore_568815581&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=3Jy-MH7heIcrc_A4qm_izduLvBoPWHSeq4gdxf5nv24&e= 
>>>
>>>>>>>> for one report on chr17,
>>>>>>>> and
>>>>>>>>
>>>>>>>>
>>>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F000001405.39&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=y6ut_Xcc4rSbXanckiJhiwLsL0W8neJfKWQa6wnG3aM&e= 
>>>
>>>>>>>>
>>>>>>>> for a table that includes the Genbank labels.
>>>>>>>>
>>>>>>>> Should I just file a PR at
>>>>>>>>
>>>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_GenomeInfoDb_&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=KMzfo3_8kkJ-wdvRCNP5rUjTVMW87brj07yHaKL5Qb0&e= 
>>>
>>>>>>>> after
>>>>>>>> testing?
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel using r-project.org mailing list
>>>>>>>
>>>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=SvtNreKVOHnSGjsRwzWWpttpEF7wBXI5utI37-qgX1A&e= 
>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> -- 
>>>>> Robert Castelo, PhD
>>>>> Associate Professor
>>>>> Dept. of Experimental and Health Sciences
>>>>> Universitat Pompeu Fabra (UPF)
>>>>> Barcelona Biomedical Research Park (PRBB)
>>>>> Dr Aiguader 88
>>>>> E-08003 Barcelona, Spain
>>>>> telf: +34.933.160.514
>>>>> fax: +34.933.160.550
>>>>>
>>>>
>>>
>>> -- 
>>> The information in this e-mail is intended only for t...{{dropped:21}}
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel