[Bioc-devel] proposal for additional seqlevelsStyle
Hervé Pagès
hp@ge@ @end|ng |rom |redhutch@org
Fri Jul 3 07:49:51 CEST 2020
Hi Vince, Robert, Kasper,
I've done some work on this. Starting with GenomeInfoDb_1.25.7 the
seqlevelsStyle() setter has 2 major improvements:
1. It knows how to rename contigs and scaffolds, not just the chromosomes:
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
seqinfo(txdb)
# Seqinfo object with 66 sequences (1 circular) from mm10 genome:
# seqnames seqlengths isCircular genome
# chr1 195471971 <NA> mm10
# chr2 182113224 <NA> mm10
# chr3 160039680 <NA> mm10
# chr4 156508116 <NA> mm10
# chr5 151834684 <NA> mm10
# ... ... ... ...
# chrUn_GL456392 23629 <NA> mm10
# chrUn_GL456393 55711 <NA> mm10
# chrUn_GL456394 24323 <NA> mm10
# chrUn_GL456396 21240 <NA> mm10
# chrUn_JH584304 114452 <NA> mm10
seqlevelsStyle(txdb) <- "NCBI"
seqinfo(txdb)
# Seqinfo object with 66 sequences (1 circular) from GRCm38 genome:
# seqnames seqlengths isCircular genome
# 1 195471971 <NA> GRCm38
# 2 182113224 <NA> GRCm38
# 3 160039680 <NA> GRCm38
# 4 156508116 <NA> GRCm38
# 5 151834684 <NA> GRCm38
# ... ... ... ...
# MSCHRUN_CTG10 23629 <NA> GRCm38
# MSCHRUN_CTG11 55711 <NA> GRCm38
# MSCHRUN_CTG12 24323 <NA> GRCm38
# MSCHRUN_CTG15 21240 <NA> GRCm38
# MSCHRUN_CTG23 114452 <NA> GRCm38
2. It supports new style RefSeq for renaming to/from RefSeq accessions:
seqlevelsStyle(txdb) <- "RefSeq"
seqinfo(txdb)
# Seqinfo object with 66 sequences (1 circular) from GRCm38 genome:
# seqnames seqlengths isCircular genome
# NC_000067.6 195471971 <NA> GRCm38
# NC_000068.7 182113224 <NA> GRCm38
# NC_000069.6 160039680 <NA> GRCm38
# NC_000070.6 156508116 <NA> GRCm38
# NC_000071.6 151834684 <NA> GRCm38
# ... ... ... ...
# NT_166476.1 23629 <NA> GRCm38
# NT_166477.1 55711 <NA> GRCm38
# NT_166478.1 24323 <NA> GRCm38
# NT_166480.1 21240 <NA> GRCm38
# NT_187064.1 114452 <NA> GRCm38
These new features only work on objects for which the genome is set to
an NCBI assembly (e.g. WBcel235) or UCSC genome (e.g. ce11). This is the
case with TxDb, BSgenome, and SNPlocs objects.
The workhorses behind them are new low-level utilities
getChromInfoFromNCBI() and getChromInfoFromUCSC(). These support 141
NCBI assemblies and 74 UCSC genomes at the moment, respectively. It's
easy to add new organisms. The gotcha is that they require internet
access and so does the seqlevelsStyle() setter. This could be mitigated
by caching the data via BiocFileCache.
Next thing on the list is to support the GenBank style (Vince's original
request) to rename to/from GenBank accessions.
Cheers,
H.
On 12/13/19 10:51, Kasper Daniel Hansen wrote:
> If the chromosome name depends on the assembly, that makes GenomeInfoDb
> even more useful and necessary. Provided it is supported of course.
>
> On Fri, Dec 13, 2019 at 11:45 AM Vincent Carey <stvjc using channing.harvard.edu>
> wrote:
>
>> I tried an inline png but I think it was rejected by bioc-devel. Here's
>> another try.
>>
>> On Fri, Dec 13, 2019 at 11:40 AM Vincent Carey <stvjc using channing.harvard.edu
>>>
>> wrote:
>>
>>> Thanks -- It is good to know more about the complications of adding
>>> seqlevelsStyle elements.
>>> I am not sure how pervasive this will be in SNP annotation in the future.
>>> The "new API" for dbSNP
>>> references SPDI annotation conventions.
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__api.ncbi.nlm.nih.gov_variation_v0_&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=q-r9L4kC5cjCxRXc33m9n1hKs7WLWiYhNB2t6B_b7nU&s=1jolcMxvVlDvdU0u8LZeWnWHJFt-R6ZweevICNvw2oo&e=
>>>
>>> at least one dbsnp build 152 resource uses this nomenclature. The one
>>>
>>> referenced below is the "go-to" resource for current rsid-coordinate
>>>
>>> correspondence, as far as I know.
>>>
>>>
>>>> library(VariantAnnotation)
>>>
>>> *0/0 packages newly attached/loaded, see sessionInfo() for details.*
>>>
>>>> mypar = GRanges("NC_000001.11", IRanges(100000,120000)) # note seqnames
>>>
>>>
>>>> nn = readVcf("
>>>
>> https://urldefense.proofpoint.com/v2/url?u=ftp-3A__ftp.ncbi.nih.gov_snp_redesign_latest-5Frelease_VCF_GCF-5F000001405.38.gz&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=q-r9L4kC5cjCxRXc33m9n1hKs7WLWiYhNB2t6B_b7nU&s=SXqRWrv1IK1WpNOf3cjMC82-fvChoJ6za48WzlRGeAU&e=
>>> ",
>>>
>>> + genome="GRCh38", param=mypar)
>>>
>>>
>>>> head(rowRanges(nn), 3)
>>>
>>> GRanges object with 3 ranges and 5 metadata columns:
>>>
>>> seqnames ranges strand | paramRangeID
>> REF
>>>
>>> <Rle> <IRanges> <Rle> | <factor>
>> <DNAStringSet>
>>>
>>> rs1331956057 NC_000001.11 100000 * | <NA>
>> C
>>>
>>> rs1252351580 NC_000001.11 100036 * | <NA>
>> T
>>>
>>> rs1238523913 NC_000001.11 100051 * | <NA>
>> T
>>>
>>> ALT QUAL FILTER
>>>
>>> <DNAStringSetList> <numeric> <character>
>>>
>>> rs1331956057 T <NA> .
>>>
>>> rs1252351580 G <NA> .
>>>
>>> rs1238523913 C <NA> .
>>>
>>> -------
>>>
>>> seqinfo: 1 sequence from GRCh38 genome; no seqlengths
>>>
>>>
>>> On Fri, Dec 13, 2019 at 11:01 AM Robert Castelo <robert.castelo using upf.edu>
>>> wrote:
>>>
>>>> hi Hervé,
>>>>
>>>> i didn't know about this new sequence style until Vince posted his
>>>> message and we briefly talked about it at the European BioC meeting this
>>>> week in Brussels. however, i didn't know that the style was specific to
>>>> a particular assembly. i have no use case of this at the mome moment,
>>>> i.e., i have not encountered myself any annotation or BAM file with
>>>> chromosome names written that way, so i don't know how pressing this
>>>> issue is, maybe Vince can tell us how spread such chromosome naming
>>>> style may become in the near future.
>>>>
>>>> naively, i'd think that it would be matter of adding a
>>>> reference-specific column, i.e., 'GRCh38.p13', 'GRCh37.p13', etc., but i
>>>> can imagine that maybe the "reference style" concept might not be the
>>>> appropriate placeholder to map all different chromosome names of all
>>>> different individual human genomes uploaded to NCBI. maybe we should
>>>> wait until we have a specific use case .. Vince?
>>>>
>>>> robert.
>>>>
>>>> On 12/11/19 10:06 PM, Pages, Herve wrote:
>>>>> Hi Vince, Robert,
>>>>>
>>>>> Looks like Vince wants the RefSeq accession e.g. NC_000017.11 for
>> chrom
>>>>> 17 in the GRCh38.
>>>>>
>>>>> @Robert: Is this what you're also interested in?
>>>>>
>>>>> The problem is that the RefSeq accessions are specific to a particular
>>>>> assembly (e.g. NC_000017.11 for chrom 17 in GRCh38 but NC_000017.10
>> for
>>>>> the same chrom in GRCh37).
>>>>>
>>>>> Currently seqlevelsStyle() doesn't know how to distinguish between
>>>>> different assemblies of the same organism. Not saying it couldn't but
>> it
>>>>> would require some thinking and some significant refactoring. It
>>>>> wouldn't be just a matter of adding a column to
>>>>> genomeStyles()$Homo_sapiens.
>>>>>
>>>>> H.
>>>>>
>>>>>
>>>>> On 12/10/19 14:19, Robert Castelo wrote:
>>>>>> I second this, and would suggest to name the style as 'GRC' for
>> "Genome
>>>>>> Reference Consortium".
>>>>>>
>>>>>> thanks Vince for bringing this up, being able to easily switch
>> between
>>>>>> genome styles is great.
>>>>>>
>>>>>> if 'paste0()' in R is one of the most influential contributions to
>>>>>> statistical computing
>>>>>>
>>>>>>
>>>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__simplystatistics.org_2013_01_31_paste0-2Dis-2Dstatistical-2Dcomputings-2Dmost-2Dinfluential-2Dcontribution-2Dof-2Dthe-2D21st-2Dcentury&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=b0_SIu8orJ7ZcCS3TIodFvGTPibt9R8vFL5Y40YSx3Q&e=
>>>>>>
>>>>>> i think that 'seqlevelsStyle()' from the GenomeInfoDb package is one
>> of
>>>>>> the most influential contributions to human genetics, if you think
>>>> about
>>>>>> the time invested by researchers in parsing and changing between
>>>>>> different styles of chromosome names :)
>>>>>>
>>>>>> robert.
>>>>>>
>>>>>> On 06/12/2019 15:03, Vincent Carey wrote:
>>>>>>> I raised this issue previously with little response.
>>>>>>>
>>>>>>> I'd propose that we add a column or two to
>> genomeStyles()$Homo_sapiens
>>>>>>>
>>>>>>>> head(genomeStyles()$Homo_sapiens, 2)
>>>>>>> circular auto sex NCBI UCSC dbSNP Ensembl
>>>>>>>
>>>>>>> 1 FALSE TRUE FALSE 1 chr1 ch1 1
>>>>>>>
>>>>>>> 2 FALSE TRUE FALSE 2 chr2 ch2 2
>>>>>>>
>>>>>>>
>>>>>>> that includes the values for "NCBI reference sequence names"
>>>>>>>
>>>>>>> See
>>>>>>>
>>>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_nuccore_568815581&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=3Jy-MH7heIcrc_A4qm_izduLvBoPWHSeq4gdxf5nv24&e=
>>>>>>> for one report on chr17,
>>>>>>> and
>>>>>>>
>>>>>>>
>>>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F000001405.39&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=y6ut_Xcc4rSbXanckiJhiwLsL0W8neJfKWQa6wnG3aM&e=
>>>>>>>
>>>>>>> for a table that includes the Genbank labels.
>>>>>>>
>>>>>>> Should I just file a PR at
>>>>>>>
>>>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_GenomeInfoDb_&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=KMzfo3_8kkJ-wdvRCNP5rUjTVMW87brj07yHaKL5Qb0&e=
>>>>>>> after
>>>>>>> testing?
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioc-devel using r-project.org mailing list
>>>>>>
>>>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=SvtNreKVOHnSGjsRwzWWpttpEF7wBXI5utI37-qgX1A&e=
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Robert Castelo, PhD
>>>> Associate Professor
>>>> Dept. of Experimental and Health Sciences
>>>> Universitat Pompeu Fabra (UPF)
>>>> Barcelona Biomedical Research Park (PRBB)
>>>> Dr Aiguader 88
>>>> E-08003 Barcelona, Spain
>>>> telf: +34.933.160.514
>>>> fax: +34.933.160.550
>>>>
>>>
>>
>> --
>> The information in this e-mail is intended only for t...{{dropped:21}}
More information about the Bioc-devel
mailing list