[Bioc-devel] proposal for additional seqlevelsStyle
Robert Castelo
robert@c@@te|o @end|ng |rom up|@edu
Fri Jul 3 15:34:56 CEST 2020
Thanks Hervé,
this looks very useful!!!
robert.
On 03/07/2020 07:49, Hervé Pagès wrote:
> Hi Vince, Robert, Kasper,
>
> I've done some work on this. Starting with GenomeInfoDb_1.25.7 the
> seqlevelsStyle() setter has 2 major improvements:
>
> 1. It knows how to rename contigs and scaffolds, not just the
> chromosomes:
>
> library(TxDb.Mmusculus.UCSC.mm10.knownGene)
>
> seqinfo(txdb)
> # Seqinfo object with 66 sequences (1 circular) from mm10 genome:
> # seqnames seqlengths isCircular genome
> # chr1 195471971 <NA> mm10
> # chr2 182113224 <NA> mm10
> # chr3 160039680 <NA> mm10
> # chr4 156508116 <NA> mm10
> # chr5 151834684 <NA> mm10
> # ... ... ... ...
> # chrUn_GL456392 23629 <NA> mm10
> # chrUn_GL456393 55711 <NA> mm10
> # chrUn_GL456394 24323 <NA> mm10
> # chrUn_GL456396 21240 <NA> mm10
> # chrUn_JH584304 114452 <NA> mm10
>
> seqlevelsStyle(txdb) <- "NCBI"
>
> seqinfo(txdb)
> # Seqinfo object with 66 sequences (1 circular) from GRCm38 genome:
> # seqnames seqlengths isCircular genome
> # 1 195471971 <NA> GRCm38
> # 2 182113224 <NA> GRCm38
> # 3 160039680 <NA> GRCm38
> # 4 156508116 <NA> GRCm38
> # 5 151834684 <NA> GRCm38
> # ... ... ... ...
> # MSCHRUN_CTG10 23629 <NA> GRCm38
> # MSCHRUN_CTG11 55711 <NA> GRCm38
> # MSCHRUN_CTG12 24323 <NA> GRCm38
> # MSCHRUN_CTG15 21240 <NA> GRCm38
> # MSCHRUN_CTG23 114452 <NA> GRCm38
>
> 2. It supports new style RefSeq for renaming to/from RefSeq accessions:
>
> seqlevelsStyle(txdb) <- "RefSeq"
>
> seqinfo(txdb)
> # Seqinfo object with 66 sequences (1 circular) from GRCm38 genome:
> # seqnames seqlengths isCircular genome
> # NC_000067.6 195471971 <NA> GRCm38
> # NC_000068.7 182113224 <NA> GRCm38
> # NC_000069.6 160039680 <NA> GRCm38
> # NC_000070.6 156508116 <NA> GRCm38
> # NC_000071.6 151834684 <NA> GRCm38
> # ... ... ... ...
> # NT_166476.1 23629 <NA> GRCm38
> # NT_166477.1 55711 <NA> GRCm38
> # NT_166478.1 24323 <NA> GRCm38
> # NT_166480.1 21240 <NA> GRCm38
> # NT_187064.1 114452 <NA> GRCm38
>
> These new features only work on objects for which the genome is set to
> an NCBI assembly (e.g. WBcel235) or UCSC genome (e.g. ce11). This is
> the case with TxDb, BSgenome, and SNPlocs objects.
>
> The workhorses behind them are new low-level utilities
> getChromInfoFromNCBI() and getChromInfoFromUCSC(). These support 141
> NCBI assemblies and 74 UCSC genomes at the moment, respectively. It's
> easy to add new organisms. The gotcha is that they require internet
> access and so does the seqlevelsStyle() setter. This could be
> mitigated by caching the data via BiocFileCache.
>
> Next thing on the list is to support the GenBank style (Vince's
> original request) to rename to/from GenBank accessions.
>
> Cheers,
> H.
>
>
> On 12/13/19 10:51, Kasper Daniel Hansen wrote:
>> If the chromosome name depends on the assembly, that makes GenomeInfoDb
>> even more useful and necessary. Provided it is supported of course.
>>
>> On Fri, Dec 13, 2019 at 11:45 AM Vincent Carey
>> <stvjc using channing.harvard.edu>
>> wrote:
>>
>>> I tried an inline png but I think it was rejected by bioc-devel.
>>> Here's
>>> another try.
>>>
>>> On Fri, Dec 13, 2019 at 11:40 AM Vincent Carey
>>> <stvjc using channing.harvard.edu
>>>>
>>> wrote:
>>>
>>>> Thanks -- It is good to know more about the complications of adding
>>>> seqlevelsStyle elements.
>>>> I am not sure how pervasive this will be in SNP annotation in the
>>>> future.
>>>> The "new API" for dbSNP
>>>> references SPDI annotation conventions.
>>>>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__api.ncbi.nlm.nih.gov_variation_v0_&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=q-r9L4kC5cjCxRXc33m9n1hKs7WLWiYhNB2t6B_b7nU&s=1jolcMxvVlDvdU0u8LZeWnWHJFt-R6ZweevICNvw2oo&e=
>>>>
>>>>
>>>> at least one dbsnp build 152 resource uses this nomenclature. The one
>>>>
>>>> referenced below is the "go-to" resource for current rsid-coordinate
>>>>
>>>> correspondence, as far as I know.
>>>>
>>>>
>>>>> library(VariantAnnotation)
>>>>
>>>> *0/0 packages newly attached/loaded, see sessionInfo() for details.*
>>>>
>>>>> mypar = GRanges("NC_000001.11", IRanges(100000,120000)) # note
>>>>> seqnames
>>>>
>>>>
>>>>> nn = readVcf("
>>>>
>>> https://urldefense.proofpoint.com/v2/url?u=ftp-3A__ftp.ncbi.nih.gov_snp_redesign_latest-5Frelease_VCF_GCF-5F000001405.38.gz&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=q-r9L4kC5cjCxRXc33m9n1hKs7WLWiYhNB2t6B_b7nU&s=SXqRWrv1IK1WpNOf3cjMC82-fvChoJ6za48WzlRGeAU&e=
>>>
>>>> ",
>>>>
>>>> + genome="GRCh38", param=mypar)
>>>>
>>>>
>>>>> head(rowRanges(nn), 3)
>>>>
>>>> GRanges object with 3 ranges and 5 metadata columns:
>>>>
>>>> seqnames ranges strand | paramRangeID
>>> REF
>>>>
>>>> <Rle> <IRanges> <Rle> | <factor>
>>> <DNAStringSet>
>>>>
>>>> rs1331956057 NC_000001.11 100000 * | <NA>
>>> C
>>>>
>>>> rs1252351580 NC_000001.11 100036 * | <NA>
>>> T
>>>>
>>>> rs1238523913 NC_000001.11 100051 * | <NA>
>>> T
>>>>
>>>> ALT QUAL FILTER
>>>>
>>>> <DNAStringSetList> <numeric> <character>
>>>>
>>>> rs1331956057 T <NA> .
>>>>
>>>> rs1252351580 G <NA> .
>>>>
>>>> rs1238523913 C <NA> .
>>>>
>>>> -------
>>>>
>>>> seqinfo: 1 sequence from GRCh38 genome; no seqlengths
>>>>
>>>>
>>>> On Fri, Dec 13, 2019 at 11:01 AM Robert Castelo
>>>> <robert.castelo using upf.edu>
>>>> wrote:
>>>>
>>>>> hi Hervé,
>>>>>
>>>>> i didn't know about this new sequence style until Vince posted his
>>>>> message and we briefly talked about it at the European BioC
>>>>> meeting this
>>>>> week in Brussels. however, i didn't know that the style was
>>>>> specific to
>>>>> a particular assembly. i have no use case of this at the mome moment,
>>>>> i.e., i have not encountered myself any annotation or BAM file with
>>>>> chromosome names written that way, so i don't know how pressing this
>>>>> issue is, maybe Vince can tell us how spread such chromosome naming
>>>>> style may become in the near future.
>>>>>
>>>>> naively, i'd think that it would be matter of adding a
>>>>> reference-specific column, i.e., 'GRCh38.p13', 'GRCh37.p13', etc.,
>>>>> but i
>>>>> can imagine that maybe the "reference style" concept might not be the
>>>>> appropriate placeholder to map all different chromosome names of all
>>>>> different individual human genomes uploaded to NCBI. maybe we should
>>>>> wait until we have a specific use case .. Vince?
>>>>>
>>>>> robert.
>>>>>
>>>>> On 12/11/19 10:06 PM, Pages, Herve wrote:
>>>>>> Hi Vince, Robert,
>>>>>>
>>>>>> Looks like Vince wants the RefSeq accession e.g. NC_000017.11 for
>>> chrom
>>>>>> 17 in the GRCh38.
>>>>>>
>>>>>> @Robert: Is this what you're also interested in?
>>>>>>
>>>>>> The problem is that the RefSeq accessions are specific to a
>>>>>> particular
>>>>>> assembly (e.g. NC_000017.11 for chrom 17 in GRCh38 but NC_000017.10
>>> for
>>>>>> the same chrom in GRCh37).
>>>>>>
>>>>>> Currently seqlevelsStyle() doesn't know how to distinguish between
>>>>>> different assemblies of the same organism. Not saying it couldn't
>>>>>> but
>>> it
>>>>>> would require some thinking and some significant refactoring. It
>>>>>> wouldn't be just a matter of adding a column to
>>>>>> genomeStyles()$Homo_sapiens.
>>>>>>
>>>>>> H.
>>>>>>
>>>>>>
>>>>>> On 12/10/19 14:19, Robert Castelo wrote:
>>>>>>> I second this, and would suggest to name the style as 'GRC' for
>>> "Genome
>>>>>>> Reference Consortium".
>>>>>>>
>>>>>>> thanks Vince for bringing this up, being able to easily switch
>>> between
>>>>>>> genome styles is great.
>>>>>>>
>>>>>>> if 'paste0()' in R is one of the most influential contributions to
>>>>>>> statistical computing
>>>>>>>
>>>>>>>
>>>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__simplystatistics.org_2013_01_31_paste0-2Dis-2Dstatistical-2Dcomputings-2Dmost-2Dinfluential-2Dcontribution-2Dof-2Dthe-2D21st-2Dcentury&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=b0_SIu8orJ7ZcCS3TIodFvGTPibt9R8vFL5Y40YSx3Q&e=
>>>
>>>>>>>
>>>>>>> i think that 'seqlevelsStyle()' from the GenomeInfoDb package is
>>>>>>> one
>>> of
>>>>>>> the most influential contributions to human genetics, if you think
>>>>> about
>>>>>>> the time invested by researchers in parsing and changing between
>>>>>>> different styles of chromosome names :)
>>>>>>>
>>>>>>> robert.
>>>>>>>
>>>>>>> On 06/12/2019 15:03, Vincent Carey wrote:
>>>>>>>> I raised this issue previously with little response.
>>>>>>>>
>>>>>>>> I'd propose that we add a column or two to
>>> genomeStyles()$Homo_sapiens
>>>>>>>>
>>>>>>>>> head(genomeStyles()$Homo_sapiens, 2)
>>>>>>>> circular auto sex NCBI UCSC dbSNP Ensembl
>>>>>>>>
>>>>>>>> 1 FALSE TRUE FALSE 1 chr1 ch1 1
>>>>>>>>
>>>>>>>> 2 FALSE TRUE FALSE 2 chr2 ch2 2
>>>>>>>>
>>>>>>>>
>>>>>>>> that includes the values for "NCBI reference sequence names"
>>>>>>>>
>>>>>>>> See
>>>>>>>>
>>>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_nuccore_568815581&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=3Jy-MH7heIcrc_A4qm_izduLvBoPWHSeq4gdxf5nv24&e=
>>>
>>>>>>>> for one report on chr17,
>>>>>>>> and
>>>>>>>>
>>>>>>>>
>>>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F000001405.39&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=y6ut_Xcc4rSbXanckiJhiwLsL0W8neJfKWQa6wnG3aM&e=
>>>
>>>>>>>>
>>>>>>>> for a table that includes the Genbank labels.
>>>>>>>>
>>>>>>>> Should I just file a PR at
>>>>>>>>
>>>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_GenomeInfoDb_&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=KMzfo3_8kkJ-wdvRCNP5rUjTVMW87brj07yHaKL5Qb0&e=
>>>
>>>>>>>> after
>>>>>>>> testing?
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioc-devel using r-project.org mailing list
>>>>>>>
>>>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=SvtNreKVOHnSGjsRwzWWpttpEF7wBXI5utI37-qgX1A&e=
>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Robert Castelo, PhD
>>>>> Associate Professor
>>>>> Dept. of Experimental and Health Sciences
>>>>> Universitat Pompeu Fabra (UPF)
>>>>> Barcelona Biomedical Research Park (PRBB)
>>>>> Dr Aiguader 88
>>>>> E-08003 Barcelona, Spain
>>>>> telf: +34.933.160.514
>>>>> fax: +34.933.160.550
>>>>>
>>>>
>>>
>>> --
>>> The information in this e-mail is intended only for t...{{dropped:21}}
>
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list