[Bioc-devel] proposal for additional seqlevelsStyle

Vincent Carey @tvjc @end|ng |rom ch@nn|ng@h@rv@rd@edu
Fri Dec 13 17:45:19 CET 2019


I tried an inline png but I think it was rejected by bioc-devel.  Here's
another try.

On Fri, Dec 13, 2019 at 11:40 AM Vincent Carey <stvjc using channing.harvard.edu>
wrote:

> Thanks -- It is good to know more about the complications of adding
> seqlevelsStyle elements.
> I am not sure how pervasive this will be in SNP annotation in the future.
> The "new API" for dbSNP
> references SPDI annotation conventions.
>
> https://api.ncbi.nlm.nih.gov/variation/v0/
>
> at least one dbsnp build 152 resource uses this nomenclature.  The one
>
> referenced below is the "go-to" resource for current rsid-coordinate
>
> correspondence, as far as I know.
>
>
> > library(VariantAnnotation)
>
> *0/0 packages newly attached/loaded, see sessionInfo() for details.*
>
> > mypar = GRanges("NC_000001.11", IRanges(100000,120000)) # note seqnames
>
>
> > nn = readVcf("
> ftp://ftp.ncbi.nih.gov/snp/redesign/latest_release/VCF/GCF_000001405.38.gz
> ",
>
> +   genome="GRCh38", param=mypar)
>
>
> > head(rowRanges(nn), 3)
>
> GRanges object with 3 ranges and 5 metadata columns:
>
>                    seqnames    ranges strand | paramRangeID            REF
>
>                       <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
>
>   rs1331956057 NC_000001.11    100000      * |         <NA>              C
>
>   rs1252351580 NC_000001.11    100036      * |         <NA>              T
>
>   rs1238523913 NC_000001.11    100051      * |         <NA>              T
>
>                               ALT      QUAL      FILTER
>
>                <DNAStringSetList> <numeric> <character>
>
>   rs1331956057                  T      <NA>           .
>
>   rs1252351580                  G      <NA>           .
>
>   rs1238523913                  C      <NA>           .
>
>   -------
>
>   seqinfo: 1 sequence from GRCh38 genome; no seqlengths
>
>
> On Fri, Dec 13, 2019 at 11:01 AM Robert Castelo <robert.castelo using upf.edu>
> wrote:
>
>> hi Hervé,
>>
>> i didn't know about this new sequence style until Vince posted his
>> message and we briefly talked about it at the European BioC meeting this
>> week in Brussels. however, i didn't know that the style was specific to
>> a particular assembly. i have no use case of this at the mome moment,
>> i.e., i have not encountered myself any annotation or BAM file with
>> chromosome names written that way, so i don't know how pressing this
>> issue is, maybe Vince can tell us how spread such chromosome naming
>> style may become in the near future.
>>
>> naively, i'd think that it would be matter of adding a
>> reference-specific column, i.e., 'GRCh38.p13', 'GRCh37.p13', etc., but i
>> can imagine that maybe the "reference style" concept might not be the
>> appropriate placeholder to map all different chromosome names of all
>> different individual human genomes uploaded to NCBI. maybe we should
>> wait until we have a specific use case .. Vince?
>>
>> robert.
>>
>> On 12/11/19 10:06 PM, Pages, Herve wrote:
>> > Hi Vince, Robert,
>> >
>> > Looks like Vince wants the RefSeq accession e.g. NC_000017.11 for chrom
>> > 17 in the GRCh38.
>> >
>> > @Robert: Is this what you're also interested in?
>> >
>> > The problem is that the RefSeq accessions are specific to a particular
>> > assembly (e.g. NC_000017.11 for chrom 17 in GRCh38 but NC_000017.10 for
>> > the same chrom in GRCh37).
>> >
>> > Currently seqlevelsStyle() doesn't know how to distinguish between
>> > different assemblies of the same organism. Not saying it couldn't but it
>> > would require some thinking and some significant refactoring. It
>> > wouldn't be just a matter of adding a column to
>> > genomeStyles()$Homo_sapiens.
>> >
>> > H.
>> >
>> >
>> > On 12/10/19 14:19, Robert Castelo wrote:
>> >> I second this, and would suggest to name the style as 'GRC' for "Genome
>> >> Reference Consortium".
>> >>
>> >> thanks Vince for bringing this up, being able to easily switch between
>> >> genome styles is great.
>> >>
>> >> if 'paste0()' in R is one of the most influential contributions to
>> >> statistical computing
>> >>
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__simplystatistics.org_2013_01_31_paste0-2Dis-2Dstatistical-2Dcomputings-2Dmost-2Dinfluential-2Dcontribution-2Dof-2Dthe-2D21st-2Dcentury&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=b0_SIu8orJ7ZcCS3TIodFvGTPibt9R8vFL5Y40YSx3Q&e=
>> >>
>> >> i think that 'seqlevelsStyle()' from the GenomeInfoDb package is one of
>> >> the most influential contributions to human genetics, if you think
>> about
>> >> the time invested by researchers in parsing and changing between
>> >> different styles of chromosome names :)
>> >>
>> >> robert.
>> >>
>> >> On 06/12/2019 15:03, Vincent Carey wrote:
>> >>> I raised this issue previously with little response.
>> >>>
>> >>> I'd propose that we add a column or two to genomeStyles()$Homo_sapiens
>> >>>
>> >>>> head(genomeStyles()$Homo_sapiens, 2)
>> >>>     circular auto   sex NCBI UCSC dbSNP Ensembl
>> >>>
>> >>> 1    FALSE TRUE FALSE    1 chr1   ch1       1
>> >>>
>> >>> 2    FALSE TRUE FALSE    2 chr2   ch2       2
>> >>>
>> >>>
>> >>> that includes the values for "NCBI reference sequence names"
>> >>>
>> >>> See
>> >>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_nuccore_568815581&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=3Jy-MH7heIcrc_A4qm_izduLvBoPWHSeq4gdxf5nv24&e=
>> >>> for one report on chr17,
>> >>> and
>> >>>
>> >>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F000001405.39&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=y6ut_Xcc4rSbXanckiJhiwLsL0W8neJfKWQa6wnG3aM&e=
>> >>>
>> >>> for a table that includes the Genbank labels.
>> >>>
>> >>> Should I just file a PR at
>> >>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Bioconductor_GenomeInfoDb_&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=KMzfo3_8kkJ-wdvRCNP5rUjTVMW87brj07yHaKL5Qb0&e=
>> >>> after
>> >>> testing?
>> >>>
>> >>
>> >> _______________________________________________
>> >> Bioc-devel using r-project.org mailing list
>> >>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=LCcYSINIz3XXhf8i-26IegXRLkTO1NgVbvzgvnPA3dc&s=SvtNreKVOHnSGjsRwzWWpttpEF7wBXI5utI37-qgX1A&e=
>> >>
>> >
>>
>> --
>> Robert Castelo, PhD
>> Associate Professor
>> Dept. of Experimental and Health Sciences
>> Universitat Pompeu Fabra (UPF)
>> Barcelona Biomedical Research Park (PRBB)
>> Dr Aiguader 88
>> E-08003 Barcelona, Spain
>> telf: +34.933.160.514
>> fax: +34.933.160.550
>>
>

-- 
The information in this e-mail is intended only for the ...{{dropped:18}}



More information about the Bioc-devel mailing list