[Bioc-devel] BSgenome changes

Hervé Pagès hp@ge@ @end|ng |rom |redhutch@org
Fri Aug 14 18:16:27 CEST 2020


Hi Felix,

On 8/13/20 21:43, Felix Ernst wrote:
> Hi Leonard, Hi Herve,
> 
> I followed your conversation, since I have noticed the same problem. Thanks, Herve, for the explanation of the recent changes on hg19.
> 
> The GRCh37.P13 report states in its last line:
> 
> MT	assembled-molecule	MT	Mitochondrion	J01415.2	=	NC_012920.1	non-nuclear	16569	chrM
> 
> Since the last name is called "UCSC-style-name", wouldn't that mean that chrM has to be renamed to MT and not chrMT?

This is a mistake in the sequence report for GRCh37.p13. GRCh37.p13:MT 
is the same as hg19:chrMT, not hg19:chrM.

hg19:chrM and hg19:chrMT are **not** the same sequences. The former is 
NC_001807 and has length 16571 and the latter is NC_012920.1 and has 
length 16569.

Yes, seqlevelsStyle() is sorting out all this mess for you ;-)

Cheers,
H.

> 
> Thanks again for the explanation.
> 
> Cheers,
> Felix
> 
> -----Ursprüngliche Nachricht-----
> Von: Bioc-devel <bioc-devel-bounces using r-project.org> Im Auftrag von Hervé Pagès
> Gesendet: Freitag, 14. August 2020 01:08
> An: Leonard Goldstein <goldstein.leonard using gene.com>; bioc-devel using r-project.org
> Cc: charlotte.soneson using fmi.ch
> Betreff: Re: [Bioc-devel] BSgenome changes
> 
> Hi Leonard,
> 
> On 8/12/20 15:22, Leonard Goldstein via Bioc-devel wrote:
>> Dear Bioc team,
>>
>> I'm following up on this recent GitHub issue
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ldg21
>> _SGSeq_issues_5&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n5bIFHTIgC1B4EdjWUDLIlVcRJdXScYvfbojaqTJZVg&s=Tfk-tDM99P63dnsvMydG2phv5WQPVbJzPk0hzi-_1SE&e= >. Please see the issue for more details and code examples.
>>
>> It looks like changes in Bioc devel result in two copies of the
>> mitochondrial chromosome for BSgenome.Hsapiens.UCSC.hg19 -- one named
>> chrM like in previous package versions (length 16571) and one named
>> chrMT (length 16569).
>>
>> When using seqlevelsStyle() to change chromosome names from UCSC to
>> NCBI format, this results in new behavior -- in the past chrM was
>> simply renamed MT, now the different sequence chrMT is used. Is this intended?
> 
> Absolutely intended.
> 
> There is a long story behind the unfortunate fate of the mitochondrial chromosome in hg19. I'll try to keep it short.
> 
> When the UCSC folks released the hg19 browser more than 10 years ago, they based it on assembly GRCh37:
> 
>     https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F000001405.13&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=jWtgKVQGC-SQp6i4prhKBiD5cBh2kEc8R1gL2uPlzy0&e=
> 
> See sequence report for GRCh37:
> 
>   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__ftp.ncbi.nlm.nih.gov_genomes_all_GCF_000_001_405_GCF-5F000001405.13-5FGRCh37_GCF-5F000001405.13-5FGRCh37-5Fassembly-5Freport.txt&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=2mzBk6ksCERabHcDIy7tR6p1aQvFGkLM8lZNrsWrA18&e=
> 
> For some mysterious reason GRCh37 didn't include the mitochondrial chromosome so the UCSC folks decided to use mitochondrial sequence
> NC_001807 and called it chrM.
> 
> However, UCSC has recently decided to base hg19 on GRCh37.p13 instead of GRCh37. A rather surprising move after many years of hg19 being based on the latter.
> 
>     https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F000001405.25_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=gxOOdwtmHjZfz-EAFblY0cm-7upZ9useI3sEgDD87o8&e=
> 
> See sequence report for GRCh37.p13:
> 
>   
> https://urldefense.proofpoint.com/v2/url?u=https-3A__ftp.ncbi.nlm.nih.gov_genomes_all_GCF_000_001_405_GCF-5F000001405.25-5FGRCh37.p13_GCF-5F000001405.25-5FGRCh37.p13-5Fassembly-5Freport.txt&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=epUg7bSfwCEF_WUOPlT5hPmLXHY7V51Mau09UaQNB5o&e=
> 
> Note that GRCh37.p13 does include the mitochondrial chromosome. It's called MT in the official sequence report above and chrMT in hg19.
> 
> At the same time the UCSC folks decided to keep chrM so now hg19 contains 2 mitochondrial sequences: chrM and chrMT. Previously it has only one: chrM.
> 
> So what you see in BioC devel in BSgenome.Hsapiens.UCSC.hg19 and with
> seqlevelsStyle(genome) is only reflecting this. In particular
> seqlevelsStyle(genome) <- "NCBI" now does the following:
> 
>     - Rename chrMT -> MT.
> 
>     - chrM does NOT get renamed. There is no point in renaming this sequence because it has no equivalent in GRCh37.p13.
> 
> Hope this helps,
> 
> H.
> 
>>
>> Leonard
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel using r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
>> man_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeA
>> vimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n5bIFHTIgC1B4EdjWUDLIlVcRJdXScYv
>> fbojaqTJZVg&s=IczvesjTwEkPQVlFX5wKSJLUHyjNHE0sk71a-kMAVEI&e=
>>
> 
> --
> Hervé Pagès
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpages using fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
> 
> _______________________________________________
> Bioc-devel using r-project.org mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=g4eW0swjrNpysDJ67do3xLWcLyskjH51X5-x4kMJYDw&e=
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list