[Bioc-devel] BSgenome changes

Hervé Pagès hp@ge@ @end|ng |rom |redhutch@org
Thu Aug 20 19:44:19 CEST 2020


Kasper,

The tradition so far has been to package all UCSC human genomes since 
hg17. We could also start producing BSgenome packages for other non-UCSC 
Human assemblies. We just need to draw a line somewhere. If there is a 
need for it we can make BSgenome.Hsapiens.NCBI.GRCh37.p13 available, as 
I said earlier. Is this what you are asking for?

H.

On 8/20/20 03:23, Kasper Daniel Hansen wrote:
> Well, the presence of two mitochondrial genomes is to fix a mistake by 
> UCSC. I can appreciate the importance of representing this mistake when 
> you build off UCSC. But it strikes me as not actually representing the 
> h37 version of the genome, and it seems to me that we want such a 
> representation in the project - not everything comes through UCSC. But 
> perhaps I have not given this sufficient thought, this is just my 
> immediate reaction.
> 
> On Tue, Aug 18, 2020 at 8:18 PM Leonard Goldstein 
> <goldstein.leonard using gene.com <mailto:goldstein.leonard using gene.com>> wrote:
> 
>     Thanks for the explanation Hervé.
> 
>     Best wishes
> 
>     Leonard
> 
> 
>     On Tue, Aug 18, 2020 at 10:06 AM Hervé Pagès <hpages using fredhutch.org
>     <mailto:hpages using fredhutch.org>> wrote:
> 
>         On 8/18/20 01:40, Kasper Daniel Hansen wrote:
>          > In light of this, could we get a version of GRCh37 with only
>         a single
>          > mitochondrial genome?
> 
>         You mean a BSgenome.Hsapiens.NCBI.GRCh37.p13 package? So it would
>         contain the same sequences as BSgenome.Hsapiens.UCSC.hg19 but
>         without
>         the hg19:chrM sequence?
> 
>         Certainly doable but note that by using
>         BSgenome.Hsapiens.UCSC.hg38 you
>         stay away from this mess. I'm not sure that adding yet another
>         BSgenome
>         package would make the situation less confusing.
> 
>          >
>          > On Fri, Aug 14, 2020 at 6:17 PM Hervé Pagès
>         <hpages using fredhutch.org <mailto:hpages using fredhutch.org>
>          > <mailto:hpages using fredhutch.org <mailto:hpages using fredhutch.org>>>
>         wrote:
>          >
>          >     Hi Felix,
>          >
>          >     On 8/13/20 21:43, Felix Ernst wrote:
>          >      > Hi Leonard, Hi Herve,
>          >      >
>          >      > I followed your conversation, since I have noticed the
>         same
>          >     problem. Thanks, Herve, for the explanation of the recent
>         changes on
>          >     hg19.
>          >      >
>          >      > The GRCh37.P13 report states in its last line:
>          >      >
>          >      > MT    assembled-molecule      MT      Mitochondrion 
>           J01415.2
>          >          =       NC_012920.1     non-nuclear     16569   chrM
>          >      >
>          >      > Since the last name is called "UCSC-style-name",
>         wouldn't that
>          >     mean that chrM has to be renamed to MT and not chrMT?
>          >
>          >     This is a mistake in the sequence report for GRCh37.p13.
>         GRCh37.p13:MT
>          >     is the same as hg19:chrMT, not hg19:chrM.
>          >
>          >     hg19:chrM and hg19:chrMT are **not** the same sequences.
>         The former is
>          >     NC_001807 and has length 16571 and the latter is
>         NC_012920.1 and has
>          >     length 16569.
>          >
>          >     Yes, seqlevelsStyle() is sorting out all this mess for
>         you ;-)
>          >
>          >     Cheers,
>          >     H.
>          >
>          >      >
>          >      > Thanks again for the explanation.
>          >      >
>          >      > Cheers,
>          >      > Felix
>          >      >
>          >      > -----Ursprüngliche Nachricht-----
>          >      > Von: Bioc-devel <bioc-devel-bounces using r-project.org
>         <mailto:bioc-devel-bounces using r-project.org>
>          >     <mailto:bioc-devel-bounces using r-project.org
>         <mailto:bioc-devel-bounces using r-project.org>>> Im Auftrag von Hervé
>         Pagès
>          >      > Gesendet: Freitag, 14. August 2020 01:08
>          >      > An: Leonard Goldstein <goldstein.leonard using gene.com
>         <mailto:goldstein.leonard using gene.com>
>          >     <mailto:goldstein.leonard using gene.com
>         <mailto:goldstein.leonard using gene.com>>>; bioc-devel using r-project.org
>         <mailto:bioc-devel using r-project.org>
>          >     <mailto:bioc-devel using r-project.org
>         <mailto:bioc-devel using r-project.org>>
>          >      > Cc: charlotte.soneson using fmi.ch
>         <mailto:charlotte.soneson using fmi.ch>
>         <mailto:charlotte.soneson using fmi.ch <mailto:charlotte.soneson using fmi.ch>>
>          >      > Betreff: Re: [Bioc-devel] BSgenome changes
>          >      >
>          >      > Hi Leonard,
>          >      >
>          >      > On 8/12/20 15:22, Leonard Goldstein via Bioc-devel wrote:
>          >      >> Dear Bioc team,
>          >      >>
>          >      >> I'm following up on this recent GitHub issue
>          >      >>
>          >   
>           <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ldg21
>          >      >>
>          >   
>           _SGSeq_issues_5&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n5bIFHTIgC1B4EdjWUDLIlVcRJdXScYvfbojaqTJZVg&s=Tfk-tDM99P63dnsvMydG2phv5WQPVbJzPk0hzi-_1SE&e=
>          >      >. Please see the issue for more details and code examples.
>          >      >>
>          >      >> It looks like changes in Bioc devel result in two
>         copies of the
>          >      >> mitochondrial chromosome for
>         BSgenome.Hsapiens.UCSC.hg19 -- one
>          >     named
>          >      >> chrM like in previous package versions (length 16571)
>         and one named
>          >      >> chrMT (length 16569).
>          >      >>
>          >      >> When using seqlevelsStyle() to change chromosome
>         names from UCSC to
>          >      >> NCBI format, this results in new behavior -- in the
>         past chrM was
>          >      >> simply renamed MT, now the different sequence chrMT
>         is used. Is
>          >     this intended?
>          >      >
>          >      > Absolutely intended.
>          >      >
>          >      > There is a long story behind the unfortunate fate of the
>          >     mitochondrial chromosome in hg19. I'll try to keep it short.
>          >      >
>          >      > When the UCSC folks released the hg19 browser more
>         than 10 years
>          >     ago, they based it on assembly GRCh37:
>          >      >
>          >      >
>          >
>         https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F000001405.13&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=jWtgKVQGC-SQp6i4prhKBiD5cBh2kEc8R1gL2uPlzy0&e=
>          >      >
>          >      > See sequence report for GRCh37:
>          >      >
>          >      >
>          >      >
>          >
>         https://urldefense.proofpoint.com/v2/url?u=https-3A__ftp.ncbi.nlm.nih.gov_genomes_all_GCF_000_001_405_GCF-5F000001405.13-5FGRCh37_GCF-5F000001405.13-5FGRCh37-5Fassembly-5Freport.txt&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=2mzBk6ksCERabHcDIy7tR6p1aQvFGkLM8lZNrsWrA18&e=
>          >      >
>          >      > For some mysterious reason GRCh37 didn't include the
>          >     mitochondrial chromosome so the UCSC folks decided to use
>          >     mitochondrial sequence
>          >      > NC_001807 and called it chrM.
>          >      >
>          >      > However, UCSC has recently decided to base hg19 on
>         GRCh37.p13
>          >     instead of GRCh37. A rather surprising move after many
>         years of hg19
>          >     being based on the latter.
>          >      >
>          >      >
>          >
>         https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F000001405.25_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=gxOOdwtmHjZfz-EAFblY0cm-7upZ9useI3sEgDD87o8&e=
>          >      >
>          >      > See sequence report for GRCh37.p13:
>          >      >
>          >      >
>          >      >
>          >
>         https://urldefense.proofpoint.com/v2/url?u=https-3A__ftp.ncbi.nlm.nih.gov_genomes_all_GCF_000_001_405_GCF-5F000001405.25-5FGRCh37.p13_GCF-5F000001405.25-5FGRCh37.p13-5Fassembly-5Freport.txt&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=epUg7bSfwCEF_WUOPlT5hPmLXHY7V51Mau09UaQNB5o&e=
>          >      >
>          >      > Note that GRCh37.p13 does include the mitochondrial
>         chromosome.
>          >     It's called MT in the official sequence report above and
>         chrMT in hg19.
>          >      >
>          >      > At the same time the UCSC folks decided to keep chrM
>         so now hg19
>          >     contains 2 mitochondrial sequences: chrM and chrMT.
>         Previously it
>          >     has only one: chrM.
>          >      >
>          >      > So what you see in BioC devel in
>         BSgenome.Hsapiens.UCSC.hg19 and with
>          >      > seqlevelsStyle(genome) is only reflecting this. In
>         particular
>          >      > seqlevelsStyle(genome) <- "NCBI" now does the following:
>          >      >
>          >      >     - Rename chrMT -> MT.
>          >      >
>          >      >     - chrM does NOT get renamed. There is no point in
>         renaming
>          >     this sequence because it has no equivalent in GRCh37.p13.
>          >      >
>          >      > Hope this helps,
>          >      >
>          >      > H.
>          >      >
>          >      >>
>          >      >> Leonard
>          >      >>
>          >      >>      [[alternative HTML version deleted]]
>          >      >>
>          >      >> _______________________________________________
>          >      >> Bioc-devel using r-project.org
>         <mailto:Bioc-devel using r-project.org>
>         <mailto:Bioc-devel using r-project.org <mailto:Bioc-devel using r-project.org>>
>          >     mailing list
>          >      >>
>          >
>         https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
>          >      >>
>          >   
>           man_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeA
>          >      >>
>          >   
>           vimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n5bIFHTIgC1B4EdjWUDLIlVcRJdXScYv
>          >      >>
>         fbojaqTJZVg&s=IczvesjTwEkPQVlFX5wKSJLUHyjNHE0sk71a-kMAVEI&e=
>          >      >>
>          >      >
>          >      > --
>          >      > Hervé Pagès
>          >      >
>          >      > Program in Computational Biology
>          >      > Division of Public Health Sciences
>          >      > Fred Hutchinson Cancer Research Center
>          >      > 1100 Fairview Ave. N, M1-B514
>          >      > P.O. Box 19024
>          >      > Seattle, WA 98109-1024
>          >      >
>          >      > E-mail: hpages using fredhutch.org
>         <mailto:hpages using fredhutch.org> <mailto:hpages using fredhutch.org
>         <mailto:hpages using fredhutch.org>>
>          >      > Phone:  (206) 667-5791
>          >      > Fax:    (206) 667-1319
>          >      >
>          >      > _______________________________________________
>          >      > Bioc-devel using r-project.org
>         <mailto:Bioc-devel using r-project.org>
>         <mailto:Bioc-devel using r-project.org <mailto:Bioc-devel using r-project.org>>
>          >     mailing list
>          >      >
>          >
>         https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=g4eW0swjrNpysDJ67do3xLWcLyskjH51X5-x4kMJYDw&e=
>          >      >
>          >
>          >     --
>          >     Hervé Pagès
>          >
>          >     Program in Computational Biology
>          >     Division of Public Health Sciences
>          >     Fred Hutchinson Cancer Research Center
>          >     1100 Fairview Ave. N, M1-B514
>          >     P.O. Box 19024
>          >     Seattle, WA 98109-1024
>          >
>          >     E-mail: hpages using fredhutch.org
>         <mailto:hpages using fredhutch.org> <mailto:hpages using fredhutch.org
>         <mailto:hpages using fredhutch.org>>
>          >     Phone:  (206) 667-5791
>          >     Fax:    (206) 667-1319
>          >
>          >     _______________________________________________
>          > Bioc-devel using r-project.org <mailto:Bioc-devel using r-project.org>
>         <mailto:Bioc-devel using r-project.org
>         <mailto:Bioc-devel using r-project.org>> mailing list
>          > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>         <https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwMFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ZEkK79ISNzkyVJe1VIHawt4Y06TaycYht6rtTE_1eAE&s=MPZsoxMTYGldvJB8QHrLQL-3j8-p1RCWFUZmUsfHlbk&e=>
>          >   
>           <https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwMFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=5BrpbmuLSg2cS13gst2oJ-M8PG3kijaxWs3dZkYY8yw&s=NvAaJQhMJpXLBRTOJp4WG11FR4tuCXJ8cfgCdMlv5OY&e=>
>          >
>          >
>          >
>          > --
>          > Best,
>          > Kasper
> 
>         -- 
>         Hervé Pagès
> 
>         Program in Computational Biology
>         Division of Public Health Sciences
>         Fred Hutchinson Cancer Research Center
>         1100 Fairview Ave. N, M1-B514
>         P.O. Box 19024
>         Seattle, WA 98109-1024
> 
>         E-mail: hpages using fredhutch.org <mailto:hpages using fredhutch.org>
>         Phone:  (206) 667-5791
>         Fax:    (206) 667-1319
> 
> 
> 
> -- 
> Best,
> Kasper

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list