[Bioc-devel] BSgenome changes
Hervé Pagès
hp@ge@ @end|ng |rom |redhutch@org
Thu Aug 20 19:44:19 CEST 2020
Kasper,
The tradition so far has been to package all UCSC human genomes since
hg17. We could also start producing BSgenome packages for other non-UCSC
Human assemblies. We just need to draw a line somewhere. If there is a
need for it we can make BSgenome.Hsapiens.NCBI.GRCh37.p13 available, as
I said earlier. Is this what you are asking for?
H.
On 8/20/20 03:23, Kasper Daniel Hansen wrote:
> Well, the presence of two mitochondrial genomes is to fix a mistake by
> UCSC. I can appreciate the importance of representing this mistake when
> you build off UCSC. But it strikes me as not actually representing the
> h37 version of the genome, and it seems to me that we want such a
> representation in the project - not everything comes through UCSC. But
> perhaps I have not given this sufficient thought, this is just my
> immediate reaction.
>
> On Tue, Aug 18, 2020 at 8:18 PM Leonard Goldstein
> <goldstein.leonard using gene.com <mailto:goldstein.leonard using gene.com>> wrote:
>
> Thanks for the explanation Hervé.
>
> Best wishes
>
> Leonard
>
>
> On Tue, Aug 18, 2020 at 10:06 AM Hervé Pagès <hpages using fredhutch.org
> <mailto:hpages using fredhutch.org>> wrote:
>
> On 8/18/20 01:40, Kasper Daniel Hansen wrote:
> > In light of this, could we get a version of GRCh37 with only
> a single
> > mitochondrial genome?
>
> You mean a BSgenome.Hsapiens.NCBI.GRCh37.p13 package? So it would
> contain the same sequences as BSgenome.Hsapiens.UCSC.hg19 but
> without
> the hg19:chrM sequence?
>
> Certainly doable but note that by using
> BSgenome.Hsapiens.UCSC.hg38 you
> stay away from this mess. I'm not sure that adding yet another
> BSgenome
> package would make the situation less confusing.
>
> >
> > On Fri, Aug 14, 2020 at 6:17 PM Hervé Pagès
> <hpages using fredhutch.org <mailto:hpages using fredhutch.org>
> > <mailto:hpages using fredhutch.org <mailto:hpages using fredhutch.org>>>
> wrote:
> >
> > Hi Felix,
> >
> > On 8/13/20 21:43, Felix Ernst wrote:
> > > Hi Leonard, Hi Herve,
> > >
> > > I followed your conversation, since I have noticed the
> same
> > problem. Thanks, Herve, for the explanation of the recent
> changes on
> > hg19.
> > >
> > > The GRCh37.P13 report states in its last line:
> > >
> > > MT assembled-molecule MT Mitochondrion
> J01415.2
> > = NC_012920.1 non-nuclear 16569 chrM
> > >
> > > Since the last name is called "UCSC-style-name",
> wouldn't that
> > mean that chrM has to be renamed to MT and not chrMT?
> >
> > This is a mistake in the sequence report for GRCh37.p13.
> GRCh37.p13:MT
> > is the same as hg19:chrMT, not hg19:chrM.
> >
> > hg19:chrM and hg19:chrMT are **not** the same sequences.
> The former is
> > NC_001807 and has length 16571 and the latter is
> NC_012920.1 and has
> > length 16569.
> >
> > Yes, seqlevelsStyle() is sorting out all this mess for
> you ;-)
> >
> > Cheers,
> > H.
> >
> > >
> > > Thanks again for the explanation.
> > >
> > > Cheers,
> > > Felix
> > >
> > > -----Ursprüngliche Nachricht-----
> > > Von: Bioc-devel <bioc-devel-bounces using r-project.org
> <mailto:bioc-devel-bounces using r-project.org>
> > <mailto:bioc-devel-bounces using r-project.org
> <mailto:bioc-devel-bounces using r-project.org>>> Im Auftrag von Hervé
> Pagès
> > > Gesendet: Freitag, 14. August 2020 01:08
> > > An: Leonard Goldstein <goldstein.leonard using gene.com
> <mailto:goldstein.leonard using gene.com>
> > <mailto:goldstein.leonard using gene.com
> <mailto:goldstein.leonard using gene.com>>>; bioc-devel using r-project.org
> <mailto:bioc-devel using r-project.org>
> > <mailto:bioc-devel using r-project.org
> <mailto:bioc-devel using r-project.org>>
> > > Cc: charlotte.soneson using fmi.ch
> <mailto:charlotte.soneson using fmi.ch>
> <mailto:charlotte.soneson using fmi.ch <mailto:charlotte.soneson using fmi.ch>>
> > > Betreff: Re: [Bioc-devel] BSgenome changes
> > >
> > > Hi Leonard,
> > >
> > > On 8/12/20 15:22, Leonard Goldstein via Bioc-devel wrote:
> > >> Dear Bioc team,
> > >>
> > >> I'm following up on this recent GitHub issue
> > >>
> >
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ldg21
> > >>
> >
> _SGSeq_issues_5&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n5bIFHTIgC1B4EdjWUDLIlVcRJdXScYvfbojaqTJZVg&s=Tfk-tDM99P63dnsvMydG2phv5WQPVbJzPk0hzi-_1SE&e=
> > >. Please see the issue for more details and code examples.
> > >>
> > >> It looks like changes in Bioc devel result in two
> copies of the
> > >> mitochondrial chromosome for
> BSgenome.Hsapiens.UCSC.hg19 -- one
> > named
> > >> chrM like in previous package versions (length 16571)
> and one named
> > >> chrMT (length 16569).
> > >>
> > >> When using seqlevelsStyle() to change chromosome
> names from UCSC to
> > >> NCBI format, this results in new behavior -- in the
> past chrM was
> > >> simply renamed MT, now the different sequence chrMT
> is used. Is
> > this intended?
> > >
> > > Absolutely intended.
> > >
> > > There is a long story behind the unfortunate fate of the
> > mitochondrial chromosome in hg19. I'll try to keep it short.
> > >
> > > When the UCSC folks released the hg19 browser more
> than 10 years
> > ago, they based it on assembly GRCh37:
> > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F000001405.13&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=jWtgKVQGC-SQp6i4prhKBiD5cBh2kEc8R1gL2uPlzy0&e=
> > >
> > > See sequence report for GRCh37:
> > >
> > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__ftp.ncbi.nlm.nih.gov_genomes_all_GCF_000_001_405_GCF-5F000001405.13-5FGRCh37_GCF-5F000001405.13-5FGRCh37-5Fassembly-5Freport.txt&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=2mzBk6ksCERabHcDIy7tR6p1aQvFGkLM8lZNrsWrA18&e=
> > >
> > > For some mysterious reason GRCh37 didn't include the
> > mitochondrial chromosome so the UCSC folks decided to use
> > mitochondrial sequence
> > > NC_001807 and called it chrM.
> > >
> > > However, UCSC has recently decided to base hg19 on
> GRCh37.p13
> > instead of GRCh37. A rather surprising move after many
> years of hg19
> > being based on the latter.
> > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ncbi.nlm.nih.gov_assembly_GCF-5F000001405.25_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=gxOOdwtmHjZfz-EAFblY0cm-7upZ9useI3sEgDD87o8&e=
> > >
> > > See sequence report for GRCh37.p13:
> > >
> > >
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__ftp.ncbi.nlm.nih.gov_genomes_all_GCF_000_001_405_GCF-5F000001405.25-5FGRCh37.p13_GCF-5F000001405.25-5FGRCh37.p13-5Fassembly-5Freport.txt&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=epUg7bSfwCEF_WUOPlT5hPmLXHY7V51Mau09UaQNB5o&e=
> > >
> > > Note that GRCh37.p13 does include the mitochondrial
> chromosome.
> > It's called MT in the official sequence report above and
> chrMT in hg19.
> > >
> > > At the same time the UCSC folks decided to keep chrM
> so now hg19
> > contains 2 mitochondrial sequences: chrM and chrMT.
> Previously it
> > has only one: chrM.
> > >
> > > So what you see in BioC devel in
> BSgenome.Hsapiens.UCSC.hg19 and with
> > > seqlevelsStyle(genome) is only reflecting this. In
> particular
> > > seqlevelsStyle(genome) <- "NCBI" now does the following:
> > >
> > > - Rename chrMT -> MT.
> > >
> > > - chrM does NOT get renamed. There is no point in
> renaming
> > this sequence because it has no equivalent in GRCh37.p13.
> > >
> > > Hope this helps,
> > >
> > > H.
> > >
> > >>
> > >> Leonard
> > >>
> > >> [[alternative HTML version deleted]]
> > >>
> > >> _______________________________________________
> > >> Bioc-devel using r-project.org
> <mailto:Bioc-devel using r-project.org>
> <mailto:Bioc-devel using r-project.org <mailto:Bioc-devel using r-project.org>>
> > mailing list
> > >>
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
> > >>
> >
> man_listinfo_bioc-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeA
> > >>
> >
> vimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=n5bIFHTIgC1B4EdjWUDLIlVcRJdXScYv
> > >>
> fbojaqTJZVg&s=IczvesjTwEkPQVlFX5wKSJLUHyjNHE0sk71a-kMAVEI&e=
> > >>
> > >
> > > --
> > > Hervé Pagès
> > >
> > > Program in Computational Biology
> > > Division of Public Health Sciences
> > > Fred Hutchinson Cancer Research Center
> > > 1100 Fairview Ave. N, M1-B514
> > > P.O. Box 19024
> > > Seattle, WA 98109-1024
> > >
> > > E-mail: hpages using fredhutch.org
> <mailto:hpages using fredhutch.org> <mailto:hpages using fredhutch.org
> <mailto:hpages using fredhutch.org>>
> > > Phone: (206) 667-5791
> > > Fax: (206) 667-1319
> > >
> > > _______________________________________________
> > > Bioc-devel using r-project.org
> <mailto:Bioc-devel using r-project.org>
> <mailto:Bioc-devel using r-project.org <mailto:Bioc-devel using r-project.org>>
> > mailing list
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=49jni5SmG_DH80nnPZXXqvFNceB5jkZtlb7eKEA8558&s=g4eW0swjrNpysDJ67do3xLWcLyskjH51X5-x4kMJYDw&e=
> > >
> >
> > --
> > Hervé Pagès
> >
> > Program in Computational Biology
> > Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024
> > Seattle, WA 98109-1024
> >
> > E-mail: hpages using fredhutch.org
> <mailto:hpages using fredhutch.org> <mailto:hpages using fredhutch.org
> <mailto:hpages using fredhutch.org>>
> > Phone: (206) 667-5791
> > Fax: (206) 667-1319
> >
> > _______________________________________________
> > Bioc-devel using r-project.org <mailto:Bioc-devel using r-project.org>
> <mailto:Bioc-devel using r-project.org
> <mailto:Bioc-devel using r-project.org>> mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwMFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=ZEkK79ISNzkyVJe1VIHawt4Y06TaycYht6rtTE_1eAE&s=MPZsoxMTYGldvJB8QHrLQL-3j8-p1RCWFUZmUsfHlbk&e=>
> >
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwMFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=5BrpbmuLSg2cS13gst2oJ-M8PG3kijaxWs3dZkYY8yw&s=NvAaJQhMJpXLBRTOJp4WG11FR4tuCXJ8cfgCdMlv5OY&e=>
> >
> >
> >
> > --
> > Best,
> > Kasper
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages using fredhutch.org <mailto:hpages using fredhutch.org>
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
>
>
> --
> Best,
> Kasper
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages using fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list