[Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects

Tue Dec 12 20:28:20 CET 2023

FWIW I've documented the process of making a TxDb object for 
T2T-CHM13v2.0 there:

https://github.com/Bioconductor/GenomicFeatures/issues/65

Please comment there for any follow-up.

Note that we're considering wrapping this is an TxDb package that we'll 
make available to the community. It's a work-in-progress.

Thanks!

H.

On 12/12/23 07:29, James W. MacDonald wrote:
> Hi Christian,
>
> This conversation is off-topic, both for this listserv (it’s meant to help people developing Bioconductor packages) and for the support site (which is meant to help people with (again), Bioconductor packages. I’ll answer your questions one more time, but if you have other questions, please move to biostars.org, or just ask the ArchR people directly, since it’s their package.
>
> I believe you are misinterpreting what an OrgDb is intended to provide. There is no positional data in an OrgDb, and what the CHM13 project has done is completely positional (what data are provided in the ‘Gene Annotation’ section of the CHM13 Github are all GFF files, which are meant to provide positional information of genes on a genome).
>
> The OrgDb package provides functional and within-annotation mappings. You can map an NCBI Gene ID to Ensembl, or to the HGNC gene symbol, or a GO term, etc. For example, I can map Gene symbol P53 to NCBI Gene ID 7157, or its UniProt symbol K7PPA8. If the new genome build says P53 has moved to a new genomic position, that has no affect on what UniProt thinks the ID for that gene’s protein should be, or what ID NCBI uses, or what GO terms are appended to that gene. Functionally it’s the same gene. We just might think it is located in a different place in the genome.
>
> The difference between CHM13 and GRCh38 is not materially different from the difference between GRCh37 and GRCh38 (they represent the current knowledge of the genome at a point in time), and while we supply TxDb packages for GRCh38 and GRCh37 (and variants based on NCBI’s mappings as well as Ensembl’s mappings), we have never supplied more than one human OrgDb package, because the positional and functional information are orthogonal.
>
> It seems pretty simple to make what you need though.
>
>> library(GenomicAlignments)
>> tx <- makeTxDbFromGFF(https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz)
> Import genomic features from the file as a GRanges object ... trying URL 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz'
> Content type 'application/x-gzip' length 79009538 bytes (75.3 MB)
> downloaded 75.3 MB
>
> OK
> Prepare the 'metadata' data frame ... OK
> Make the TxDb object ... OK
> Warning messages:
> 1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
>    some transcripts have no
>    "transcript_id" attribute ==>
>    their name ("tx_name" column in
>    the TxDb object) was set to NA
> 2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
>    the transcript names ("tx_name"
>    column in the TxDb object)
>    imported from the
>    "transcript_id" attribute are
>    not unique
> 3: In .find_exon_cds(exons, cds) : The following transcripts have
>    exons that contain more than one
>    CDS (only the first CDS was kept
>    for each exon):
>    rna-NM_001134939.1,
>    rna-NM_001172437.2,
>    rna-NM_001184961.1,
>    rna-NM_001301020.1,
>    rna-NM_001301302.1,
>    rna-NM_001301371.1,
>    rna-NM_002537.3,
>    rna-NM_004152.3,
>    rna-NM_015068.3, rna-NM_016178.2
>> tx
> TxDb object:
> # Db type: TxDb
> # Supporting package: GenomicFeatures
> # Data source:https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz
> # Organism: NA
> # Taxonomy ID: NA
> # miRBase build ID: NA
> # Genome: NA
> # Nb of transcripts: 188205
> # Db created by: GenomicFeatures package from Bioconductor
> # Creation time: 2023-12-12 10:17:34 -0500 (Tue, 12 Dec 2023)
> # GenomicFeatures version at creation time: 1.54.1
> # RSQLite version at creation time: 2.3.1
> # DBSCHEMAVERSION: 1.2
>
> genomeAnnotation <- createGenomeAnnotation(BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0)
> geneAnnotation <- createGeneAnnotation(TxDb = tx, OrgDb = org.Hs.eg.db)
>
>
> Best,
>
> Jim
>
> From: Christian Arnold<chrarnold using web.de>
> Sent: Tuesday, December 12, 2023 9:35 AM
> To: Vincent Carey<stvjc using channing.harvard.edu>; James W. MacDonald<jmacdon using uw.edu>
> Cc:bioc-devel using r-project.org
> Subject: Re: [Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects
>
> Dear Vincent and others, thanks for the reply! Irrespective of whether a different OrgDb is required, the name itself suggested that there "should be" also corresponding OrgDb and TxDb packages. I can build one on my own, I see, is there anyone
> ZjQcmQRYFpfptBannerStart
> This Message Is From an Untrusted Sender
> You have not previously corresponded with this sender.
> Seehttps://itconnect.uw.edu/email-tags  for additional information. Please contact the UW-IT Service Center,help using uw.edu<mailto:help using uw.edu>  206.221.5000, for assistance.
> ZjQcmQRYFpfptBannerEnd
>
> Dear Vincent and others,
>
> thanks for the reply! Irrespective of whether a different OrgDb is required, the name itself suggested that there "should be" also corresponding OrgDb and TxDb packages. I can build one on my own, I see, is there anyone who works on providing the TxDB object for Bioc?
>
> I am also asking this because the T2T people specifically provide an "updated" gene annotation dataset which may differ from what's inside OrgDb and may be incompatible with? See here:https://github.com/marbl/CHM13<https://urldefense.com/v3/__https:/github.com/marbl/CHM13__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl5saKKkDg$>:
>
> JHU RefSeqv110 + Liftoff v5.1<https://urldefense.com/v3/__https:/s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_RefSeq_Liftoff_v5.1.gff3.gz__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl6IjF5vbw$>: This contains curated annotations of the ampliconic genes on the Y chromosome, correcting annotation errors in GENCODEv35 CAT/Liftoff and RefSeqv110 annotation. Additional copies found in T2T-Y were annotated to the closest available gene in RefSeq, allowing multiple genes to have the same common name. This file has been modified to correct special character issues from the original file.
>
>
>
>
> For ArchR, I tried to understand how one can create a new genome by checking here:https://www.archrproject.com/bookdown/getting-set-up.html<https://urldefense.com/v3/__https:/www.archrproject.com/bookdown/getting-set-up.html__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl6DoYvxHg$>. There, they explicitly mention the TxDb and OrgDb objects that are needed for building a custom genome. There seems to be another option when both or any of these 2 is not available ("Alternatively, if you dont have a TxDb and OrgDb object, you can create a geneAnnotation object from the following information" ), but I first tried to do it the easy way as I want to properly embed it in a pipeline with as little "custom" code as possible.
>
>
>
> Thanks,
> Christian
>
>
>
>
> On 11/12/2023 15:30, Vincent Carey wrote:
> Thanks Jim, I tend to agree with you.  Christian, I had a look at ArchR but could not tell where the
> system contacts the Bioc annotation elements.  Can you give some hints?  I'd like to be able to
> verify compatibility.
>
> On Mon, Dec 11, 2023 at 9:19 AM James W. MacDonald <jmacdon using uw.edu<mailto:jmacdon using uw.edu>> wrote:
> I don't believe a different OrgDb is required. The OrgDb package is meant to provide annotations for genes such as gene symbol or GO term, etc, which are orthogonal to the sequence of the genome, so the current version should suffice.
>
> -----Original Message-----
> From: Bioc-devel <bioc-devel-bounces using r-project.org<mailto:bioc-devel-bounces using r-project.org>> On Behalf Of Vincent Carey
> Sent: Sunday, December 10, 2023 1:44 PM
> To: Christian Arnold <chrarnold using web.de<mailto:chrarnold using web.de>>
> Cc:bioc-devel using r-project.org<mailto:bioc-devel using r-project.org>
> Subject: Re: [Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects
>
> Good question.  I believe these will be forthcoming soon.  In the mean time you can create your own.  See, for example
>
> https://urldefense.com/v3/__https://github.com/vjcitn/BiocT2T/blob/devel/inst/scripts/makeTxDb.R__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMcpWaIEw$<https://urldefense.com/v3/__https:/github.com/vjcitn/BiocT2T/blob/devel/inst/scripts/makeTxDb.R__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMcpWaIEw$>
>
> It's an active area so you can pull a gff file fromhttps://urldefense.com/v3/__https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T*CHM13*assemblies*annotation*__;Ly8vLw!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adM7PNUeks$<https://urldefense.com/v3/__https:/s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T*CHM13*assemblies*annotation*__;Ly8vLw!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adM7PNUeks$>
> and adjust the code noted above for the TxDb.
>
> For the org.db I have to get back to you.
>
> On Sun, Dec 10, 2023 at 12:06 PM Christian Arnold via Bioc-devel <bioc-devel using r-project.org<mailto:bioc-devel using r-project.org>> wrote:
>
>> Hello, I am working with the new human T2T-CHM13v2.0  assembly and
>> while a BSgenome package already exists
>> (BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0), I could not find the
>> corresponding TxDb and OrgDb packages. Is there any information when
>> they may also become available so it is easier to work with the new
>> genome for packages like ArchR, which support a custom genome but need
>> these standard annotation packages for their creation?
>>
>>
>> Thanks a lot for any information regarding this!
>>
>> Best, Christian
>>
>> _______________________________________________
>> Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org>  mailing list
>> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/bioc<https://urldefense.com/v3/__https:/stat.ethz.ch/mailman/listinfo/bioc>
>> -devel__;!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIM
>> iLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adMOtbUwTc$
>>
> --
> The information in this e-mail is intended only for th...{{dropped:28}}