[Bioc-devel] Missing CHM13v2.0 TxDB and OrgDb objects
James W. MacDonald
jm@cdon @end|ng |rom uw@edu
Tue Dec 12 16:29:04 CET 2023
Hi Christian,
This conversation is off-topic, both for this listserv (it’s meant to help people developing Bioconductor packages) and for the support site (which is meant to help people with (again), Bioconductor packages. I’ll answer your questions one more time, but if you have other questions, please move to biostars.org, or just ask the ArchR people directly, since it’s their package.
I believe you are misinterpreting what an OrgDb is intended to provide. There is no positional data in an OrgDb, and what the CHM13 project has done is completely positional (what data are provided in the ‘Gene Annotation’ section of the CHM13 Github are all GFF files, which are meant to provide positional information of genes on a genome).
The OrgDb package provides functional and within-annotation mappings. You can map an NCBI Gene ID to Ensembl, or to the HGNC gene symbol, or a GO term, etc. For example, I can map Gene symbol P53 to NCBI Gene ID 7157, or its UniProt symbol K7PPA8. If the new genome build says P53 has moved to a new genomic position, that has no affect on what UniProt thinks the ID for that gene’s protein should be, or what ID NCBI uses, or what GO terms are appended to that gene. Functionally it’s the same gene. We just might think it is located in a different place in the genome.
The difference between CHM13 and GRCh38 is not materially different from the difference between GRCh37 and GRCh38 (they represent the current knowledge of the genome at a point in time), and while we supply TxDb packages for GRCh38 and GRCh37 (and variants based on NCBI’s mappings as well as Ensembl’s mappings), we have never supplied more than one human OrgDb package, because the positional and functional information are orthogonal.
It seems pretty simple to make what you need though.
> library(GenomicAlignments)
> tx <- makeTxDbFromGFF(https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz)
Import genomic features from the file as a GRanges object ... trying URL 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz'
Content type 'application/x-gzip' length 79009538 bytes (75.3 MB)
downloaded 75.3 MB
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, :
some transcripts have no
"transcript_id" attribute ==>
their name ("tx_name" column in
the TxDb object) was set to NA
2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID, :
the transcript names ("tx_name"
column in the TxDb object)
imported from the
"transcript_id" attribute are
not unique
3: In .find_exon_cds(exons, cds) : The following transcripts have
exons that contain more than one
CDS (only the first CDS was kept
for each exon):
rna-NM_015068.3, rna-NM_016178.2
> tx
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.gff.gz
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 188205
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2023-12-12 10:17:34 -0500 (Tue, 12 Dec 2023)
# GenomicFeatures version at creation time: 1.54.1
# RSQLite version at creation time: 2.3.1
genomeAnnotation <- createGenomeAnnotation(BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0)
geneAnnotation <- createGeneAnnotation(TxDb = tx, OrgDb = org.Hs.eg.db)
Dear Vincent and others,
thanks for the reply! Irrespective of whether a different OrgDb is required, the name itself suggested that there "should be" also corresponding OrgDb and TxDb packages. I can build one on my own, I see, is there anyone who works on providing the TxDB object for Bioc?
I am also asking this because the T2T people specifically provide an "updated" gene annotation dataset which may differ from what's inside OrgDb and may be incompatible with? See here: https://github.com/marbl/CHM13<https://urldefense.com/v3/__https:/github.com/marbl/CHM13__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl5saKKkDg$>:
JHU RefSeqv110 + Liftoff v5.1<https://urldefense.com/v3/__https:/s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_RefSeq_Liftoff_v5.1.gff3.gz__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl6IjF5vbw$>: This contains curated annotations of the ampliconic genes on the Y chromosome, correcting annotation errors in GENCODEv35 CAT/Liftoff and RefSeqv110 annotation. Additional copies found in T2T-Y were annotated to the closest available gene in RefSeq, allowing multiple genes to have the same common name. This file has been modified to correct special character issues from the original file.
For ArchR, I tried to understand how one can create a new genome by checking here: https://www.archrproject.com/bookdown/getting-set-up.html<https://urldefense.com/v3/__https:/www.archrproject.com/bookdown/getting-set-up.html__;!!K-Hz7m0Vt54!m5AUbsFFY81NPPkO8E4UZmvb52jX8mZa7UCSbvRXFEVy8t1KVLChFpBnSRA2g5qYisIoQw9tWl6DoYvxHg$>. There, they explicitly mention the TxDb and OrgDb objects that are needed for building a custom genome. There seems to be another option when both or any of these 2 is not available ("Alternatively, if you dont have a TxDb and OrgDb object, you can create a geneAnnotation object from the following information" ), but I first tried to do it the easy way as I want to properly embed it in a pipeline with as little "custom" code as possible.
Thanks Jim, I tend to agree with you. Christian, I had a look at ArchR but could not tell where the
system contacts the Bioc annotation elements. Can you give some hints? I'd like to be able to
verify compatibility.
Good question. I believe these will be forthcoming soon. In the mean time you can create your own. See, for example
It's an active area so you can pull a gff file from https://urldefense.com/v3/__https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T*CHM13*assemblies*annotation*__;Ly8vLw!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adM7PNUeks$<https://urldefense.com/v3/__https:/s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T*CHM13*assemblies*annotation*__;Ly8vLw!!K-Hz7m0Vt54!ixhBX1kJeZc-9e3gcVgd5OOsvXj8vYfmUZphWadsaXZmdIMiLYcLZEGkJmZhkFTxT-wXY5c_hr0C9adM7PNUeks$>
and adjust the code noted above for the TxDb.
For the org.db I have to get back to you.
> Hello, I am working with the new human T2T-CHM13v2.0 assembly and
> while a BSgenome package already exists
> (BSgenome.Hsapiens.NCBI.T2T.CHM13v2.0), I could not find the
> corresponding TxDb and OrgDb packages. Is there any information when
> they may also become available so it is easier to work with the new
> genome for packages like ArchR, which support a custom genome but need
> these standard annotation packages for their creation?
> Thanks a lot for any information regarding this!
> Best, Christian
