[Bioc-devel] Bioconductor 3.19 db0s, OrgDbs, and TxDbs now available

Thu Mar 28 20:18:37 CET 2024

hi, related to this, I found recently the problem that the 
TxDb.Hsapiens.UCSC.hg38.knownGene package contains only a few of the 
RN7SL (RNA component of signal recognition particle 7SL) genes and I 
suspect the problem is that on the one hand UCSC currently provides the 
'knownGene' track based on GENCODE annotations, but, on the other hand, 
they map them to Entrez IDs in a way in which they miss genes in that 
mapping. See the following example:

library(org.Hs.eg.db)
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

## fetch all the genes with 'RN7SL' in their official HGNC symbol
allsyms <- keys(org.Hs.eg.db, keytype="SYMBOL")
whrn7sl <- grep("RN7SL", allsyms)
length(whrn7sl)
[1] 687

## using org.Hs.eg.db build a map from Entrez IDs to Ensembl IDs
rn7slmap <- select(org.Hs.eg.db, keys=allsyms[whrn7sl], 
columns=c("ENTREZID", "ENSEMBL"),
                    keytype="SYMBOL")
dim(rn7slmap)
[1] 687  3
head(rn7slmap)
    SYMBOL ENTREZID         ENSEMBL
1  RN7SL1     6029 ENSG00000276168
2 RN7SL4P     6030 ENSG00000263740
3 RN7SL5P     6031            <NA>
4 RN7SL6P     6032            <NA>
5 RN7SL7P     6033            <NA>
6 RN7SL8P     6034            <NA>
all(!is.na(rn7slmap$ENTREZID))
[1] TRUE
sum(!is.na(rn7slmap$ENSEMBL))
[1] 666

as you see, org.Hs.eg.db provides Entrez IDs for all of the RN7SL genes 
and a map to most of the corresponding Ensembl gene IDs (666 out of 
687). However, when we look up how may of these genes are in 
TxDb.Hsapiens.UCSC.hg38.knownGene only 4 of them are present!

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
alltxdbEGs <- keys(txdb, keytype="GENEID")
sum(rn7slmap$ENTREZID %in% alltxdbEGs)
[1] 4

if you inspect the metadata of this TxDb package you'll see that it is 
based on GENCODE v44, let's build the TxDb object directly from the 
GENCODE v44 GTF file:

gencodetxdb <- 
makeTxDbFromGFF("https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.primary_assembly.annotation.gtf.gz")
allgencodeENSGs <- keys(gencodetxdb, keytype="GENEID")
head(allgencodeENSGs)
[1] "ENSG00000000003.16" "ENSG00000000005.6" "ENSG00000000419.14"
[4] "ENSG00000000457.14" "ENSG00000000460.17" "ENSG00000000938.13"

now let's try to see how many of the RN7SL genes in org.Hs.eg.db map to 
the original GENCODE annotations using the mapping in org.Hs.eg.db:

sum(rn7slmap$ENSEMBL %in% gsub("\\.[0-9]+$", "", allgencodeENSGs))
[1] 666

I guess the way in which UCSC maps Entrez IDs and Ensembl IDs could be 
improved, but obviously this is not our responsibility. What I wonder is 
whether we could build a more comprehensive TxDb package based on 
EntrezIDs. I'd say that ideally one would like to have a TxDb package 
with genomic annotations for every gene in org.Hs.eg.db, so that both 
org.Hs.eg.db and at least one of the TxDb.Hsapiens.* packages provided 
by Bioconductor have some minimum level of integrity.

robert.

On 28/3/24 16:31, James W. MacDonald wrote:
> There are EnsDbs for Ensembl builds 87-111 that Johannes Rainier submits to the AnnotationHub for those who want to use Ensembl mappings. And a direct build of a TxDb using a UCSC GTF file has the chrM genes as well.
>
>> ensdb <- hub[["AH100643"]]
>> subsetByOverlaps(transcriptsBy(ensdb), GRanges("MT:1-16569"))
> GRangesList object of length 37:
> $ENSG00000198695
> GRanges object with 1 range and 11 metadata columns:
>        seqnames      ranges strand |           tx_id     tx_biotype
>           <Rle>   <IRanges>  <Rle> |     <character>    <character>
>    [1]       MT 14149-14673      - | ENST00000361681 protein_coding
>        tx_cds_seq_start tx_cds_seq_end         gene_id tx_support_level
>               <integer>      <integer>     <character>        <integer>
>    [1]            14149          14673 ENSG00000198695             <NA>
>            tx_id_version gc_content tx_external_name tx_is_canonical
>              <character>  <numeric>      <character>       <integer>
>    [1] ENST00000361681.2    42.6667       MT-ND6-201               1
>                tx_name
>            <character>
>    [1] ENST00000361681
>    -------
>    seqinfo: 457 sequences (1 circular) from GRCh38 genome
> <snip>
>
>> z <- makeTxDbFromGFF(https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.knownGene.gtf.gz)
> Import genomic features from the file as a GRanges object ... trying URL 'https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.knownGene.gtf.gz'
> Content type 'application/x-gzip' length 38959957 bytes (37.2 MB)
> ==================================================
> downloaded 37.2 MB
>
> OK
> Prepare the 'metadata' data frame ... OK
> Make the TxDb object ... OK
> Warning message:
> In .get_cds_IDX(mcols0$type, mcols0$phase) :
>    The "phase" metadata column contains non-NA values for features of type
>    stop_codon. This information was ignored.
>> subsetByOverlaps(transcriptsBy(z), GRanges("chrM:1-16569"))
> GRangesList object of length 37:
> $ENST00000386347.1
> GRanges object with 1 range and 2 metadata columns:
>        seqnames    ranges strand |     tx_id           tx_name
>           <Rle> <IRanges>  <Rle> | <integer>       <character>
>    [1]     chrM 3230-3304      + |    252881 ENST00000386347.1
>    -------
>    seqinfo: 439 sequences (1 circular) from an unspecified genome; no seqlengths
> <snip>
>
> As with most human endeavors, the weight of history hangs heavy on Bioconductor, and it’s often easier to understand how the machine works rather than trying to change things that are set in stone. I mean I have tried many (many) times to get affy removed in lieu of oligo, to no avail.
>
> From: Tim Triche, Jr.<tim.triche using gmail.com>
> Sent: Thursday, March 28, 2024 10:40 AM
> To: James W. MacDonald<jmacdon using uw.edu>
> Cc: Vincent Carey<stvjc using channing.harvard.edu>;bioc-devel using r-project.org
> Subject: Re: [Bioc-devel] Bioconductor 3.19 db0s, OrgDbs, and TxDbs now available
>
> is this an argument in favor of using ENSEMBL gene and transcript IDs rather than ENTREZ or UCSC? Or just changing the way that the databases are keyed? There really ought not to be transcripts for a gene on a different chromosome from the
> ZjQcmQRYFpfptBannerStart
> This Message Is From an Untrusted Sender
> You have not previously corresponded with this sender.
> Seehttps://itconnect.uw.edu/email-tags  for additional information. Please contact the UW-IT Service Center,help using uw.edu<mailto:help using uw.edu>  206.221.5000, for assistance.
> ZjQcmQRYFpfptBannerEnd
> is this an argument in favor of using ENSEMBL gene and transcript IDs rather than ENTREZ or UCSC?  Or just changing the way that the databases are keyed?  There really ought not to be transcripts for a gene on a different chromosome from the gene, although the MHC and KIR loci (with alt contigs) somewhat force the issue for that.  (we could discuss graph genomes here, but we aren't going to do that, because all gene->transcript->contig mappings start to break)
>
> Omission of an entire chromosome seems... bad?  Regardless of the technical reason why.  There are arguments in favor of e.g. gene -> transcript -> contig where each relationship is potentially 1:many, but if chrM can't be sorted out then I am dubious that more complicated mappings can be efficiently handled.  chrM is particularly weird in that it can have multiple haplotypes (i.e. contigs) even within the same cell, but at some point, simplifications are merited
>
>
> --t
>
>
> On Thu, Mar 28, 2024 at 10:12 AM James W. MacDonald <jmacdon using uw.edu<mailto:jmacdon using uw.edu>> wrote:
> As well as
>
>> subsetByOverlaps(transcripts(Homo.sapiens), GRanges("chrM:1-16569"))
> 'select()' returned 1:1 mapping between keys and columns
> GRanges object with 37 ranges and 2 metadata columns:
>         seqnames      ranges strand |          TXID            TXNAME
>            <Rle>   <IRanges>  <Rle> | <IntegerList>   <CharacterList>
>     [1]     chrM     577-647      + |        252799 ENST00000387314.1
>     [2]     chrM    648-1601      + |        252800 ENST00000389680.2
>     [3]     chrM   1602-1670      + |        252801 ENST00000387342.1
>     [4]     chrM   1671-3229      + |        252802 ENST00000387347.2
>     [5]     chrM   3230-3304      + |        252803 ENST00000386347.1
>     ...      ...         ...    ... .           ...               ...
>    [33]     chrM   5826-5891      - |        252831 ENST00000387409.1
>    [34]     chrM   7446-7514      - |        252832 ENST00000387416.2
>    [35]     chrM 14149-14673      - |        252833 ENST00000361681.2
>    [36]     chrM 14674-14742      - |        252834 ENST00000387459.1
>    [37]     chrM 15956-16023      - |        252835 ENST00000387461.2
>    -------
>    seqinfo: 711 sequences (1 circular) from hg38 genome
>
> However
>
>> subsetByOverlaps(transcriptsBy(Homo.sapiens), GRanges("chrM:1-16569"))
> GRangesList object of length 0:
> <0 elements>
>
> And
>
>> subsetByOverlaps(transcripts(Homo.sapiens, columns = c("GENEID","SYMBOL")), GRanges("chrM:1-16569"))
> 'select()' returned 1:1 mapping between keys and columns
> GRanges object with 37 ranges and 2 metadata columns:
>         seqnames      ranges strand |          GENEID          SYMBOL
>            <Rle>   <IRanges>  <Rle> | <CharacterList> <CharacterList>
>     [1]     chrM     577-647      + |            <NA>            <NA>
>     [2]     chrM    648-1601      + |            <NA>            <NA>
>     [3]     chrM   1602-1670      + |            <NA>            <NA>
>     [4]     chrM   1671-3229      + |            <NA>            <NA>
>     [5]     chrM   3230-3304      + |            <NA>            <NA>
>     ...      ...         ...    ... .             ...             ...
>    [33]     chrM   5826-5891      - |            <NA>            <NA>
>    [34]     chrM   7446-7514      - |            <NA>            <NA>
>    [35]     chrM 14149-14673      - |            <NA>            <NA>
>    [36]     chrM 14674-14742      - |            <NA>            <NA>
>    [37]     chrM 15956-16023      - |            <NA>            <NA>
>    -------
>    seqinfo: 711 sequences (1 circular) from hg38 genome
>
> Everything is mapped via the GENEID, and if you query the UCSC genome browser for hg38/knownGene, asking for gene name, known gene ID and gene symbol, you will get the first and last but not the middle.
>
>
>
> -----Original Message-----
> From: Bioc-devel <bioc-devel-bounces using r-project.org<mailto:bioc-devel-bounces using r-project.org>> On Behalf Of Vincent Carey
> Sent: Thursday, March 28, 2024 10:00 AM
> To: Tim Triche, Jr. <tim.triche using gmail.com<mailto:tim.triche using gmail.com>>
> Cc:bioc-devel using r-project.org<mailto:bioc-devel using r-project.org>
> Subject: Re: [Bioc-devel] Bioconductor 3.19 db0s, OrgDbs, and TxDbs now available
>
> winging it here tim
>
>> select(Homo.sapiens, keys="ENSG00000198727", keytype="ENSEMBL",
> columns=c("GENENAME", "GENEID", "CDSCHROM", "SYMBOL")) 'select()' returned 1:1 mapping between keys and columns
>            ENSEMBL     GENENAME SYMBOL CDSCHROM GENEID
> 1 ENSG00000198727 cytochrome b   CYTB     <NA>   4519
>> select(Homo.sapiens, keys= "MTCYBP1", keytype="SYMBOL",
> columns=c("GENENAME", "GENEID", "CDSCHROM", "SYMBOL")) 'select()' returned 1:1 mapping between keys and columns
>     SYMBOL            GENENAME CDSCHROM    GENEID
> 1 MTCYBP1 MT-CYB pseudogene 1     <NA> 100499418
>
> relevant?
>
> On Thu, Mar 28, 2024 at 9:17 AM Tim Triche, Jr. <tim.triche using gmail.com<mailto:tim.triche using gmail.com>>
> wrote:
>
>> Hi Lori and fellow maintainers,
>>
>> I had a strange experience yesterday where I pulled down genes and
>> transcripts from Homo.sapiens, only to discover that all mitochondrial
>> encoded genes (MT-CYB, MT-CO2, etc) were missing.
>>
>> Is there an historical reason why this is so? Obviously these
>> transcripts are physiologically important, but beyond that, they’re
>> also used all the time in single cell sequencing to estimate viability.
>>
>> Best,
>>
>> --t
>>
>>> On Mar 28, 2024, at 8:47 AM, Kern, Lori via Bioc-devel <
>> bioc-devel using r-project.org<mailto:bioc-devel using r-project.org>> wrote:
>>> Hello Bioconductor community,
>>>
>>> The newest db0, OrgDb, and TxDb annotation packages for the upcoming
>> Bioconductor 3.19 release are up and available for download in the
>> devel version of Bioconductor.
>>> The deadline for submitting contributed annotation packages will be
>> Wednesday April 17 th.
>>> The new db0 packages are:
>>>
>>> anopheles.db0_3.19.0.tar.gz
>>> arabidopsis.db0_3.19.0.tar.gz
>>> bovine.db0_3.19.0.tar.gz
>>> canine.db0_3.19.0.tar.gz
>>> chicken.db0_3.19.0.tar.gz
>>> chimp.db0_3.19.0.tar.gz
>>> ecoliK12.db0_3.19.0.tar.gz
>>> ecoliSakai.db0_3.19.0.tar.gz
>>> fly.db0_3.19.0.tar.gz
>>> human.db0_3.19.0.tar.gz
>>> malaria.db0_3.19.0.tar.gz
>>> mouse.db0_3.19.0.tar.gz
>>> pig.db0_3.19.0.tar.gz
>>> rat.db0_3.19.0.tar.gz
>>> rhesus.db0_3.19.0.tar.gz
>>> worm.db0_3.19.0.tar.gz
>>> xenopus.db0_3.19.0.tar.gz
>>> yeast.db0_3.19.0.tar.gz
>>> zebrafish.db0_3.19.0.tar.gz
>>>
>>> The new OrgDb packages are:
>>>
>>> GO.db_3.19.0.tar.gz
>>> org.Ag.eg.db_3.19.0.tar.gz
>>> org.At.tair.db_3.19.0.tar.gz
>>> org.Bt.eg.db_3.19.0.tar.gz
>>> org.Ce.eg.db_3.19.0.tar.gz
>>> org.Cf.eg.db_3.19.0.tar.gz
>>> org.Dm.eg.db_3.19.0.tar.gz
>>> org.Dr.eg.db_3.19.0.tar.gz
>>> org.EcK12.eg.db_3.19.0.tar.gz
>>> org.EcSakai.eg.db_3.19.0.tar.gz
>>> org.Gg.eg.db_3.19.0.tar.gz
>>> org.Hs.eg.db_3.19.0.tar.gz
>>> org.Mm.eg.db_3.19.0.tar.gz
>>> org.Mmu.eg.db_3.19.0.tar.gz
>>> org.Pt.eg.db_3.19.0.tar.gz
>>> org.Rn.eg.db_3.19.0.tar.gz
>>> org.Sc.eg.db_3.19.0.tar.gz
>>> org.Ss.eg.db_3.19.0.tar.gz
>>> org.Xl.eg.db_3.19.0.tar.gz
>>> Orthology.eg.db_3.19.0.tar.gz
>>> PFAM.db_3.19.0.tar.gz
>>>
>>> The new TxDb packages are:
>>>
>>> TxDb.Hsapiens.UCSC.hg38.refGene_3.19.0.tar.gz
>>> TxDb.Mmusculus.UCSC.mm39.refGene_3.19.0.tar.gz
>>>
>>> Thank you
>>>
>>>
>>> Lori Shepherd - Kern
>>>
>>> Bioconductor Core Team
>>>
>>> Roswell Park Comprehensive Cancer Center
>>>
>>> Department of Biostatistics & Bioinformatics
>>>
>>> Elm & Carlton Streets
>>>
>>> Buffalo, New York 14263
>>>
>>>
>>> This email message may contain legally privileged and/or
>>> confidential
>> information.  If you are not the intended recipient(s), or the
>> employee or agent responsible for the delivery of this message to the
>> intended recipient(s), you are hereby notified that any disclosure,
>> copying, distribution, or use of this email message is prohibited.  If
>> you have received this message in error, please notify the sender
>> immediately by e-mail and delete this email message from your computer. Thank you.
>>>     [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org>  mailing list
>>> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/bi<https://urldefense.com/v3/__https:/stat.ethz.ch/mailman/listinfo/bi>
>>> oc-devel__;!!K-Hz7m0Vt54!ixxrV4cynVQr_14T7XsAJir0gOIlLduVfG5aOUHpbF0
>>> cO2xJulG_Fb0BdHs7hb-iOay_QMdEi_zp2wWMcftbdXE$
>> _______________________________________________
>> Bioc-devel using r-project.org<mailto:Bioc-devel using r-project.org>  mailing list
>> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/bioc<https://urldefense.com/v3/__https:/stat.ethz.ch/mailman/listinfo/bioc>
>> -devel__;!!K-Hz7m0Vt54!ixxrV4cynVQr_14T7XsAJir0gOIlLduVfG5aOUHpbF0cO2x
>> JulG_Fb0BdHs7hb-iOay_QMdEi_zp2wWMcftbdXE$
>>
> --
> The information in this email is intended only for the...{{dropped:26}}