[Bioc-devel] Apparent error in illuminaHumanv4.db
James W. MacDonald
jmacdon at uw.edu
Thu Mar 27 14:38:00 CET 2014
Hi Taku,
This 'error' is not due to anything in the illuminahumanv4.db package.
All that package does is link the probe IDs to Entrez Gene IDs, and then
the org.Hs.eg.db package does the remainder of the annotation. So if we
look at org.Hs.eg.db, we get this:
> select(org.Hs.eg.db, c("C16ORF15","C16orf15","C15orf16"),
c("ENTREZID","SYMBOL","GENENAME"), "ALIAS")
ALIAS ENTREZID SYMBOL GENENAME
1 C16ORF15 161725 OTUD7A OTU domain containing 7A
2 C16orf15 197335 WDR90 WD repeat domain 90
3 C15orf16 161725 OTUD7A OTU domain containing 7A
And if we go to NCBI and search the Gene database, we get (in order):
Gene ID 161725
Official Symbol
OTUD7Aprovided by HGNC <http://www.genenames.org/>
Official Full Name
OTU deubiquitinase 7Aprovided by HGNC <http://www.genenames.org/>
Primary source
HGNC:20718 <http://www.genenames.org/data/hgnc_data.php?hgnc_id=20718>
See related
Ensembl:ENSG00000169918; <http://www.ensembl.org/id/ENSG00000169918>
HPRD:12666; <http://www.hprd.org/protein/12666> MIM:612024;
<http://www.ncbi.nlm.nih.gov/omim/612024> Vega:OTTHUMG00000129275
<http://vega.sanger.ac.uk/id/OTTHUMG00000129275>
Gene type
protein coding
RefSeq status
PROVISIONAL
Organism
Homo sapiens
<https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606>
Lineage
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo
Also known as
OTUD7; C15orf16; C16ORF15; CEZANNE2
And
Gene ID 197335
Official Symbol
WDR90provided by HGNC <http://www.genenames.org/>
Official Full Name
WD repeat domain 90provided by HGNC <http://www.genenames.org/>
Primary source
HGNC:26960 <http://www.genenames.org/data/hgnc_data.php?hgnc_id=26960>
See related
Ensembl:ENSG00000161996; <http://www.ensembl.org/id/ENSG00000161996>
HPRD:08311; <http://www.hprd.org/protein/08311> HPRD:14118;
<http://www.hprd.org/protein/14118> Vega:OTTHUMG00000048040
<http://vega.sanger.ac.uk/id/OTTHUMG00000048040>
Gene type
protein coding
RefSeq status
PROVISIONAL
Organism
Homo sapiens
<https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606>
Lineage
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo
Also known as
C16orf15; C16orf16; C16orf17; C16orf18; C16orf19
So what is in the org.Hs.eg.db package conforms exactly to the data from
NCBI. Please note that the annotation packages supplied by Bioconductor
are simply re-formulations of data we get from sources like NCBI, and we
make no claims as to the accuracy of those data. In other words, we try
our best to ensure that the information you get from a given annotation
package conforms exactly to what you would get by going to the NCBI
website and searching by hand, but do NOT make any claims as to the
accuracy of the data on the NCBI website.
And there have been any number of emails on this list by Marc Carlson,
explaining to people that HGNC symbols and especially other random
aliases are not unique, and should not be relied upon for annotating
data accurately. So yeah, don't do that.
Best,
Jim
On 3/27/2014 6:11 AM, Taku Tokuyasu wrote:
> Hello Mark,
>
> I'm writing to report an apparent error in the illuminaHumanv4.db package,
> version 1.20.0. Specifically, the mapping for "C16ORF15" in ALIAS2PROBE
> appears to be incorrect. Below is an R code snippet:
>
> library("illuminaHumanv4.db")
> packageVersion("illuminaHumanv4.db")
> # [1] '1.20.0'
> #
> http://www.bioconductor.org/packages/release/data/annotation/html/illuminaHumanv4.db.html
>
> # Define some mappings
> xxAP <- as.list(illuminaHumanv4ALIAS2PROBE)
> xxS <- as.list(illuminaHumanv4SYMBOL)
>
> # Compare these two:
> xxS[xxAP[["C16ORF15"]]]
> xxS[xxAP[["C16orf15"]]]
>
> # I get:
> # > xxS[xxAP[["C16ORF15"]]]
> # $ILMN_1718060
> # [1] "OTUD7A"
> # $ILMN_1785146
> # [1] "OTUD7A"
> # $ILMN_2298160
> # [1] "OTUD7A"
> #
> # > xxS[xxAP[["C16orf15"]]]
> # $ILMN_1693042
> # [1] "WDR90"
> # $ILMN_1698185
> # [1] "WDR90"
>
> According to HGNC (via DuckDuckGo):
> OTUD7A (OTU domain containing 7A)
> Protein-coding gene on human chromosome 15q13.1, also known as *C15orf16*,
> CEZANNE2, OTU domain containing 7, OTUD7, chromosome 15 open reading frame
> 16.
>
> WDR90 (WD repeat domain 90)
> Protein-coding gene on human chromosome 16p13.3, also known as *C16orf15*,
> C16orf16, C16orf17, C16orf18, C16orf19, FLJ36483, KIAA1924, chromosome 16
> open reading frame 15, chromosome 16 open reading frame 16, chromosome 16
> open reading frame 17, chromosome 16 open reading frame 18, chromosome 16
> open reading frame 19.
>
> So it appears the ALIAS2PROBE mapping for C16ORF15 is actually for
> C15orf16. Indeed,
> all.equal(xxAP[["C16ORF15"]], xxAP[["C15orf16"]])
> # [1] TRUE
>
> Some questions:
> 1) Why is there a mapping for both C16orf15 and C16ORF15?
> 2) Can you make the names for mappings like ALIAS2PROBE all upper case?
> Perhaps there is a Bioconductor annotation convention that prevents this?
> 3) I also noticed:
> nms <- names(xxAP)
> length(nms); length(unique(nms)); length(unique(toupper(nms)))
> # [1] 99696
> # [1] 99696
> # [1] 99378
> Is there potentially a problem with the 300-odd names that are no longer
> unique when raised to upper case?
>
> Regards,
>
> _Taku
>
> Taku A. Tokuyasu, PhD
> Computational Biology Core
> UCSF Helen Diller Family Comprehensive Cancer Center
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioc-devel
mailing list