[Bioc-devel] Apparent error in illuminaHumanv4.db

Thu Mar 27 14:38:00 CET 2014

Hi Taku,

This 'error' is not due to anything in the illuminahumanv4.db package. 
All that package does is link the probe IDs to Entrez Gene IDs, and then 
the org.Hs.eg.db package does the remainder of the annotation. So if we 
look at org.Hs.eg.db, we get this:

 > select(org.Hs.eg.db, c("C16ORF15","C16orf15","C15orf16"), 
c("ENTREZID","SYMBOL","GENENAME"), "ALIAS")
      ALIAS ENTREZID SYMBOL                 GENENAME
1 C16ORF15   161725 OTUD7A OTU domain containing 7A
2 C16orf15   197335  WDR90      WD repeat domain 90
3 C15orf16   161725 OTUD7A OTU domain containing 7A

And if we go to NCBI and search the Gene database, we get (in order):

Gene ID 161725

Official Symbol
    OTUD7Aprovided by HGNC <http://www.genenames.org/>
Official Full Name
    OTU deubiquitinase 7Aprovided by HGNC <http://www.genenames.org/>
Primary source
    HGNC:20718 <http://www.genenames.org/data/hgnc_data.php?hgnc_id=20718> 
See related
    Ensembl:ENSG00000169918; <http://www.ensembl.org/id/ENSG00000169918>
    HPRD:12666; <http://www.hprd.org/protein/12666> MIM:612024;
    <http://www.ncbi.nlm.nih.gov/omim/612024> Vega:OTTHUMG00000129275
    <http://vega.sanger.ac.uk/id/OTTHUMG00000129275> 
Gene type
    protein coding
RefSeq status
    PROVISIONAL
Organism
    Homo sapiens
    <https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606> 
Lineage
    Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
    Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
    Catarrhini; Hominidae; Homo
Also known as
    OTUD7; C15orf16; C16ORF15; CEZANNE2

And

Gene ID 197335

Official Symbol
    WDR90provided by HGNC <http://www.genenames.org/>
Official Full Name
    WD repeat domain 90provided by HGNC <http://www.genenames.org/>
Primary source
    HGNC:26960 <http://www.genenames.org/data/hgnc_data.php?hgnc_id=26960> 
See related
    Ensembl:ENSG00000161996; <http://www.ensembl.org/id/ENSG00000161996>
    HPRD:08311; <http://www.hprd.org/protein/08311> HPRD:14118;
    <http://www.hprd.org/protein/14118> Vega:OTTHUMG00000048040
    <http://vega.sanger.ac.uk/id/OTTHUMG00000048040> 
Gene type
    protein coding
RefSeq status
    PROVISIONAL
Organism
    Homo sapiens
    <https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606> 
Lineage
    Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
    Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
    Catarrhini; Hominidae; Homo
Also known as
    C16orf15; C16orf16; C16orf17; C16orf18; C16orf19

So what is in the org.Hs.eg.db package conforms exactly to the data from 
NCBI. Please note that the annotation packages supplied by Bioconductor 
are simply re-formulations of data we get from sources like NCBI, and we 
make no claims as to the accuracy of those data. In other words, we try 
our best to ensure that the information you get from a given annotation 
package conforms exactly to what you would get by going to the NCBI 
website and searching by hand, but do NOT make any claims as to the 
accuracy of the data on the NCBI website.

And there have been any number of emails on this list by Marc Carlson, 
explaining to people that HGNC symbols and especially other random 
aliases are not unique, and should not be relied upon for annotating 
data accurately. So yeah, don't do that.

Best,

Jim

On 3/27/2014 6:11 AM, Taku Tokuyasu wrote:
> Hello Mark,
>
> I'm writing to report an apparent error in the illuminaHumanv4.db package,
> version 1.20.0.  Specifically, the mapping for "C16ORF15" in ALIAS2PROBE
> appears to be incorrect.  Below is an R code snippet:
>
> library("illuminaHumanv4.db")
> packageVersion("illuminaHumanv4.db")
> # [1] '1.20.0'
> #
> http://www.bioconductor.org/packages/release/data/annotation/html/illuminaHumanv4.db.html
>
> # Define some mappings
> xxAP <- as.list(illuminaHumanv4ALIAS2PROBE)
> xxS <- as.list(illuminaHumanv4SYMBOL)
>
> # Compare these two:
> xxS[xxAP[["C16ORF15"]]]
> xxS[xxAP[["C16orf15"]]]
>
> # I get:
> # > xxS[xxAP[["C16ORF15"]]]
> # $ILMN_1718060
> # [1] "OTUD7A"
> # $ILMN_1785146
> # [1] "OTUD7A"
> # $ILMN_2298160
> # [1] "OTUD7A"
> #
> # > xxS[xxAP[["C16orf15"]]]
> # $ILMN_1693042
> # [1] "WDR90"
> # $ILMN_1698185
> # [1] "WDR90"
>
> According to HGNC (via DuckDuckGo):
> OTUD7A (OTU domain containing 7A)
> Protein-coding gene on human chromosome 15q13.1, also known as *C15orf16*,
> CEZANNE2, OTU domain containing 7, OTUD7, chromosome 15 open reading frame
> 16.
>
> WDR90 (WD repeat domain 90)
> Protein-coding gene on human chromosome 16p13.3, also known as *C16orf15*,
> C16orf16, C16orf17, C16orf18, C16orf19, FLJ36483, KIAA1924, chromosome 16
> open reading frame 15, chromosome 16 open reading frame 16, chromosome 16
> open reading frame 17, chromosome 16 open reading frame 18, chromosome 16
> open reading frame 19.
>
> So it appears the ALIAS2PROBE mapping for C16ORF15 is actually for
> C15orf16.  Indeed,
> all.equal(xxAP[["C16ORF15"]], xxAP[["C15orf16"]])
> # [1] TRUE
>
> Some questions:
> 1) Why is there a mapping for both C16orf15 and C16ORF15?
> 2) Can you make the names for mappings like ALIAS2PROBE all upper case?
> Perhaps there is a Bioconductor annotation convention that prevents this?
> 3) I also noticed:
> nms <- names(xxAP)
> length(nms); length(unique(nms)); length(unique(toupper(nms)))
> # [1] 99696
> # [1] 99696
> # [1] 99378
> Is there potentially a problem with the 300-odd names that are no longer
> unique when raised to upper case?
>
> Regards,
>
> _Taku
>
>   Taku A. Tokuyasu, PhD
> Computational Biology Core
> UCSF Helen Diller Family Comprehensive Cancer Center
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099