[BioC] annotate and biomaRt: inconsistent behaviour; nsFilter question

Wed Sep 26 21:13:19 CEST 2007

Hi Saira,

Saira Mian wrote:
>    I noticed that for some Affymetrix probe sets, "genenames" (annotate) 
> returns a single gene whereas "getBM" (biomaRt) returns two:
> 
> annotate:
>  > library(hgu133a)
>  > genenames <- as.list(hgu133aGENENAME)
>  > genenames[["200710_at"]]
> [1] "acyl-Coenzyme A dehydrogenase, very long chain"
> 
> biomaRt:
>  > ensemblhuman <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
>  > getBM(attributes=c("affy_hg_u133a", "hgnc_symbol", 
> "ensembl_transcript_id"),filters="affy_hg_u133a",values="200710_at",mart=ensemblhuman)
> affy_hg_u133a hgnc_symbol ensembl_transcript_id
> 1 200710_at ACADVL ENST00000356839
> 2 200710_at ACADVL ENST00000322910
> 3 200710_at ACADVL ENST00000350303
> 4 200710_at DVL2   ENST00000380838
> 5 200710_at DVL2   ENST00000005340
> 
>   Why are the results from annotate and biomaRT inconsistent? Is there a 
> "correct" answer? The above probe set is just one of the examples I came 
> across when learning biomaRt using the first 30 rows of my ExpressionSet 
> object "eset" produced by nsFilter (see below). My cursory examination 
> of ACADVL and DVL2 using the UCSC genome browser suggests that the 
> one-to-many behaviour may occur because the genes are physically 
> adjacent in the genome (for this and one other example I inspected, the 
> genes were head-to-tail).

The inconsistency arises because of the annotation you are using. In the 
first case you are using Entrez Gene (which as the name implies is a 
_gene_ level annotation). In the second case you are using Ensemble 
transcript level annotations, which is annotation at the mRNA level. 
Since there can be splice variants for a given gene that may result in 
different protein products, you can always get different names.

The Entrez Gene ID for this probeset is 37. If you look that up on NCBI 
you will see that there are two RefSeq IDs associated, which indicates 
that NCBI thinks there are two isoforms for this gene.

Other inconsistencies may arise from the fact that you are using two 
different sources for annotation. You cannot always assume that two 
different groups will have the same information.

> 
>  > ans <- nsFilter(eset)
>  > eset <- ans$eset
>  > affyids <- rownames(exprs(eset[1:30, ]))
>>  affyids
>  [1] "214440_at"   "202376_at"   "201511_at"   "201000_at"   "209459_s_at"
>  [6] "203504_s_at" "212772_s_at" "204343_at"   "209620_s_at" "200045_at"
> [11] "202123_s_at" "206411_s_at" "212895_s_at" "214274_s_at" "212186_at"
> [16] "43427_at"    "202502_at"   "202366_at"   "205355_at"   "200710_at"
> [21] "205412_at"   "209608_s_at" "210337_s_at" "207071_s_at" "200793_s_at"
> [26] "213501_at"   "201629_s_at" "202767_at"   "204393_s_at" "200974_at"
>>  annotation <- unique(getBM(c("affy_hg_u133a", "hgnc_symbol"), 
> filters="affy_hg_u133a", values=affyids, mart=ensemblhuman))
>>  annotation
> affy_hg_u133a hgnc_symbol
>  1 200045_at   ABCF1
>  4 200710_at   ACADVL
>  7 200710_at   DVL2
>  9 200793_s_at ACO2
> 10 200793_s_at POLR3H
> 13 200974_at
> 14 200974_at   ACTA2
> 16 201000_at   AARS
> 19 201511_at   GPBAR1
> 20 201511_at   AAMP
> 21 201629_s_at ACP1
> 23 202123_s_at ABL1
> 25 202366_at   ACADS
> 26 202376_at   SERPINA3
> 28 202502_at   ACADM
> 30 202767_at   DDB2
> 34 202767_at   ACP2
> 35 203504_s_at ABCA1
> 36 204343_at   ABCA3
> 39 204393_s_at ACPP
> 40 205355_at   ACADSB
> 42 205412_at   ACAT1
> 43 206411_s_at ABL2
> 46 207071_s_at ACO1
> 49 209459_s_at ABAT
> 50 209608_s_at ACAT2
> 51 209608_s_at TCP1
> 52 209620_s_at ABCB7
> 55 210337_s_at ACLY
> 57 212186_at   ACACA
> 61 212772_s_at ABCA2
> 64 212895_s_at TIMM22
> 65 212895_s_at ABR
> 69 213501_at   ACOX1
> 71 214274_s_at DLEC1
> 74 214274_s_at ACAA1
> 77 214440_at   NAT1
> 79 43427_at    ACACB
> 
>   I don't understand why the results for "200974_at" are a gene with no 
> hgnc_symbol and ACTA2 since I thought nsFilter would have removed the 
> gene with no name.

Why would you think that? I don't see anything in the help page for 
nsFilter() that would indicate any probeset without a gene symbol would 
be removed.

Best,

Jim

> 
>   I'm an inexperienced R/Bioconductor user and so am unsure whether I've 
> simply made some elementary mistakes.
> 
> Saira Mian
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623