[BioC] annotate and biomaRt: inconsistent behaviour; nsFilter question

Wed Sep 26 21:55:33 CEST 2007

Hi Saira,

You can add an extra filter in order to return only values that have a 
gene symbol associated with them by:

annotation <- unique(getBM(c("affy_hg_u133a", "hgnc_symbol"), 
filters=c("affy_hg_u133a","with_hgnc_symbol"), 
values=list(affyids,TRUE), mart=ensemblhuman))

Note as well that for the version of biomaRt included the new BioC 
release there will be no need to apply the unique function on the  getBM 
output as  getBM  will do this by default at the web service side.

Ensembl does and independent mapping of the affymetrix probes to  the 
genome.  If they find multiple gene matches for one probe they will 
return all of these matches.
The two genes that are retrieved in your query are indeed next to each 
other, not overlapping and on opposite strands.   As Ensembl associates 
both of these genes with the 200710_at affymetrix probe there must be a 
match for this probe to each of these two genes, maybe they share some 
homology and the affy probe happens to be in that region?
We could mail the Ensembl helpdesk at helpdesk at ensembl.org to get more 
details on this particular mapping.

Cheers,
Steffen

James W. MacDonald wrote:
> Hi Saira,
>
> Saira Mian wrote:
>   
>>    I noticed that for some Affymetrix probe sets, "genenames" (annotate) 
>> returns a single gene whereas "getBM" (biomaRt) returns two:
>>
>> annotate:
>>  > library(hgu133a)
>>  > genenames <- as.list(hgu133aGENENAME)
>>  > genenames[["200710_at"]]
>> [1] "acyl-Coenzyme A dehydrogenase, very long chain"
>>
>> biomaRt:
>>  > ensemblhuman <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
>>  > getBM(attributes=c("affy_hg_u133a", "hgnc_symbol", 
>> "ensembl_transcript_id"),filters="affy_hg_u133a",values="200710_at",mart=ensemblhuman)
>> affy_hg_u133a hgnc_symbol ensembl_transcript_id
>> 1 200710_at ACADVL ENST00000356839
>> 2 200710_at ACADVL ENST00000322910
>> 3 200710_at ACADVL ENST00000350303
>> 4 200710_at DVL2   ENST00000380838
>> 5 200710_at DVL2   ENST00000005340
>>
>>   Why are the results from annotate and biomaRT inconsistent? Is there a 
>> "correct" answer? The above probe set is just one of the examples I came 
>> across when learning biomaRt using the first 30 rows of my ExpressionSet 
>> object "eset" produced by nsFilter (see below). My cursory examination 
>> of ACADVL and DVL2 using the UCSC genome browser suggests that the 
>> one-to-many behaviour may occur because the genes are physically 
>> adjacent in the genome (for this and one other example I inspected, the 
>> genes were head-to-tail).
>>     
>
> The inconsistency arises because of the annotation you are using. In the 
> first case you are using Entrez Gene (which as the name implies is a 
> _gene_ level annotation). In the second case you are using Ensemble 
> transcript level annotations, which is annotation at the mRNA level. 
> Since there can be splice variants for a given gene that may result in 
> different protein products, you can always get different names.
>
> The Entrez Gene ID for this probeset is 37. If you look that up on NCBI 
> you will see that there are two RefSeq IDs associated, which indicates 
> that NCBI thinks there are two isoforms for this gene.
>
> Other inconsistencies may arise from the fact that you are using two 
> different sources for annotation. You cannot always assume that two 
> different groups will have the same information.
>
>
>   
>>  > ans <- nsFilter(eset)
>>  > eset <- ans$eset
>>  > affyids <- rownames(exprs(eset[1:30, ]))
>>     
>>>  affyids
>>>       
>>  [1] "214440_at"   "202376_at"   "201511_at"   "201000_at"   "209459_s_at"
>>  [6] "203504_s_at" "212772_s_at" "204343_at"   "209620_s_at" "200045_at"
>> [11] "202123_s_at" "206411_s_at" "212895_s_at" "214274_s_at" "212186_at"
>> [16] "43427_at"    "202502_at"   "202366_at"   "205355_at"   "200710_at"
>> [21] "205412_at"   "209608_s_at" "210337_s_at" "207071_s_at" "200793_s_at"
>> [26] "213501_at"   "201629_s_at" "202767_at"   "204393_s_at" "200974_at"
>>     
>>>  annotation <- unique(getBM(c("affy_hg_u133a", "hgnc_symbol"), 
>>>       
>> filters="affy_hg_u133a", values=affyids, mart=ensemblhuman))
>>     
>>>  annotation
>>>       
>> affy_hg_u133a hgnc_symbol
>>  1 200045_at   ABCF1
>>  4 200710_at   ACADVL
>>  7 200710_at   DVL2
>>  9 200793_s_at ACO2
>> 10 200793_s_at POLR3H
>> 13 200974_at
>> 14 200974_at   ACTA2
>> 16 201000_at   AARS
>> 19 201511_at   GPBAR1
>> 20 201511_at   AAMP
>> 21 201629_s_at ACP1
>> 23 202123_s_at ABL1
>> 25 202366_at   ACADS
>> 26 202376_at   SERPINA3
>> 28 202502_at   ACADM
>> 30 202767_at   DDB2
>> 34 202767_at   ACP2
>> 35 203504_s_at ABCA1
>> 36 204343_at   ABCA3
>> 39 204393_s_at ACPP
>> 40 205355_at   ACADSB
>> 42 205412_at   ACAT1
>> 43 206411_s_at ABL2
>> 46 207071_s_at ACO1
>> 49 209459_s_at ABAT
>> 50 209608_s_at ACAT2
>> 51 209608_s_at TCP1
>> 52 209620_s_at ABCB7
>> 55 210337_s_at ACLY
>> 57 212186_at   ACACA
>> 61 212772_s_at ABCA2
>> 64 212895_s_at TIMM22
>> 65 212895_s_at ABR
>> 69 213501_at   ACOX1
>> 71 214274_s_at DLEC1
>> 74 214274_s_at ACAA1
>> 77 214440_at   NAT1
>> 79 43427_at    ACACB
>>
>>   I don't understand why the results for "200974_at" are a gene with no 
>> hgnc_symbol and ACTA2 since I thought nsFilter would have removed the 
>> gene with no name.
>>     
>
> Why would you think that? I don't see anything in the help page for 
> nsFilter() that would indicate any probeset without a gene symbol would 
> be removed.
>
> Best,
>
> Jim
>
>
>   
>>   I'm an inexperienced R/Bioconductor user and so am unsure whether I've 
>> simply made some elementary mistakes.
>>
>> Saira Mian
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>     
>
>