[BioC] annotate and biomaRt: inconsistent behaviour; nsFilter question
James W. MacDonald
jmacdon at med.umich.edu
Wed Sep 26 21:13:19 CEST 2007
Hi Saira,
Saira Mian wrote:
> I noticed that for some Affymetrix probe sets, "genenames" (annotate)
> returns a single gene whereas "getBM" (biomaRt) returns two:
>
> annotate:
> > library(hgu133a)
> > genenames <- as.list(hgu133aGENENAME)
> > genenames[["200710_at"]]
> [1] "acyl-Coenzyme A dehydrogenase, very long chain"
>
> biomaRt:
> > ensemblhuman <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
> > getBM(attributes=c("affy_hg_u133a", "hgnc_symbol",
> "ensembl_transcript_id"),filters="affy_hg_u133a",values="200710_at",mart=ensemblhuman)
> affy_hg_u133a hgnc_symbol ensembl_transcript_id
> 1 200710_at ACADVL ENST00000356839
> 2 200710_at ACADVL ENST00000322910
> 3 200710_at ACADVL ENST00000350303
> 4 200710_at DVL2 ENST00000380838
> 5 200710_at DVL2 ENST00000005340
>
> Why are the results from annotate and biomaRT inconsistent? Is there a
> "correct" answer? The above probe set is just one of the examples I came
> across when learning biomaRt using the first 30 rows of my ExpressionSet
> object "eset" produced by nsFilter (see below). My cursory examination
> of ACADVL and DVL2 using the UCSC genome browser suggests that the
> one-to-many behaviour may occur because the genes are physically
> adjacent in the genome (for this and one other example I inspected, the
> genes were head-to-tail).
The inconsistency arises because of the annotation you are using. In the
first case you are using Entrez Gene (which as the name implies is a
_gene_ level annotation). In the second case you are using Ensemble
transcript level annotations, which is annotation at the mRNA level.
Since there can be splice variants for a given gene that may result in
different protein products, you can always get different names.
The Entrez Gene ID for this probeset is 37. If you look that up on NCBI
you will see that there are two RefSeq IDs associated, which indicates
that NCBI thinks there are two isoforms for this gene.
Other inconsistencies may arise from the fact that you are using two
different sources for annotation. You cannot always assume that two
different groups will have the same information.
>
> > ans <- nsFilter(eset)
> > eset <- ans$eset
> > affyids <- rownames(exprs(eset[1:30, ]))
>> affyids
> [1] "214440_at" "202376_at" "201511_at" "201000_at" "209459_s_at"
> [6] "203504_s_at" "212772_s_at" "204343_at" "209620_s_at" "200045_at"
> [11] "202123_s_at" "206411_s_at" "212895_s_at" "214274_s_at" "212186_at"
> [16] "43427_at" "202502_at" "202366_at" "205355_at" "200710_at"
> [21] "205412_at" "209608_s_at" "210337_s_at" "207071_s_at" "200793_s_at"
> [26] "213501_at" "201629_s_at" "202767_at" "204393_s_at" "200974_at"
>> annotation <- unique(getBM(c("affy_hg_u133a", "hgnc_symbol"),
> filters="affy_hg_u133a", values=affyids, mart=ensemblhuman))
>> annotation
> affy_hg_u133a hgnc_symbol
> 1 200045_at ABCF1
> 4 200710_at ACADVL
> 7 200710_at DVL2
> 9 200793_s_at ACO2
> 10 200793_s_at POLR3H
> 13 200974_at
> 14 200974_at ACTA2
> 16 201000_at AARS
> 19 201511_at GPBAR1
> 20 201511_at AAMP
> 21 201629_s_at ACP1
> 23 202123_s_at ABL1
> 25 202366_at ACADS
> 26 202376_at SERPINA3
> 28 202502_at ACADM
> 30 202767_at DDB2
> 34 202767_at ACP2
> 35 203504_s_at ABCA1
> 36 204343_at ABCA3
> 39 204393_s_at ACPP
> 40 205355_at ACADSB
> 42 205412_at ACAT1
> 43 206411_s_at ABL2
> 46 207071_s_at ACO1
> 49 209459_s_at ABAT
> 50 209608_s_at ACAT2
> 51 209608_s_at TCP1
> 52 209620_s_at ABCB7
> 55 210337_s_at ACLY
> 57 212186_at ACACA
> 61 212772_s_at ABCA2
> 64 212895_s_at TIMM22
> 65 212895_s_at ABR
> 69 213501_at ACOX1
> 71 214274_s_at DLEC1
> 74 214274_s_at ACAA1
> 77 214440_at NAT1
> 79 43427_at ACACB
>
> I don't understand why the results for "200974_at" are a gene with no
> hgnc_symbol and ACTA2 since I thought nsFilter would have removed the
> gene with no name.
Why would you think that? I don't see anything in the help page for
nsFilter() that would indicate any probeset without a gene symbol would
be removed.
Best,
Jim
>
> I'm an inexperienced R/Bioconductor user and so am unsure whether I've
> simply made some elementary mistakes.
>
> Saira Mian
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
More information about the Bioconductor
mailing list