[BioC] understanding multiples matches between probesets and entrezgene (biomart)
James W. MacDonald
jmacdon at uw.edu
Wed Jun 13 18:13:53 CEST 2012
Hi Juliet,
On 6/13/2012 11:01 AM, Juliet Hannah wrote:
> All,
>
> I understand the concept of multiple probesets corresponding to one
> identifier. But what is the meaning of
> a probeset corresponding to multiple identifiers? And below, given
> that 220547_s_at has a match,
> why should another row be returned with NA.
>
> Did I happen to choose a few probesets where the gene definition is
> changing, or am I misunderstanding
> something else, such as the biomart syntax.
I'm not sure about the NA being returned. That probably has something to
do with how the Biomart database is set up.
As for the multiple genes per probeset, this has to do with the fact
that a 25-mer isn't really long enough to distinguish between genes with
relatively high homology. This is supposed to be reflected in the
probeset ID, although things have changed quite a bit since UniGene
build 133.
The probeset you are showing below has a _s_at identifier, which
indicates that it cross-hybridizes to multiple members of a related gene
family (in this case the FAM35 gene family). There are other identifiers
like the _x_at which indicates cross-hybridization to unrelated genes.
http://www.affymetrix.com/support/help/faqs/hgu133/index.jsp
Best,
Jim
>
> Thanks,
>
> Juliet
>
> library("biomaRt")
> probeSets<- c("219666_at", "220547_s_at", "218034_at")
> ensembl = useMart("ensembl")
> ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
>
> getBM(attributes = c("affy_hg_u133a", "entrezgene"), filters =
> "affy_hg_u133a",values = probeSets, mart = ensembl)
>
>
> affy_hg_u133a entrezgene
> 1 220547_s_at 54537
> 2 218034_at 51024
> 3 220547_s_at NA
> 4 219666_at 64231
> 5 220547_s_at 414241
> 6 220547_s_at 439965
>
>
>
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] biomaRt_2.12.0 BiocInstaller_1.4.6
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.91-1 tools_2.15.0 XML_3.9-4
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list