[BioC] understanding multiples matches between probesets and entrezgene (biomart)

Wed Jun 13 18:13:53 CEST 2012

Hi Juliet,

On 6/13/2012 11:01 AM, Juliet Hannah wrote:
> All,
>
> I understand the concept of multiple probesets corresponding to one
> identifier. But what is the meaning of
> a probeset corresponding to multiple identifiers?  And below, given
> that 220547_s_at has a match,
> why should another row be returned with NA.
>
> Did I happen to choose a few probesets where the gene definition is
> changing, or am I misunderstanding
> something else, such as the biomart syntax.

I'm not sure about the NA being returned. That probably has something to 
do with how the Biomart database is set up.

As for the multiple genes per probeset, this has to do with the fact 
that a 25-mer isn't really long enough to distinguish between genes with 
relatively high homology. This is supposed to be reflected in the 
probeset ID, although things have changed quite a bit since UniGene 
build 133.

The probeset you are showing below has a _s_at identifier, which 
indicates that it cross-hybridizes to multiple members of a related gene 
family (in this case the FAM35 gene family). There are other identifiers 
like the _x_at which indicates cross-hybridization to unrelated genes.

http://www.affymetrix.com/support/help/faqs/hgu133/index.jsp

Best,

Jim

>
> Thanks,
>
> Juliet
>
> library("biomaRt")
> probeSets<- c("219666_at", "220547_s_at", "218034_at")
> ensembl = useMart("ensembl")
> ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
>
> getBM(attributes = c("affy_hg_u133a", "entrezgene"), filters =
> "affy_hg_u133a",values = probeSets, mart = ensembl)
>
>
>    affy_hg_u133a entrezgene
> 1   220547_s_at      54537
> 2     218034_at      51024
> 3   220547_s_at         NA
> 4     219666_at      64231
> 5   220547_s_at     414241
> 6   220547_s_at     439965
>
>
>
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] biomaRt_2.12.0      BiocInstaller_1.4.6
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.91-1 tools_2.15.0 XML_3.9-4
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099