[BioC] understanding multiples matches between probesets and entrezgene (biomart)

Wed Jun 13 18:16:57 CEST 2012

Hi,

On Wed, Jun 13, 2012 at 11:01 AM, Juliet Hannah <juliet.hannah at gmail.com> wrote:
> All,
>
> I understand the concept of multiple probesets corresponding to one
> identifier. But what is the meaning of
> a probeset corresponding to multiple identifiers?  And below, given
> that 220547_s_at has a match,
> why should another row be returned with NA.

[snip]

Given the output from the entrez IDs you entered (below, in remaining
quoted text), the duplicate entrez for the same probesets map to these
entrez ids:

http://www.ncbi.nlm.nih.gov/gene?term=414241
http://www.ncbi.nlm.nih.gov/gene?term=54537
http://www.ncbi.nlm.nih.gov/gene?term=439965

They're all w/in the same family and there is at least one pseudo gene
-- in their "Gene description" field, they all mention that they have
"high sequence similarity 35"

Given that information, I guess we can take a guess as to why this
might be happening.

You might consider looking into the CDFs the "brainarray" people are
publishing to perhaps avoid these probes altogether:

http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp

Not sure about the NA part of your question ...

HTH,
-steve

>
> Did I happen to choose a few probesets where the gene definition is
> changing, or am I misunderstanding
> something else, such as the biomart syntax.
>
> Thanks,
>
> Juliet
>
> library("biomaRt")
> probeSets <- c("219666_at", "220547_s_at", "218034_at")
> ensembl = useMart("ensembl")
> ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
>
> getBM(attributes = c("affy_hg_u133a", "entrezgene"), filters =
> "affy_hg_u133a",values = probeSets, mart = ensembl)
>
>
>  affy_hg_u133a entrezgene
> 1   220547_s_at      54537
> 2     218034_at      51024
> 3   220547_s_at         NA
> 4     219666_at      64231
> 5   220547_s_at     414241
> 6   220547_s_at     439965
>
>
>
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=C                 LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] biomaRt_2.12.0      BiocInstaller_1.4.6
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.91-1 tools_2.15.0 XML_3.9-4
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact