[BioC] understanding multiples matches between probesets and entrezgene (biomart)
Steve Lianoglou
mailinglist.honeypot at gmail.com
Wed Jun 13 18:16:57 CEST 2012
Hi,
On Wed, Jun 13, 2012 at 11:01 AM, Juliet Hannah <juliet.hannah at gmail.com> wrote:
> All,
>
> I understand the concept of multiple probesets corresponding to one
> identifier. But what is the meaning of
> a probeset corresponding to multiple identifiers? And below, given
> that 220547_s_at has a match,
> why should another row be returned with NA.
[snip]
Given the output from the entrez IDs you entered (below, in remaining
quoted text), the duplicate entrez for the same probesets map to these
entrez ids:
http://www.ncbi.nlm.nih.gov/gene?term=414241
http://www.ncbi.nlm.nih.gov/gene?term=54537
http://www.ncbi.nlm.nih.gov/gene?term=439965
They're all w/in the same family and there is at least one pseudo gene
-- in their "Gene description" field, they all mention that they have
"high sequence similarity 35"
Given that information, I guess we can take a guess as to why this
might be happening.
You might consider looking into the CDFs the "brainarray" people are
publishing to perhaps avoid these probes altogether:
http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp
Not sure about the NA part of your question ...
HTH,
-steve
>
> Did I happen to choose a few probesets where the gene definition is
> changing, or am I misunderstanding
> something else, such as the biomart syntax.
>
> Thanks,
>
> Juliet
>
> library("biomaRt")
> probeSets <- c("219666_at", "220547_s_at", "218034_at")
> ensembl = useMart("ensembl")
> ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
>
> getBM(attributes = c("affy_hg_u133a", "entrezgene"), filters =
> "affy_hg_u133a",values = probeSets, mart = ensembl)
>
>
> affy_hg_u133a entrezgene
> 1 220547_s_at 54537
> 2 218034_at 51024
> 3 220547_s_at NA
> 4 219666_at 64231
> 5 220547_s_at 414241
> 6 220547_s_at 439965
>
>
>
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] biomaRt_2.12.0 BiocInstaller_1.4.6
>
> loaded via a namespace (and not attached):
> [1] RCurl_1.91-1 tools_2.15.0 XML_3.9-4
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list