[BioC] how does an annotation package handle ambigious probe set id mappings

Mon Oct 19 19:53:38 CEST 2009

Thank you Jim,

The probeset is not displayed in the regular mappings precisely because
it maps to multiple things.  Because it is of ambiguous assignment, most
people will probably want to avoid it the majority of the time.  Also,
legacy code that depends on getting one value back for such operations
needs to be able to continue working.  However, as you have so carefully
illustrated, you can now get the complete data for any mapping by using
toggleProbes().  There are 3 settings for any mapping that can be set
using toggleProbes: "all" which gives you every mapping regardless of
what it is, "multiple" which only exposes mappings where many probes map
to the same item, and "single" which is the default and which will
expose only those probe IDs that have been assigned by the manufacturer
to a single gene.  So if you don't use "all" then the troublesome
ambiguously assigned probes will not be represented in the mapping (ie.
you will get an NA).  The majority of the time, probes are assigned to a
single gene and so normally most things are represented just fine by
"single".  This default has the side benefit that it shields you from
those probes where the manufacturer is less than certain about the
identity.  But for those cases where there are multiple genes assigned
to a probe or probeset, you can now also get all the assignments out if
you wish (or just the troublesome ones if you want to focus on them) so
that you can make a guess about which one you think it is you have
actually measured.

  Marc

James W. MacDonald wrote:
> Hi Andrew,
>
> Andrew Yee wrote:
>> Apologies if this has been asked before, but how does an annotation
>> package handle an ambiguous probe set ID mapping?
>>
>> Take for example the Affymetrix chip U133X3P.
>>
>> When I use the annotation for this chip for probe set ID
>> 1552641_3p_s_at, it returns only one match:
>>
>>> library('u133x3p.db')
>>> mget('1552641_3p_s_at', env=u133x3pSYMBOL)
>> $`1552641_3p_s_at`
>> [1] "ATAD3B"
>>> mget('1552641_3p_s_at', env=u133x3pENTREZID)
>> $`1552641_3p_s_at`
>> [1] "83858"
>>
>> However, when I search Affymetrix, with:
>>
>> https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=U133_X3P:1552641_3P_S_AT
>>
>>
>> it states that it ambiguously maps to three gene symbols, ATAD3A,
>> ATAD3B, and LOC732419.
>>
>> How does the annotation package determine which gene symbol it should
>> map to?
>
> In the past we just used the first probeset ==> Entrez Gene ID
> mapping. However, in the soon to be released BioC 2.5 annotation
> packages all the mappings are included (thanks to Marc Carlson).
>
> > tmp <- toggleProbes(u133x3pENTREZID, "all")
> > get('1552641_3p_s_at', tmp)
> [1] "55210"  "732419" "83858"
> > tmp2 <- toggleProbes(u133x3pSYMBOL, "all")
> > get('1552641_3p_s_at', tmp2)
> [1] "ATAD3A"    "LOC732419" "ATAD3B"
>
> Oddly enough, this probeset isn't mapped in the 'regular' mappings:
>
> > get('1552641_3p_s_at', u133x3pENTREZID)
> [1] NA
> > get('1552641_3p_s_at', u133x3pSYMBOL)
> [1] NA
>
> Marc?
>
> > sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-09-21 r49780)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United States.1252
> [2] LC_CTYPE=English_United States.1252
> [3] LC_MONETARY=English_United States.1252
> [4] LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats     graphics  grDevices datasets  utils     methods   base
>
> other attached packages:
> [1] u133x3p.db_2.3.5     org.Hs.eg.db_2.3.4   RSQLite_0.7-2
> [4] DBI_0.2-4            AnnotationDbi_1.7.17 Biobase_2.5.6
>
> loaded via a namespace (and not attached):
> [1] tools_2.10.0
> >
>
> Best,
>
> Jim
>
>
>>
>> Thanks,
>> Andrew
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>