[BioC] confused about mget with u133x3p.db and u133x3pALIAS2PROBE

Marc Carlson mcarlson at fhcrc.org
Tue Jan 27 18:13:39 CET 2009


Hi Andrew,

Another thing your example points out that is worth mentioning is that
gene symbols are AWFUL as identifiers, because not only are there many
symbols for a given gene, but a given symbol is not even guaranteed to
be unique.

Because of the problems with them, the best thing to do is to avoid
using gene symbols to identify genes at all.  Usually you only want to
provide them as additional information about a gene, and never as a
primary identifier.  The alias mapping is therefore really just for
those times when your only option is to use a gene symbol and you really
need to go fishing for what something might be starting with a gene
symbol.  But be careful when using it.  Sometimes a symbol might map to
more than one gene.  And there is not really much we can do about that. 
We simply have no way to know which gene you will want in those cases. 
That actually happened in this case as well, which is why its a good
idea to start with the symbol mapping 1st when you are forced to go
fishing for a genes identity like that.


  Marc



Andrew Yee wrote:
> Thanks for your help.  And I should have looked at the entry on NCBI
> more carefully for the aliases of NOL3.
>
> In the future, I should just use the revmap() to restrict the query to
> official symbols.
>
> mget("MYC", env=revmap(u133x3pSYMBOL))
>
> Thanks,
> Andrew
>
>
> On Mon, Jan 26, 2009 at 7:29 PM, Marc Carlson <mcarlson at fhcrc.org
> <mailto:mcarlson at fhcrc.org>> wrote:
>
>     Hi Andrew,
>
>     I think that what is confusing you is that the "alias" and "symbol"
>     mappings have different meanings.  The "alias" mapping is for mapping
>     all possible gene symbols (as used by scientists) and the "symbol"
>     mapping only returns one (the "official" one according to NCBI) symbol
>     per probeset.
>
>     If you were to look at the reverse map of the alias map, you could see
>     this demonstrated here:
>
>     mget("4854714C_3p_s_at", env=revmap(u133x3pALIAS2PROBE))
>
>     This returns all the symbols associated with this probeset
>     including the
>     "official" one  of NOL3:
>     $`4854714C_3p_s_at`
>     [1] "ARC"   "CARD2" "MYC"   "MYP"   "NOL3"  "NOP"   "NOP30"
>
>     In contrast, the u133x3pSYMBOL mapping only maps to the "official"
>     gene
>     symbol for a probeset, so all you get there is NOL3.
>
>
>      Marc
>
>
>
>     Andrew Yee wrote:
>     > I was trying to figure out why I was getting this output with
>     the u133x3p.db
>     > package.
>     >
>     > Specifically, if I enter:
>     >
>     > mget("MYC", env=u133x3pALIAS2PROBE)
>     >
>     > It returns:
>     >
>     > $MYC
>     > [1] "4854714C_3p_s_at"      "Hs.300470.0.A1_3p_at"
>      "Hs2.372887.1.S1_3p_at"
>     > "g12962934_3p_a_at"     "g3126906_3p_a_at"
>     >
>     > However, if you try to convert the first probe set ID returned,
>     >
>     > mget( "4854714C_3p_s_at", env=u133x3pSYMBOL)
>     >
>     > it returns:
>     >
>     > $`4854714C_3p_s_at`
>     > [1] "NOL3"
>     >
>     > I'm puzzled why the output from u133x3pALIAS2PROBE doesn't
>     exactly match up
>     > with u133x3pSYMBOL
>     >
>     > Thanks,
>     > Andrew
>     >
>     >       [[alternative HTML version deleted]]
>     >
>     > _______________________________________________
>     > Bioconductor mailing list
>     > Bioconductor at stat.math.ethz.ch
>     <mailto:Bioconductor at stat.math.ethz.ch>
>     > https://stat.ethz.ch/mailman/listinfo/bioconductor
>     > Search the archives:
>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>     >
>     >
>
>



More information about the Bioconductor mailing list