[Bioc-devel] GSEABase::mapIdentifiers() stopped mapping identifiers

Wed Mar 14 23:58:20 CET 2012

On 03/14/2012 07:08 AM, Robert Castelo wrote:
> dear list, dear Martin,
>
> recently
>
> https://stat.ethz.ch/pipermail/bioc-devel/2012-March/003173.html
>
> the mapping of identifiers between ExpressionSet objects and
> GeneSetCollection objects was modified to avoid an unsuccessful mapping
> operation when *both* objects had features based on Entrez Gene
> identifiers since the function mapIdentifiers() would not find a
> corresponding org.Hs.egENTREZID bimap.
>
> however, it seems that this last modification has broken the regular
> mapping between two different kind of identifiers. this is not
> manifested as an error during the mapping but it may unpredictable
> consequences downstream as, for instance, currently breaking the GSVA
> vignette because the feature ids in the GeneSetCollection object do not
> map to the feature ids in the ExpressionSet object. here is the code
> reproducing the problem:
>
> library(GSEABase)
> library(GSVAdata)
>
> data(leukemia)
> annotation(leukemia_eset) ## hgu95a chip!
> [1] "hgu95a"
>
> data(c2BroadSets)
>
> gsc_hgu95a<- mapIdentifiers(c2BroadSets,
> AnnotationIdentifier(annotation(leukemia_eset)))
>
> head(lapply(geneIds(gsc_hgu95a), head)) ## these are not hgu95a IDs!
> $NAKAMURA_CANCER_MICROENVIRONMENT_UP
> [1] "5167"      "100288400" "338328"    "388"       "10631"
> "440387"
>
> $NAKAMURA_CANCER_MICROENVIRONMENT_DN
> [1] "55215" "9319"  "81610" "9455"  "64759" "8767"
>
> $WEST_ADRENOCORTICAL_TUMOR_MARKERS_UP
> [1] "5142"   "6781"   "580"    "6713"   "112950" "11182"
>
> $WEST_ADRENOCORTICAL_TUMOR_MARKERS_DN
> [1] "125"  "2619" "5919" "4856" "5156" "4046"
>
> $WINTER_HYPOXIA_UP
> [1] "7022"   "404550" "5738"   "9456"   "5230"   "10856"
>
> $WINTER_HYPOXIA_DN
> [1] "5168"  "9452"  "3112"  "91526" "55843" "9459"
> R Under development (unstable) (2012-01-31 r58242)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
>   [1] GSVAdata_0.99.3       hgu95a.db_2.6.3       org.Hs.eg.db_2.6.4
>   [4] RSQLite_0.11.1        DBI_0.2-5             GSEABase_1.17.3
>   [7] graph_1.33.1          annotate_1.33.2       AnnotationDbi_1.17.23
> [10] Biobase_2.15.4        BiocGenerics_0.1.11
>
> loaded via a namespace (and not attached):
> [1] IRanges_1.13.27 stats4_2.15.0   tools_2.15.0    XML_3.9-4
> [5] xtable_1.7-0
>
> note that the identifiers shown on the previous gene sets should have
> been hgu95a probeset identifiers, which is what you get when you do it
> using the current release version of GSEABase:
>
> library(GSEABase)
> library(GSVAdata)
> data(leukemia)
> data(c2BroadSets)
> gsc_hgu95a<- mapIdentifiers(c2BroadSets,
> AnnotationIdentifier(annotation(leukemia_eset)))
> head(lapply(geneIds(gsc_hgu95a), head))
> $NAKAMURA_CANCER_MICROENVIRONMENT_UP
> [1] "342_at"    "343_s_at"  "1826_at"   "1451_s_at" "33436_at"
> "32488_at"
>
> $NAKAMURA_CANCER_MICROENVIRONMENT_DN
> [1] "32617_at" "36813_at" "38292_at" "41384_at" "36205_at" "39425_at"
>
> $WEST_ADRENOCORTICAL_TUMOR_MARKERS_UP
> [1] "33705_at" "41354_at" "1801_at"  "35839_at" "33432_at" "36907_at"
>
> $WEST_ADRENOCORTICAL_TUMOR_MARKERS_DN
> [1] "35730_at" "41839_at" "661_at"   "34407_at" "39250_at" "1731_at"
>
> $WINTER_HYPOXIA_UP
> [1] "40303_at"   "41010_at"   "31488_s_at" "37677_at"   "35758_at"
> [6] "34301_r_at"
>
> $WINTER_HYPOXIA_DN
> [1] "41123_s_at" "41124_r_at" "41125_r_at" "40775_at"   "38570_at"
> [6] "37543_at"
>
> sessionInfo()
> R version 2.14.0 (2011-10-31)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
>   [1] GSVAdata_0.99.3       hgu95a.db_2.6.3       org.Hs.eg.db_2.6.4
>   [4] RSQLite_0.11.1        DBI_0.2-5             GSEABase_1.16.0
>   [7] graph_1.32.0          annotate_1.32.1       AnnotationDbi_1.16.16
> [10] Biobase_2.14.0
>
> loaded via a namespace (and not attached):
> [1] IRanges_1.12.6 tools_2.14.0   XML_3.9-4      xtable_1.7-0
>
>
> i'm sorry that my previous request for idempotent maps broke the more
> fundamental mapping functionality but i hope that this has an easy fix.

So the answer to the original post should have been closer to 'use 
EntrezIdentifier for org.Hs.eg.db annotations'? As in the final line of

library(Biobase)
library(org.Hs.eg.db)
library(GSVAdata)
data(c2BroadSets)
mapped_genes <- mappedkeys(org.Hs.egSYMBOL)
exp <- matrix(rnorm(1000), nrow=100,
     dimnames=list(mapped_genes[1:100], paste("sample", 1:10, sep="")))
eset <- new("ExpressionSet", exprs=exp, annotation="org.Hs.eg.db")
gsc <- mapIdentifiers(c2BroadSets, EntrezIdentifier(annotation(eset)))

class?GeneIdentifierType says AnnotationIdentifier is meant for 
Affymetrix chip packages (there seem to be some underlying problems 
anyway, GeneSet(eset) creates AnnotationIdentifier).

Martin

>
> thanks!!
> robert.
>

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793