[BioC] Changes in annotations?

Loren Engrav engrav at u.washington.edu
Wed Jul 8 18:22:29 CEST 2009


For probe 238900_at the Affy csv file for symbol is
  
 HLA-DRB1 /// HLA-DRB2 /// HLA-DRB3 /// HLA-DRB4 /// HLA-DRB5 ///
LOC100133484 /// LOC100133661 /// LOC100133811 /// LOC730415 /// RNASE2 ///
ZNF749  

And for Gene ID is
  
 100133484 /// 100133661 /// 100133811 /// 3123 /// 3124 /// 3125 /// 3126
/// 3127 /// 388567 /// 6036 /// 730415

So it is confusing even without the "moving target" matter

If a probe has multiple choices for symbol and ID, what would happen if the
response is "multiple, you must choose one"

Or
The response is to select all of the identifiers

Thank you

> From: "James W. MacDonald" <jmacdon at med.umich.edu>
> Date: Mon, 06 Jul 2009 09:58:05 -0400
> To: Alex Sanchez <asanchez at ub.edu>
> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
> Subject: Re: [BioC] Changes in annotations?
> 
> Hi Alex,
> 
> This is a question that comes up on the Bioc list fairly regularly, and
> the answer is in two parts:
> 
> First, the annotations supplied in the various metadata packages
> supplied by BioC are *not* our annotations, but are simply a
> re-packaging of data we collect from various sources. As an example, we
> use the mappings of Affymetrix Probe ID to Entrez Gene ID from the
> annotation csv files you can download from the Affy website. We then map
> the Entrez Gene IDs to other annotation using primarily NCBI data. So if
> you go to Affy's netaffx site (free registration required) and query on
> say, 238900_at, you get this:
> 
> https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-U133_PLUS_2%
> 3A238900_AT
> 
> And you will note that the first Entrez Gene ID listed there is
> 100133484, which happens to be a defunct ID. However, this is the first
> of many listed there (and we need a one-to-one mapping), so we chose
> that one. A more likely Entrez Gene ID can be found further down the
> list, but we simply don't have the resources to figure out if there is a
> better choice in that list (for every reporter on every Affy chip we
> annotate). Nor do we have the resources to ensure that any of the
> mappings that Affy make are reasonable to begin with. We have to trust
> that they (with *way* more resources that us) are doing a reasonable job.
> 
> The second part of the answer has to do with the 'moving target' aspect
> of Biological annotations. These data change all the time, and there is
> the recurring question of whether one should do an analysis and 'freeze'
> it to that point in time, or should the annotations be updated on a
> regular basis, with the realization that things can and will change?
> 
> Without looking at each reporter ID you list, I can't say if the changes
> are due to Affy changing their annotation csv files, or to changing
> knowledge of the genome, but I suspect it is a combination of the two.
> 
> Best,
> 
> Jim
> 
> 
> 
> 
> 
> 
> Alex Sanchez wrote:
>> Hello
>> 
>> I have had to review recently an analysis I did some time ago. This was done
>> on affymetrix hgu133plus2 chips with R 2.4 and BioC 1.9 I have re-run the
>> analyses using R 2.9 and BioC 2.4 (sessionInfo below).
>> I have been surprised by the changes in the annotations: Many probesets that
>> had had an annotation have become NA's whereas some have changed their symbol
>> and their Entrez gene.
>> 
>> To be specific I summarize my question with the top genes of my list
>> 
>> The list I obtained 2 years ago is:
>> 
>> probeset    locuslink    symbol
>> 238900_at 3123 HLA-DRB1
>> 232583_at 8440 NCK2
>> 236307_at 60468 BACH2
>> 223620_at 2857 GPR34
>> 219759_at 64167 LRAP
>> 201702_s_at 5514 PPP1R10
>> 232882_at 2308 FOXO1A
>> 213446_s_at 8826 IQGAP1
>> 234033_at 9693 RAPGEF2
>> 243006_at 2534 FYN
>> 244648_at 54520 CCDC93
>> 243691_at 23142 DCUN1D4
>> 239264_at 60412 EXOC4
>> 243546_at 143686 SESN3
>> 205239_at 374 AREG
>> 1565703_at 55520 ELAC1
>> 244061_at 55843 ARHGAP15
>> 230505_at 26037 SIPA1L1
>> 242688_at 9320 TRIP12
>> 1556474_a_at 285097 FLJ38379
>> 232614_at 596 BCL2
>> 1565689_at 3839 KPNA3
>> 236685_at NA NA
>> 225173_at 93663 ARHGAP18
>> 241893_at 4249 MGAT5
>> 
>> I used the following code to reproduce the issue with the annotations:
>> 
>> 
>> #####################################################################
>> ## Verification using R 2.9 & BioC 2.4
>> #####################################################################
>> 
>>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" , "219759_at"
>>> ,
>> +  "201702_s_at" , "232882_at"  ,  "213446_s_at",  "234033_at",
>> "243006_at" ,  
>> +  "244648_at" ,   "243691_at" ,   "239264_at" ,   "243546_at" ,
>> "205239_at" ,
>> +  "1565703_at" ,  "244061_at"  ,  "230505_at" ,   "242688_at" ,
>> "1556474_a_at",
>> +  "232614_at"  ,  "1565689_at" ,  "236685_at"  ,  "225173_at" ,
>> "241893_at")
>>> library(hgu133plus2.db)
>>> library(annotate)
>>> 
>>> entrezs<- getEG(probes, "hgu133plus2")
>>> symbols<- getSYMBOL(probes, "hgu133plus2")
>>> sel2<- cbind(probes, entrezs, symbols)
>>> sel2
>>              probes         entrezs     symbols
>> 238900_at    "238900_at"    "100133484" "LOC100133484"
>> 232583_at    "232583_at"    NA          NA
>> 236307_at    "236307_at"    NA          NA
>> 223620_at    "223620_at"    "2857"      "GPR34"
>> 219759_at    "219759_at"    "64167"     "ERAP2"
>> 201702_s_at  "201702_s_at"  "5514"      "PPP1R10"
>> 232882_at    "232882_at"    NA          NA
>> 213446_s_at  "213446_s_at"  "8826"      "IQGAP1"
>> 234033_at    "234033_at"    NA          NA
>> 243006_at    "243006_at"    NA          NA
>> 244648_at    "244648_at"    NA          NA
>> 243691_at    "243691_at"    NA          NA
>> 239264_at    "239264_at"    NA          NA
>> 243546_at    "243546_at"    NA          NA
>> 205239_at    "205239_at"    "374"       "AREG"
>> 1565703_at   "1565703_at"   "4089"      "SMAD4"
>> 244061_at    "244061_at"    NA          NA
>> 230505_at    "230505_at"    "145474"    "LOC145474"
>> 242688_at    "242688_at"    NA          NA
>> 1556474_a_at "1556474_a_at" "285097"    "FLJ38379"
>> 232614_at    "232614_at"    NA          NA
>> 1565689_at   "1565689_at"   NA          NA
>> 236685_at    "236685_at"    NA          NA
>> 225173_at    "225173_at"    "93663"     "ARHGAP18"
>> 241893_at    "241893_at"    NA          NA
>>> sessionInfo()
>> R version 2.9.0 (2009-04-17)
>> i386-pc-mingw32 
>> 
>> locale:
>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>> States.1252;LC_MONETARY=English_United
>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> other attached packages:
>> [1] annotate_1.22.0       hgu133plus2.db_2.2.11 RSQLite_0.7-1
>> DBI_0.2-4             AnnotationDbi_1.6.0   Biobase_2.4.1
>> 
>> loaded via a namespace (and not attached):
>> [1] xtable_1.5-5
>> #############################################
>> 
>> Many probesets seem to have changed.
>> Can someone explain to me what is happening (or what may I be doing wrong)?
>> 
>> The same code does not work with R 2.4 but if I change hgu133plus2.db by
>> hgu133plus2 and getEG by getLL I obtain the original results:
>> 
>> ###############################################
>> ### Review of annotatons with R 2.4 and BioC 1.9
>> ###############################################
>> 
>> ### This code is executed on a clean new session with R 2. and BioC 1.9
>> 
>>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" , "219759_at"
>>> ,
>> +  "201702_s_at" , "232882_at"  ,  "213446_s_at",  "234033_at",
>> "243006_at" ,  
>> +  "244648_at" ,   "243691_at" ,   "239264_at" ,   "243546_at" ,
>> "205239_at" ,
>> +  "1565703_at" ,  "244061_at"  ,  "230505_at" ,   "242688_at" ,
>> "1556474_a_at",
>> +  "232614_at"  ,  "1565689_at" ,  "236685_at"  ,  "225173_at" ,
>> "241893_at")
>>> LLs<- getLL(rownames(sel), "hgu133plus2")
>>> symbols<- getSYMBOL(rownames(sel), "hgu133plus2")
>>> sel1<- cbind(probes, LLs, symbols)
>>> sel1
>>              probes         LLs      symbols
>> 238900_at    "238900_at"    "3123"   "HLA-DRB1"
>> 232583_at    "232583_at"    "8440"   "NCK2"
>> 236307_at    "236307_at"    "60468"  "BACH2"
>> 223620_at    "223620_at"    "2857"   "GPR34"
>> 219759_at    "219759_at"    "64167"  "ERAP2"
>> 201702_s_at  "201702_s_at"  "5514"   "PPP1R10"
>> 232882_at    "232882_at"    "2308"   "FOXO1"
>> 213446_s_at  "213446_s_at"  "8826"   "IQGAP1"
>> 234033_at    "234033_at"    "9693"   "RAPGEF2"
>> 243006_at    "243006_at"    "2534"   "FYN"
>> 244648_at    "244648_at"    "54520"  "CCDC93"
>> 243691_at    "243691_at"    "23142"  "DCUN1D4"
>> 239264_at    "239264_at"    "60412"  "EXOC4"
>> 243546_at    "243546_at"    "143686" "SESN3"
>> 205239_at    "205239_at"    "374"    "AREG"
>> 1565703_at   "1565703_at"   "4089"   "SMAD4"
>> 244061_at    "244061_at"    "55843"  "ARHGAP15"
>> 230505_at    "230505_at"    "145474" "LOC145474"
>> 242688_at    "242688_at"    "9320"   "TRIP12"
>> 1556474_a_at "1556474_a_at" "285097" "FLJ38379"
>> 232614_at    "232614_at"    "596"    "BCL2"
>> 1565689_at   "1565689_at"   "3839"   "KPNA3"
>> 236685_at    "236685_at"    NA       NA
>> 225173_at    "225173_at"    "93663"  "ARHGAP18"
>> 241893_at    "241893_at"    "4249"   "MGAT5"
>> 
>>> sessionInfo()
>> R version 2.4.1 (2006-12-18)
>> i386-pc-mingw32 
>> 
>> locale:
>> LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish
>> _Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252
>> 
>> attached base packages:
>> [1] "tools"     "stats"     "graphics"  "grDevices"
>> [5] "utils"     "datasets"  "methods"   "base"
>> 
>> other attached packages:
>>    annotate     Biobase hgu133plus2
>>    "1.12.1"    "1.12.2"    "1.14.0"
>> 
>> ########################################################
>> 
>> In summary. If I use R 2.4/BioC 1.9 I obtain the same results I ibtained 2
>> years ago, but If I do the same steps using R2.9/BioC2.4 the results change
>> dramatically.
>> I have repeated the analyses using BioC 2.01 in R 2.7 and BioC 2.2 in R 2.8
>> (results not shown here). BioC 2.0 yield the same as 1.9 and BioC 2.2 the
>> same as 2.4,
>> 
>> Any help to understand what's happening would be appreciated
>> 
>> Alex Sanchez
>> 
>> -----------------------------------------------------------------------------
>> ------------------------
>> Dr. Alex  Sánchez. Statistics Department. University of Barcelona.
>> Facultat de Biologia UB. Avda Diagonal 645. 08028 Barcelona. Spain
>> asanchez_at_ub.edu
>> Statistics and Bioinformatics Unit
>> Institut de Recerca. Hospital Universitari Vall 'Hebron
>> Passeig Vall d'Hebron 112-119. 08034 Barcelona
>> asanchez_at_ir.vhebron.net
>> -----------------------------------------------------------------------------
>> -----------------------
>> 
>> 
>> 
>> 
>> [[alternative HTML version deleted]]
>> 
>> 
>> 
>> ------------------------------------------------------------------------
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> -- 
> James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list