[BioC] Changes in annotations?

Loren Engrav engrav at u.washington.edu
Fri Jul 10 04:43:09 CEST 2009


Thank you

Why is it not "best" to include everything Affy includes? All the synonyms
and IDs

Or is this not software possible?



> From: "James W. MacDonald" <jmacdon at med.umich.edu>
> Date: Thu, 09 Jul 2009 09:36:46 -0400
> To: Alex Sanchez <asanchez at ub.edu>
> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
> Subject: Re: [BioC] Changes in annotations?
> 
> Hi Alex,
> 
> We have taken this up as an internal topic of discussion and hopefully
> will be able to come up with a solution that is a bit better than what
> we are doing right now.
> 
> However, some of this is going to be unavoidable. The chips were based
> on UniGene build 133 (and we are now on build 219), so things are going
> to tend to move around. But I don't think this is the main reason for
> what you are seeing, as the latest Affy annotation file is still based
> on UCSC build hg18 (which is based on NCBI build 36.1), so these data
> are over three years old.
> 
> Instead, it appears that Affy have changed something about how they are
> annotating things. Some of this is an improvement, and some not so much.
> 
> For instance, 203315_at does interrogate NCK2:
> 
> http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=136510
> 693&hgt.out3=10x&position=chr2%3A105876577-105876826&hgtgroup_map_close=0&hgtg
> roup_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_expr
> ession_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgroup_
> varRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&hg
> tgroup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAndV
> ar_close=1
> 
> whereas 232583_at does not:
> 
> http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=136510
> 693&hgt.out3=10x&position=chr2%3A105752637-105752886&hgtgroup_map_close=0&hgtg
> roup_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_expr
> ession_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgroup_
> varRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&hg
> tgroup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAndV
> ar_close=1
> 
> so in this case they have improved their annotations.
> 
> In the case of 238900_at, they seem to have added in a bunch of EG IDs
> that are based on a model rather than being actual reviewed genes. In
> addition, that probeset seems to hit an intron, so adding in all this
> other annotation appears pointless:
> 
> http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=136511
> 081&hgt.out3=10x&position=chr6%3A32653723-32653972&hgtgroup_map_close=0&hgtgro
> up_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_expres
> sion_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgroup_va
> rRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&hgtg
> roup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAndVar
> _close=1
> 
> What this really comes down to is the fact that you can't take anything
> for granted. Not only do you have to validate a set of significant
> results using some other technology, you also need to ensure that the
> chip you are using is actually measuring what you think prior to doing
> the validation.
> 
> It should be relatively painless to set up a pipeline where you could
> use the BSgenome.Hsapiens.UCSC.hg18 package along with functions from
> BSgenome/Biostrings to map probesets to the genome and then use the
> org.Hs.eg.db package to see if a given probeset is even interrogating
> the gene it is supposed to. One could then use rtracklayer to visualize
> things in the UCSC genome browser to make sure you weren't interrogating
> an intron.
> 
> Best,
> 
> Jim
> 
> 
> 
> 
> Alex Sanchez wrote:
>> Hi James and thanks for the help.
>> 
>> I know, of course, that what Bioconductor does is to take the
>> annotations from public data sources.
>> I will now turn to affymetrix to see if someone there, or in their
>> forums, can explain why so many annotations have been turned into "NA's"
>> 
>> There seem to be two problems here
>> - The first one, pointed by Loren in his message today is synonims. For
>> instance the first probeset in my list was 238900_at,
>>   In the first version I used (BioC 1.9) the Gene Symbol provide by
>> getSymbol was HLA-DRB1 whereas now (BioC 2.4) it is LOC100133484
>>   However the original symbol is still in the Affymetrix annotation file
>> "HG-U133_Plus_2.na28.annot.csv". It seems as if the new symbol is intended
>>   to show the relation with the new Entrez ID (100133484).
>> In any case it is ennoying but it can be assumed.
>> - What seems worse is what happens to some annotations: If, for
>> instance,  I take the second probeset in my list, 232583_at, and I look
>> for it in the affy annotations file I find that apart some links to
>> databases as genebank it does not have a gene symbol or an entrez (or
>> most annotations anymore).
>> In the previous version of the annotations this probeset was one of two
>> associated with gene NCK2 (232583_at;203315_at) .
>> The problem is that the second probeset (203315_at) was not selected by
>> the analysis. That is the only evidence suggesting that gene NCK2 was
>> differentially expressed (232583_at) does not suggest it anymore.
>> 
>> Last, but in any case least, this is not happening to a few probesets.
>> My original list had 417 probesets. 210 have changed/synonimized their
>> gene symbols, but from these 210, 160 have become NA's
>> 
>> Seems too many to feel comfortable :-(
>> 
>> Thanks for the help
>> 
>> Alex
>> 
>> .............................................................................
>> ..................................
>> 
>> Dr. Alex  Sánchez.
>> Associate Professor. Statistics Department. University of Barcelona.
>> Facultat de Biologia UB. Avda Diagonal 645. 08028 Barcelona. Spain
>> asanchez_at_ub.edu
>> Statistics and Bioinformatics Unit
>> Institut de Recerca. Hospital Universitari Vall 'Hebron
>> Passeig Vall d'Hebron 112-119. 08034 Barcelona
>> asanchez_at_ir.vhebron.net
>> .............................................................................
>> ..................................
>> 
>> 
>> 
>> 
>> 
>> ----- Original Message ----- From: "James W. MacDonald"
>> <jmacdon at med.umich.edu>
>> To: "Alex Sanchez" <asanchez at ub.edu>
>> Cc: <bioconductor at stat.math.ethz.ch>
>> Sent: Monday, July 06, 2009 3:58 PM
>> Subject: Re: [BioC] Changes in annotations?
>> 
>> 
>>> Hi Alex,
>>> 
>>> This is a question that comes up on the Bioc list fairly regularly,
>>> and the answer is in two parts:
>>> 
>>> First, the annotations supplied in the various metadata packages
>>> supplied by BioC are *not* our annotations, but are simply a
>>> re-packaging of data we collect from various sources. As an example,
>>> we use the mappings of Affymetrix Probe ID to Entrez Gene ID from the
>>> annotation csv files you can download from the Affy website. We then
>>> map the Entrez Gene IDs to other annotation using primarily NCBI data.
>>> So if you go to Affy's netaffx site (free registration required) and
>>> query on say, 238900_at, you get this:
>>> 
>>> https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-U133_PLUS_
>>> 2%3A238900_AT 
>>> 
>>> 
>>> And you will note that the first Entrez Gene ID listed there is
>>> 100133484, which happens to be a defunct ID. However, this is the
>>> first of many listed there (and we need a one-to-one mapping), so we
>>> chose that one. A more likely Entrez Gene ID can be found further down
>>> the list, but we simply don't have the resources to figure out if
>>> there is a better choice in that list (for every reporter on every
>>> Affy chip we annotate). Nor do we have the resources to ensure that
>>> any of the mappings that Affy make are reasonable to begin with. We
>>> have to trust that they (with *way* more resources that us) are doing
>>> a reasonable job.
>>> 
>>> The second part of the answer has to do with the 'moving target'
>>> aspect of Biological annotations. These data change all the time, and
>>> there is the recurring question of whether one should do an analysis
>>> and 'freeze' it to that point in time, or should the annotations be
>>> updated on a regular basis, with the realization that things can and
>>> will change?
>>> 
>>> Without looking at each reporter ID you list, I can't say if the
>>> changes are due to Affy changing their annotation csv files, or to
>>> changing knowledge of the genome, but I suspect it is a combination of
>>> the two.
>>> 
>>> Best,
>>> 
>>> Jim
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Alex Sanchez wrote:
>>>> Hello
>>>> 
>>>> I have had to review recently an analysis I did some time ago. This
>>>> was done on affymetrix hgu133plus2 chips with R 2.4 and BioC 1.9 I
>>>> have re-run the analyses using R 2.9 and BioC 2.4 (sessionInfo below).
>>>> I have been surprised by the changes in the annotations: Many
>>>> probesets that had had an annotation have become NA's whereas some
>>>> have changed their symbol and their Entrez gene.
>>>> 
>>>> To be specific I summarize my question with the top genes of my list
>>>> 
>>>> The list I obtained 2 years ago is:
>>>> 
>>>> probeset    locuslink    symbol
>>>> 238900_at 3123 HLA-DRB1
>>>> 232583_at 8440 NCK2
>>>> 236307_at 60468 BACH2
>>>> 223620_at 2857 GPR34
>>>> 219759_at 64167 LRAP
>>>> 201702_s_at 5514 PPP1R10
>>>> 232882_at 2308 FOXO1A
>>>> 213446_s_at 8826 IQGAP1
>>>> 234033_at 9693 RAPGEF2
>>>> 243006_at 2534 FYN
>>>> 244648_at 54520 CCDC93
>>>> 243691_at 23142 DCUN1D4
>>>> 239264_at 60412 EXOC4
>>>> 243546_at 143686 SESN3
>>>> 205239_at 374 AREG
>>>> 1565703_at 55520 ELAC1
>>>> 244061_at 55843 ARHGAP15
>>>> 230505_at 26037 SIPA1L1
>>>> 242688_at 9320 TRIP12
>>>> 1556474_a_at 285097 FLJ38379
>>>> 232614_at 596 BCL2
>>>> 1565689_at 3839 KPNA3
>>>> 236685_at NA NA
>>>> 225173_at 93663 ARHGAP18
>>>> 241893_at 4249 MGAT5
>>>> 
>>>> I used the following code to reproduce the issue with the annotations:
>>>> 
>>>> 
>>>> #####################################################################
>>>> ## Verification using R 2.9 & BioC 2.4
>>>> #####################################################################
>>>> 
>>>>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" ,
>>>>> "219759_at" ,
>>>> +  "201702_s_at" , "232882_at"  ,  "213446_s_at",  "234033_at",
>>>> "243006_at" ,  +  "244648_at" ,   "243691_at" ,   "239264_at" ,
>>>> "243546_at" ,   "205239_at" ,
>>>> +  "1565703_at" ,  "244061_at"  ,  "230505_at" ,   "242688_at" ,
>>>> "1556474_a_at",
>>>> +  "232614_at"  ,  "1565689_at" ,  "236685_at"  ,  "225173_at" ,
>>>> "241893_at")
>>>>> library(hgu133plus2.db)
>>>>> library(annotate)
>>>>> 
>>>>> entrezs<- getEG(probes, "hgu133plus2")
>>>>> symbols<- getSYMBOL(probes, "hgu133plus2")
>>>>> sel2<- cbind(probes, entrezs, symbols)
>>>>> sel2
>>>>              probes         entrezs     symbols       238900_at
>>>> "238900_at"    "100133484" "LOC100133484"
>>>> 232583_at    "232583_at"    NA          NA           236307_at
>>>> "236307_at"    NA          NA           223620_at    "223620_at"
>>>> "2857"      "GPR34"       219759_at    "219759_at"    "64167"
>>>> "ERAP2" 201702_s_at  "201702_s_at"  "5514"      "PPP1R10"
>>>> 232882_at "232882_at"    NA          NA           213446_s_at
>>>> "213446_s_at" "8826"      "IQGAP1"      234033_at    "234033_at"
>>>> NA          NA 243006_at    "243006_at"    NA          NA
>>>> 244648_at "244648_at"    NA          NA           243691_at
>>>> "243691_at"    NA NA           239264_at    "239264_at"
>>>> NA          NA 243546_at    "243546_at"    NA          NA
>>>> 205239_at "205239_at"    "374"       "AREG"        1565703_at
>>>> "1565703_at" "4089"      "SMAD4"       244061_at    "244061_at"
>>>> NA          NA 230505_at    "230505_at"    "145474"    "LOC145474"
>>>> 242688_at "242688_at"    NA          NA           1556474_a_at
>>>> "1556474_a_at" "285097"    "FLJ38379"    232614_at    "232614_at"
>>>> NA          NA 1565689_at   "1565689_at"   NA          NA
>>>> 236685_at "236685_at"    NA          NA           225173_at
>>>> "225173_at" "93663"     "ARHGAP18"    241893_at    "241893_at"
>>>> NA          NA
>>>>> sessionInfo()
>>>> R version 2.9.0 (2009-04-17) i386-pc-mingw32 locale:
>>>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>>>> States.1252;LC_MONETARY=English_United
>>>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>>> 
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>> other attached packages:
>>>> [1] annotate_1.22.0       hgu133plus2.db_2.2.11 RSQLite_0.7-1
>>>> DBI_0.2-4             AnnotationDbi_1.6.0   Biobase_2.4.1
>>>> loaded via a namespace (and not attached):
>>>> [1] xtable_1.5-5
>>>> #############################################
>>>> 
>>>> Many probesets seem to have changed.
>>>> Can someone explain to me what is happening (or what may I be doing
>>>> wrong)?
>>>> 
>>>> The same code does not work with R 2.4 but if I change hgu133plus2.db
>>>> by hgu133plus2 and getEG by getLL I obtain the original results:
>>>> 
>>>> ###############################################
>>>> ### Review of annotatons with R 2.4 and BioC 1.9
>>>> ###############################################
>>>> 
>>>> ### This code is executed on a clean new session with R 2. and BioC 1.9
>>>> 
>>>>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" ,
>>>>> "219759_at" ,
>>>> +  "201702_s_at" , "232882_at"  ,  "213446_s_at",  "234033_at",
>>>> "243006_at" ,  +  "244648_at" ,   "243691_at" ,   "239264_at" ,
>>>> "243546_at" ,   "205239_at" ,
>>>> +  "1565703_at" ,  "244061_at"  ,  "230505_at" ,   "242688_at" ,
>>>> "1556474_a_at",
>>>> +  "232614_at"  ,  "1565689_at" ,  "236685_at"  ,  "225173_at" ,
>>>> "241893_at")
>>>>> LLs<- getLL(rownames(sel), "hgu133plus2")
>>>>> symbols<- getSYMBOL(rownames(sel), "hgu133plus2")
>>>>> sel1<- cbind(probes, LLs, symbols)
>>>>> sel1
>>>>              probes         LLs      symbols    238900_at
>>>> "238900_at" "3123"   "HLA-DRB1" 232583_at    "232583_at"    "8440"
>>>> "NCK2" 236307_at    "236307_at"    "60468"  "BACH2"    223620_at
>>>> "223620_at" "2857"   "GPR34"    219759_at    "219759_at"    "64167"
>>>> "ERAP2" 201702_s_at  "201702_s_at"  "5514"   "PPP1R10"  232882_at
>>>> "232882_at" "2308"   "FOXO1"    213446_s_at  "213446_s_at"  "8826"
>>>> "IQGAP1" 234033_at    "234033_at"    "9693"   "RAPGEF2"  243006_at
>>>> "243006_at" "2534"   "FYN"      244648_at    "244648_at"    "54520"
>>>> "CCDC93" 243691_at    "243691_at"    "23142"  "DCUN1D4"  239264_at
>>>> "239264_at" "60412"  "EXOC4"    243546_at    "243546_at"    "143686"
>>>> "SESN3" 205239_at    "205239_at"    "374"    "AREG"     1565703_at
>>>> "1565703_at" "4089"   "SMAD4"    244061_at    "244061_at"    "55843"
>>>> "ARHGAP15" 230505_at    "230505_at"    "145474" "LOC145474"
>>>> 242688_at    "242688_at"    "9320"   "TRIP12"   1556474_a_at
>>>> "1556474_a_at" "285097" "FLJ38379" 232614_at    "232614_at"    "596"
>>>> "BCL2"     1565689_at   "1565689_at"   "3839"   "KPNA3"    236685_at
>>>> "236685_at"    NA       NA         225173_at    "225173_at"
>>>> "93663" "ARHGAP18" 241893_at    "241893_at"    "4249"   "MGAT5"
>>>>> sessionInfo()
>>>> R version 2.4.1 (2006-12-18) i386-pc-mingw32 locale:
>>>> LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spani
>>>> sh_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252
>>>> 
>>>> 
>>>> attached base packages:
>>>> [1] "tools"     "stats"     "graphics"  "grDevices"
>>>> [5] "utils"     "datasets"  "methods"   "base"     other attached
>>>> packages:
>>>>    annotate     Biobase hgu133plus2 "1.12.1"    "1.12.2"    "1.14.0"
>>>> ########################################################
>>>> 
>>>> In summary. If I use R 2.4/BioC 1.9 I obtain the same results I
>>>> ibtained 2 years ago, but If I do the same steps using R2.9/BioC2.4
>>>> the results change dramatically.
>>>> I have repeated the analyses using BioC 2.01 in R 2.7 and BioC 2.2 in
>>>> R 2.8 (results not shown here). BioC 2.0 yield the same as 1.9 and
>>>> BioC 2.2 the same as 2.4,
>>>> 
>>>> Any help to understand what's happening would be appreciated
>>>> 
>>>> Alex Sanchez
>>>> 
>>>> ---------------------------------------------------------------------------
>>>> --------------------------
>>>> 
>>>> Dr. Alex  Sánchez. Statistics Department. University of Barcelona.
>>>> Facultat de Biologia UB. Avda Diagonal 645. 08028 Barcelona. Spain
>>>> asanchez_at_ub.edu
>>>> Statistics and Bioinformatics Unit
>>>> Institut de Recerca. Hospital Universitari Vall 'Hebron
>>>> Passeig Vall d'Hebron 112-119. 08034 Barcelona
>>>> asanchez_at_ir.vhebron.net
>>>> ---------------------------------------------------------------------------
>>>> -------------------------
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> [[alternative HTML version deleted]]
>>>> 
>>>> 
>>>> 
>>>> ------------------------------------------------------------------------
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> 
>>> -- 
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> Douglas Lab
>>> University of Michigan
>>> Department of Human Genetics
>>> 5912 Buhl
>>> 1241 E. Catherine St.
>>> Ann Arbor MI 48109-5618
>>> 734-615-7826
>>> 
>> 
>> 
> 
> -- 
> James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list