[BioC] Changes in annotations?
Loren Engrav
engrav at u.washington.edu
Sat Jul 11 05:54:10 CEST 2009
Thank you
> From: Marc Carlson <mcarlson at fhcrc.org>
> Date: Fri, 10 Jul 2009 08:46:36 -0700
> To: Loren Engrav <engrav at u.washington.edu>
> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
> Subject: Re: [BioC] Changes in annotations?
>
> Hi Loren,
>
> I am working on this right now. But it is really not as simple as it
> sounds. I hope to have more to say about this very soon.
>
>
> Marc
>
>
>
> Loren Engrav wrote:
>> Thank you
>>
>> Why is it not "best" to include everything Affy includes? All the synonyms
>> and IDs
>>
>> Or is this not software possible?
>>
>>
>>
>>
>>> From: "James W. MacDonald" <jmacdon at med.umich.edu>
>>> Date: Thu, 09 Jul 2009 09:36:46 -0400
>>> To: Alex Sanchez <asanchez at ub.edu>
>>> Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch>
>>> Subject: Re: [BioC] Changes in annotations?
>>>
>>> Hi Alex,
>>>
>>> We have taken this up as an internal topic of discussion and hopefully
>>> will be able to come up with a solution that is a bit better than what
>>> we are doing right now.
>>>
>>> However, some of this is going to be unavoidable. The chips were based
>>> on UniGene build 133 (and we are now on build 219), so things are going
>>> to tend to move around. But I don't think this is the main reason for
>>> what you are seeing, as the latest Affy annotation file is still based
>>> on UCSC build hg18 (which is based on NCBI build 36.1), so these data
>>> are over three years old.
>>>
>>> Instead, it appears that Affy have changed something about how they are
>>> annotating things. Some of this is an improvement, and some not so much.
>>>
>>> For instance, 203315_at does interrogate NCK2:
>>>
>>> http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=1365
>>> 10
>>> 693&hgt.out3=10x&position=chr2%3A105876577-105876826&hgtgroup_map_close=0&hg
>>> tg
>>> roup_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_ex
>>> pr
>>> ession_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgrou
>>> p_
>>> varRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&
>>> hg
>>> tgroup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAn
>>> dV
>>> ar_close=1
>>>
>>> whereas 232583_at does not:
>>>
>>> http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=1365
>>> 10
>>> 693&hgt.out3=10x&position=chr2%3A105752637-105752886&hgtgroup_map_close=0&hg
>>> tg
>>> roup_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_ex
>>> pr
>>> ession_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgrou
>>> p_
>>> varRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&
>>> hg
>>> tgroup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAn
>>> dV
>>> ar_close=1
>>>
>>> so in this case they have improved their annotations.
>>>
>>> In the case of 238900_at, they seem to have added in a bunch of EG IDs
>>> that are based on a model rather than being actual reviewed genes. In
>>> addition, that probeset seems to hit an intron, so adding in all this
>>> other annotation appears pointless:
>>>
>>> http://genome.ucsc.edu/cgi-bin/hgTracks?insideX=115&revCmplDisp=0&hgsid=1365
>>> 11
>>> 081&hgt.out3=10x&position=chr6%3A32653723-32653972&hgtgroup_map_close=0&hgtg
>>> ro
>>> up_phenDis_close=1&hgtgroup_genes_close=0&hgtgroup_rna_close=0&hgtgroup_expr
>>> es
>>> sion_close=1&hgtgroup_regulation_close=0&hgtgroup_compGeno_close=0&hgtgroup_
>>> va
>>> rRep_close=0&hgtgroup_encodeGenes_close=1&hgtgroup_encodeTxLevels_close=1&hg
>>> tg
>>> roup_encodeChip_close=1&hgtgroup_encodeChrom_close=1&hgtgroup_encodeCompAndV
>>> ar
>>> _close=1
>>>
>>> What this really comes down to is the fact that you can't take anything
>>> for granted. Not only do you have to validate a set of significant
>>> results using some other technology, you also need to ensure that the
>>> chip you are using is actually measuring what you think prior to doing
>>> the validation.
>>>
>>> It should be relatively painless to set up a pipeline where you could
>>> use the BSgenome.Hsapiens.UCSC.hg18 package along with functions from
>>> BSgenome/Biostrings to map probesets to the genome and then use the
>>> org.Hs.eg.db package to see if a given probeset is even interrogating
>>> the gene it is supposed to. One could then use rtracklayer to visualize
>>> things in the UCSC genome browser to make sure you weren't interrogating
>>> an intron.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>>
>>> Alex Sanchez wrote:
>>>
>>>> Hi James and thanks for the help.
>>>>
>>>> I know, of course, that what Bioconductor does is to take the
>>>> annotations from public data sources.
>>>> I will now turn to affymetrix to see if someone there, or in their
>>>> forums, can explain why so many annotations have been turned into "NA's"
>>>>
>>>> There seem to be two problems here
>>>> - The first one, pointed by Loren in his message today is synonims. For
>>>> instance the first probeset in my list was 238900_at,
>>>> In the first version I used (BioC 1.9) the Gene Symbol provide by
>>>> getSymbol was HLA-DRB1 whereas now (BioC 2.4) it is LOC100133484
>>>> However the original symbol is still in the Affymetrix annotation file
>>>> "HG-U133_Plus_2.na28.annot.csv". It seems as if the new symbol is intended
>>>> to show the relation with the new Entrez ID (100133484).
>>>> In any case it is ennoying but it can be assumed.
>>>> - What seems worse is what happens to some annotations: If, for
>>>> instance, I take the second probeset in my list, 232583_at, and I look
>>>> for it in the affy annotations file I find that apart some links to
>>>> databases as genebank it does not have a gene symbol or an entrez (or
>>>> most annotations anymore).
>>>> In the previous version of the annotations this probeset was one of two
>>>> associated with gene NCK2 (232583_at;203315_at) .
>>>> The problem is that the second probeset (203315_at) was not selected by
>>>> the analysis. That is the only evidence suggesting that gene NCK2 was
>>>> differentially expressed (232583_at) does not suggest it anymore.
>>>>
>>>> Last, but in any case least, this is not happening to a few probesets.
>>>> My original list had 417 probesets. 210 have changed/synonimized their
>>>> gene symbols, but from these 210, 160 have become NA's
>>>>
>>>> Seems too many to feel comfortable :-(
>>>>
>>>> Thanks for the help
>>>>
>>>> Alex
>>>>
>>>> ...........................................................................
>>>> ..
>>>> ..................................
>>>>
>>>> Dr. Alex Sánchez.
>>>> Associate Professor. Statistics Department. University of Barcelona.
>>>> Facultat de Biologia UB. Avda Diagonal 645. 08028 Barcelona. Spain
>>>> asanchez_at_ub.edu
>>>> Statistics and Bioinformatics Unit
>>>> Institut de Recerca. Hospital Universitari Vall 'Hebron
>>>> Passeig Vall d'Hebron 112-119. 08034 Barcelona
>>>> asanchez_at_ir.vhebron.net
>>>> ...........................................................................
>>>> ..
>>>> ..................................
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ----- Original Message ----- From: "James W. MacDonald"
>>>> <jmacdon at med.umich.edu>
>>>> To: "Alex Sanchez" <asanchez at ub.edu>
>>>> Cc: <bioconductor at stat.math.ethz.ch>
>>>> Sent: Monday, July 06, 2009 3:58 PM
>>>> Subject: Re: [BioC] Changes in annotations?
>>>>
>>>>
>>>>
>>>>> Hi Alex,
>>>>>
>>>>> This is a question that comes up on the Bioc list fairly regularly,
>>>>> and the answer is in two parts:
>>>>>
>>>>> First, the annotations supplied in the various metadata packages
>>>>> supplied by BioC are *not* our annotations, but are simply a
>>>>> re-packaging of data we collect from various sources. As an example,
>>>>> we use the mappings of Affymetrix Probe ID to Entrez Gene ID from the
>>>>> annotation csv files you can download from the Affy website. We then
>>>>> map the Entrez Gene IDs to other annotation using primarily NCBI data.
>>>>> So if you go to Affy's netaffx site (free registration required) and
>>>>> query on say, 238900_at, you get this:
>>>>>
>>>>> https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG-U133_PLU
>>>>> S_
>>>>> 2%3A238900_AT
>>>>>
>>>>>
>>>>> And you will note that the first Entrez Gene ID listed there is
>>>>> 100133484, which happens to be a defunct ID. However, this is the
>>>>> first of many listed there (and we need a one-to-one mapping), so we
>>>>> chose that one. A more likely Entrez Gene ID can be found further down
>>>>> the list, but we simply don't have the resources to figure out if
>>>>> there is a better choice in that list (for every reporter on every
>>>>> Affy chip we annotate). Nor do we have the resources to ensure that
>>>>> any of the mappings that Affy make are reasonable to begin with. We
>>>>> have to trust that they (with *way* more resources that us) are doing
>>>>> a reasonable job.
>>>>>
>>>>> The second part of the answer has to do with the 'moving target'
>>>>> aspect of Biological annotations. These data change all the time, and
>>>>> there is the recurring question of whether one should do an analysis
>>>>> and 'freeze' it to that point in time, or should the annotations be
>>>>> updated on a regular basis, with the realization that things can and
>>>>> will change?
>>>>>
>>>>> Without looking at each reporter ID you list, I can't say if the
>>>>> changes are due to Affy changing their annotation csv files, or to
>>>>> changing knowledge of the genome, but I suspect it is a combination of
>>>>> the two.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Alex Sanchez wrote:
>>>>>
>>>>>> Hello
>>>>>>
>>>>>> I have had to review recently an analysis I did some time ago. This
>>>>>> was done on affymetrix hgu133plus2 chips with R 2.4 and BioC 1.9 I
>>>>>> have re-run the analyses using R 2.9 and BioC 2.4 (sessionInfo below).
>>>>>> I have been surprised by the changes in the annotations: Many
>>>>>> probesets that had had an annotation have become NA's whereas some
>>>>>> have changed their symbol and their Entrez gene.
>>>>>>
>>>>>> To be specific I summarize my question with the top genes of my list
>>>>>>
>>>>>> The list I obtained 2 years ago is:
>>>>>>
>>>>>> probeset locuslink symbol
>>>>>> 238900_at 3123 HLA-DRB1
>>>>>> 232583_at 8440 NCK2
>>>>>> 236307_at 60468 BACH2
>>>>>> 223620_at 2857 GPR34
>>>>>> 219759_at 64167 LRAP
>>>>>> 201702_s_at 5514 PPP1R10
>>>>>> 232882_at 2308 FOXO1A
>>>>>> 213446_s_at 8826 IQGAP1
>>>>>> 234033_at 9693 RAPGEF2
>>>>>> 243006_at 2534 FYN
>>>>>> 244648_at 54520 CCDC93
>>>>>> 243691_at 23142 DCUN1D4
>>>>>> 239264_at 60412 EXOC4
>>>>>> 243546_at 143686 SESN3
>>>>>> 205239_at 374 AREG
>>>>>> 1565703_at 55520 ELAC1
>>>>>> 244061_at 55843 ARHGAP15
>>>>>> 230505_at 26037 SIPA1L1
>>>>>> 242688_at 9320 TRIP12
>>>>>> 1556474_a_at 285097 FLJ38379
>>>>>> 232614_at 596 BCL2
>>>>>> 1565689_at 3839 KPNA3
>>>>>> 236685_at NA NA
>>>>>> 225173_at 93663 ARHGAP18
>>>>>> 241893_at 4249 MGAT5
>>>>>>
>>>>>> I used the following code to reproduce the issue with the annotations:
>>>>>>
>>>>>>
>>>>>> #####################################################################
>>>>>> ## Verification using R 2.9 & BioC 2.4
>>>>>> #####################################################################
>>>>>>
>>>>>>
>>>>>>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" ,
>>>>>>> "219759_at" ,
>>>>>>>
>>>>>> + "201702_s_at" , "232882_at" , "213446_s_at", "234033_at",
>>>>>> "243006_at" , + "244648_at" , "243691_at" , "239264_at" ,
>>>>>> "243546_at" , "205239_at" ,
>>>>>> + "1565703_at" , "244061_at" , "230505_at" , "242688_at" ,
>>>>>> "1556474_a_at",
>>>>>> + "232614_at" , "1565689_at" , "236685_at" , "225173_at" ,
>>>>>> "241893_at")
>>>>>>
>>>>>>> library(hgu133plus2.db)
>>>>>>> library(annotate)
>>>>>>>
>>>>>>> entrezs<- getEG(probes, "hgu133plus2")
>>>>>>> symbols<- getSYMBOL(probes, "hgu133plus2")
>>>>>>> sel2<- cbind(probes, entrezs, symbols)
>>>>>>> sel2
>>>>>>>
>>>>>> probes entrezs symbols 238900_at
>>>>>> "238900_at" "100133484" "LOC100133484"
>>>>>> 232583_at "232583_at" NA NA 236307_at
>>>>>> "236307_at" NA NA 223620_at "223620_at"
>>>>>> "2857" "GPR34" 219759_at "219759_at" "64167"
>>>>>> "ERAP2" 201702_s_at "201702_s_at" "5514" "PPP1R10"
>>>>>> 232882_at "232882_at" NA NA 213446_s_at
>>>>>> "213446_s_at" "8826" "IQGAP1" 234033_at "234033_at"
>>>>>> NA NA 243006_at "243006_at" NA NA
>>>>>> 244648_at "244648_at" NA NA 243691_at
>>>>>> "243691_at" NA NA 239264_at "239264_at"
>>>>>> NA NA 243546_at "243546_at" NA NA
>>>>>> 205239_at "205239_at" "374" "AREG" 1565703_at
>>>>>> "1565703_at" "4089" "SMAD4" 244061_at "244061_at"
>>>>>> NA NA 230505_at "230505_at" "145474" "LOC145474"
>>>>>> 242688_at "242688_at" NA NA 1556474_a_at
>>>>>> "1556474_a_at" "285097" "FLJ38379" 232614_at "232614_at"
>>>>>> NA NA 1565689_at "1565689_at" NA NA
>>>>>> 236685_at "236685_at" NA NA 225173_at
>>>>>> "225173_at" "93663" "ARHGAP18" 241893_at "241893_at"
>>>>>> NA NA
>>>>>>
>>>>>>> sessionInfo()
>>>>>>>
>>>>>> R version 2.9.0 (2009-04-17) i386-pc-mingw32 locale:
>>>>>> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
>>>>>> States.1252;LC_MONETARY=English_United
>>>>>> States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] stats graphics grDevices utils datasets methods base
>>>>>> other attached packages:
>>>>>> [1] annotate_1.22.0 hgu133plus2.db_2.2.11 RSQLite_0.7-1
>>>>>> DBI_0.2-4 AnnotationDbi_1.6.0 Biobase_2.4.1
>>>>>> loaded via a namespace (and not attached):
>>>>>> [1] xtable_1.5-5
>>>>>> #############################################
>>>>>>
>>>>>> Many probesets seem to have changed.
>>>>>> Can someone explain to me what is happening (or what may I be doing
>>>>>> wrong)?
>>>>>>
>>>>>> The same code does not work with R 2.4 but if I change hgu133plus2.db
>>>>>> by hgu133plus2 and getEG by getLL I obtain the original results:
>>>>>>
>>>>>> ###############################################
>>>>>> ### Review of annotatons with R 2.4 and BioC 1.9
>>>>>> ###############################################
>>>>>>
>>>>>> ### This code is executed on a clean new session with R 2. and BioC 1.9
>>>>>>
>>>>>>
>>>>>>> probes<-c("238900_at" , "232583_at", "236307_at" ,"223620_at" ,
>>>>>>> "219759_at" ,
>>>>>>>
>>>>>> + "201702_s_at" , "232882_at" , "213446_s_at", "234033_at",
>>>>>> "243006_at" , + "244648_at" , "243691_at" , "239264_at" ,
>>>>>> "243546_at" , "205239_at" ,
>>>>>> + "1565703_at" , "244061_at" , "230505_at" , "242688_at" ,
>>>>>> "1556474_a_at",
>>>>>> + "232614_at" , "1565689_at" , "236685_at" , "225173_at" ,
>>>>>> "241893_at")
>>>>>>
>>>>>>> LLs<- getLL(rownames(sel), "hgu133plus2")
>>>>>>> symbols<- getSYMBOL(rownames(sel), "hgu133plus2")
>>>>>>> sel1<- cbind(probes, LLs, symbols)
>>>>>>> sel1
>>>>>>>
>>>>>> probes LLs symbols 238900_at
>>>>>> "238900_at" "3123" "HLA-DRB1" 232583_at "232583_at" "8440"
>>>>>> "NCK2" 236307_at "236307_at" "60468" "BACH2" 223620_at
>>>>>> "223620_at" "2857" "GPR34" 219759_at "219759_at" "64167"
>>>>>> "ERAP2" 201702_s_at "201702_s_at" "5514" "PPP1R10" 232882_at
>>>>>> "232882_at" "2308" "FOXO1" 213446_s_at "213446_s_at" "8826"
>>>>>> "IQGAP1" 234033_at "234033_at" "9693" "RAPGEF2" 243006_at
>>>>>> "243006_at" "2534" "FYN" 244648_at "244648_at" "54520"
>>>>>> "CCDC93" 243691_at "243691_at" "23142" "DCUN1D4" 239264_at
>>>>>> "239264_at" "60412" "EXOC4" 243546_at "243546_at" "143686"
>>>>>> "SESN3" 205239_at "205239_at" "374" "AREG" 1565703_at
>>>>>> "1565703_at" "4089" "SMAD4" 244061_at "244061_at" "55843"
>>>>>> "ARHGAP15" 230505_at "230505_at" "145474" "LOC145474"
>>>>>> 242688_at "242688_at" "9320" "TRIP12" 1556474_a_at
>>>>>> "1556474_a_at" "285097" "FLJ38379" 232614_at "232614_at" "596"
>>>>>> "BCL2" 1565689_at "1565689_at" "3839" "KPNA3" 236685_at
>>>>>> "236685_at" NA NA 225173_at "225173_at"
>>>>>> "93663" "ARHGAP18" 241893_at "241893_at" "4249" "MGAT5"
>>>>>>
>>>>>>> sessionInfo()
>>>>>>>
>>>>>> R version 2.4.1 (2006-12-18) i386-pc-mingw32 locale:
>>>>>> LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spa
>>>>>> ni
>>>>>> sh_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252
>>>>>>
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] "tools" "stats" "graphics" "grDevices"
>>>>>> [5] "utils" "datasets" "methods" "base" other attached
>>>>>> packages:
>>>>>> annotate Biobase hgu133plus2 "1.12.1" "1.12.2" "1.14.0"
>>>>>> ########################################################
>>>>>>
>>>>>> In summary. If I use R 2.4/BioC 1.9 I obtain the same results I
>>>>>> ibtained 2 years ago, but If I do the same steps using R2.9/BioC2.4
>>>>>> the results change dramatically.
>>>>>> I have repeated the analyses using BioC 2.01 in R 2.7 and BioC 2.2 in
>>>>>> R 2.8 (results not shown here). BioC 2.0 yield the same as 1.9 and
>>>>>> BioC 2.2 the same as 2.4,
>>>>>>
>>>>>> Any help to understand what's happening would be appreciated
>>>>>>
>>>>>> Alex Sanchez
>>>>>>
>>>>>> -------------------------------------------------------------------------
>>>>>> --
>>>>>> --------------------------
>>>>>>
>>>>>> Dr. Alex Sánchez. Statistics Department. University of Barcelona.
>>>>>> Facultat de Biologia UB. Avda Diagonal 645. 08028 Barcelona. Spain
>>>>>> asanchez_at_ub.edu
>>>>>> Statistics and Bioinformatics Unit
>>>>>> Institut de Recerca. Hospital Universitari Vall 'Hebron
>>>>>> Passeig Vall d'Hebron 112-119. 08034 Barcelona
>>>>>> asanchez_at_ir.vhebron.net
>>>>>> -------------------------------------------------------------------------
>>>>>> --
>>>>>> -------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>> --
>>>>> James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> Douglas Lab
>>>>> University of Michigan
>>>>> Department of Human Genetics
>>>>> 5912 Buhl
>>>>> 1241 E. Catherine St.
>>>>> Ann Arbor MI 48109-5618
>>>>> 734-615-7826
>>>>>
>>>>>
>>>>
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> Douglas Lab
>>> University of Michigan
>>> Department of Human Genetics
>>> 5912 Buhl
>>> 1241 E. Catherine St.
>>> Ann Arbor MI 48109-5618
>>> 734-615-7826
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
More information about the Bioconductor
mailing list