[BioC] Affymetrix annotation using R or NetAffx

Thu Aug 16 17:38:26 CEST 2012

Hi Natasha,

On 8/15/2012 12:05 PM, Natasha Sahgal wrote:
> Dear List,
>
>
>
> Normally I use the batch query mode of the NetAffx (affymetrix website) to annotate my gene lists (downside: It can do only 10K probes at a time).
>
> I decided to try the R packages to do the same.
>
>
>
> However, I decided to check a couple of probes to see if the output in either case is the same but it is not! So I do not know where the error lies: in my code, in the meta data packages or the website or if I have the incorrect array package?

Well, it appears this is some weird disagreement between Affy's netaffx 
database and the transcript files they release.

Note that the annotation files we supply are simply repackaged versions 
of data we get from the manufacturer. There is no implied warranty that 
the data are right. And in the case of these particular data packages, 
we don't even make them - they are supplied by Arthur Li.

So back to the task at hand. If you query netaffx for the first ID you 
supply (8180413), it will tell you that it is unmapped. However, if you 
download the transcript annotations for this chip and look for that ID, 
you get this:

NM_175039.3 // --- // Homo sapiens ST6 
(alpha-N-acetyl-neuraminyl-2,3-beta-galactosyl-1,3)-N-acetylgalactosaminide 
alpha-2,6-sialyltransferase 4 (ST6GALNAC4), transcript variant 1, mRNA 
// --- // --- // --- // --- // --- // ---

in the mrna_assignment column. In addition, in the 'category' column it says
flmrna->unmapped

So this probeset is either unmapped, or it queries a transcript variant 
of ST6GALNAC4. No telling which. But the annotation package claims the 
variant because that is what is in the annotation file.

I suppose you could take all the probe sequences for this probeset and 
Blat them against the genome to see where they bind. Or you could just 
remove all control probesets before annotating.

The gene and exon chips have way more control probes than the old school 
3' biased chips had, and they don't have the convenient AFFX handle that 
would let you know they were creeping into your set of significant 
genes. On one hand, these should never end up in your list of 
significant genes because they are controls. On the other hand, what 
exactly is a control, and who is to say it won't change expression 
(seriously - does this probeset measure something or not?).

Recently I have taken to summarily excluding all non-main probesets from 
my analyses, right after the eBayes() step in limma. You can do 
something super cool like

> library(pd.hugene.1.0.st.v1)
> con <- db(pd.hugene.1.0.st.v1)
> types <- dbGetQuery(con, "select meta_fsetid, type from featureSet
inner join core_mps on featureSet.fsetid=core_mps.fsetid;")
> head(types)
   meta_fsetid type
1     7892501    6
2     7892502    7
3     7892503    7
4     7892504    7
5     7892505    7
6     7892506    7

ind <- types[,2] == 1

and then you can subset using the ind variable.

Best,

Jim

>
>
>
>
>
>> eset
> ExpressionSet (storageMode: lockedEnvironment)
>
> assayData: 32321 features, 21 samples
>
>    element names: exprs, se.exprs
>
> protocolData
>
>    sampleNames: CD8 5_14_(HuGene-1_0-st-v1).CEL CD8
>
>      5_16_(HuGene-1_0-st-v1).CEL ... CD8 6_8_(HuGene-1_0-st-v1).CEL (21
>
>     total)
>
>    varLabels: ScanDate
>
>    varMetadata: labelDescription
>
> phenoData
>
>    sampleNames: CD8 5_14_(HuGene-1_0-st-v1).CEL CD8
>
>      5_16_(HuGene-1_0-st-v1).CEL ... CD8 6_8_(HuGene-1_0-st-v1).CEL (21
>
>      total)
>
>    varLabels: sample
>
>    varMetadata: labelDescription
>
> featureData: none
>
> experimentData: use 'experimentData(object)'
>
> Annotation: hugene10stv1
>
>
>
>> library(annotate)
>> library(hugene10stv1cdf)
>> library(hugene10sttranscriptcluster.db)
>> ls("package:hugene10sttranscriptcluster.db")
>> ID<- featureNames(eset)
>> Symbol<- getSYMBOL(ID, "hugene10sttranscriptcluster.db")
>> Name<- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "GENENAME"))
>> Entrez<- as.character(lookUp(ID, "hugene10sttranscriptcluster.db", "ENTREZID"))
>
>
>> tail(Symbol)
>       8180413      8180414      8180415      8180416      8180417      8180418
>
> "ST6GALNAC4" "ST6GALNAC1"     "MCART1"      "OR3A2"    "METTL2B"           NA
>
>
>
>> tail(ID)
> [1] "8180413" "8180414" "8180415" "8180416" "8180417" "8180418"
>
>
>
>> tail(Entrez)
> [1] "27090" "55808" "92014" "4995"  "55798" "NA"
>
>
>
>
>
> Which does not match the output from the website below for the same set of IDs:
>
> *         GeneChip(r) Human Gene 1.0 ST Array
>
> Displaying Results: 1-6 of 6.
>
>
>
> Transcript Cluster ID
>
> GO Description
>
> Chromosome
>
> Gene Title
>
> Pathway
>
> Gene Symbol
>
> Entrez Gene ID
>
> Cytoband
>
>
> 8180416
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 8180417
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 8180418
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 8180413
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 8180414
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 8180415
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
>
> Platform: x86_64-pc-linux-gnu (64-bit)
>
>
>
> locale:
>
> [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>
>   [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>
>   [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>
>   [7] LC_PAPER=C                 LC_NAME=C
>
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
>
>
> attached base packages:
>
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
>
>
> other attached packages:
>
> [1] hugene10sttranscriptcluster.db_8.0.1 hugene10stprobeset.db_8.0.1
>
>   [3] org.Hs.eg.db_2.7.1                   RSQLite_0.11.1
>
>   [5] DBI_0.2-5                            annotate_1.34.0
>
>   [7] rgl_0.92.892                         R.utils_1.12.1
>
>   [9] R.oo_1.9.3                           R.methodsS3_1.2.2
>
> [11] plotrix_3.4                          hugene10stv1cdf_2.10.0
>
> [13] AnnotationDbi_1.18.0                 scatterplot3d_0.3-33
>
> [15] WriteXLS_2.1.0                       affyQCReport_1.34.0
>
> [17] lattice_0.20-6                       affy_1.34.0
>
> [19] Biobase_2.16.0                       BiocGenerics_0.2.0
>
> [21] gdata_2.8.2                          limma_3.12.0
>
>
>
> loaded via a namespace (and not attached):
>
> [1] affyio_1.24.0         affyPLM_1.32.0        BiocInstaller_1.4.7
>
>   [4] Biostrings_2.24.1     gcrma_2.28.0          genefilter_1.38.0
>
>   [7] grid_2.15.0           gtools_2.6.2          IRanges_1.14.2
>
> [10] preprocessCore_1.18.0 RColorBrewer_1.0-5    simpleaffy_2.32.0
>
> [13] splines_2.15.0        stats4_2.15.0         survival_2.36-14
>
> [16] tcltk_2.15.0          tools_2.15.0          xtable_1.7-0
>
> [19] zlibbioc_1.2.0
>
>
>
>
>
> Any help/suggestion would be appreciated.
>
> Apologies if this is posted more than once!!
>
>
>
>
>
> Many Thanks,
>
> Natasha
>
>
>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099