[BioC] annotating microarray data with mogene10stv1

Sat Jul 19 14:30:56 CEST 2014

Hello everyone,

This my first attempt so it may not be a perfect email. I am not very advanced in bioinformatics so I tried to be very detailed.
Basically I am trying to annotate microarray dataset from Affymetrix using bioconductor.

I used following steps:

1. loading libraries

> library(annotate)
> library(limma)
> library(mogene10sttranscriptcluster.db)
> library(affy)

2. I read my target files into annotated data frame using
>adf <- read.AnnotatedDataFrame("target.txt",header=TRUE,row.names=1 ,as.is=TRUE)

3. I read my expression .CEL files into target files  using
>mydata <- ReadAffy(filenames=pData(adf)$FileName,phenoData=adf)

> > mydata
> AffyBatch object
> size of arrays=1050x1050 features (19 kb)
> cdf=MoGene-1_0-st-v1 (34760 affyids)
> number of samples=6
> number of genes=34760
> annotation=mogene10stv1
> notes=

4. I normalised my data with ram
>eset <- rma(mydata)

when you check eset it looks ok
> > eset
> ExpressionSet (storageMode: lockedEnvironment)
> assayData: 34760 features, 6 samples 
>   element names: exprs 
> protocolData
>   sampleNames: GSM910962.CEL GSM910963.CEL ... GSM910967.CEL (6 total)
>   varLabels: ScanDate
>   varMetadata: labelDescription
> phenoData
>   sampleNames: GSM910962.CEL GSM910963.CEL ... GSM910967.CEL (6 total)
>   varLabels: FileName Description
>   varMetadata: labelDescription
> featureData
>   featureNames: 10338001 10338003 ... 10608724 (34760 total)
>   fvarLabels: ID Symbol Name
>   fvarMetadata: labelDescription
> experimentData: use 'experimentData(object)'
> Annotation: mogene10stv1 

5. I was able to generate fold change vector for interesting samples but I have problem annotating eset.

6. I tried annotate the eset file using annotation from above using mogene10sttranscriptcluster.db or mogene10stprobes.db.

#I build an annotation table

ID <- featureNames(eset)

Symbol <- getSYMBOL(ID, "mogene10sttranscriptcluster.db")

Name <- as.character(lookUp(ID, "mogene10sttranscriptcluster.db", "GENENAME"))

tmp <- data.frame(ID=ID, Symbol=Symbol, Name=Name,
stringsAsFactors=F)

tmp[tmp=="NA"] <- NA #fix padding with NA characters

The problem I have is that for large number of IDs - all initial 6500 - I am getting Symbol and Name annotated as NA
Here is the output for some of them

> > Name[6500:6550]
>  [1] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA"
> [22] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA"
> [43] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" “NA"
> 
> > Symbol[6500:6550]
> 10344545 10344546 10344547 10344548 10344549 10344550 10344551 10344552 10344553 10344554 10344555 10344556 
>       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA 
> 10344557 10344558 10344559 10344560 10344561 10344562 10344563 10344564 10344565 10344566 10344567 10344568 
>       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA 
> 10344569 10344570 10344571 10344572 10344573 10344574 10344575 10344576 10344577 10344578 10344579 10344580 
>       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA 
> 10344581 10344582 10344583 10344584 10344585 10344586 10344587 10344588 10344589 10344590 10344591 10344592 
>       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA 
> 10344593 10344594 10344595 
>       NA       NA       NA 

#So if I assign it as feature data of the current Eset 

fData(eset) <- tmp

#and perform stat with limma

#Build the design matrix
design <- model.matrix(~-1+factor(c(1,1,2,2,3,3)))
colnames(design) <- c(“mock”,"siGFP”,"siLin28a",”")

> > design
>   mock siGFP siLin28a
> 1    1     0        0
> 2    1     0        0
> 3    0     1        0
> 4    0     1        0
> 5    0     0        1
> 6    0     0        1
> attr(,"assign")
> [1] 1 1 1
> attr(,"contrasts")
> attr(,"contrasts")$`factor(c(1, 1, 2, 2, 3, 3))`
> [1] "contr.treatment"

# instructs Limma which comparisons to make
contrastmatrix <- makeContrasts(mock-siGFP,mock-siLin28a,siGFP-siLin28a,levels=design)

> > contrastmatrix
>           Contrasts
> Levels     mock - siGFP mock - siLin28a siGFP - siLin28a
>   mock                1               1                0
>   siGFP              -1               0                1
>   siLin28a            0              -1               -1

 # make the contrasts
fit <- lmFit(eset, design)

fit2 <- contrasts.fit(fit, contrastmatrix)
fit2 <- eBayes(fit2)

#listed the top differentially expressed genes 

> > topTable(fit2,coef=1,adjust="fdr")
>                ID Symbol Name     logFC  AveExpr          t      P.Value adj.P.Val         B
> 10342604 10342604   <NA> <NA>  1.831809 2.845863  14.037480 1.248257e-05 0.4338942 -3.694032
> 10343224 10343224   <NA> <NA> -2.751868 2.658551 -10.992289 4.802754e-05 0.8347186 -3.715792
> 10339733 10339733   <NA> <NA>  1.703917 3.426402   9.860457 8.674159e-05 1.0000000 -3.728996
> 10341175 10341175   <NA> <NA> -2.861665 2.167471  -8.877976 1.526481e-04 1.0000000 -3.744350
> 10340405 10340405   <NA> <NA> -1.368074 2.944242  -8.245297 2.263752e-04 1.0000000 -3.756919
> 10343199 10343199   <NA> <NA> -1.238289 2.077839  -8.130892 2.437752e-04 1.0000000 -3.759472
> 10339048 10339048   <NA> <NA>  1.101959 2.129288   7.949683 2.746238e-04 1.0000000 -3.763713
> 10344413 10344413   <NA> <NA> -1.407850 2.259821  -7.810608 3.014011e-04 1.0000000 -3.767144
> 10343919 10343919   <NA> <NA>  1.134134 2.650166   7.644642 3.374231e-04 1.0000000 -3.771452
> 10338867 10338867   <NA> <NA>  1.162114 5.792621   7.454279 3.850651e-04 1.0000000 -3.776701

As you can see just this small portion is already missing information about Symbol and Name.

So my question is do I use a correct .db library for annotation? As it looks like I missing a lot of ID cannot be annotated 

How can I fix that problem?

Many Thanks,

Jakub

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.