[BioC] annotating microarray data with mogene10stv1
Jakub Stanislaw Nowak
jakub.nowak at ed.ac.uk
Sat Jul 19 14:30:56 CEST 2014
Hello everyone,
This my first attempt so it may not be a perfect email. I am not very advanced in bioinformatics so I tried to be very detailed.
Basically I am trying to annotate microarray dataset from Affymetrix using bioconductor.
I used following steps:
1. loading libraries
> library(annotate)
> library(limma)
> library(mogene10sttranscriptcluster.db)
> library(affy)
2. I read my target files into annotated data frame using
>adf <- read.AnnotatedDataFrame("target.txt",header=TRUE,row.names=1 ,as.is=TRUE)
3. I read my expression .CEL files into target files using
>mydata <- ReadAffy(filenames=pData(adf)$FileName,phenoData=adf)
> > mydata
> AffyBatch object
> size of arrays=1050x1050 features (19 kb)
> cdf=MoGene-1_0-st-v1 (34760 affyids)
> number of samples=6
> number of genes=34760
> annotation=mogene10stv1
> notes=
4. I normalised my data with ram
>eset <- rma(mydata)
when you check eset it looks ok
> > eset
> ExpressionSet (storageMode: lockedEnvironment)
> assayData: 34760 features, 6 samples
> element names: exprs
> protocolData
> sampleNames: GSM910962.CEL GSM910963.CEL ... GSM910967.CEL (6 total)
> varLabels: ScanDate
> varMetadata: labelDescription
> phenoData
> sampleNames: GSM910962.CEL GSM910963.CEL ... GSM910967.CEL (6 total)
> varLabels: FileName Description
> varMetadata: labelDescription
> featureData
> featureNames: 10338001 10338003 ... 10608724 (34760 total)
> fvarLabels: ID Symbol Name
> fvarMetadata: labelDescription
> experimentData: use 'experimentData(object)'
> Annotation: mogene10stv1
5. I was able to generate fold change vector for interesting samples but I have problem annotating eset.
6. I tried annotate the eset file using annotation from above using mogene10sttranscriptcluster.db or mogene10stprobes.db.
#I build an annotation table
ID <- featureNames(eset)
Symbol <- getSYMBOL(ID, "mogene10sttranscriptcluster.db")
Name <- as.character(lookUp(ID, "mogene10sttranscriptcluster.db", "GENENAME"))
tmp <- data.frame(ID=ID, Symbol=Symbol, Name=Name,
stringsAsFactors=F)
tmp[tmp=="NA"] <- NA #fix padding with NA characters
The problem I have is that for large number of IDs - all initial 6500 - I am getting Symbol and Name annotated as NA
Here is the output for some of them
> > Name[6500:6550]
> [1] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA"
> [22] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA"
> [43] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" “NA"
>
> > Symbol[6500:6550]
> 10344545 10344546 10344547 10344548 10344549 10344550 10344551 10344552 10344553 10344554 10344555 10344556
> NA NA NA NA NA NA NA NA NA NA NA NA
> 10344557 10344558 10344559 10344560 10344561 10344562 10344563 10344564 10344565 10344566 10344567 10344568
> NA NA NA NA NA NA NA NA NA NA NA NA
> 10344569 10344570 10344571 10344572 10344573 10344574 10344575 10344576 10344577 10344578 10344579 10344580
> NA NA NA NA NA NA NA NA NA NA NA NA
> 10344581 10344582 10344583 10344584 10344585 10344586 10344587 10344588 10344589 10344590 10344591 10344592
> NA NA NA NA NA NA NA NA NA NA NA NA
> 10344593 10344594 10344595
> NA NA NA
#So if I assign it as feature data of the current Eset
fData(eset) <- tmp
#and perform stat with limma
#Build the design matrix
design <- model.matrix(~-1+factor(c(1,1,2,2,3,3)))
colnames(design) <- c(“mock”,"siGFP”,"siLin28a",”")
> > design
> mock siGFP siLin28a
> 1 1 0 0
> 2 1 0 0
> 3 0 1 0
> 4 0 1 0
> 5 0 0 1
> 6 0 0 1
> attr(,"assign")
> [1] 1 1 1
> attr(,"contrasts")
> attr(,"contrasts")$`factor(c(1, 1, 2, 2, 3, 3))`
> [1] "contr.treatment"
# instructs Limma which comparisons to make
contrastmatrix <- makeContrasts(mock-siGFP,mock-siLin28a,siGFP-siLin28a,levels=design)
> > contrastmatrix
> Contrasts
> Levels mock - siGFP mock - siLin28a siGFP - siLin28a
> mock 1 1 0
> siGFP -1 0 1
> siLin28a 0 -1 -1
# make the contrasts
fit <- lmFit(eset, design)
fit2 <- contrasts.fit(fit, contrastmatrix)
fit2 <- eBayes(fit2)
#listed the top differentially expressed genes
> > topTable(fit2,coef=1,adjust="fdr")
> ID Symbol Name logFC AveExpr t P.Value adj.P.Val B
> 10342604 10342604 <NA> <NA> 1.831809 2.845863 14.037480 1.248257e-05 0.4338942 -3.694032
> 10343224 10343224 <NA> <NA> -2.751868 2.658551 -10.992289 4.802754e-05 0.8347186 -3.715792
> 10339733 10339733 <NA> <NA> 1.703917 3.426402 9.860457 8.674159e-05 1.0000000 -3.728996
> 10341175 10341175 <NA> <NA> -2.861665 2.167471 -8.877976 1.526481e-04 1.0000000 -3.744350
> 10340405 10340405 <NA> <NA> -1.368074 2.944242 -8.245297 2.263752e-04 1.0000000 -3.756919
> 10343199 10343199 <NA> <NA> -1.238289 2.077839 -8.130892 2.437752e-04 1.0000000 -3.759472
> 10339048 10339048 <NA> <NA> 1.101959 2.129288 7.949683 2.746238e-04 1.0000000 -3.763713
> 10344413 10344413 <NA> <NA> -1.407850 2.259821 -7.810608 3.014011e-04 1.0000000 -3.767144
> 10343919 10343919 <NA> <NA> 1.134134 2.650166 7.644642 3.374231e-04 1.0000000 -3.771452
> 10338867 10338867 <NA> <NA> 1.162114 5.792621 7.454279 3.850651e-04 1.0000000 -3.776701
As you can see just this small portion is already missing information about Symbol and Name.
So my question is do I use a correct .db library for annotation? As it looks like I missing a lot of ID cannot be annotated
How can I fix that problem?
Many Thanks,
Jakub
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the Bioconductor
mailing list