[BioC] Using limma to identify differentially expressed genes

James W. MacDonald jmacdon at uw.edu
Tue Apr 10 21:58:55 CEST 2012


Hi David,

On 4/10/2012 11:14 AM, David Westergaard wrote:
> Hello,
>
> I've been trying to use limma to identify genes from the following
> data: http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-21340 -
> It's a simple control vs. disease experiment
>
>
> # SDRF downloaded from above page
> SDRF<- read.table(file="E-GEOD-21340.sdrf.txt",header=TRUE,stringsAsFactors=FALSE,sep="\t")
>
> # Looking to compare Family-history negativer versus Diabetis
> controls<- SDRF[grep("Control, Family History
> Neg",SDRF$Comment..Sample_source_name.),]
> disease<- SDRF[grep("^DM",SDRF$Characteristics.DiseaseState.),]
> Batch<- rbind(controls,disease)
>
> # Read in CEL files
> mixture.batch<- ReadAffy(filenames=Batch$Array.Data.File)
>
> # Preprocess data
> mixture.processed<- expresso(mixture.batch, bgcorrect.method = "rma",
> normalize.method = "quantiles", pmcorrect.method = "pmonly",
> summary.method = "medianpolish")
>
> # Get data in matrix
> signals<- exprs(mixture.prepared)
> cl<- ifelse(colnames(signals) %in% disease$Array.Data.File,1,0)
>
> # Do design matrix and fit
> design<- model.matrix(~factor(cl))
> fit<- lmFit(signals,design)
> fit<- eBayes(fit)
> topTable(fit2,coef=2)
>
> Which yields the following:
>                 ID  logFC AveExpr     t P.Value adj.P.Val     B
> 7513    208004_at -0.323    5.43 -4.65 0.00191     0.999 -3.10
> 11225 211829_s_at  0.340    5.07  4.36 0.00278     0.999 -3.17
> 5950    206424_at -0.907    6.65 -4.15 0.00363     0.999 -3.23
> 1354  201826_s_at -0.447    8.37 -4.13 0.00374     0.999 -3.24
> 19782   220418_at  0.392    5.43  4.02 0.00431     0.999 -3.27
> 8889  209396_s_at  1.899    7.47  4.01 0.00437     0.999 -3.28
> 5005    205478_at -0.931    9.22 -3.94 0.00481     0.999 -3.30
> 9469  209983_s_at  0.412    5.72  3.92 0.00492     0.999 -3.31
> 2936    203409_at  0.506    6.93  3.87 0.00531     0.999 -3.32
> 5054  205527_s_at  0.331    6.80  3.84 0.00549     0.999 -3.33
>
> I'm abit puzzled over the adjusted P-values. Can it really be true
> that ALL of the adjusted P-values are 0.999, or did I make a rookie
> mistake somewhere?

Well, there definitely is some weirdness going on. To answer your 
question, yes it is possible for all the adjusted p-values to be right 
close to 1, if there isn't any evidence for differential expression. 
There are any number of reasons this can arise, but the main culprits in 
my experience are usually very noisy intra-group data with hardly any 
inter-group differences. A PCA plot can often shed light on this sort of 
problem.

Now to get to the weirdness. You are getting these data by hand, when it 
might be easier to get from within R, using GEOquery. Note here that I 
have snipped extraneous output from my R session.

 > library(GEOquery)
 > geo <- getGEO("GSE21340")
 > dat <- getGEOSuppFiles("GSE21340")
 > setwd("GSE21340")
 > system("tar xvf GSE21340_RAW.tar")
 > library(affy)
 > eset <- rma(ReadAffy())
 > design <- model.matrix(~pData(geo[[1]])$source_name_ch1)

And just to note that I am using the correct data for the design matrix:

 > levels(pData(geo[[1]])$source_name_ch1)
[1] "Control, Family History Neg" "Control, Family History Pos"
[3] "Replicate from 1 patient"    "Type 2 DM"


 > fit <- lmFit(eset, design)
 > fit <- eBayes(fit)
 > topTable(fit, 4)[,c(1,6)]
                  ID   adj.P.Val
2454      M19483_at 0.005571008
204       D10523_at 0.005571008
2126      L38941_at 0.009749608
2299 M11717_rna1_at 0.015951421
2574      M25077_at 0.015951421
193       D00760_at 0.022937490
824  D87002_cds2_at 0.022937490
3079    M71243_f_at 0.027533357
1467      J02902_at 0.027533357
4802      U59423_at 0.028686408


Sorry about the alignment, but you see here that I do get some 
differentially expressed genes, but not many. This is probably due in 
part to the fact that I am using all four sample types in this analysis, 
so the denominator in any contrast for me will be based on the sums of 
squares of error for all four samples, whereas you are using only two. 
This will increase the power to detect differentially expressed genes to 
some extent.

However, note the probeset IDs, as well as the annotation of my eset object:

 > annotation(eset)
[1] "hu6800"

If I search for the top probeset ID from your topTable, it appears to be 
from a hgu133a or hgfocus array, not a hu6800 array (which is also known 
as a HuGeneFl array). So it looks like your expression data might not be 
from this experiment.

Best,

Jim


>
> Best Regards,
> David Westergaard
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list