[BioC] Limma issues.

Marcus marcusb at biotech.kth.se
Tue Nov 11 21:51:51 MET 2003


Hi limma users.

I have some general questions regarding the use of dupcor.series and 
gls.series functions.

I use an experimental design that compares three different conditions to a 
fourth (which I will call a reference).
So the design of the experiment looks something like:


Cond 1             Cond2              Cond2


                 Ref

(There are supposed to be double arrows between the different conditions 
and the reference.)

The experiment is done in four (and between some conditions, five) 
biological replicates, so in total I have made 4 times 6 (dye-swap) = 24 
hybridizations which compares all the conditions with the reference, plus 
four additional hybridizations which is a dye-swap between cond2 and 
reference plus cond3 and the reference. This gives in total 27 slides since 
one slide was excluded due to a terrible hybridization. Each slide has a 
duplicate spot.

This data has been filtered using different filter cut-offs and the 
remaining data has been put into a matrix of M-values. The data has been 
normalized and dye-swapped.

 From this point I have made two different analysis paths.

One analysis path was to divide the M-matrix into three different matrices 
which contains data from the different conditions and I continued using the 
matrices with the following 
command:     dupcor.series(M.matrix.cond1,design,ndups=2,spacing=13824)->corr

Where the design was something like: design <- c(1,1,1,1,1,1,1,1)

The correlation is later used in the function:  gls.series(M.matrix.cond1, 
design, ndups=2,spacing=13824,corr$cor)->fit

and continuing with:    eb<-ebayes(fit)

the data from ebayes is later used in the toptable 
function:    toptable(number=100,genelist=list[,],fit=fit,eb=eb,adjust="fdr")

When I make this analysis I get a list of 82 genes for cond1 that has a 
B-value larger than zero.

But if I use another strategy where I use all the slides in the same 
analysis and use a design matrix which looks something like:

  Con1 Con2 Con3
1     1    0     0
2     1    0     0
3     1    0     0
4     1    0     0
5     1    0     0
6     1    0     0
7     1    0     0
8     1    0     0
9     0    1     0
10    0    1     0
11    0    1     0
12    0    1     0
13    0    1     0
14    0    1     0
15    0    1     0
16    0    1     0
17    0    1     0
18    0    1     0
19    0    0     1
20    0    0     1
21    0    0     1
22    0    0     1
23    0    0     1
24    0    0     1
25    0    0     1
26    0    0     1
27    0    0     1

And make the function calls like the procedures above except the toptable 
where I specify the coef number to be the condition which I want to 
investigate. The thing is, when I use this approach I get more genes that 
have a b-value greater than zero, namely 110.

My guess is that when I run the dupcor.series the correlations is based on 
all the slides in some way, since corr$cor is a lonely number and not a 
correlation for every comparison in the experimental design. The cor 
numbers from the different analysis pathways are not equal either.

When I run the gls.series function I get a fit$sigma that is based on all 
the slides or?

What is actually happening when the sigma is calculated. The sigma is also 
generally smaller when all the slides are included compared to the analysis 
path where I used one condition at a time, which I guess should be the case 
if all slides are included.

The lods which is calculated in the ebayes function is larger then the lods 
which originates from the analysis path where I examined one condition at a 
time, so this result in more genes that is larger than zero.

So to my questions (sorry for the long introduction):

Is the design matrix correct when the experiment is performed the way it 
is? According to the user guide, this is an optional way when you have a 
reference design but since the result is deviating from the one comparison 
at a time I am not sure.

If the design is correct, which of the two analysis pathways is the 
correct/better one?

I dont consider myself as a statistician, more like a biologist, but I 
reckon the sigma factor as a crucial variable in these analyses. I changed 
the fit$sigma for the "one condition" analysis path to the sigma that 
originates from the analysis that contains all the slides and the number of 
genes that reach above B-value zero increased to 111 which is quite like 
the 110 that I got from the other analysis pathway. So I guess that the 
sigma factor is quite responsible for the B-value spectra.

If I would like to compare, lets say the cond2 with cond3 can I then add a 
another column to the design matrix, and if so, how would it look like.

During these analysis I have used R-version 1.8 and limma version 1.3.1

I would be very grateful for some response.

Regards

Marcus


*******************************************************************************************
Marcus Gry Björklund

Royal Institute of Technology
AlbaNova University Center
Stockholm Center for Physics, Astronomy and Biotechnology
Department of Molecular Biotechnology
106 91 Stockholm, Sweden

Phone (office): +46 8 553 783 45
Fax: + 46 8 553 784 81
Visiting adress: Roslagstullsbacken 21, Floor 3
Delivery adress: Roslagsvägen 30B



More information about the Bioconductor mailing list