[BioC] Linear modeling for affy experiment

Fri Jun 11 01:22:49 CEST 2004

Hi Fai,

Below I comment on some of your questions. I'm courious my self about the answers other people will post ...

> -----Original Message-----
> From: bioconductor-bounces at stat.math.ethz.ch
> [mailto:bioconductor-bounces at stat.math.ethz.ch]On Behalf Of YUK FAI
> LEUNG
> Sent: 10 June 2004 23:06
> To: bioconductor at stat.math.ethz.ch
> Subject: [BioC] Linear modeling for affy experiment
> 
> 
> Hi there,
> 
> I am going to do an affy experiment for the first time. I have a few 
> questions about the linear model design for my experiment.
> 
> I have a 3x2 factorial experiment. Three biological samples 
> (wild type 
> whole animal (WA), wild type tissue (WT), mutant tissue (MT)) and two 
> time points (t1 and t2). The effects of interest are the 
> mutant (M) and 
> tissue (T) specific expression and their changes other time (Ti).
> 
> I suppose the model should have the following equations (error term 
> omitted) and I would have 3 affy biological replicates for 
> each condition.
> 
> WA.t1 = mu
> WT.t1 = mu + T
> MT.t1 = mu + T + M + T*M
> WA.t2 = mu + Ti
> WT.t2 = mu + T + Ti + T*Ti
> MT.t2 = mu + T + M + Ti + T*M + M*Ti + T*Ti + T*M*Ti

I guess the model in R would just look like this:

fit <- lm(Value ~ Sample*Time)

where Sample is a factor with the levels WA, WT, MT, and Time has two levels.

If you also want to include a 3rd factor which is either Genotype (wild type or mutant) or something like "origin" which is Animal or Tissue. I guess one will then have a problem with nesting. Any coments on this, did I miss the point ;-) ?

> My questions are:
> 
> 1. How many degree of freedom do I have in the model? How do 
> I calculate 
> degree of freedom in linear model in general? For my case, is it  3 
> arrays * 6 conditions - 8 coefficients to be estimated = 10 degree of 
> freedom?

Yes for the 2 factor model.

> 2. If I want to increase my degree of freedom, is it true 
> that I can do 
> it by increasing my replicate? If it is true, is there a difference 
> between repeating a sample with more coeffcients (e.g. MT.t2) and a 
> sample with less coefficients (e.g. WA.t1)? It seems to me having a 
> repeat with more coefficients is better off, but I don't know have to 
> stay it out statistically.

hm, I'm not sure what you mean. Off course it's good to run more replicates ;-) . You can boost your DFs by running your model on the probe level instead on the summarized probe sets. The full model would the look like this:

fit <- lm(Value ~ Sample*Time*Probe)

This adds a lot of DFs, since you've about 16 probes per probe set.

> 3. What is the formal way to determine whether an interaction term is 
> meaningful/significant in the model or not? Is it by the p-value? And 
> should I remove the term and fit the model (& again) if it is not 
> significant and deemed not important by biological knowledge? 
> Or should 
> I just fit the full model once and go ahead to interpret the 
> contrasts 
> of interest? Is there a formal way (e.g. the diagnostics 
> people use to 
> assess ANOVA models) for evaluating the quality of the whole fitted 
> model? Or I need not worry about this at all?

look at interaction.plot, this is helpful. Usually if lines in an interaction plot are sort of parallel there is no interaction, oterwise it means that the one factor is dependent on the level of the other. 

You can off course not look at all these plot if you run one model per gene,s o you may just judge by the p-value if there's a significant interaction.

If there are not many significant interactions you can well omit the interacion term.

The overall quality of the model can be tested by looking at the distribution of the R-squared values and the distribution of the residuals, these should be normal distributed. You can test this with a ks.test or shapiro.test. Often you'll see that the residuals are not normal (extreme tails) - this has an impact on. Kerr and Churchill have done work on this.

You can look at the distribution of residuals of all models (from all genes) by calculating normalized residuals. 

    (obs.values - fitted.values) / var(obs.values)

where obs.values are the observed values

You may also need to log transform your initial data.

	regards,

	Arne

> 5. I have some confusion about the multiple hypothesis testing 
> adjustment for many contrasts. (I know I should better only use the 
> p-values/B/moderated t for ranking genes, but I am just curious to 
> know). For example in limma one would extract the contrast of 
> interest 
> and list the candidate genes out on Toptable with the option 
> = FDR etc., 
> but isn't it true that this is just the adjustment for that estimate? 
> When I evaluate all possible contrasts, how can I adjust the multiple 
> hypothesis testing for the genes in all the contrasts that I 
> have made?
> 
> 6. A minor question. What does M & A in the Toptable of a 
> coefficient/contrast mean for affy data? If A stands the log2 
> intensity 
> estimate for that coefficient/contrast, is M the log2 ratio of (mu + 
> (coefficient or contrast estimate))/mu?
> 
> Thanks a lot for answering my questions. Any other advice for 
> my design 
> is also welcome.
> 
> Best regards,
> Fai
> -- 
> Yuk Fai Leung
> Department of Molecular and Cellular Biology
> Harvard University
> BL 2079, 16 Divinity Avenue
> Cambridge, MA 02138
> Tel: 617-495-2599
> Fax: 617-496-3321
> email: yfleung at mcb.harvard.edu; yfleung at genomicshome.com
> URL: http://genomicshome.com
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
>