[BioC] when do linear models work?

James MacDonald jmacdon at med.umich.edu
Thu Mar 4 21:28:12 MET 2004

```The linear model fit here is not what you think. Since we are using
factors, this is an analysis of variance model, so there is no
assumption of linearity per se. In other words, we are not testing to
see if there is a linear relationship between say, treatment and no
treatment. Instead what we are testing is to see if there is a
difference in the mean expression of each gene at the two (or more)
factor levels.

So if you are testing the five different treatment levels you mention,
you are really testing to see if the mean expression level for each gene
is the same at all levels or not. If they are not, you then have to fit
contrasts to see where they differ. You can also fit different contrasts
to see if, say, the mean expression is the same at 0 mM and 0.1 mM, but
then changes at 0.25 mM (here you would be comparing the mean expression
of the 0 mM and 0.1 mM samples to the 0.25 mM samples).

If the book(s) you are reading cover ANOVA, you should take a look at
those sections, especially the parts about design matrices and
contrasts.

HTH,

Jim

James W. MacDonald
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623

>>> <Arne.Muller at aventis.com> 03/04/04 01:48PM >>>
Hello All,

I've two fundamental problems with linear models (lm), maybe you can
help me
to clearify these issues:

1. Irrespective of how many factors you use in your expriment, the
relationship is always assumed to be linear. If you've a response
vector Y
and vector X of indeppendent variables, the Y ~ X basically assumes a
straight line (with some kind of slope). If you do say Y ~ X + Z then
one can
think of the lm as a *flat* surface. The same is true for higher
dimensions
(X ~ dose + time + batch + gender + ... )

This assumtion is realy dangerous I think, since many
treatment/response
relationships are not linear. For example think about an experiment:
I've 5
doses 0.0mM, 0.10mM, 0.25mM, 0.5mM and 1.0mM of a drug with which cell
cultures get treated. The 0.1mM dose causes hardly any change in gene
expression, whereas there's a big difference in gene expression at
0.25mM.
Then at 0.5mM and 1.0mM the reponse is not much stronger than at
0.25mM.

If one just looks at a single gene, then expression of this gene goes
up
quite strongly from 0.1mM to 0.25mM, and then expression flattens out
for the
higher doses. The response reaches saturation. Other resposnes are more
like
a logistic curve. This is a typical scenario.

The problem is that many genes within one experiment behave like
described
above, otheres change linear others exponetial ...

Could I still use lm for this kind of experiment? Would I've to decide
on a
gene by gene basis?

2. Some of the factors such as treament (T) for an experiment can only
take
say 2 distinct values: treated (t) and untreated (ut). Does a model
such as Y
~ T make any sense in this case?

Doesn't this assume a linear relationship between just 2 "clouds" of
data
(assume there are many samples for each factor level)? Even if one can
clearly distinguish between t and ut - assuming a straight line may
wrong.
This is like drawing a straight line between two points. Just like in
my
example above with the different doses, you may have already reached
some
kind of saturation. Using such a model for prediction would then give
wrong
results.

However, if one just wants to distinguish between t and ut, would the
lm be a
valid method?

to
understand what's going on ... .

Maybe you could comment on this. I'd be very interested in any
explanation or
clearification.

kind regards,

Arne

--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

```