[BioC] when do linear models work?

Thu Mar 4 21:37:05 MET 2004

Dear Arne,

If you declare your factors to be factors "as.factor(x)" then lm creates 
indicator variables which allows a different mean for each 
treatment.  These means will not lie on a straight line.  So we are not 
assuming linearity in the sense you discuss below.

The "linear" in linear models does not indicate that the data vary around a 
line.  It indicates that the estimated effects are linear functions of the 
dependent variable (i.e. if you multiply all of your response variables by 
the same constant, the estimated effects are multiplied by the same 
constant.  The t and F-tests are therefore independent of the measurement 
units.  If you are using the log of the data, it means that your tests of 
statistical significance will not depend on whether you use log2, log10 or 
natural log.)

--Naomi Altman

At 01:48 PM 3/4/2004, Arne.Muller at aventis.com wrote:
>Hello All,
>
>I've two fundamental problems with linear models (lm), maybe you can help me
>to clearify these issues:
>
>1. Irrespective of how many factors you use in your expriment, the
>relationship is always assumed to be linear. If you've a response vector Y
>and vector X of indeppendent variables, the Y ~ X basically assumes a
>straight line (with some kind of slope). If you do say Y ~ X + Z then one can
>think of the lm as a *flat* surface. The same is true for higher dimensions
>(X ~ dose + time + batch + gender + ... )
>
>This assumtion is realy dangerous I think, since many treatment/response
>relationships are not linear. For example think about an experiment: I've 5
>doses 0.0mM, 0.10mM, 0.25mM, 0.5mM and 1.0mM of a drug with which cell
>cultures get treated. The 0.1mM dose causes hardly any change in gene
>expression, whereas there's a big difference in gene expression at 0.25mM.
>Then at 0.5mM and 1.0mM the reponse is not much stronger than at 0.25mM.
>
>If one just looks at a single gene, then expression of this gene goes up
>quite strongly from 0.1mM to 0.25mM, and then expression flattens out for the
>higher doses. The response reaches saturation. Other resposnes are more like
>a logistic curve. This is a typical scenario.
>
>The problem is that many genes within one experiment behave like described
>above, otheres change linear others exponetial ...
>
>Could I still use lm for this kind of experiment? Would I've to decide on a
>gene by gene basis?
>
>2. Some of the factors such as treament (T) for an experiment can only take
>say 2 distinct values: treated (t) and untreated (ut). Does a model such as Y
>~ T make any sense in this case?
>
>Doesn't this assume a linear relationship between just 2 "clouds" of data
>(assume there are many samples for each factor level)? Even if one can
>clearly distinguish between t and ut - assuming a straight line may wrong.
>This is like drawing a straight line between two points. Just like in my
>example above with the different doses, you may have already reached some
>kind of saturation. Using such a model for prediction would then give wrong
>results.
>
>However, if one just wants to distinguish between t and ut, would the lm be a
>valid method?
>
>I'm reading some "beginners" literature about lm's, and I'm just trying to
>understand what's going on ... .
>
>Maybe you could comment on this. I'd be very interested in any explanation or
>clearification.
>
>         kind regards,
>
>         Arne
>
>--
>Arne Muller, Ph.D.
>Toxicogenomics, Aventis Pharma
>arne dot muller domain=aventis com
>
>_______________________________________________
>Bioconductor mailing list
>Bioconductor at stat.math.ethz.ch
>https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

Naomi S. Altman                                814-865-3791 (voice)
Associate Professor
Bioinformatics Consulting Center
Dept. of Statistics                              814-863-7114 (fax)
Penn State University                         814-865-1348 (Statistics)
University Park, PA 16802-2111