[BioC] robust linear model in anovas
Arne.Muller at aventis.com
Arne.Muller at aventis.com
Tue Apr 6 17:52:03 CEST 2004
Hello,
I'm trying to analyze some toxicology data, and I got some problems with lm
and rlm for which I'm looking for help. The experiment is a 3 factor design:
- the value (the y vector) is a response for a treatment with a drug
- a dose factor with 5 levles (a drug in 5 concentrations)
- a time factor with 2 levels (time of treatment)
- a batch factor with 3 levels corresponding to the laboratories that
analyzed the data
Within each dose, time and batch I've 3 "technical" replicates. I'm looking
for genes for which the expression is altered mainly by the dose at any of
the time points.
Ideally I'd just like to merge the 3 replicates produced by each lab into one
group with then 9 replicates (i.e. omitting the batch factor). Unfortunately
the batch factor is very strong and introduces quite a large variance within
each time and dose level.
I know that sometimes one laboratory is quite a strong outlier, but the
"extreme" lab changes from gene to gene! When looking at a stripchart for the
dose (within one timepoint) and merged batches the datapoints are scattered,
and often 2 or 3 of the replicates from one lab are >2 standard deviations
away from the mean.
I'm doing a simple lm to quantify the main effects (creating an lm for each
gene).
fit <- lm(value ~ dose + time + batch, d)
> summary(fit)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.772575997 0.04040316 167.6249085 2.174127e-99
dose010mM -0.038755967 0.04435994 -0.8736704 3.850501e-01
dose025mM -0.028481675 0.04598919 -0.6193123 5.375628e-01
dose050mM -0.069415506 0.05087185 -1.3645171 1.764316e-01
dose100mM 0.008306756 0.04435994 0.1872580 8.519573e-01
time24h 0.015798762 0.02976119 0.5308511 5.970699e-01
batchOLD 0.134205422 0.03534714 3.7967832 2.930010e-04
batchPRG 0.040506343 0.03837539 1.0555291 2.945274e-01
> anova(fit)
Analysis of Variance Table
Response: value
Df Sum Sq Mean Sq F value Pr(>F)
dose 4 0.06086 0.01521 0.8180 0.517660
time 1 0.00524 0.00524 0.2818 0.597070
batch 2 0.27889 0.13944 7.4968 0.001068 **
Residuals 76 1.41362 0.01860
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
> anova(rfit)
Analysis of Variance Table
To minumize the batch effect, e.g. to find true influences of the dose I
thought it may be a good idea to exclude obvious outliers. I've done this by
excludung data point >2 SD from the mean within a dose at a timepoint. I
guess that's not realy elegant ... .
Would it be apporpiate to use a robust lm (rlm or lqs) to "downgrade" the
outliers in importance? Could a robutsly fitted linear model be used for an
anova, e.g. do weight the data within the groups? E.g.
rfit <- rlm(value ~ dose + time + batch, d)
Why are there no p-values given for the t-statistics?
> summary(rfit)$coefficients
Value Std. Error t value
(Intercept) 6.781613686 0.04215968 160.8554538
dose010mM -0.038060320 0.04628847 -0.8222418
dose025mM -0.042400425 0.04798856 -0.8835528
dose050mM -0.078787526 0.05308349 -1.4842192
dose100mM 0.008569384 0.04628847 0.1851300
time24h 0.005742874 0.03105505 0.1849256
batchOLD 0.140608264 0.03688384 3.8121911
batchPRG 0.031820089 0.04004375 0.7946331
... and no p-values for the F-stats/F-values either ...
> anova(rfit)
Analysis of Variance Table
Response: value
Df Sum Sq Mean Sq F value Pr(>F)
dose 4 0.07794 0.01949
time 1 0.00165 0.00165
batch 2 0.29484 0.14742
Residuals 1.42161
lqs doesn't have any implementation for anova, is it theoretically not
possible to use the model for an anova?
As an alternative I'm thinking about using the "weights" option for lm. Maybe
the distance from the mean in units of standard deviations
w <- abs(mean(i) - i) / sd(i)
w <- 1/w # to give small weight to outliers
I'd be happy to discuss these options with you.
kind regards + thanks a lot for your comments,
Arne
--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com
More information about the Bioconductor
mailing list