[R] Trust p-values or not ?
p.dalgaard at biostat.ku.dk
Sun Oct 7 22:06:39 CEST 2007
rocker turtle wrote:
> First of all kudos to the creaters/contributors to R ! This is a great
> package and I am finding it very useful in my research, will love to
> contribute any modules and dataset which I develop to the project.
> While doing multiple regression I arrived at the following peculiar
> Out of 8 variables only 4 have <0.04 p-values (of t-statistic), rest all
> have p-values between 0.1 and 1.0 and the coeff of Regression is coming
> around ~0.8 (adjusted ~0.78). The F-statistic is
> around 30 and its own p-value is ~0. Also I am constrained with a dataset of
> 130 datapoints.
Nothing particularly peculiar about this...
> Being new to statistics I would really appreciate if someone can help me
> understand these values.
> 1) Does the above test values indicate a statistically sound and significant
> model ?
Significant, yes, in a sense (see below). Soundness is something you
cannot really see from the output of a regression analysis, because it
contains results which are valid _provided_ the model assumption holds.
To check the assumptions there is a battery of techniques, e.g. residual
plots and interaction tests -- there are books about this, which won't
really fit into a short email....
Re. significance, it is important to realise that you generally need to
compare multiple model fits to assess which variables are important.
With one fit, you can say what happens if you drop single variables from
the model, so in your case, you have four seven-variable models that do
not fit any worse than the full model. You can't really say anything
about what happens if you remove two or more variables. You can also see
what happens if you drop all variables; this is the overall F test,
which in your case is highly significant, so at least one variable must
be required. You can be fairly confident that variables with very small
p-values cannot be removed, whereas borderline cases may end up with
their p-values becoming insignificant when other variables are removed.
> 2) Is a dataset of 130 enough to run linear regression with ~7-10 variables
> ? If not what is approximately a good size.
Wrong question, I think. Some people suggest heuristics like 10-20
observations per variable, but this contains an implicit understanding
that you are dealing with "typical problems" in e.g. clinical
epidemiology. Designed experiments can contain many more parameters,
data with strong correlations require more observations to untangle
which variables are important, and even otherwise, you might be looking
for effects that are small compared to the residual variation and
consequentially require more observations. When you do have the data, I
think it is more sound to look at the standard errors of the regression
coefficients and discuss whether they are sufficiently small for the
kinds of conclusions you want to make.
> Thanks in advance.
> [[alternative HTML version deleted]]
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help