[R] Trust p-values or not ?

Sun Oct 7 22:06:39 CEST 2007

rocker turtle wrote:
> Hi,
>
> First of all kudos to the creaters/contributors to R ! This is a great
> package and I am finding it very useful in my research, will love to
> contribute any modules and dataset which I develop to the project.
>
> While doing multiple regression I arrived at the following peculiar
> situation.
> Out of 8 variables only 4 have  <0.04 p-values (of t-statistic), rest all
> have p-values between 0.1 and 1.0 and the coeff of Regression is coming
> around ~0.8 (adjusted ~0.78). The F-statistic is
> around 30 and its own p-value is ~0. Also I am constrained with a dataset of
> 130 datapoints.
>
>   
Nothing particularly peculiar about this...

> Being new to statistics I would really appreciate if someone can help me
> understand these values.
> 1) Does the above test values indicate a statistically sound and significant
> model ?
>   
Significant, yes, in a sense (see below). Soundness is something you 
cannot really see from the output of a regression analysis, because it 
contains results which are valid _provided_ the model assumption holds. 
To check the assumptions there is a battery of techniques, e.g. residual 
plots and interaction tests -- there are books about this, which won't 
really fit into a short email....

Re. significance, it is important to realise that you generally need to 
compare multiple model fits to assess which variables are important. 
With one fit, you can say what happens if you drop single variables from 
the model, so in your case, you have four seven-variable models that do 
not fit any worse than the full model. You can't really say anything 
about what happens if you remove two or more variables. You can also see 
what happens if you drop all variables; this is the overall F test, 
which in your case is highly significant, so at least one variable must 
be required. You can be fairly confident that variables with very small 
p-values cannot be removed, whereas borderline cases may end up with 
their p-values becoming insignificant when other variables are removed.

> 2) Is a dataset of 130 enough to run linear regression with ~7-10 variables
> ? If not what is approximately a good size.
>
>   
Wrong question, I think. Some people suggest heuristics like 10-20 
observations per variable, but this contains an implicit understanding 
that you are dealing with "typical problems" in e.g. clinical 
epidemiology. Designed experiments can contain many more parameters, 
data with strong correlations require more observations to untangle 
which variables are important, and even otherwise, you might be looking 
for effects that are small compared to the residual variation and 
consequentially require more observations. When you do have the data, I 
think it is more sound to look at the standard errors of the regression 
coefficients and discuss whether they are sufficiently small  for the 
kinds of conclusions you want to make.

> Thanks in advance.
> -Ankit
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>   

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907