[Rd] R and Gnumeric
Peter Dalgaard
P.Dalgaard at biostat.ku.dk
Mon Jun 9 16:34:53 CEST 2008
Jean Bréfort wrote:
> One other totally unrelated thing. We got recently a bug report about an
> incorrect R squared in gnumeric regression code
> (http://bugzilla.gnome.org/show_bug.cgi?id=534659). R (version 2.7.0)
> give the same result as Gnumeric as can be seen below:
>
>
>> mydata <- read.csv(file="data.csv",sep=",")
>> mydata
>>
> X Y
> 1 1 2
> 2 2 4
> 3 3 5
> 4 4 8
> 5 5 0
> 6 6 7
> 7 7 8
> 8 8 9
> 9 9 10
>
>> summary(lm(mydata$Y~mydata$X))
>>
>
> Call:
> lm(formula = mydata$Y ~ mydata$X)
>
> Residuals:
> Min 1Q Median 3Q Max
> -5.8889 0.2444 0.5111 0.7111 2.9778
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 1.5556 1.8587 0.837 0.4303
> mydata$X 0.8667 0.3303 2.624 0.0342 *
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 2.559 on 7 degrees of freedom
> Multiple R-squared: 0.4958, Adjusted R-squared: 0.4238
> F-statistic: 6.885 on 1 and 7 DF, p-value: 0.03422
>
>
>> summary(lm(mydata$Y~mydata$X-1))
>>
>
> Call:
> lm(formula = mydata$Y ~ mydata$X - 1)
>
> Residuals:
> Min 1Q Median 3Q Max
> -5.5614 0.1018 0.3263 1.6632 3.5509
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> mydata$X 1.1123 0.1487 7.481 7.06e-05 ***
> ---
> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 2.51 on 8 degrees of freedom
> Multiple R-squared: 0.8749, Adjusted R-squared: 0.8593
> F-statistic: 55.96 on 1 and 8 DF, p-value: 7.056e-05
>
> I am unable to figure out what this 0.8749 value might represent. If it
> is intended to be the Pearson moment, it should be 0.4958, and if it is
> the coefficient of determination, I think the correct value would be
> 0.4454 as given by Excel. It's of course nice to have the same result in
> R and Gnumeric,but it would be better if this result was accurate (if it
> is, we need some documentation fix). Btw, I am not a statistics expert
> at all.
>
This horse has been flogged multiple times on the list.
It is of course mainly a matter of convention, but the convention used
by R has been around at least since Genstat in the mid-1970s. In the
no-intercept case, you get the _uncentered_ version of R-squared; that
is, the proportion of the sum of squares explained by the model (as
opposed to sum of squares of _deviations_ in the usual case.) The
rationale is that the R^2 should be based on a reduction in residual
variation between two nested models, and if theres no intercept, the
only well-determined nested model is the one where mydata$Y has mean
zero for all x corresponding to all-zero regression coefficients. The
resulting R^2 is directly related to the F statistic, which you'll see
is also larger and more significant when the intercept is removed.
BTW: lm(mydata$Y~mydata$X) is bad practice, use lm(Y~X, data=mydata).
Use of predict() will demonstrate why.
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-devel
mailing list