[R] lm and R-squared (newbie)

Thu Dec 15 15:20:34 CET 2011

On Thu, Dec 15, 2011 at 8:35 AM, PtitBleu <ptit_bleu at yahoo.fr> wrote:
> Hello,
>
> I've two data.frames (data1 and data4), dec="." and sep=";".
> http://r.789695.n4.nabble.com/file/n4199964/data1.txt data1.txt
> http://r.789695.n4.nabble.com/file/n4199964/data4.txt data4.txt
>
> When I do
> plot(data1$nx,data1$ny, col="red")
> points(data4$nx,data4$ny, col="blue")
> ,  results seem very similar (at least to me) but the R-squared of
> summary(lm(data1$ny ~ data1$nx))
> and
> summary(lm(data4$ny ~ data4$nx))
> are very different (0.48 against 0.89).
>
> Could someone explain me the reason?
>
> To be complete, I am looking for an simple indicator telling me if it is
> worthwhile to keep the values provided by lm. I thought that R-squared could
> do the job. For me, if R-squared is far from 1, the data are not good enough
> to perform a linear fit.
> It seems that I'm wrong.

The problem is the outliers. Try using a robust measure instead.  If
we replace Pearson correlations with Spearman (rank) correlations they
are much closer:

> # R^2 based on Pearson correlations
> cor(fitted(lm(ny ~ nx, data4)), data4$ny)^2
[1] 0.8916924
> cor(fitted(lm(ny ~ nx, data1)), data1$ny)^2
[1] 0.4868575
>
> # R^2 based on Spearman (rank) correlations
> cor(fitted(lm(ny ~ nx, data4)), data4$ny, method = "spearman")^2
[1] 0.8104026
> cor(fitted(lm(ny ~ nx, data1)), data1$ny, method = "spearman")^2
[1] 0.7266705

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com