[R] cor(data.frame) infelicities
Liaw, Andy
andy_liaw at merck.com
Mon Dec 3 19:58:21 CET 2007
I'd call that another infelicity. Species is supposed to be nominal,
not ordinal, so rank correlation wouldn't make much sense. So what does
cor(, method="kendall") do? It looks like it simply uses the underlying
numeric code. (Change Species to numerics and you'll see the same
answer.) However, reordering the levels changes the result:
R> iris2 <- iris
R> levels(iris2$Species) <- levels(iris2$Species)[c(2, 1, 3)]
R> cor(iris2, method = "kendall")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Sepal.Length 1.00000000 -0.07699679 0.7185159 0.6553086 0.1897778
Sepal.Width -0.07699679 1.00000000 -0.1859944 -0.1571257 0.1439793
Petal.Length 0.71851593 -0.18599442 1.0000000 0.8068907 0.2677154
Petal.Width 0.65530856 -0.15712566 0.8068907 1.0000000 0.2724843
Species 0.18977778 0.14397927 0.2677154 0.2724843 1.0000000
To me, this is dangerous!
Andy
From: Gabor Grothendieck
>
> You can calculate the Kendall rank correlation with such a matrix
> so you would not want to exclude factors in that case:
>
> > cor(iris, method = "kendall")
> Sepal.Length Sepal.Width Petal.Length
> Petal.Width Species
> Sepal.Length 1.00000000 -0.07699679 0.7185159
> 0.6553086 0.6704444
> Sepal.Width -0.07699679 1.00000000 -0.1859944
> -0.1571257 -0.3376144
> Petal.Length 0.71851593 -0.18599442 1.0000000
> 0.8068907 0.8229112
> Petal.Width 0.65530856 -0.15712566 0.8068907
> 1.0000000 0.8396874
> Species 0.67044444 -0.33761438 0.8229112
> 0.8396874 1.0000000
>
>
> On Dec 3, 2007 9:27 AM, Michael Friendly <friendly at yorku.ca> wrote:
> > In using cor(data.frame), it is annoying that you have to explicitly
> > filter out non-numeric columns, and when you don't, the
> error message
> > is misleading:
> >
> > > cor(iris)
> > Error in cor(iris) : missing observations in cov/cor
> > In addition: Warning message:
> > In cor(iris) : NAs introduced by coercion
> >
> > It would be nicer if stats:::cor() did the equivalent
> *itself* of the
> > following for a data.frame:
> > > cor(iris[,sapply(iris, is.numeric)])
> > Sepal.Length Sepal.Width Petal.Length Petal.Width
> > Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
> > Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
> > Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
> > Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
> > >
> >
> > A change could be implemented here:
> > if (is.data.frame(x))
> > x <- as.matrix(x)
> >
> > Second, the default, use="all" throws an error if there are any
> > NAs. It would be nicer if the default was use="complete.cases",
> > which would generate warnings instead. Most other statistical
> > software is more tolerant of missing data.
> >
> > > library(corrgram)
> > > data(auto)
> > > cor(auto[,sapply(auto, is.numeric)])
> > Error in cor(auto[, sapply(auto, is.numeric)]) :
> > missing observations in cov/cor
> > > cor(auto[,sapply(auto, is.numeric)],use="complete")
> > # works; output elided
> >
> > -Michael
> >
> > --
> > Michael Friendly Email: friendly AT yorku DOT ca
> > Professor, Psychology Dept.
> > York University Voice: 416 736-5115 x66249 Fax: 416 736-5814
> > 4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html
> > Toronto, ONT M3J 1P3 CANADA
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachme...{{dropped:15}}
More information about the R-help
mailing list