[R] cor(data.frame) infelicities

Michael Friendly friendly at yorku.ca
Mon Dec 3 22:31:12 CET 2007


Returning to my original post, I still believe that a basic work-horse
like cor(data.frame) with the default method="pearson" should try to do 
something more useful in this case than barf with a misleading error
message if the data frame contains character variables.

To paraphrase Einstein,
``Things [in R] should be made as simple as possible, but not any simpler''

The case that Andy Liaw cited is a good example of the 'not any
simpler' part.

-Michael

Gabor Grothendieck wrote:
> You are right but I was just trying to stick to the same example.
> In reality it would be ok as long as its an ordered factor.  One could
> restrict it to those of class "ordered".
> 
> 
> On Dec 3, 2007 1:58 PM, Liaw, Andy <andy_liaw at merck.com> wrote:
>> I'd call that another infelicity.  Species is supposed to be nominal,
>> not ordinal, so rank correlation wouldn't make much sense.  So what does
>> cor(, method="kendall") do?  It looks like it simply uses the underlying
>> numeric code.  (Change Species to numerics and you'll see the same
>> answer.)  However, reordering the levels changes the result:
>>
>> R> iris2 <- iris
>> R> levels(iris2$Species) <- levels(iris2$Species)[c(2, 1, 3)]
>> R> cor(iris2, method = "kendall")
>>             Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
>> Sepal.Length   1.00000000 -0.07699679    0.7185159   0.6553086 0.1897778
>> Sepal.Width   -0.07699679  1.00000000   -0.1859944  -0.1571257 0.1439793
>> Petal.Length   0.71851593 -0.18599442    1.0000000   0.8068907 0.2677154
>> Petal.Width    0.65530856 -0.15712566    0.8068907   1.0000000 0.2724843
>> Species        0.18977778  0.14397927    0.2677154   0.2724843 1.0000000
>>
>> To me, this is dangerous!
>>
>> Andy
>>
>>
>> From: Gabor Grothendieck
>>
>>> You can calculate the Kendall rank correlation with such a matrix
>>> so you would not want to exclude factors in that case:
>>>
>>>> cor(iris, method = "kendall")
>>>              Sepal.Length Sepal.Width Petal.Length
>>> Petal.Width    Species
>>> Sepal.Length   1.00000000 -0.07699679    0.7185159
>>> 0.6553086  0.6704444
>>> Sepal.Width   -0.07699679  1.00000000   -0.1859944
>>> -0.1571257 -0.3376144
>>> Petal.Length   0.71851593 -0.18599442    1.0000000
>>> 0.8068907  0.8229112
>>> Petal.Width    0.65530856 -0.15712566    0.8068907
>>> 1.0000000  0.8396874
>>> Species        0.67044444 -0.33761438    0.8229112
>>> 0.8396874  1.0000000
>>>
>>>
>>> On Dec 3, 2007 9:27 AM, Michael Friendly <friendly at yorku.ca> wrote:
>>>> In using cor(data.frame), it is annoying that you have to explicitly
>>>> filter out non-numeric columns, and when you don't, the
>>> error message
>>>> is misleading:
>>>>
>>>>  > cor(iris)
>>>> Error in cor(iris) : missing observations in cov/cor
>>>> In addition: Warning message:
>>>> In cor(iris) : NAs introduced by coercion
>>>>
>>>> It would be nicer if stats:::cor() did the equivalent
>>> *itself* of the
>>>> following for a data.frame:
>>>>  > cor(iris[,sapply(iris, is.numeric)])
>>>>              Sepal.Length Sepal.Width Petal.Length Petal.Width
>>>> Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
>>>> Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
>>>> Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
>>>> Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000
>>>>  >
>>>>
>>>> A change could be implemented here:
>>>>     if (is.data.frame(x))
>>>>         x <- as.matrix(x)
>>>>
>>>> Second, the default, use="all" throws an error if there are any
>>>> NAs.  It would be nicer if the default was use="complete.cases",
>>>> which would generate warnings instead.  Most other statistical
>>>> software is more tolerant of missing data.
>>>>
>>>>  > library(corrgram)
>>>>  > data(auto)
>>>>  > cor(auto[,sapply(auto, is.numeric)])
>>>> Error in cor(auto[, sapply(auto, is.numeric)]) :
>>>>   missing observations in cov/cor
>>>>  > cor(auto[,sapply(auto, is.numeric)],use="complete")
>>>> # works; output elided
>>>>
>>>> -Michael
-- 
Michael Friendly     Email: friendly AT yorku DOT ca
Professor, Psychology Dept.
York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street    http://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT  M3J 1P3 CANADA



More information about the R-help mailing list