[Rd] Incorrect Kendall's tau for ordered variables (PR#14207)

Prof Brian Ripley ripley at stats.ox.ac.uk
Mon Feb 15 08:41:10 CET 2010


What seems a more serious error is that the current code (and Peter's 
modification) returns correlations computed from unordered factors, 
and there are examples in packages 'agsemisc', 'ggm' and 'mi'.
And in all cases these are Pearson correlations, as is the use of 
ordered factors in 'sfsmisc'.

The as.vector() seems to have been introduced to combat PR#7116, but 
it is not the right fix as swapping the 'x' and 'y' arguments in the 
regression example for that still crashed.  (It seems to me that the 
correct C-level fix is to check the length of the dimnames before 
trying to access the second element.)

It would be tricky to do the coercion right for ordered factors (or 
more general rankable classes): cor() accepts a data frame and does 
as.matrix() on it: if the data frame includes such columns the 
coercion has to be done column-by-column.  So I decided to pass the 
responsibility back to the caller, and only accept numeric arguments 
(as the help page says).  However, package 'mice' passes a logical 
matrix, and as we do usually silently promote logical to numeric I 
have continued to allow that.

Experience suggests that we have been too generous in doing autmatic 
coercion in the past.  It seems every time we tighten something up we 
find a handful of packages that got dubious results from inappropriate 
conversions.


On Mon, 8 Feb 2010, Prof Brian Ripley wrote:

> On Mon, 8 Feb 2010, Peter Dalgaard wrote:
>
>> msa at biostat.mgh.harvard.edu wrote:
>>> Full_Name: Marek Ancukiewicz
>>> Version: 2.10.1
>>> OS: Linux
>>> Submission from: (NULL) (74.0.49.2)
>>> 
>>> 
>>> Both cor() and cor.test() incorrectly handle ordered variables with
>>> method="kendall", cor() incorrectly handles ordered variables for
>>> method="spearman" (method="person" always works correctly, while
>>> method="spearman" works for cor.test, but not for cor()).
>>> 
>>> In erroneous calculations these functions ignore the inherent ordering
>>> of the ordered variable (e.g., '9'<'10'<'11') and instead seem to assume
>>> an alphabetic ordering ('10'<'11'<'9').
>> 
>> Strictly speaking, not a bug, since the documentation has
>>
>>       x: a numeric vector, matrix or data frame.
>> 
>> respectively
>>
>>    x, y: numeric vectors of data values.  ‘x’ and ‘y’ must have the
>>          same length.
>> 
>> so noone ever claimed that class "ordered" variables should work.
>> 
>> However, the root cause is that as.vector on a factor variable (ordered
>> or not) converts it to a character vector, hence
>> 
>>> rank(as.vector(as.ordered(9:11)))
>> [1] 3 1 2
>> 
>> Looks like a simple fix would be to use as.vector(x, "numeric") inside
>> the definition of cor().
>
> A fix for that particular case: the problem is that relies on the underlying 
> representation.  I think a better fix would be to do either of
>
> - test for numeric and throw an error otherwise, or
> - use xtfrm, which has the advantage of being more general and
>  allowing methods to be written (S3 or S4 methods in R-devel).
>
>> 
>> 
>>>> cor(9:11,1:3,method="k")
>>> [1] 1
>>>> cor(as.ordered(9:11),1:3,method="k")
>>> [1] -0.3333333
>>>> cor.test(as.ordered(9:11),1:3,method="k")
>>>
>>> 	Kendall's rank correlation tau
>>> 
>>> data:  as.ordered(9:11) and 1:3
>>> T = 1, p-value = 1
>>> alternative hypothesis: true tau is not equal to 0
>>> sample estimates:
>>>        tau
>>> -0.3333333
>>> 
>>>> cor(9:11,1:3,method="s")
>>> [1] 1
>>>> cor(as.ordered(9:11),1:3,method="s")
>>> [1] -0.5
>>>> cor.test(as.ordered(9:11),1:3,method="s")
>>>
>>> 	Spearman's rank correlation rho
>>> 
>>> data:  as.ordered(9:11) and 1:3
>>> S = 0, p-value = 0.3333
>>> alternative hypothesis: true rho is not equal to 0
>>> sample estimates:
>>> rho
>>>   1
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> 
>> --
>>   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>>  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>> (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
>> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907
>> 
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-devel mailing list