[R] faster algorithm for Kendall's tau

ferdinand principia ferd_principia at yahoo.ca
Wed Jun 29 00:14:07 CEST 2005


Sorry,

I should specifiy in more detail what my data looks
like. The data vectors (simulations) are mostly
composed of floats (for which it's pretty unlikely to
produce ties), but there are integer values to be
found as well (up to 10% of vector elements). 
As I undestand, Marc's algo is not suited for this
case. 

Is there another solution?

Thanks
Ferdinand


--- "Marc Schwartz (via MN)" <mschwartz at mn.rr.com>
wrote:

> On Tue, 2005-06-28 at 13:03 -0400, ferdinand
> principia wrote:
> > Hi,
> > 
> > I need to calculate Kendall's tau for large data
> > vectors (length > 100'000). 
> > Is somebody aware of a faster algorithm or package
> > function than "cor(, method="kendall")"? 
> > There are ties in the data to be considered
> (Kendall's
> > tau-b).
> > 
> > Any suggestions?
> > 
> > Regards
> > Ferdinand
> 
> 
> The time intensive part of the process is typically
> the ranking/ordering
> of the vector pairs to calculate the numbers of
> concordant and
> discordant pairs.
> 
> If the number of _unique pairs_ in your data is
> substantially less than
> the number of total pairs (in other words, creating
> a smaller 2d
> contingency table from a pair of your vectors makes
> sense), then the
> following may be of help.
> 
> # Calculate CONcordant Pairs in a table
> # cycle through x[r, c] and multiply by
> # sum(x elements below and to the right of x[r, c])
> # x = table
> concordant <- function(x)
> {
>   x <- matrix(as.numeric(x), dim(x))
>   
>   # get sum(matrix values > r AND > c)
>   # for each matrix[r, c]
>   mat.lr <- function(r, c)
>   { 
>     lr <- x[(r.x > r) & (c.x > c)]
>     sum(lr)
>   }
> 
>   # get row and column index for each
>   # matrix element
>   r.x <- row(x)
>   c.x <- col(x)
> 
>   # return the sum of each matrix[r, c] * sums
>   # using mapply to sequence thru each matrix[r, c]
>   sum(x * mapply(mat.lr, r = r.x, c = c.x))
> }
> 
> # Calculate DIScordant Pairs in a table
> # cycle through x[r, c] and multiply by
> # sum(x elements below and to the left of x[r, c])
> # x = table
> discordant <- function(x)
> {
>   x <- matrix(as.numeric(x), dim(x))
>   
>   # get sum(matrix values > r AND < c)
>   # for each matrix[r, c]
>   mat.ll <- function(r, c)
>   { 
>     ll <- x[(r.x > r) & (c.x < c)]
>     sum(ll)
>   }
> 
>   # get row and column index for each
>   # matrix element
>   r.x <- row(x)
>   c.x <- col(x)
> 
>   # return the sum of each matrix[r, c] * sums
>   # using mapply to sequence thru each matrix[r, c]
>   sum(x * mapply(mat.ll, r = r.x, c = c.x))
> }
> 
> 
> # Calculate Kendall's Tau-b
> # x = table
> calc.KTb <- function(x)
> {
>   x <- matrix(as.numeric(x), dim(x))
>   
>   c <- concordant(x)
>   d <- discordant(x)
> 
>   n <- sum(x)
>   SumR <- rowSums(x)
>   SumC <- colSums(x)
> 
>   KTb <- (2 * (c - d)) / sqrt(((n ^ 2) -
>          (sum(SumR ^ 2))) * ((n ^ 2) -
>          (sum(SumC ^ 2))))
> 
>   KTb
> }
> 
> 
> Note that I made some modifications of the above,
> relative to prior
> versions that I have posted to handle large numbers
> of pairs to avoid
> integer overflows in summations. Hence the:
> 
>   x <- matrix(as.numeric(x), dim(x))
> 
> conversion in each function.
> 
> Now, create some random test data, with 100,000
> elements in each vector,
> sampling from 'letters', which would yield a 26 x 26
> table:
> 
>  a <- sample(letters, 100000, replace = TRUE)
>  b <- sample(letters, 100000, replace = TRUE)
>  
>  > dim(table(a, b))
>  [1] 26 26
> 
>  > system.time(print(calc.KTb(table(a, b))))
> [1] 0.0006906088
> [1] 0.77 0.02 0.83 0.00 0.00
> 
> Note that in the above, the initial table takes most
> of the time:
> 
> > system.time(table(a, b))
> [1] 0.55 0.00 0.56 0.00 0.00
> 
> Hence:
> 
> > tab.ab <- table(a, b)
> > system.time(print(calc.KTb(tab.ab)))
> [1] 0.0006906088
> [1] 0.25 0.01 0.27 0.00 0.00
> 
> 
> I should note that I also ran:
> 
> > system.time(print(cor(a, b, method = "kendall")))
> [1] 0.0006906088 
> [1] 694.80   7.72 931.89   0.00   0.00 
> 
> Nice to know the results work out at least...  :-)
> 
> 
> I have not tested with substantially larger 2d
> matrices, but would
> envision that as the dimensions of the resultant
> tabulation increases,
> my method probably approaches and may even become
> less efficient than
> the approach implemented in cor(). Some testing
> would validate this and
> perhaps point to coding the concordant() and
> discordant() functions in C
> for improvement in timing.
> 
> HTH,
> 
> Marc Schwartz
> 
> 
>




More information about the R-help mailing list