[R] Efficient distance calculation on big matrix
Boel Brynedal
brynedal at gmail.com
Mon Jun 18 16:28:10 CEST 2012
Stefan, that looks wonderful! I am most certainly going to try to
download and use your 'wordspace', even though I am unsure on how to
even download it at this point. Many thanks! But yes, I need the
Canberra distance unfortunately. But decreasing the time by more than
60% will save me days! If I can get there.
2012/6/17 Stefan Evert <stefanML at collocations.de>:
>
>> I'm working on analyzing a large data set, lets asume that
>> dim(Data)=c(1000,8700). I want to calculate the canberra distance
>> between the columns of this matrix, and using a toy example ('test' is
>> a matrix filled with random numbers 0-1):
>>
>>> system.time(d<-as.matrix(dist(t(test), method = "canberra", diag = FALSE, upper = FALSE, p = 2)))
>> user system elapsed
>> 1417.713 3.219 1421.144
>> The system.time results also confuse me a bit, since 99% of the time
>> is not system time but user time. What does that mean?
>
> User time is the time that R spends working on your problem; system time refers to tasks done by the operating system, e.g. disk access, managing locks and, most importantly, swapping when you run out of RAM. With multi-threading, system time can be much larger than the time that has actually elapsed.
>
>>
>> Is there any way to calculate the distance which would take less time?
>
> Well, one thing you can do is to get a faster computer. :-) The command above takes only 670 seconds on my MacBook Pro (without multi-threading).
>
> Calculating a distance matrix is an expensive computation. In your example, R has to carry out (8700 * 8700 * 1000) / 2 = 37.8 billion floating point divisions. With approx. 27 clock cycles per division (according to tables I've found on the Web), this takes at least 340 seconds even on a 3GHz CPU (and ignoring memory access, addition/subtraction, loops, etc.).
>
> You can shave off some of the time if you implement the distance calculation in C, inline the code to avoid callback functions in loops, operate on columns of the matrix directly (which should be more cache-friendly than rows) and don't check for NA's, NaN's and other degenerate cases.
>
> I've done just that in my experimental R package "wordspace", which isn't on CRAN yet:
>
>> library(wordspace)
>> A <- matrix(runif(8.7e6), 1000, 8700)
>
>> system.time(d1 <- as.matrix(dist(t(A), method="canberra")))
> user system elapsed
> 669.207 2.724 669.305
>
>> system.time(d2 <- dist.matrix(A, method="canberra", byrow=FALSE))
> user system elapsed
> 250.534 0.784 250.301
>
>> all(d1 == d2)
> [1] TRUE
>
> If you aren't tied to Canberra distance, you can use a less expensive metric such as the Manhattan distance for an additional, more substantial speed boost:
>
>> system.time(d3 <- dist.matrix(A, method="manhattan", byrow=FALSE))
> user system elapsed
> 42.488 0.999 43.569
>
> This is still single-threaded, so you can run multiple of these calculations in parallel depending on how many cores your server has.
>
> Hope this helps,
> Stefan
>
>
> PS: In case you'd like to give it a try yourself and aren't daunted by a complete lack of documentation:
>
> svn checkout svn://scm.r-forge.r-project.org/svnroot/wordspace/pkg
>
>
> [ evert at linglit.tu-darmstadt.de | http://purl.org/stefan.evert ]
>
More information about the R-help
mailing list