[R] Efficient distance calculation on big matrix

Sat Jun 16 19:04:38 CEST 2012

Hi All,

I'm working on analyzing a large data set, lets asume that
dim(Data)=c(1000,8700). I want to calculate the canberra distance
between the columns of this matrix, and using a toy example ('test' is
a matrix filled with random numbers 0-1):

> system.time(d<-as.matrix(dist(t(test), method = "canberra", diag = FALSE, upper = FALSE, p = 2)))
    user   system  elapsed
1417.713    3.219 1421.144

Is there any way to calculate the distance which would take less time?
I am already parallelizing this to a great deal (the real data has
many more rows), but I cant go below 1000 rows in order to get
reliable results. And I will calculate the distances repeatedly (about
100 times if 1000 rows) while removing small parts of the matrix.

The system.time results also confuse me a bit, since 99% of the time
is not system time but user time. What does that mean?

I'm on a Linux server and should have about 48GB RAM here.

Any suggestions appreciated,

Bo

> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C
 [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915
 [5] LC_MONETARY=C                  LC_MESSAGES=en_US.iso885915
 [7] LC_PAPER=en_US.iso885915       LC_NAME=C
 [9] LC_ADDRESS=C                   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] hash_2.1.0 mmap_0.6-9

loaded via a namespace (and not attached):
[1] tools_2.12.1

$ uname -a
Linux compute-13-2.local 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36
EST 2009 x86_64 x86_64 x86_64 GNU/Linux