[R] Efficient distance calculation on big matrix
Boel Brynedal
brynedal at gmail.com
Sat Jun 16 19:04:38 CEST 2012
Hi All,
I'm working on analyzing a large data set, lets asume that
dim(Data)=c(1000,8700). I want to calculate the canberra distance
between the columns of this matrix, and using a toy example ('test' is
a matrix filled with random numbers 0-1):
> system.time(d<-as.matrix(dist(t(test), method = "canberra", diag = FALSE, upper = FALSE, p = 2)))
user system elapsed
1417.713 3.219 1421.144
Is there any way to calculate the distance which would take less time?
I am already parallelizing this to a great deal (the real data has
many more rows), but I cant go below 1000 rows in order to get
reliable results. And I will calculate the distances repeatedly (about
100 times if 1000 rows) while removing small parts of the matrix.
The system.time results also confuse me a bit, since 99% of the time
is not system time but user time. What does that mean?
I'm on a Linux server and should have about 48GB RAM here.
Any suggestions appreciated,
Bo
> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C
[3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915
[5] LC_MONETARY=C LC_MESSAGES=en_US.iso885915
[7] LC_PAPER=en_US.iso885915 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] hash_2.1.0 mmap_0.6-9
loaded via a namespace (and not attached):
[1] tools_2.12.1
$ uname -a
Linux compute-13-2.local 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36
EST 2009 x86_64 x86_64 x86_64 GNU/Linux
More information about the R-help
mailing list