[R] Any better way of optimizing time for calculating distances in the mentioned scenario??

Stefan Evert stefanML at collocations.de
Fri Oct 12 14:47:25 CEST 2012


On 12 Oct 2012, at 09:46, Purna chander wrote:

> 4) scenario4:
>> x<-read.table("query.vec")
>> v<-read.table("query.vec2")
>> v<-as.matrix(v)
>> d<-dist(rbind(v,x),method="manhattan")
>> m<-as.matrix(d)
>> m2<-m[1:nrow(v),(nrow(v)+1):nrow(x)]
>> print(m2[1,1:10])
> 
> time taken for running the code:
> real    0m0.445s
> user    0m0.401s
> sys     0m0.041s
> 1) Though scenario 4 is optimum, this scenario failed when matrix 'v'
> having more no. of rows. An error occurred while converting distance
> object 'd' to a matrix 'm'.
>     For E.g: > m<-as.matrix(d)
>       the above command resulted in error: "Error: cannot allocate
> vector of size 922.7 MB".

That's because you're calculating a full distance matrix with (10000+100) * (10000+100) points and then extract the much smaller number of distance values (10000 * 100) that you actually need.

I have a use case with similar requirements, so ...

> 3) Any other ideas to optimize the problem i'm facing with.

... my experimental "wordspace" package includes a function dist.matrix() for calculating such cross-distance matrices.  The function is written in C code and doesn't handle NA's and NaN's properly, but it's considerably faster than the current implementation of dist().

I haven't uploaded the package to CRAN yet, but you should be able to install with
 
	install.packages("wordspace", repos="http://R-Forge.R-project.org")

Best,
Stefan


PS: Glad to see that daily builds on R-Forge work again -- that's an extremely useful feature to get beta testers for experimental package versions. :-)




More information about the R-help mailing list