[R] millions of comparisons, speed wanted
Martin Maechler
maechler at stat.math.ethz.ch
Fri Dec 16 16:27:42 CET 2005
I have not taken the time to look into this example,
but
daisy()
from the (recommended, hence part of R) package 'cluster'
is more flexible than dist(), particularly in the case of NAs
and for (a mixture of continuous and) categorical variables.
It uses a version of Gower's formula in order to deal with NAs
and asymmetric binary variables. The example below look like
very well matching to this problem.
Regards,
Martin Maechler, ETH Zurich
>>>>> "Adrian" == Adrian DUSA <adi at roda.ro>
>>>>> on Thu, 15 Dec 2005 22:04:01 +0200 writes:
Adrian> Dear Andy,
Adrian> On Thursday 15 December 2005 20:57, Liaw, Andy wrote:
>> Just some untested idea:
>> If the data are all 0/1, you could use dist(input, method="manhattan"), and
>> then check which entry equals 1. This should be much faster than creating
>> all pairs of rows and check position-by-position.
Adrian> Thanks for the idea, I played a little with it. At the beginning yes, the data
Adrian> are all 0/1, but during the minimizing iterations there are also "x" values;
Adrian> for example comparing:
Adrian> 0 1 0 1 1
Adrian> 0 0 0 1 1
Adrian> should return
Adrian> 0 "x" 0 1 1
Adrian> whereas
Adrian> 0 "x" 0 1 1
Adrian> 0 0 0 1 1
Adrian> shouldn't even be compared (they have different number of figures).
Adrian> Replacing "x" with NA in dist is not yielding results either, as with
Adrian> NA 0 0 1 1
Adrian> 0 0 0 1 1
Adrian> dist returns 0.
Adrian> I even wanted to see if I could tweak the dist code, but it calls a C program
Adrian> and I gave up.
Adrian> Nice idea anyhow, maybe I'll find a way to use it further.
Adrian> Best,
Adrian> Adrian
Adrian> --
Adrian> Adrian DUSA
Adrian> Romanian Social Data Archive
Adrian> 1, Schitu Magureanu Bd
Adrian> 050025 Bucharest sector 5
Adrian> Romania
Adrian> Tel./Fax: +40 21 3126618 \
Adrian> +40 21 3120210 / int.101
Adrian> ______________________________________________
Adrian> R-help at stat.math.ethz.ch mailing list
Adrian> https://stat.ethz.ch/mailman/listinfo/r-help
Adrian> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
More information about the R-help
mailing list