[R] Very Slow Gower Similarity Function

Jari Oksanen jari.oksanen at oulu.fi
Mon Apr 18 21:00:10 CEST 2005


On 18 Apr 2005, at 20:36, Anon. wrote:

> Jari Oksanen wrote:
>
>>
>> On 18 Apr 2005, at 19:10, Tyler Smith wrote:
>>
>>> Hello,
>>>
>>> I am a relatively new user of R. I have written a basic function to 
>>> calculate
>>> the Gower similarity function. I was motivated to do so partly as an 
>>> excercise
>>> in learning R, and partly because the existing option (vegdist in 
>>> the vegan
>>> package) does not accept missing values.
>>>
>> Speed is the reason to use C instead of R. It should be easy, almost 
>> trivial, to modify the vegdist.c  so that it handles missing values. 
>> I guess this handling means ignoring the value pair if one of the 
>> values is missing -- which is not so gentle to the metric properties 
>> so dear to Gower. Package vegan is designed for ecological community 
>> data which generally do not have missing values (except in 
>> environmental data), but contributions are welcome.
>>
> The only reason you never see ecological community data with missing 
> values is because the ecologists remove those species/sites from their 
> Excel sheets before they give it to you to sort out their mess.

Well, ecologists have plenty of missing species in their community 
data, but these have zero values since they were not observed. I guess 
some Bob O'Hara is going to have a paper about this in JAE.

> This is actually one of the few things they know how to do in Excel - 
> I'm dreading the day when a paper appears in JAE saying that you can 
> use Excel to produce P-values.
>
The "A" in "JAE" stands for "Animal": for real things they still have 
Journal of Ecology.

> To be slightly more serious, as an exercise the OP could consider 
> writing a wrapper function in R that removes the missing data and then 
> calls vegdist to calculate his Gower similarity index.
>
The looping goes within C code, and for pairwise deletion of missing 
values wrapping is difficult. With complete.cases this is trivial (and 
then your result would be more metric as well).
--
Jari Oksanen, Oulu, Finland




More information about the R-help mailing list