[R] hierarchical clustering of large dataset

Fri Mar 9 21:39:34 CET 2012

2012/3/9 Uwe Ligges <ligges at statistik.tu-dortmund.de>:
> I think the main issue of the OP is that he geneartes a 55000x55000 distance
> matrix and has to calculate on it. Beside immense main memory consumption
> this may take ages to complete with hierarchical clustering.

Indeed. I missed that in the original email.

If a non-hierarchical clustering is acceptable, clara() from the
cluster package may be of use.

Sarah

> Uwe Ligges
>
>
> On 08.03.2012 15:02, Sarah Goslee wrote:
>>
>> See inline:
>>
>> On Thu, Mar 8, 2012 at 7:41 AM, Massimo Di Stefano
>> <massimodisasha at gmail.com>  wrote:
>>>
>>>
>>> Hello All,
>>>
>>> i've a set of observations that is in the form :
>>>
>>> a,    b,    c,    d,    e,    f
>>> 67.12,    4.28,    1.7825,    30,    3,    16001
>>> 67.12,    4.28,    1.7825,    30,    3,    16001
>>> 66.57,    4.28,    1.355,    30,    3,    16001
>>> 66.2,    4.28,    1.3459,    13,    3,    16001
>>> 66.2,    4.28,    1.3459,    13,    3,    16001
>>> 66.2,    4.28,    1.3459,    13,    3,    16001
>>> 66.2,    4.28,    1.3459,    13,    3,    16001
>>> 66.2,    4.28,    1.3459,    13,    3,    16001
>>> 66.2,    4.28,    1.3459,    13,    3,    16001
>>> 63.64,    9.726,    1.3004,    6,    3,    11012
>>> 63.28,    9.725,    1.2755,    6,    3,    11012
>>> 63.28,    9.725,    1.2755,    6,    3,    11012
>>> 63.28,    9.725,    1.2755,    6,    3,    11012
>>> 63.28,    9.725,    1.2755,    6,    3,    11012
>>> 63.28,    9.725,    1.2755,    6,    3,    11012
>>> …
>>> ….
>>>
>>> 55.000 observation in total.
>>>
>>> where :
>>>
>>> a,    b,    c,    d,    e
>>> are environmental parameters
>>> and f  is a label.
>>>
>>> as you can see some rows are duplicated,
>>> this means that the observation occurred more times
>>
>>
>> If you use dput() for the first 10 or 20 rows of your data, then you will
>> have provided the requested reproducible example.
>>
>>> (in my use cases the observation is the presence of a specific
>>>  biological specie in a photo,
>>> if in the photo there are more than one individual of the same species i
>>> have a duplicated row )
>>>
>>>
>>> i'm trying to learn how to use R in order to build a dendrogram
>>> that will help me to 'group' several species in communities, based on the
>>> similarity of the env. parameters.
>>>
>>> i tried with
>>>
>>> d<- diet(as.matrix(my data))
>>> hc<- hclust(d)
>>>
>>> but it doesn't works.
>>
>>
>> I'm assuming you mean dist() instead of diet() ? I don't know of any
>> function named
>> diet().
>>
>> What "doesn't work"? We can't answer your question unless we know what it
>> is.
>>
>>> is the 'redundancy' of my data (multiple rows with same information) a
>>> problem?
>>> should i remove all the rows that are exactly the same ?
>>
>>
>> Yes. Identical rows have a distance of 0, so they're clustered
>> together immediately,
>> so a dendrogram that includes them is identical to one that has only
>> unique rows.
>>
>>> in this way how to take care about the fact that for the same
>>> environmental parameters i've multiple observation ?
>>> maybe this information is not relevant in order to build the dendrogram ?
>>>
>>> Please, can you suggest me a valid approach in order to cluster a such
>>> dataset ?
>>> forgive me, i've an evident lack of statistic knowledge, thank you very
>>> mach for you help!
>>
>>
>> Perhaps some reading in one of the many excellent ecologically-based
>> multivariate
>> statistics books is called for?
>>

-- 
Sarah Goslee
http://www.functionaldiversity.org