[R] hierarchical clustering of large dataset

Fri Mar 9 20:47:25 CET 2012

I think the main issue of the OP is that he geneartes a 55000x55000 
distance matrix and has to calculate on it. Beside immense main memory 
consumption this may take ages to complete with hierarchical clustering.

Uwe Ligges

On 08.03.2012 15:02, Sarah Goslee wrote:
> See inline:
>
> On Thu, Mar 8, 2012 at 7:41 AM, Massimo Di Stefano
> <massimodisasha at gmail.com>  wrote:
>>
>> Hello All,
>>
>> i've a set of observations that is in the form :
>>
>> a,    b,    c,    d,    e,    f
>> 67.12,    4.28,    1.7825,    30,    3,    16001
>> 67.12,    4.28,    1.7825,    30,    3,    16001
>> 66.57,    4.28,    1.355,    30,    3,    16001
>> 66.2,    4.28,    1.3459,    13,    3,    16001
>> 66.2,    4.28,    1.3459,    13,    3,    16001
>> 66.2,    4.28,    1.3459,    13,    3,    16001
>> 66.2,    4.28,    1.3459,    13,    3,    16001
>> 66.2,    4.28,    1.3459,    13,    3,    16001
>> 66.2,    4.28,    1.3459,    13,    3,    16001
>> 63.64,    9.726,    1.3004,    6,    3,    11012
>> 63.28,    9.725,    1.2755,    6,    3,    11012
>> 63.28,    9.725,    1.2755,    6,    3,    11012
>> 63.28,    9.725,    1.2755,    6,    3,    11012
>> 63.28,    9.725,    1.2755,    6,    3,    11012
>> 63.28,    9.725,    1.2755,    6,    3,    11012
>> …
>> ….
>>
>> 55.000 observation in total.
>>
>> where :
>>
>> a,    b,    c,    d,    e
>> are environmental parameters
>> and f  is a label.
>>
>> as you can see some rows are duplicated,
>> this means that the observation occurred more times
>
> If you use dput() for the first 10 or 20 rows of your data, then you will
> have provided the requested reproducible example.
>
>> (in my use cases the observation is the presence of a specific  biological specie in a photo,
>> if in the photo there are more than one individual of the same species i have a duplicated row )
>>
>>
>> i'm trying to learn how to use R in order to build a dendrogram
>> that will help me to 'group' several species in communities, based on the similarity of the env. parameters.
>>
>> i tried with
>>
>> d<- diet(as.matrix(my data))
>> hc<- hclust(d)
>>
>> but it doesn't works.
>
> I'm assuming you mean dist() instead of diet() ? I don't know of any
> function named
> diet().
>
> What "doesn't work"? We can't answer your question unless we know what it is.
>
>> is the 'redundancy' of my data (multiple rows with same information) a problem?
>> should i remove all the rows that are exactly the same ?
>
> Yes. Identical rows have a distance of 0, so they're clustered
> together immediately,
> so a dendrogram that includes them is identical to one that has only
> unique rows.
>
>> in this way how to take care about the fact that for the same environmental parameters i've multiple observation ?
>> maybe this information is not relevant in order to build the dendrogram ?
>>
>> Please, can you suggest me a valid approach in order to cluster a such dataset ?
>> forgive me, i've an evident lack of statistic knowledge, thank you very mach for you help!
>
> Perhaps some reading in one of the many excellent ecologically-based
> multivariate
> statistics books is called for?
>
> Sarah
>
>
>