[R] hierarchical clustering of large dataset
Uwe Ligges
ligges at statistik.tu-dortmund.de
Fri Mar 9 20:47:25 CET 2012
I think the main issue of the OP is that he geneartes a 55000x55000
distance matrix and has to calculate on it. Beside immense main memory
consumption this may take ages to complete with hierarchical clustering.
Uwe Ligges
On 08.03.2012 15:02, Sarah Goslee wrote:
> See inline:
>
> On Thu, Mar 8, 2012 at 7:41 AM, Massimo Di Stefano
> <massimodisasha at gmail.com> wrote:
>>
>> Hello All,
>>
>> i've a set of observations that is in the form :
>>
>> a, b, c, d, e, f
>> 67.12, 4.28, 1.7825, 30, 3, 16001
>> 67.12, 4.28, 1.7825, 30, 3, 16001
>> 66.57, 4.28, 1.355, 30, 3, 16001
>> 66.2, 4.28, 1.3459, 13, 3, 16001
>> 66.2, 4.28, 1.3459, 13, 3, 16001
>> 66.2, 4.28, 1.3459, 13, 3, 16001
>> 66.2, 4.28, 1.3459, 13, 3, 16001
>> 66.2, 4.28, 1.3459, 13, 3, 16001
>> 66.2, 4.28, 1.3459, 13, 3, 16001
>> 63.64, 9.726, 1.3004, 6, 3, 11012
>> 63.28, 9.725, 1.2755, 6, 3, 11012
>> 63.28, 9.725, 1.2755, 6, 3, 11012
>> 63.28, 9.725, 1.2755, 6, 3, 11012
>> 63.28, 9.725, 1.2755, 6, 3, 11012
>> 63.28, 9.725, 1.2755, 6, 3, 11012
>> …
>> ….
>>
>> 55.000 observation in total.
>>
>> where :
>>
>> a, b, c, d, e
>> are environmental parameters
>> and f is a label.
>>
>> as you can see some rows are duplicated,
>> this means that the observation occurred more times
>
> If you use dput() for the first 10 or 20 rows of your data, then you will
> have provided the requested reproducible example.
>
>> (in my use cases the observation is the presence of a specific biological specie in a photo,
>> if in the photo there are more than one individual of the same species i have a duplicated row )
>>
>>
>> i'm trying to learn how to use R in order to build a dendrogram
>> that will help me to 'group' several species in communities, based on the similarity of the env. parameters.
>>
>> i tried with
>>
>> d<- diet(as.matrix(my data))
>> hc<- hclust(d)
>>
>> but it doesn't works.
>
> I'm assuming you mean dist() instead of diet() ? I don't know of any
> function named
> diet().
>
> What "doesn't work"? We can't answer your question unless we know what it is.
>
>> is the 'redundancy' of my data (multiple rows with same information) a problem?
>> should i remove all the rows that are exactly the same ?
>
> Yes. Identical rows have a distance of 0, so they're clustered
> together immediately,
> so a dendrogram that includes them is identical to one that has only
> unique rows.
>
>> in this way how to take care about the fact that for the same environmental parameters i've multiple observation ?
>> maybe this information is not relevant in order to build the dendrogram ?
>>
>> Please, can you suggest me a valid approach in order to cluster a such dataset ?
>> forgive me, i've an evident lack of statistic knowledge, thank you very mach for you help!
>
> Perhaps some reading in one of the many excellent ecologically-based
> multivariate
> statistics books is called for?
>
> Sarah
>
>
>
More information about the R-help
mailing list