[R] hierarchical clustering of large dataset
Sarah Goslee
sarah.goslee at gmail.com
Fri Mar 9 21:39:34 CET 2012
2012/3/9 Uwe Ligges <ligges at statistik.tu-dortmund.de>:
> I think the main issue of the OP is that he geneartes a 55000x55000 distance
> matrix and has to calculate on it. Beside immense main memory consumption
> this may take ages to complete with hierarchical clustering.
Indeed. I missed that in the original email.
If a non-hierarchical clustering is acceptable, clara() from the
cluster package may be of use.
Sarah
> Uwe Ligges
>
>
> On 08.03.2012 15:02, Sarah Goslee wrote:
>>
>> See inline:
>>
>> On Thu, Mar 8, 2012 at 7:41 AM, Massimo Di Stefano
>> <massimodisasha at gmail.com> wrote:
>>>
>>>
>>> Hello All,
>>>
>>> i've a set of observations that is in the form :
>>>
>>> a, b, c, d, e, f
>>> 67.12, 4.28, 1.7825, 30, 3, 16001
>>> 67.12, 4.28, 1.7825, 30, 3, 16001
>>> 66.57, 4.28, 1.355, 30, 3, 16001
>>> 66.2, 4.28, 1.3459, 13, 3, 16001
>>> 66.2, 4.28, 1.3459, 13, 3, 16001
>>> 66.2, 4.28, 1.3459, 13, 3, 16001
>>> 66.2, 4.28, 1.3459, 13, 3, 16001
>>> 66.2, 4.28, 1.3459, 13, 3, 16001
>>> 66.2, 4.28, 1.3459, 13, 3, 16001
>>> 63.64, 9.726, 1.3004, 6, 3, 11012
>>> 63.28, 9.725, 1.2755, 6, 3, 11012
>>> 63.28, 9.725, 1.2755, 6, 3, 11012
>>> 63.28, 9.725, 1.2755, 6, 3, 11012
>>> 63.28, 9.725, 1.2755, 6, 3, 11012
>>> 63.28, 9.725, 1.2755, 6, 3, 11012
>>> …
>>> ….
>>>
>>> 55.000 observation in total.
>>>
>>> where :
>>>
>>> a, b, c, d, e
>>> are environmental parameters
>>> and f is a label.
>>>
>>> as you can see some rows are duplicated,
>>> this means that the observation occurred more times
>>
>>
>> If you use dput() for the first 10 or 20 rows of your data, then you will
>> have provided the requested reproducible example.
>>
>>> (in my use cases the observation is the presence of a specific
>>> biological specie in a photo,
>>> if in the photo there are more than one individual of the same species i
>>> have a duplicated row )
>>>
>>>
>>> i'm trying to learn how to use R in order to build a dendrogram
>>> that will help me to 'group' several species in communities, based on the
>>> similarity of the env. parameters.
>>>
>>> i tried with
>>>
>>> d<- diet(as.matrix(my data))
>>> hc<- hclust(d)
>>>
>>> but it doesn't works.
>>
>>
>> I'm assuming you mean dist() instead of diet() ? I don't know of any
>> function named
>> diet().
>>
>> What "doesn't work"? We can't answer your question unless we know what it
>> is.
>>
>>> is the 'redundancy' of my data (multiple rows with same information) a
>>> problem?
>>> should i remove all the rows that are exactly the same ?
>>
>>
>> Yes. Identical rows have a distance of 0, so they're clustered
>> together immediately,
>> so a dendrogram that includes them is identical to one that has only
>> unique rows.
>>
>>> in this way how to take care about the fact that for the same
>>> environmental parameters i've multiple observation ?
>>> maybe this information is not relevant in order to build the dendrogram ?
>>>
>>> Please, can you suggest me a valid approach in order to cluster a such
>>> dataset ?
>>> forgive me, i've an evident lack of statistic knowledge, thank you very
>>> mach for you help!
>>
>>
>> Perhaps some reading in one of the many excellent ecologically-based
>> multivariate
>> statistics books is called for?
>>
--
Sarah Goslee
http://www.functionaldiversity.org
More information about the R-help
mailing list