[R] hierarchical clustering of large dataset

Fri Mar 9 22:50:53 CET 2012

Peter,

really thanks  for your answer.

install.packages("flashClust")
library(flashClust)
data <- read.csv('/Users/epifanio/Desktop/cluster/x.txt')
data <- na.omit(data)
data <- scale(data)
> mydata
                 a             b            c          d           e
1     -0.207709346 -6.618558e-01  0.481413046  0.7761133  0.96473124
2     -0.207709346 -6.618558e-01  0.481413046  0.7761133  0.96473124
3     -0.256330843 -6.618558e-01 -0.352285877  0.7761133  0.96473124
4     -0.289039851 -6.618558e-01 -0.370032451 -0.2838308  0.96473124

my target is to group my observation by 'speciesID' 
the speciesID is the last column : 'e' 

Before to go ahead, i should understand how to tell R that the he has to generate the groups using the column 'e' as variable,
so to have the groups by speciesID.

using this instruction :

d <- dist(data)
clust <- hclust(d)

is not clear to me how R will understand to use the column 'e' as label.

####
Sarah said :

Yes. Identical rows have a distance of 0, so they're clustered
together immediately,
so a dendrogram that includes them is identical to one that has only
unique rows.
####

in this way i will lose a lot informations!
seems relevant for me that a species is found 4 times instead of 1 with a specific combination of environmental parameters.
no?

Maybe a way to Try to decrease the size of my dataset can be :

convert my multiple rows to abundance values, i means :  
if a species occurs four times with exactly the same environmental parameters
i'll add a column for "abundance" and fill in a "4". and then remove three rows ?
in this way i can decrease the size of my dataset (in rows) but i'll add a column.

make sense ?

Thanks a lot for your help (and patience),

Massimo.

Il giorno Mar 9, 2012, alle ore 3:54 PM, Peter Langfelder ha scritto:

> On Thu, Mar 8, 2012 at 4:41 AM, Massimo Di Stefano
> <massimodisasha at gmail.com> wrote:
>> 
>> Hello All,
>> 
>> i've a set of observations that is in the form :
>> 
>> a,    b,    c,    d,    e,    f
>> 67.12,    4.28,    1.7825,    30,    3,    16001
>> 67.12,    4.28,    1.7825,    30,    3,    16001
>> 66.57,    4.28,    1.355,    30,    3,    16001
>> 66.2,    4.28,    1.3459,    13,    3,    16001
>> 66.2,    4.28,    1.3459,    13,    3,    16001
>> 66.2,    4.28,    1.3459,    13,    3,    16001
>> 66.2,    4.28,    1.3459,    13,    3,    16001
>> 66.2,    4.28,    1.3459,    13,    3,    16001
>> 66.2,    4.28,    1.3459,    13,    3,    16001
>> 63.64,    9.726,    1.3004,    6,    3,    11012
>> 63.28,    9.725,    1.2755,    6,    3,    11012
>> 63.28,    9.725,    1.2755,    6,    3,    11012
>> 63.28,    9.725,    1.2755,    6,    3,    11012
>> 63.28,    9.725,    1.2755,    6,    3,    11012
>> 63.28,    9.725,    1.2755,    6,    3,    11012
>> …
>> ….
>> 
>> 55.000 observation in total.
> 
> Hi Massimo,
> 
> you don't want to use the entire matrix to calculate the distance. You
> will want to select the environmental columns and you may want to
> standardize them to prevent one of them having more influence than
> others.
> 
> Second, if you want to cluster such a huge data set using hierarchical
> clustering, you need a lot of memory, at least 32GB but preferably
> 64GB. If you don't have that much, you cannot use hierarchical
> clustering.
> 
> Third, if you do have enough memory, use package flashClust or
> fastcluster (I am the maintainer of flashClust.)
> For flashClust, you can install it using
> install.packages("flashClust") and load it using library(flashClust).
> The standard R implementation of hclust is unnecessarily slow (order
> n^3). flashClust provides a replacement (function hclust) that is
> approximately n^2. I have clustered data sets of 30000 variables in a
> minute or two, so 55000 shouldn't take more than 4-5 minutes, again
> assuming your computer has enough memory.
> 
> HTH,
> 
> Peter