[R] hierarchical clustering of large dataset
Massimo Di Stefano
massimodisasha at gmail.com
Fri Mar 9 22:50:53 CET 2012
Peter,
really thanks for your answer.
install.packages("flashClust")
library(flashClust)
data <- read.csv('/Users/epifanio/Desktop/cluster/x.txt')
data <- na.omit(data)
data <- scale(data)
> mydata
a b c d e
1 -0.207709346 -6.618558e-01 0.481413046 0.7761133 0.96473124
2 -0.207709346 -6.618558e-01 0.481413046 0.7761133 0.96473124
3 -0.256330843 -6.618558e-01 -0.352285877 0.7761133 0.96473124
4 -0.289039851 -6.618558e-01 -0.370032451 -0.2838308 0.96473124
my target is to group my observation by 'speciesID'
the speciesID is the last column : 'e'
Before to go ahead, i should understand how to tell R that the he has to generate the groups using the column 'e' as variable,
so to have the groups by speciesID.
using this instruction :
d <- dist(data)
clust <- hclust(d)
is not clear to me how R will understand to use the column 'e' as label.
####
Sarah said :
Yes. Identical rows have a distance of 0, so they're clustered
together immediately,
so a dendrogram that includes them is identical to one that has only
unique rows.
####
in this way i will lose a lot informations!
seems relevant for me that a species is found 4 times instead of 1 with a specific combination of environmental parameters.
no?
Maybe a way to Try to decrease the size of my dataset can be :
convert my multiple rows to abundance values, i means :
if a species occurs four times with exactly the same environmental parameters
i'll add a column for "abundance" and fill in a "4". and then remove three rows ?
in this way i can decrease the size of my dataset (in rows) but i'll add a column.
make sense ?
Thanks a lot for your help (and patience),
Massimo.
Il giorno Mar 9, 2012, alle ore 3:54 PM, Peter Langfelder ha scritto:
> On Thu, Mar 8, 2012 at 4:41 AM, Massimo Di Stefano
> <massimodisasha at gmail.com> wrote:
>>
>> Hello All,
>>
>> i've a set of observations that is in the form :
>>
>> a, b, c, d, e, f
>> 67.12, 4.28, 1.7825, 30, 3, 16001
>> 67.12, 4.28, 1.7825, 30, 3, 16001
>> 66.57, 4.28, 1.355, 30, 3, 16001
>> 66.2, 4.28, 1.3459, 13, 3, 16001
>> 66.2, 4.28, 1.3459, 13, 3, 16001
>> 66.2, 4.28, 1.3459, 13, 3, 16001
>> 66.2, 4.28, 1.3459, 13, 3, 16001
>> 66.2, 4.28, 1.3459, 13, 3, 16001
>> 66.2, 4.28, 1.3459, 13, 3, 16001
>> 63.64, 9.726, 1.3004, 6, 3, 11012
>> 63.28, 9.725, 1.2755, 6, 3, 11012
>> 63.28, 9.725, 1.2755, 6, 3, 11012
>> 63.28, 9.725, 1.2755, 6, 3, 11012
>> 63.28, 9.725, 1.2755, 6, 3, 11012
>> 63.28, 9.725, 1.2755, 6, 3, 11012
>> …
>> ….
>>
>> 55.000 observation in total.
>
> Hi Massimo,
>
> you don't want to use the entire matrix to calculate the distance. You
> will want to select the environmental columns and you may want to
> standardize them to prevent one of them having more influence than
> others.
>
> Second, if you want to cluster such a huge data set using hierarchical
> clustering, you need a lot of memory, at least 32GB but preferably
> 64GB. If you don't have that much, you cannot use hierarchical
> clustering.
>
> Third, if you do have enough memory, use package flashClust or
> fastcluster (I am the maintainer of flashClust.)
> For flashClust, you can install it using
> install.packages("flashClust") and load it using library(flashClust).
> The standard R implementation of hclust is unnecessarily slow (order
> n^3). flashClust provides a replacement (function hclust) that is
> approximately n^2. I have clustered data sets of 30000 variables in a
> minute or two, so 55000 shouldn't take more than 4-5 minutes, again
> assuming your computer has enough memory.
>
> HTH,
>
> Peter
More information about the R-help
mailing list