[R] hierarchical clustering of large dataset
Massimo Di Stefano
massimodisasha at gmail.com
Sat Mar 10 02:26:01 CET 2012
i'll try to describe the data,
here [1] there is a subdatset (255 rows) 6 columns (a to f)
the last columns contains the Identification Number (ID) for a particular species.
the ID in f are 20 different species and it should be my 'label':
16001
11012
25011
13011
11029
11027
10022
10024
20009
11016
20002
13001
11010
22037
15001
30016
21005
11028
15002
20008
the other vars (from 'a' to 'e') are :
depth
temperature
salinity
substrate-class
morphology-class
my target is to have 'groups of species' based on the similarity of theyr environmental parameters, and build a dendrogram like [2]
the full dataset (1,5 mb) is available here [3]
[1] http://massimo-timecapsule.whoi.edu//data/img/subdataset.txt
[2] http://massimo-timecapsule.whoi.edu//data/img/manova_clust_matlab.png
[3] http://massimo-timecapsule.whoi.edu//data/img/x.txt
Il giorno Mar 9, 2012, alle ore 7:18 PM, Peter Langfelder ha scritto:
> On Fri, Mar 9, 2012 at 1:50 PM, Massimo Di Stefano
> <massimodisasha at gmail.com> wrote:
>> Peter,
>>
>> really thanks for your answer.
>>
>>
>>
>> install.packages("flashClust")
>> library(flashClust)
>> data <- read.csv('/Users/epifanio/Desktop/cluster/x.txt')
>> data <- na.omit(data)
>> data <- scale(data)
>>> mydata
>> a b c d e
>> 1 -0.207709346 -6.618558e-01 0.481413046 0.7761133 0.96473124
>> 2 -0.207709346 -6.618558e-01 0.481413046 0.7761133 0.96473124
>> 3 -0.256330843 -6.618558e-01 -0.352285877 0.7761133 0.96473124
>> 4 -0.289039851 -6.618558e-01 -0.370032451 -0.2838308 0.96473124
>>
>>
>> my target is to group my observation by 'speciesID'
>> the speciesID is the last column : 'e'
>>
>>
>>
>> Before to go ahead, i should understand how to tell R that the he has to generate the groups using the column 'e' as variable,
>> so to have the groups by speciesID.
>>
>> using this instruction :
>>
>> d <- dist(data)
>> clust <- hclust(d)
>>
>> is not clear to me how R will understand to use the column 'e' as label.
>
> Well, you didn't say that column e was a label that you wanted to keep
> separate. Any other labels in the data? You may not want to use labels
> in the distance calculation.
>
> Do I understand right that you want to cluster each species separately?
>
> Peter
More information about the R-help
mailing list