[R-sig-Geo] dendrogram of large dataset

Tue Apr 5 18:42:38 CEST 2011

Hi All,

I'm tring to generate a dendrogram from a set of observations.

i have all my data inside postgis, using some spatial query i extractd the data for a precise boundy-box and generate the following output in a csv file format :

a,b,c,d,e,f
1      67.12 4.280 1.78250 30 3 16001
2      67.12 4.280 1.78250 30 3 16001
3      66.57 4.280 1.35500 30 3 16001
...
16665  68.21 3.605 2.34190 48 2 18004
16666  68.21 3.605 2.34190 48 2 18004
...
18665 ... ... ... ... 18004
18666 ... ... ... ... 18004
...

where : 

a,b,c,d,e are parameters associated to a speciphic ID (f)  that represent a biological specie.

> data <- read.csv('x.txt', header = TRUE)
> data
           a     b       c  d e     f
1      67.12 4.280 1.78250 30 3 16001
2      67.12 4.280 1.78250 30 3 16001
3      66.57 4.280 1.35500 30 3 16001
...
16665  68.21 3.605 2.34190 48 2 16001
16666  68.21 3.605 2.34190 48 2 16001
 [ reached getOption("max.print") -- omitted 36039 rows ]]

> summary(data)
       a                b               c                d        
 Min.   : 57.97   Min.   :3.594   Min.   :0.7037   Min.   : 1.00  
 1st Qu.: 64.74   1st Qu.:4.299   1st Qu.:1.1792   1st Qu.: 6.00  
 Median : 67.30   Median :4.551   Median :1.4144   Median : 6.00  
 Mean   : 69.47   Mean   :5.796   Mean   :1.5356   Mean   :17.55  
 3rd Qu.: 69.13   3rd Qu.:8.861   3rd Qu.:1.7391   3rd Qu.:30.00  
 Max.   :133.69   Max.   :9.745   Max.   :4.9751   Max.   :54.00  
       e               f        
 Min.   :1.000   Min.   :10022  
 1st Qu.:1.000   1st Qu.:11027  
 Median :2.000   Median :16001  
 Mean   :2.088   Mean   :15191  
 3rd Qu.:3.000   3rd Qu.:16001  
 Max.   :4.000   Max.   :30016  

> str(data)
'data.frame':	52705 obs. of  6 variables:
 $ a: num  67.1 67.1 66.6 66.2 66.2 ...
 $ b: num  4.28 4.28 4.28 4.28 4.28 4.28 4.28 4.28 4.28 4.28 ...
 $ c: num  1.78 1.78 1.35 1.35 1.35 ...
 $ d: int  30 30 30 13 13 13 13 13 13 13 ...
 $ e: int  3 3 3 3 3 3 3 3 3 3 ...
 $ f: int  16001 16001 16001 16001 16001 16001 16001 16001 16001 16001 ...

# f is the specie-id (a,b,c,d,e are environmental parameters derived from measurments)
# i'm tring to grouping my species in several community based on common environmental parameters 

> x <- as.matrix(data[-6]) 
> x
          a     b       c  d e
[1,]  67.12 4.280 1.78250 30 3
[2,]  67.12 4.280 1.78250 30 3
[3,]  66.57 4.280 1.35500 30 3
...
[19997,]  67.30 9.736 1.25000  2 2
[19998,]  67.30 9.736 1.76990  2 2
[19999,]  67.85 9.735 1.15740  2 2
 [ reached getOption("max.print") -- omitted 32706 rows ]]

> x <- x[sample(seq_len(nrow(x))), ] 
> d <- dist(x)
Error: cannot allocate vector of size 10.3 Gb

> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] seriation_1.0-3  colorspace_1.0-1 gclus_1.3        TSP_1.0-2       
[5] cluster_1.13.3  
> 

i'm using the package :  Seriation .. googling it seems to do what i need ... but gived me the previouse error
do you know i'f i'm doing something wrong ? or maybe i've to change something in the memory managment ?
i'm running R on a 64 bit debian sid distro (dual quad core, 4 gb ram)

(i tried to do a similar analisys on matlab, using the function "Manova" [1] , maybe it uses a different approach, but using the same dataset if no memory issue)
[1] http://www.mathworks.com/help/toolbox/stats/manovacluster.html

thanks for any hints!

Massimo.