[R-sig-Geo] dendrogram of large dataset
Massimo Di Stefano
massimodisasha at gmail.com
Tue Apr 5 18:42:38 CEST 2011
Hi All,
I'm tring to generate a dendrogram from a set of observations.
i have all my data inside postgis, using some spatial query i extractd the data for a precise boundy-box and generate the following output in a csv file format :
a,b,c,d,e,f
1 67.12 4.280 1.78250 30 3 16001
2 67.12 4.280 1.78250 30 3 16001
3 66.57 4.280 1.35500 30 3 16001
...
16665 68.21 3.605 2.34190 48 2 18004
16666 68.21 3.605 2.34190 48 2 18004
...
18665 ... ... ... ... 18004
18666 ... ... ... ... 18004
...
where :
a,b,c,d,e are parameters associated to a speciphic ID (f) that represent a biological specie.
> data <- read.csv('x.txt', header = TRUE)
> data
a b c d e f
1 67.12 4.280 1.78250 30 3 16001
2 67.12 4.280 1.78250 30 3 16001
3 66.57 4.280 1.35500 30 3 16001
...
16665 68.21 3.605 2.34190 48 2 16001
16666 68.21 3.605 2.34190 48 2 16001
[ reached getOption("max.print") -- omitted 36039 rows ]]
> summary(data)
a b c d
Min. : 57.97 Min. :3.594 Min. :0.7037 Min. : 1.00
1st Qu.: 64.74 1st Qu.:4.299 1st Qu.:1.1792 1st Qu.: 6.00
Median : 67.30 Median :4.551 Median :1.4144 Median : 6.00
Mean : 69.47 Mean :5.796 Mean :1.5356 Mean :17.55
3rd Qu.: 69.13 3rd Qu.:8.861 3rd Qu.:1.7391 3rd Qu.:30.00
Max. :133.69 Max. :9.745 Max. :4.9751 Max. :54.00
e f
Min. :1.000 Min. :10022
1st Qu.:1.000 1st Qu.:11027
Median :2.000 Median :16001
Mean :2.088 Mean :15191
3rd Qu.:3.000 3rd Qu.:16001
Max. :4.000 Max. :30016
> str(data)
'data.frame': 52705 obs. of 6 variables:
$ a: num 67.1 67.1 66.6 66.2 66.2 ...
$ b: num 4.28 4.28 4.28 4.28 4.28 4.28 4.28 4.28 4.28 4.28 ...
$ c: num 1.78 1.78 1.35 1.35 1.35 ...
$ d: int 30 30 30 13 13 13 13 13 13 13 ...
$ e: int 3 3 3 3 3 3 3 3 3 3 ...
$ f: int 16001 16001 16001 16001 16001 16001 16001 16001 16001 16001 ...
# f is the specie-id (a,b,c,d,e are environmental parameters derived from measurments)
# i'm tring to grouping my species in several community based on common environmental parameters
> x <- as.matrix(data[-6])
> x
a b c d e
[1,] 67.12 4.280 1.78250 30 3
[2,] 67.12 4.280 1.78250 30 3
[3,] 66.57 4.280 1.35500 30 3
...
[19997,] 67.30 9.736 1.25000 2 2
[19998,] 67.30 9.736 1.76990 2 2
[19999,] 67.85 9.735 1.15740 2 2
[ reached getOption("max.print") -- omitted 32706 rows ]]
> x <- x[sample(seq_len(nrow(x))), ]
> d <- dist(x)
Error: cannot allocate vector of size 10.3 Gb
> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] seriation_1.0-3 colorspace_1.0-1 gclus_1.3 TSP_1.0-2
[5] cluster_1.13.3
>
i'm using the package : Seriation .. googling it seems to do what i need ... but gived me the previouse error
do you know i'f i'm doing something wrong ? or maybe i've to change something in the memory managment ?
i'm running R on a 64 bit debian sid distro (dual quad core, 4 gb ram)
(i tried to do a similar analisys on matlab, using the function "Manova" [1] , maybe it uses a different approach, but using the same dataset if no memory issue)
[1] http://www.mathworks.com/help/toolbox/stats/manovacluster.html
thanks for any hints!
Massimo.
More information about the R-sig-Geo
mailing list