[R-sig-Geo] Skater grouping effectiveness, within- and between-group similarities

Wed Sep 21 12:44:38 CEST 2016

I asked Elias Krainski (the autor of skater()), who replied as copied 
inline below:

On Tue, 13 Sep 2016, Michael O'Donnell wrote:

> Hi,
>
> I am interested in calculating multiple statistics based on 
> skater{spdep} results for a SpatialPointsDataFrame, and I was wondering 
> if someone could help me verify that what I have done is correct (Q1).
>
> My objective is to evaluate the performance of the clustering while 
> using different parameters for different skater() runs. Specifically, I 
> am not sure how to measure the within-group similarity and I believe the 
> other statistics are defined correctly.
>
> Also, can someone provide more details on the objects "not.prune" and 
> "candidates" (Q2)?

not.prune is the set of edges that if once pruned generate groups that 
does not follows the restriction. For example, when you want to have 
groups with at least 10 areas and at some point a group stop to be 
considered to be pruned due this.

>
> Q1 ------------------------------ These are the statistics that I would
> like to calculate:
> res1 <- skater() # Example of skater object
>
> # The sum of the between-group dissimilarity
> sst <- res1$ssto
>
> # The within-group similarity
> sse <- sum(res1$ssw)/max(res1$groups)

SSW is the sum of homogeneity at each step of the SKATER algorithm. So the 
first number coincides with SSTO, the second is for the case of two 
groups, the third for the case of three groups and so on. That is it has 
length equal the number of clusters. However, res1$groups is the 
identification of each area to with group it belongs to and has length 
equals the number of areas. So, it doesn't makes sense to divide 
sum(res1$ssw) to the number of groups. You may want 
res1$ssw/1:length(res1$ssw)

>
> # R2
> R2 <- (sst-sse)/sst

Is it the case to compute some kind of gain when having groups? The gain 
can be the difference between consecutive partitions, like diff(res1$ssw)

>
> # AIC,AICc
> # AIC = n*log(SSD/n)+2*cov_count
> # AICc = AIC + 2*cov_count(cov_count+1)/(n-cov_count-1))
> cov_count <- 1 # Number of covariates considered by skater and provided in
> data
> n_count <- nrow(shape2) # Node count
> aic <- (n_count * log(sst)/(n_count) + 2.0 * cov_count)
> aicc <- aic + 2.0 * cov_count * (cov_count + 1.0)/(n_count - cov_count -
> 1.0)

I'm not sure about this anymore...

>
> # Calinski-Harabasz pseudo F-statistic
> nc <- max(res1$groups)
> n <- nrow(shape2)
> fstat = (R2 / (nc - 1)) / ((1 - R2) / (n - nc))

It will be useful to consider the function index.G1 from the clusterSim 
package.

>
> # Review
> print(c(aic, aicc, fstat, R2))
>
> Q2 ------------------------------
> Define "not.prune" and "candidates"
>
> For example, are candidates a list of cluster groups that are 
> statistically significant while not.prune is a list of nodes that did 
> not get assigned to a group. I have not been able to locate enough 
> documentation on these objects and I am not sure how to interpret.

No. We haven't considered any kind of statistical test. As I mentioned 
above, the not.prune are those that doesn't matches the criteria (about 
size of the cluster).

Elias

>
> Thank you for your assistance,
> Mike
>
>

-- 
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; fax +47 55 95 91 00
e-mail: Roger.Bivand at nhh.no
http://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
http://depsy.org/person/434412