[R-sig-Geo] Skater grouping effectiveness, within- and between-group similarities
Roger Bivand
Roger.Bivand at nhh.no
Wed Sep 21 12:44:38 CEST 2016
I asked Elias Krainski (the autor of skater()), who replied as copied
inline below:
On Tue, 13 Sep 2016, Michael O'Donnell wrote:
> Hi,
>
> I am interested in calculating multiple statistics based on
> skater{spdep} results for a SpatialPointsDataFrame, and I was wondering
> if someone could help me verify that what I have done is correct (Q1).
>
> My objective is to evaluate the performance of the clustering while
> using different parameters for different skater() runs. Specifically, I
> am not sure how to measure the within-group similarity and I believe the
> other statistics are defined correctly.
>
> Also, can someone provide more details on the objects "not.prune" and
> "candidates" (Q2)?
not.prune is the set of edges that if once pruned generate groups that
does not follows the restriction. For example, when you want to have
groups with at least 10 areas and at some point a group stop to be
considered to be pruned due this.
>
> Q1 ------------------------------ These are the statistics that I would
> like to calculate:
> res1 <- skater() # Example of skater object
>
> # The sum of the between-group dissimilarity
> sst <- res1$ssto
>
> # The within-group similarity
> sse <- sum(res1$ssw)/max(res1$groups)
SSW is the sum of homogeneity at each step of the SKATER algorithm. So the
first number coincides with SSTO, the second is for the case of two
groups, the third for the case of three groups and so on. That is it has
length equal the number of clusters. However, res1$groups is the
identification of each area to with group it belongs to and has length
equals the number of areas. So, it doesn't makes sense to divide
sum(res1$ssw) to the number of groups. You may want
res1$ssw/1:length(res1$ssw)
>
> # R2
> R2 <- (sst-sse)/sst
Is it the case to compute some kind of gain when having groups? The gain
can be the difference between consecutive partitions, like diff(res1$ssw)
>
> # AIC,AICc
> # AIC = n*log(SSD/n)+2*cov_count
> # AICc = AIC + 2*cov_count(cov_count+1)/(n-cov_count-1))
> cov_count <- 1 # Number of covariates considered by skater and provided in
> data
> n_count <- nrow(shape2) # Node count
> aic <- (n_count * log(sst)/(n_count) + 2.0 * cov_count)
> aicc <- aic + 2.0 * cov_count * (cov_count + 1.0)/(n_count - cov_count -
> 1.0)
I'm not sure about this anymore...
>
> # Calinski-Harabasz pseudo F-statistic
> nc <- max(res1$groups)
> n <- nrow(shape2)
> fstat = (R2 / (nc - 1)) / ((1 - R2) / (n - nc))
It will be useful to consider the function index.G1 from the clusterSim
package.
>
> # Review
> print(c(aic, aicc, fstat, R2))
>
> Q2 ------------------------------
> Define "not.prune" and "candidates"
>
> For example, are candidates a list of cluster groups that are
> statistically significant while not.prune is a list of nodes that did
> not get assigned to a group. I have not been able to locate enough
> documentation on these objects and I am not sure how to interpret.
No. We haven't considered any kind of statistical test. As I mentioned
above, the not.prune are those that doesn't matches the criteria (about
size of the cluster).
Elias
>
> Thank you for your assistance,
> Mike
>
>
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; fax +47 55 95 91 00
e-mail: Roger.Bivand at nhh.no
http://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
http://depsy.org/person/434412
More information about the R-sig-Geo
mailing list