[R] Issue:CCC Doesn't Match in R and SAS
sagnik chakravarty
sagnik.stats at gmail.com
Fri Feb 20 14:34:56 CET 2015
Hi,
I was trying to calculate the CCC metric in R with the help of "NbClust"
source code and the available SAS manual for CCC. All the Pseudo-F and
R-square are matching exactly with the SAS output except for the E_R2 and
hence CCC. I have tried and tested in multiple ways but couldn’t get any
explanation for this.
I have attached sample data, initial seed and also the SAS cluster output
in Rdata format which I used for E_R2 and CCC calculation.
FYI, following are the values of E_R2 in SAS and R respectively:
E_R2=0.4630339 (R); but ERSQ=0.3732597284 (SAS)
Could you please help me out with finding what's going wrong in the
background?
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Kindly find below the codes I have used for this:
----------
<SAS>
----------
proc fastclus data=sample_data
maxiter=100 seed=initial_seed maxc=5 outstat=metrics out=output;
Var v1 v2 v3 v4 v5 v6;
run;
----------
<R>
----------
load("C:\\Users\\sagnik\\Desktop\\SAS_cluster.Rdata")
clust.perf.metrics <- function(data, cl) {
data1 <- as.matrix(data)
numberObsBefore <- dim(data1)[1]
data <- na.omit(data1)
nn <- numberObsAfter <- dim(data)[1]
pp <- dim(data)[2]
qq <- max(cl)
TT <- t(data) %*% data
sizeEigenTT <- length(eigen(TT)$value)
eigenValues <- eigen(TT/(nn - 1))$value
for (i in 1:sizeEigenTT) {
if (eigenValues[i] < 0) {
cat(paste("There are only", numberObsAfter, "non-missing observations
out of a possible",
numberObsBefore, "observations."))
stop("The TSS matrix is indefinite. There must be too many missing
values. The index cannot be calculated.")
}
}
s1 <- sqrt(eigenValues)
ss <- rep(1, sizeEigenTT)
for (i in 1:sizeEigenTT) {
if (s1[i] != 0)
ss[i] = s1[i]
}
vv <- prod(ss)
z <- matrix(0, ncol = qq, nrow = nn)
clX <- as.matrix(cl)
for (i in 1:nn)
for (j in 1:qq) {
z[i, j] == 0
if (clX[i, 1] == j) z[i, j] = 1
}
xbar <- solve(t(z) %*% z) %*% t(z) %*% data
B <- t(xbar) %*% t(z) %*% z %*% xbar
W <- TT - B
R2 <- 1 - (sum(diag(W))/sum(diag(TT)))
PseudoF <- (sum(diag(B))/(qq-1))/(sum(diag(W))/(nn-qq))
v1 <- 1
u1 <- rep(0, pp)
c1 <- (vv/qq)^(1/pp)
u1 <- ss/c1
k1 <- sum((u1 >= 1) == TRUE)
p1 <- min(k1, qq - 1)
if (all(p1 > 0, p1 < pp)) {
for (i in 1:p1) { v1 <- v1 * ss[i]}
c <- (v1/qq)^(1/p1)
u <- ss/c
b1 <- sum(1/(nn + u[1:p1]))
b2 <- sum(u[(p1 + 1):pp]^2/(nn + u[(p1 + 1):pp]), na.rm = TRUE)
E_R2 <- 1 - ((b1 + b2)/sum(u^2)) * ((nn - qq)^2/nn) * (1 + (4/nn))
ccc <- log((1 - E_R2)/(1 - R2)) * (sqrt(nn * p1/2)/((0.001 + E_R2)^1.2))
} else {
b1 <- sum(1/(nn + u))
E_R2 <- 1 - (b1/sum(u^2)) * ((nn - qq)^2/nn) * (1 + 4/nn)
ccc <- log((1 - E_R2)/(1 - R2)) * (sqrt(nn * pp/2)/((0.001 + E_R2)^1.2))
}
results <- list(R_2=R2, PseudoF=PseudoF, CCC = ccc, E_R2=E_R2);
return(results)
}
clust.perf.metrics(output[,1:6],output[,7])
#-----------------------------------------------------------------------------------------------------------------------------------
THANKS IN ADVANCE,
REGARDS,
SAGNIK
More information about the R-help
mailing list