[R] Issue:CCC Doesn't Match in R and SAS

sagnik chakravarty sagnik.stats at gmail.com
Fri Feb 20 14:34:56 CET 2015


Hi,

I was trying to calculate the CCC metric in R with the help of "NbClust"
source code and the available SAS manual for CCC. All the Pseudo-F and
R-square are matching exactly with the SAS output except for the E_R2 and
hence CCC. I have tried and tested in multiple ways but couldn’t get any
explanation for this.

I have attached sample data, initial seed and also the SAS cluster output
in Rdata format which I used for E_R2 and CCC calculation.

FYI, following are the values of E_R2 in SAS and R respectively:

E_R2=0.4630339 (R); but ERSQ=0.3732597284 (SAS)

Could you please help me out with finding what's going wrong in the
background?

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Kindly find below the codes I have used for this:

----------

<SAS>

----------

proc fastclus data=sample_data

maxiter=100 seed=initial_seed maxc=5 outstat=metrics out=output;

Var v1 v2 v3 v4 v5 v6;

run;



----------

<R>

----------

load("C:\\Users\\sagnik\\Desktop\\SAS_cluster.Rdata")



clust.perf.metrics <- function(data, cl) {

  data1 <- as.matrix(data)

  numberObsBefore <- dim(data1)[1]

  data <- na.omit(data1)

  nn <- numberObsAfter <- dim(data)[1]

  pp <- dim(data)[2]

  qq <- max(cl)

  TT <- t(data) %*% data

  sizeEigenTT <- length(eigen(TT)$value)

  eigenValues <- eigen(TT/(nn - 1))$value

  for (i in 1:sizeEigenTT) {

    if (eigenValues[i] < 0) {

      cat(paste("There are only", numberObsAfter, "non-missing observations
out of a possible",

                numberObsBefore, "observations."))

      stop("The TSS matrix is indefinite. There must be too many missing
values. The index cannot be calculated.")

    }

  }



  s1 <- sqrt(eigenValues)

  ss <- rep(1, sizeEigenTT)

  for (i in 1:sizeEigenTT) {

    if (s1[i] != 0)

      ss[i] = s1[i]

  }

  vv <- prod(ss)

  z <- matrix(0, ncol = qq, nrow = nn)

  clX <- as.matrix(cl)

  for (i in 1:nn)

    for (j in 1:qq) {

      z[i, j] == 0

      if (clX[i, 1] == j) z[i, j] = 1

    }

  xbar <- solve(t(z) %*% z) %*% t(z) %*% data

  B <- t(xbar) %*% t(z) %*% z %*% xbar

  W <- TT - B

  R2 <- 1 - (sum(diag(W))/sum(diag(TT)))

  PseudoF <- (sum(diag(B))/(qq-1))/(sum(diag(W))/(nn-qq))



  v1 <- 1

  u1 <- rep(0, pp)

  c1 <- (vv/qq)^(1/pp)

  u1 <- ss/c1

  k1 <- sum((u1 >= 1) == TRUE)

  p1 <- min(k1, qq - 1)



  if (all(p1 > 0, p1 < pp)) {

    for (i in 1:p1) { v1 <- v1 * ss[i]}

    c <- (v1/qq)^(1/p1)

    u <- ss/c

    b1 <- sum(1/(nn + u[1:p1]))

    b2 <- sum(u[(p1 + 1):pp]^2/(nn + u[(p1 + 1):pp]), na.rm = TRUE)

    E_R2 <- 1 - ((b1 + b2)/sum(u^2)) * ((nn - qq)^2/nn) * (1 + (4/nn))

    ccc <- log((1 - E_R2)/(1 - R2)) * (sqrt(nn * p1/2)/((0.001 + E_R2)^1.2))

  } else {

    b1 <- sum(1/(nn + u))

    E_R2 <- 1 - (b1/sum(u^2)) * ((nn - qq)^2/nn) * (1 + 4/nn)

    ccc <- log((1 - E_R2)/(1 - R2)) * (sqrt(nn * pp/2)/((0.001 + E_R2)^1.2))

  }

  results <- list(R_2=R2, PseudoF=PseudoF, CCC = ccc, E_R2=E_R2);
return(results)

}



clust.perf.metrics(output[,1:6],output[,7])

#-----------------------------------------------------------------------------------------------------------------------------------

THANKS IN ADVANCE,

REGARDS,

SAGNIK


More information about the R-help mailing list