[R] homogenity inside groups

Fri Nov 16 13:21:13 CET 2007

On Fri, 2007-11-16 at 11:49 +0100, Petr PIKAL wrote:
> Yes, that is what I meant. It is not a species but some products and I 
> have various parameters measured for each product. But basically I thought 
> that ecological data are quite similar.
> 
> So I would be glad to be able to try your code.
> 
> Thank you

Dear Petr,

The code is attached the file. The functions use my analogue package (on
CRAN). I haven't written the Rd files yet but the functions are quite
simple and require few arguments. The code chunk below use data from the
vegan package as an example.

require(vegan)
require(analogue)
data(varespec)
dis <- vegdist(varespec)
groups <- factor(c(rep(1,16), rep(2,8)), labels = c("grazed","ungrazed"))
mod <- dispersion(dis, groups)
mod
anova(mod)
system.time(permDisp(mod))
system.time(perm.mod <- permDisp(mod, nperm = 9999))
perm.mod
plot(mod)
boxplot(mod)

Hope you find it useful and if you have any comments, let me know.

All the best,

G

> 
> Petr Pikal
> petr.pikal at precheza.cz
> 
> Gavin Simpson <gavin.simpson at ucl.ac.uk> napsal dne 15.11.2007 18:32:08:
> 
> > On Thu, 2007-11-15 at 16:07 +0100, Petr PIKAL wrote:
> > > Dear all
> > > 
> > > I would like to show my audience that some variables are homogenous 
> inside 
> > > groups but different outside. I can use by with summary for all 
> variables
> > > 
> > >  by(iris[,1:4],  iris$Species, summary)
> > > 
> > > what can be quite messy in case of more than few variables and about 8 
> 
> > > groups
> > > 
> > > or densityplot for one variable
> > > 
> > > densityplot(~Petal.Length | Species, iris)
> > > 
> > > I have two questions:
> > > 
> > > 1.      Is there any other plot to show all variables at once? 
> Something 
> > > like
> > > 
> > > densityplot(~iris[,1:4] | Species, iris)
> > > 
> > > 2.      Is it possible to evaluate homogenity of many (20-30) 
> variables 
> > > inside groups by some other function/table/graph?
> > 
> > Hi Petr,
> > 
> > I haven't replied-all by the way, in case I've misunderstood, but...
> > 
> > If you mean that you have a data set with say 10 samples split into 2
> > groups, and for each sample you have measured many variables (say
> > species in a quadrat or lots of morphological parameters on individual
> > plants), then one way might be to look at the work of Marti Anderson
> > [*]. She has developed a method that calculates the multivariate
> > distance between each sample in a group and that group's multivariate
> > centroid. You then take these distances to group centroid and do an
> > ANOVA on them, the general point being that this is a multivariate
> > analogue of something like a Levene's test and if groups variances are
> > heterogeneous then one or more groups will have a higher/lower mean
> > distance to centroid than the other groups.
> > 
> > groups <- something.that.gives.groups.as.a.factor()
> > dis <- something.that.gives.distances.centroids(my_data, groups)
> > anova(lm(dis ~ groups))
> > 
> > If this is the case, then I have some code that I'm currently working on
> > which does this, which works (!) and which I can send to you. Marti
> > tests for homogeneity using a permutation test. I have that as well, but
> > currently it doesn't give the same results as Marti's PERMDISP2
> > programme (standalone Fortran, source not available), though I can't see
> > what I'm doing wrong, if anything - and my permutation p-value closely
> > matches the ANOVA p-value for tests where the data don't violate ANOVA
> > assumptions too grossly - Levene's test is quite robust in this regard.
> > 
> > Let me know if this is what you meant and if the code will be useful and
> > I'll send a reply to the list for the archives and send you my code.
> > 
> > All the best,
> > 
> > G
> > 
> > [*] Anderson, M.J. (2006) Distance-based tests for homogeneity of
> > multivariate dispersions. Biometrics 62, 245--253
> > 
> > http://www.stat.auckland.ac.nz/~mja/Programs.htm
> > 
> > > 
> > > Thank you
> > > 
> > > Petr Pikal
> > > petr.pikal at precheza.cz
> > > 
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > -- 
> > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >  Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
> >  ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
> >  Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
> >  Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
> >  UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
> > %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> > 
> 
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
-------------- next part --------------
## calculates multivariate dispersion for groups, with
## appropriate permutation test
##
## Gavin Simpson (c) 2007
##
## Licence: GNU GPL Version 2
##
dispersion <- function(d, group, type = c("centroid","median"))
  {
    ## uses code from stats:::cmdscale by R Core Development Team
    if(!inherits(d, "dist"))
      stop("distances 'd' must be a 'dist' object")
    ## checks for groups - need to be a factor for later
    if(!is.factor(group))
      group <- as.factor(group)
    n <- attr(d, "Size")
    x <- matrix(0, ncol = n, nrow = n)
    x[row(x) > col(x)] <- d^2
    x <- x + t(x)
    storage.mode(x) <- "double"
    .C(stats:::R_dblcen, x, as.integer(n), DUP = FALSE)
    e <- eigen(-x/2, symmetric = TRUE)
    vectors <- e$vectors
    eig <- e$values
    ## store which are the positive eigenvalues
    pos <- eig > 0
    ## scale Eigenvectors
    vectors <- vectors %*% diag(sqrt(abs(eig)))
    ## Eigenvectors for negative Eigenvalues need a further scaling
    ## actually appears they don't need this at all - checked with PERMDISP2
    #vectors[, !pos] <- Re(sqrt(-1i) * vectors[, !pos])
    ## storage for centroids and distances to centroids
    centroids <- matrix(0, ncol = length(eig), nrow = length(levels(group)))
    dist.pos <- dist.neg <- numeric(length = n)
    ## for each of the groups, calculate distance to centroid for
    ## observation in the group
    for(i in levels(group)) {
      j <- which(levels(group) == i)
      centroids[j, ] <- colMeans(vectors[group == i, ])
      dist.pos[group == i] <- distance(vectors[group == i, pos],
                 centroids[j, pos, drop = FALSE], method = "SQeuclidean")
      dist.neg[group == i] <- distance(vectors[group == i, !pos],
                 centroids[j, !pos, drop = FALSE], method = "SQeuclidean")
    }
    colnames(vectors) <- colnames(centroids) <- names(eig) <-
      paste("PCoA", 1:n, sep = "")
    rownames(vectors) <- attr(d, "Labels")
    rownames(centroids) <- as.character(levels(group))
    ## zij are the distances of each point to its group centroid
    zij <- sqrt(dist.pos - dist.neg)
    retval <- list(eig = eig, vectors = vectors, distances = zij,
                   group = group, centroids = centroids, call = match.call())
    class(retval) <- "dispersion"
    attr(retval, "method") <- attr(d, "method")
    retval
  }

print.dispersion <- function(x, digits = max(3, getOption("digits") - 3),
                             ...) {
  cat("\n")
  writeLines(strwrap("Homogeneity of multivariate dispersions\n",
                     prefix = "\t"))
  cat("\n")
  writeLines(strwrap(pasteCall(x$call)))
  cat(paste("\nNo. of Positive Eigenvalues:", sum(x$eig > 0)))
  cat(paste("\nNo. of Negative Eigenvalues:", sum(x$eig < 0)))
  cat("\n\n")
  writeLines(strwrap("Average distance to centroid:\n"))
  print.default(tapply(x$distances, x$group, mean), digits = digits)
  cat("\n")
  writeLines(strwrap("Eigenvalues for PCoA axes:\n"))
  print.default(round(x$eig, digits = digits))
  invisible(x)
}

pasteCall <- function (call, prefix = "Call:") 
{
    objcall <- deparse(call)
    if (length(objcall) > 1) {
        call.str <- ""
        for (i in seq(along = objcall)) {
            call.str <- paste(call.str, objcall[i], sep = " ")
        }
    }
    else {
        call.str <- objcall
    }
    call.str <- paste(prefix, call.str, "\n", sep = " ")
    return(call.str)
}

boxplot.dispersion <- function(x, ylab = "Distance to centroid", ...) {
  tmp <- boxplot(x$distances ~ x$group, ylab = ylab, ...)
  invisible(tmp)
}

plot.dispersion <- function(x, axes = c(1,2), cex = 0.7, hull = TRUE,
                            ylab, xlab, main, sub, ...) {
  if(missing(main))
    main <- deparse(substitute(x))
  if(missing(sub))
    sub <- paste("method = \"", attr(x, "method"), "\"", sep = "")
  if(missing(xlab))
    xlab <- paste("PCoA", axes[1])
  if(missing(ylab))
    ylab <- paste("PCoA", axes[2])
  plot(x$vectors[, axes], asp = 1, type = "n",
       axes = FALSE, ann = FALSE, ...)
  for(i in levels(x$group)) {
    j <- which(levels(x$group) == i)
    segments(x$centroids[j, axes[1]],
             x$centroids[j, axes[2]],
             x$vectors[x$group == i, axes[1]],
             x$vectors[x$group == i, axes[2]], col = "blue")
    if(hull) {
      ch <- chull(x$vectors[x$group == i, axes])
      ch <- c(ch, ch[1])
      lines(x$vectors[x$group == i, axes][ch, ],
            col = "black", lty = "dashed")
    }
  }
  points(x$centroids[, axes], pch = 16, cex = 1, col = "red")
  points(x$vectors[, axes], pch = as.numeric(x$group),
         cex = cex, ...)
  title(main = main, xlab = xlab, ylab = ylab, sub = sub)
  axis(1)
  axis(2)
  box()
}

anova.dispersion <- function(object, ...)
  {
    model.dat <- with(object, data.frame(Distances = distances,
                                         Groups = group))
    anova(lm(Distances ~ Groups, data = model.dat))
  }

permDisp <- function(object, nperm = 999, ...)
  {
    if(!inherits(object, "dispersion"))
      stop("Only for class \"dispersion\"")
    nobs <- length(object$distances)
    mod <- lm(object$distances ~ object$group)
    mod.Q <- mod$qr
    p <- mod.Q$rank
    resids <- qr.resid(mod.Q, object$distances)
    res <- numeric(length = nperm + 1)
    res[1] <- summary(mod)$fstatistic[1]
    for(i in seq(along = res[-1])) {
      perm.resid <- sample(resids)
      f <- qr.fitted(mod.Q, perm.resid)
      mss <- sum((f - mean(f))^2)
      r <- qr.resid(mod.Q, perm.resid)
      rss <- sum(r^2)
      rdf <- nobs - p
      resvar <- rss / rdf
      res[i+1] <- (mss / (p - 1)) / resvar
    }
    pval <- sum(res >= res[1]) / length(res)
    mod.aov <- anova(object)
    retval <- cbind(mod.aov[, 1:4], c(nperm, NA), c(pval, NA))
    dimnames(retval) <- list(c("Groups", "Residuals"),
                             c("Df", "Sum Sq", "Mean Sq", "F", "N.Perm",
                               "Pr(>F)")) 
    class(retval) <- c("permDisp", class(retval))
    retval
  }

print.permDisp <- function(x, digits = max(getOption("digits") - 2, 3),
                           ...) {
  ## uses code from stats:::print.anova by R Core Development Team
  cat("\n")
  writeLines(strwrap("Permutation test for homogeneity of multivariate dispersions\n"))
  cat("\n")
  nc <- dim(x)[2]
  cn <- colnames(x)
  has.P <- substr(cn[nc], 1, 3) == "Pr("
  zap.i <- 1:(if (has.P) 
              nc - 1
  else nc)
  i <- which(substr(cn, 2, 7) == " value")
  i <- c(i, which(!is.na(match(cn, "F"))))
  if (length(i)) 
    zap.i <- zap.i[!(zap.i %in% i)]
  tst.i <- i
  if (length(i <- grep("Df$", cn))) 
    zap.i <- zap.i[!(zap.i %in% i)]
  if (length(i <- grep("N.Perm$", cn))) 
    zap.i <- zap.i[!(zap.i %in% i)]
  cat("Response: Distances", sep = "\n")
  printCoefmat(x, digits = digits,
               signif.stars = getOption("show.signif.stars"),
               has.Pvalue = has.P, P.values = has.P, cs.ind = NULL,
               zap.ind = zap.i, tst.ind = tst.i, na.print = "", ...)
  invisible(x)
}

##
## Example usage:
##
#require(vegan)
#require(analogue)
#data(varespec)
#dis <- vegdist(varespec)
#groups <- factor(c(rep(1,16), rep(2,8)), labels = c("grazed","ungrazed"))
#mod <- dispersion(dis, groups)
#mod
#anova(mod)
#system.time(permDisp(mod))
#system.time(perm.mod <- permDisp(mod, nperm = 9999))
#perm.mod
#plot(mod)
#boxplot(mod)