aov design questions

Fri, 29 May 1998 16:44:12 -0600 (MDT)

R developers,

  I have a first attempt to make an aov function.  Eventually I want to
  build in  Error() structure, but first I am trying to get this
  presentable for balanced data with only a single stratum, just using 
  residual error.  I  am following R. M. Heiberger's Computation for the 
  Analysis of Designed Experiments, Wiley (1989)

  I a using a wrapper (aov.bal) to call the fitting function in the same 
way that lm() does a bunch of stuff and then calls lm.fit(), in fact I 
took most of the wrapper right out of lm(). I'm trying to make the aov 
object look as much as possible like an lm.object

  Several questions have come up which need discussion.

Q 1)  Is aov supposed to avoid qr decomposition and matrix inversion?
      In S it is claimed to be faster than lm() for large datasets.  Is
      that due to avoidance of qr()? 

      (My main goal is to get the Error strata working.)

Q 2)  If the speed is important, can we give up choice of contrasts used to
  set up the dummy factor variables? ( contr.treatment, contr.sum) 

   Why do I ask? 
   In pursuit of speed (we're talking nanoseconds for a reasonable size
   data set) I have forced in helmert  contrasts because their
   orthogonality makes it slick to compute. So in this version, whatever
   option(contrasts) is set will be ignored.

   I don't like making that choice for the user esp. since it gives wrong 
   answers under the default contrasts, and have thought about
   converting back into the original contrasts, but it isn't easy with
   higher-way anova models. It looks easy in VR2 p 197-204 
   (BTW I have kroenecker and ginv functions if anyone wants them)
  but I don't get it to work out nicely.  V&R show the contrast
  transformation as:
   alpha_T = ginv(C_T) %*% C_H %*% alpha_H
  where C_T is the contrast matrix for treatment contrasts, and C_H for
  Helmert contrasts. The alpha's contain only the treatment contrast 
  effects, not the mean.  This would leave the intercept estimate
   unchanged (not right).  For single factors it works to stick a 
  col.vector of 1s on the front of C_H and C_T, and include the mu.hat in 
  the alpha_H or alpha_T.  But when I get into interactions, I don't see 
  what is needed.

Q 3)  I am having a problem with keep.order. With balanced data, I know
    the effects are orthogonal, so order is not an issue, but as I move
    onto the tougher cases, I need model.matrix to preserve order, and
    it doesn't.

Q 4)  What are the $effects in lm.objects?
     For the moment I've stuck in the coefficients.

Q 5)  Any objection to passing cov.unscaled from aov to summary.aov?
    (Since I'm not using qr and can't pass R)

Feedback needed, comments are welcome.

Jim Robison-Cox                 ____________            
Department of Math Sciences    |            |           phone: (406)994-5340
2-214 Wilson Hall               \   BZN, MT |           FAX:   (406)994-1789
Montana State University         |  *_______|
Bozeman, MT 59717                 \_|         e-mail: jimrc@math.montana.edu 

#---------------------------------------------------------------------------
##  First hack at aov for balanced data

aov.bal <- function(formula, data = list(), subset, weights, na.action, 
        method = "qr", model = TRUE, keep.order=F, ...) 
{  
     mt <- terms(formula, data = data, keep.order = keep.order)
     nt <- length(attr(mt,"order")[attr(mt,"order") ==1])
     lt <- rep("contr.helmert",nt)            ## Sets helmert contrasts
     names(lt) <- attr(mt,"term.labels")[ attr(mt,"order") ==1]
     mf <- match.call()
     mf$model <- NULL
     mf[[1]] <- as.name("model.frame")
     mf <- eval(mf, sys.frame(sys.parent()))
     if (method == "model.frame") 
          return(mf)
     else if (method != "qr")  ## change for aov??
         warning(paste("method =", method, "is not supported. Using \"qr\"."))
     if (length(list(...))) 
          warning(paste("Extra arguments", deparse(substitute(...)), 
                        "are just disregarded."))
     y <- model.response(mf, "numeric")
     w <- model.weights(mf)
     if (is.empty.model(mt)) {
                z <- list(coefficients = numeric(0), residuals = y, 
                        fitted.values = 0 * y, weights = w, rank = 0, 
                        df.residual = length(y))
                class(z) <- if (is.matrix(y)) 
                        c("mlm.null", "lm.null", "mlm", "lm")
                else c("lm.null", "lm")
     }
     else {
                x <- model.matrix(mt, mf, as.list(lt))
                          # forces in contr.helmert
                z <- if (is.null(w)) {
                   aov.balanced.fit(x, y, mt)
		}
                else {
                  stop("Weighted aov not implemented, use lm")
                  #aov.balanced.wfit(x, y, mt, w)
                }
                class(z) <- c(if (is.matrix(y)) c("maov","mlm"), c("aov","lm"))
     }
     z$call <- match.call()
     z$terms <- mt
     if (model) 
         z$model <- mf
     z
}

aov.balanced.fit <- function(x,y,mod.terms,wt=NULL){
    x <- as.matrix(x); dimx <- dim(x)
    xnames <- dimnames(x)[[2]]
    y <- as.matrix(y); dimy <- dim(y)
    assn <- attr(x,"assign")
    df  <-   c(table(assn), dimy[1]- dimx[2])
    D <- diag(1/diag(t(x) %*% x))
    dimnames(D) <- list(xnames,xnames)
    coef <- as.numeric(D %*% t(x) %*%  y)
    names(coef) <- xnames
    fits <- x %*% coef
    resid <- y - fits
    eff <-  apply((t(x) * rep(coef,dimx[1]))^2,1,sum)
             ## Take each column of x (row of t(x)) times its
             ## coefficient, square the result and sum (over columns of x)
    eff <-  c(sapply( split(eff,assn), sum),   sum(resid^2))
             ## Adds them up over the factors.
    if(attr(mod.terms,"intercept") == 1) {
      eff <- eff[-1]
      df  <- df[-1]
    }
    name <- c(attr(mod.terms,"term.labels"), "Residuals")
    names(eff) <- name
    names(df)  <- name
    list(coefficients= coef, residuals = resid, effects = eff, 
         rank= dimx[2], fitted.values = fits, assign= assn,
         cov.unscaled = D, df.residual =  dimy[1] - dimx[2],
         df =  df  )

}

print.aov <- 
function (x, digits = max(3, .Options$digits - 3), ...) 
{  
   lenx <- length(x$df)
   tabl <- cbind(x$df,x$effects, x$effects/x$df)
   tabl <- cbind(tabl,c(tabl[1:(lenx-1),3]/tabl[lenx,3],NA))
   tabl <- cbind(tabl,1-pf(tabl[,4],tabl[,1],tabl[lenx,1]))
   dimnames(tabl) <- list(names(x$effects), c("df","SS","MS","F","P > F"))
   cat("\nAnalysis of Variance:\n\n")
   print.default(round(tabl, digits), na = "", print.gap = 2)
   cat("\n")
   invisible(x)
}

summary.aov <- function (z, correlation=F) 
{
        p <- z$rank
        p1 <- 1:p
        r <- resid(z)
        f <- fitted(z)
        n <- length(f)
        w <- weights(z)
        if (is.null(z$terms)) {
                stop("invalid 'aov' object:  no terms component")
        }
        else {
                if (is.null(w)) {
                        mss <- if (attr(z$terms, "intercept")) 
                                sum((f - mean(f))^2)
                        else sum(f^2)
                        rss <- sum(r^2)
                }
        }
        resvar <- rss/(n - p)
        se <- sqrt(resvar * diag(z$cov.unscaled))
        est <- z$coefficients
        tval <- est/se
        ans <- z[c("call", "terms")]
        ans$residuals <- as.numeric(r)
        ans$coefficients <- cbind(est, se, tval, 2 * (1 - pt(abs(tval), 
                n - p)))
        dimnames(ans$coefficients) <- list(names(z$coefficients),
                c("Estimate", "Std.Error", "t Value", "Pr(>|t|)"))
        ans$sigma <- sqrt(resvar)
        ans$df <- c(p, n - p)
        if (p != attr(z$terms, "intercept")) {
                df.int <- if (attr(z$terms, "intercept")) 
                        1
                else 0
                ans$r.squared <- mss/(mss + rss)
                #0.14 :	(n/(n-p))
                ans$adj.r.squared <- 1 - (1 - ans$r.squared) * 
                        ((n - df.int)/(n - p))
                ans$fstatistic <- c((mss/(p - df.int))/(rss/(n - 
                        p)), p - df.int, n - p)
                #0.14: ans$fstatistic <- c((mss/(p-1))/(rss/(n-p)),p-1,n-p)
                names(ans$fstatistic) <- c("value", "numdf", 
                        "dendf")
        }
        ans$cov.unscaled <- z$cov.unscaled
        if (correlation) {
                invsqrt.diag <- diag(sqrt(1/diag(z$cov.unscaled)))
                ans$correlation <-
                      invsqrt.diag %*% z$cov.unscaled %*% invsqrt.diag
                dimnames(ans$correlation) <- dimnames(z$cov.unscaled)
        }
        class(ans) <- "summary.aov"
        ans
}

print.summary.aov <- function (x, digits = max(3, .Options$digits - 3), 
                              roundfun = round, ...) 
{
        cat("\nCall:\n")
        cat(paste(deparse(x$call), sep = "\n", collapse = "\n"), 
                "\n\n", sep = "")
        resid <- x$residuals
        df <- x$df
        rdf <- df[2]
        if (rdf > 5) {
                cat("Residuals:\n")
                if (length(dim(resid)) == 2) {
                        rq <- apply(t(resid), 1, quantile)
                        dimnames(rq) <- list(c("Min", "1Q", "Median", 
                                "3Q", "Max"), dimnames(resid)[[2]])
                }
                else {
                        rq <- quantile(as.numeric(resid))
                        names(rq) <- c("Min", "1Q", "Median", 
                                "3Q", "Max")
                }
                print(rq, digits = digits, ...)
        }
        else if (rdf > 0) {
                cat("Residuals:\n")
                print(resid, digits = digits, ...)
        }
#        if (nsingular <- df[3] - df[1]) 
#                cat("\nCoefficients: (", nsingular, " not defined because of singularities)\n", 
#                        sep = "")
#        else
        cat("\nCoefficients:\n")
        print(roundfun(x$coefficients, digits = digits), quote = FALSE, 
                ...)
        cat("\nResidual standard error:", format(signif(x$sigma, 
                digits)), "on", rdf, "degrees of freedom\n")
        if (!is.null(x$fstatistic)) {
                cat("Multiple R-Squared:", format(signif(x$r.squared, 
                        digits)))
                cat(",\tAdjusted R-squared:", format(signif(x$adj.r.squared, 
                        digits)), "\n")
                cat("F-statistic:", format(signif(x$fstatistic[1], 
                        digits)), "on", x$fstatistic[2], "and", 
                        x$fstatistic[3], "degrees of freedom")
                cat(",\tp-value:", format(signif(1 - pf(x$fstatistic[1], 
                        x$fstatistic[2], x$fstatistic[3]), digits)), 
                        "\n")
        }
        correl <- x$correlation
        if (!is.null(correl)) {
                p <- dim(correl)[2]
                if (p > 1) {
                        cat("\nCorrelation of Coefficients:\n")
                        correl[!lower.tri(correl)] <- NA
                        print(correl[-1, -NCOL(correl)], digits = digits, 
                                na = "")
                }
        }
        cat("\n")
        invisible(x)
}

#-------------------------------------------------------------------------
##  Rd file:
\name{aov.bal}
\title{Function to compute analysis of variance for balanced designs}
\usage{
aov.bal(formula, data=list(), subset, weights, na.action, method="qr", model=TRUE, keep.order=F, ...)
}
%- maybe also `usage' for other functions documented here.
\alias{aov.bal}
\alias{summary.aov}
\alias{coefficients.lm}
\alias{df.residual.lm}
\alias{fitted.values.lm}
\alias{residuals.lm}
\alias{aov.balanced.fit}
\alias{print.aov}
\alias{print.summary.aov}
\keyword{regression}
%- Also NEED an `\alias' for EACH other function documented here.
\arguments{
 \item{formula}{A detailed description of the model to be fit.  See
   details under \code{\link{lm}} }
 \item{data}{ data frame contining the data to be fit. Default is
   working directory.}
 \item{subset}{Description of what rows of the data frame to use in the fit. }
 \item{na.action}{ Currently not used }
 \item{method}{Not currently used, except that method="model.frame" will
 extract the model frame only and not fit the anova.}
 \item{model}{ If true, model frame is included in the output.}
 \item{keep.order}{If true, model should be evaluated in the order
   specified by the model statement. (Doesn't work.)}
 \item{\dots}{ other arguments (not presently used) }
}
\description{
\code{aov.bal} is used to fit analysis of variance models.  Eventually it
will handle multiple strata. The generic accessor functions
\code{coefficients}, \code{fitted.values} and \code{residuals}
can be used to extract various useful features of the
value returned by \code{aov.bal}.  Currently \code{aov.bal} uses only the
Helmert contrasts and ignores the contrast functions set by
options()$contrast. 
}
\value{
\code{aov.bal} returns an object of class code\aov} which may be printed
with the \code{print.aov} or \code{summary.aov}.
  \item{coefficients}{Coefficients under the Helmert contrasts.}
  \item{residuals} {Residuals}
  \item{effects } {effects ?=? coefficients?}
  \item{rank}     
  \item{fitted.values}
  \item{assign}
  \item{var }
  \item{df.residual}
  \item{df}
  \item{call}
  \item{terms}
}
\references{ Heiberger, R. M., 1989, Computation for the Analysis of
  Designed Experiments, Wiley}
\author{ Jim Robison-Cox }
\seealso{ \code{\link{lm}},\code{\link{glm}},\code{\link{coefficients}},\code{\link{residuals}},\code{\link{fitted.values}} }

\examples{
 speed <- c(59, 69, 66, 75, 54, 65, 58, 70, 66, 77, 71, 71, 58, 65, 56,
            62, 54, 60, 52, 64, 59, 73, 60, 70)
 typists <- gl(6,4,24)
 brands  <- gl(4,1,24)
 type.aov <- aov.bal(speed ~ typists + brands)
 summary(type.aov)
}
\keyword{ regression }%-- one or more ...
\keyword{models}

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._