[R] Outlier identification according to Hardin & Rocke (1999)

Wed May 26 21:31:03 CEST 2004

I'm trying to use a paper by Hardin & Rocke: http://handel.cipic.ucdavis.edu/~dmrocke/Robdist5.pdf
as a guide for a function to identify outliers in multivariate data.  Attached below is a function that is my attempt to reproduce their method and also a test to see what fraction of the data are identified as outliers.  Using this function I am able to reproduce their results regarding the asymptotic chi-square method, but not their new method using an asymptotic F method.  In particular, the asymptotic F method (and adjusted F method) seem to give critical distances that are too large and therefore no (or few) points are identified as outliers.

I have tried unsuccessfully to contact the authors of the paper to seek further information and / or numerical examples.

I would be most interested if anybody has used the method of Hardin & Rocke or can take the function I have provided below and modify it to reproduce their results.  Note: The mvtnorm package is required.

Best,

Kevin Wright

outliers.id <- function(x, alpha=.05){
  # See: Hardin & Rocke (1999), "The Distribution of Robust Distances"
  # http://handel.cipic.ucdavis.edu/~dmrocke/Robdist5.pdf

  # See: Hardin & Rocke (2002), "Outlier Detection in the Multiple Cluster
  # Setting Using the Minimum Covariance Determinant Estimator"
  # http://bioinfo.cipic.ucdavis.edu/publications/print_pub?pub_id=736&category=1

  # Drop factors first
  factors <- names(x)[sapply(x,is.factor)] 
  if(length(factors)>0)
    x <- x[-factors]

  # Get the robust location/scale estimates
  require(MASS)
  covResult <- cov.rob(x)

  # Calculate the mahalanobis distance for each datum
  distance <- mahalanobis(x,covResult$center,covResult$cov)

  n <- nrow(x)
  p <- ncol(x)
  h <- floor((n+p+1)/2)

  # Asymptotic chi-square method (page 11)
  # Often identifies too many points as outliers
  critical.chi <- qchisq(1-alpha, p)
  cat("Chi square critical distance:", critical.chi,"\n")

  # Now the approximate F method.  First estimate c (page 19)
  c <- pchisq(qchisq(1-h/n, p),p+2) / (h/n)

  # Now to estimate m (page 22)
  a <- (n-h)/n
  qa <- qchisq(1-a,p)
  ca <- (1-a)/pchisq(qa,p+2)
  c2 <- -pchisq(qa,p+2)/2
  c3 <- -pchisq(qa,p+4)/2
  c4 <- 3 * c3
  b1 <- ca * (c3 - c4) / (1-a)
  b2 <- 0.5 + ca/(1-a) * (c3 - qa/p * (c2 + (1-a)/2))
  v1 <- (1-a) * b1^2 * (a * (ca * qa/p -1)^2 -1) -
    2 * c3 * ca^2 * (3* (b1 - p * b2)^2 + (p+2) * b2 * (2*b1 - p*b2))
  v2 <- n * (b1 * (b1 - p*b2) * (1-a))^2 * ca^2
  v <- v1/v2
  m <- 2/(ca^2 * v)

  # (page 17)
  critical.F.asy <- p*m * qf(1-alpha, p, m-p+1) / (c * (m-p+1))
  cat("Asymptotic F critical distance:",critical.F.asy,"\n")

  # The small-sample (hundreds of points) adjustment to m.
  # Hardin & Rocke, 2002, page 631
  m <- m * exp(0.725 - 0.00663*p -0.0780 * log(n))
  # Finally, the critical point (using adjusted m)
  critical.F.adj <- p*m*qf(1-alpha, p, m-p+1) / (c * (m-p+1))
  cat("Adjusted asymptotic F critical distance:",critical.F.adj,"\n")

  #outliers <- as.integer(distance > critical)
  x$Distances <- distance
  #x$Outliers <- outliers
  attr(x,"critical.chi") <- critical.chi
  attr(x,"critical.F.asy") <- critical.F.asy
  attr(x,"critical.F.adj") <- critical.F.adj
  return(x)
}

# Try to reproduce the tables in the paper
if(FALSE){
  require(mvtnorm)
  # Simulate data
  n <- 500;p <- 5
  dat <- rmvnorm(n,rep(0,p),diag(p))

  # Identify outliers
  dat.out <- outliers.id(as.data.frame(dat),alpha=.05)

  # Now see what percent are identified as outliers
  cat("Chi-square \n")
  100*sum(dat.out$Distances>attr(dat.out,"critical.chi"))/n
  cat("Approximate F asymptotic \n")
  100*sum(dat.out$Distances>attr(dat.out,"critical.F.asy"))/n
  cat("Approximate F adjusted \n")
  100*sum(dat.out$Distances>attr(dat.out,"critical.F.adj"))/n
}