[R] Outlier identification according to Hardin & Rocke (1999)
Kevin Wright
kwright at eskimo.com
Wed May 26 21:31:03 CEST 2004
I'm trying to use a paper by Hardin & Rocke: http://handel.cipic.ucdavis.edu/~dmrocke/Robdist5.pdf
as a guide for a function to identify outliers in multivariate data. Attached below is a function that is my attempt to reproduce their method and also a test to see what fraction of the data are identified as outliers. Using this function I am able to reproduce their results regarding the asymptotic chi-square method, but not their new method using an asymptotic F method. In particular, the asymptotic F method (and adjusted F method) seem to give critical distances that are too large and therefore no (or few) points are identified as outliers.
I have tried unsuccessfully to contact the authors of the paper to seek further information and / or numerical examples.
I would be most interested if anybody has used the method of Hardin & Rocke or can take the function I have provided below and modify it to reproduce their results. Note: The mvtnorm package is required.
Best,
Kevin Wright
outliers.id <- function(x, alpha=.05){
# See: Hardin & Rocke (1999), "The Distribution of Robust Distances"
# http://handel.cipic.ucdavis.edu/~dmrocke/Robdist5.pdf
# See: Hardin & Rocke (2002), "Outlier Detection in the Multiple Cluster
# Setting Using the Minimum Covariance Determinant Estimator"
# http://bioinfo.cipic.ucdavis.edu/publications/print_pub?pub_id=736&category=1
# Drop factors first
factors <- names(x)[sapply(x,is.factor)]
if(length(factors)>0)
x <- x[-factors]
# Get the robust location/scale estimates
require(MASS)
covResult <- cov.rob(x)
# Calculate the mahalanobis distance for each datum
distance <- mahalanobis(x,covResult$center,covResult$cov)
n <- nrow(x)
p <- ncol(x)
h <- floor((n+p+1)/2)
# Asymptotic chi-square method (page 11)
# Often identifies too many points as outliers
critical.chi <- qchisq(1-alpha, p)
cat("Chi square critical distance:", critical.chi,"\n")
# Now the approximate F method. First estimate c (page 19)
c <- pchisq(qchisq(1-h/n, p),p+2) / (h/n)
# Now to estimate m (page 22)
a <- (n-h)/n
qa <- qchisq(1-a,p)
ca <- (1-a)/pchisq(qa,p+2)
c2 <- -pchisq(qa,p+2)/2
c3 <- -pchisq(qa,p+4)/2
c4 <- 3 * c3
b1 <- ca * (c3 - c4) / (1-a)
b2 <- 0.5 + ca/(1-a) * (c3 - qa/p * (c2 + (1-a)/2))
v1 <- (1-a) * b1^2 * (a * (ca * qa/p -1)^2 -1) -
2 * c3 * ca^2 * (3* (b1 - p * b2)^2 + (p+2) * b2 * (2*b1 - p*b2))
v2 <- n * (b1 * (b1 - p*b2) * (1-a))^2 * ca^2
v <- v1/v2
m <- 2/(ca^2 * v)
# (page 17)
critical.F.asy <- p*m * qf(1-alpha, p, m-p+1) / (c * (m-p+1))
cat("Asymptotic F critical distance:",critical.F.asy,"\n")
# The small-sample (hundreds of points) adjustment to m.
# Hardin & Rocke, 2002, page 631
m <- m * exp(0.725 - 0.00663*p -0.0780 * log(n))
# Finally, the critical point (using adjusted m)
critical.F.adj <- p*m*qf(1-alpha, p, m-p+1) / (c * (m-p+1))
cat("Adjusted asymptotic F critical distance:",critical.F.adj,"\n")
#outliers <- as.integer(distance > critical)
x$Distances <- distance
#x$Outliers <- outliers
attr(x,"critical.chi") <- critical.chi
attr(x,"critical.F.asy") <- critical.F.asy
attr(x,"critical.F.adj") <- critical.F.adj
return(x)
}
# Try to reproduce the tables in the paper
if(FALSE){
require(mvtnorm)
# Simulate data
n <- 500;p <- 5
dat <- rmvnorm(n,rep(0,p),diag(p))
# Identify outliers
dat.out <- outliers.id(as.data.frame(dat),alpha=.05)
# Now see what percent are identified as outliers
cat("Chi-square \n")
100*sum(dat.out$Distances>attr(dat.out,"critical.chi"))/n
cat("Approximate F asymptotic \n")
100*sum(dat.out$Distances>attr(dat.out,"critical.F.asy"))/n
cat("Approximate F adjusted \n")
100*sum(dat.out$Distances>attr(dat.out,"critical.F.adj"))/n
}
More information about the R-help
mailing list