[R] Oja median
Roger Koenker
roger at ysidro.econ.uiuc.edu
Fri Aug 15 17:29:21 CEST 2003
I discovered recently that the phrase "Oja median" produces no hits in
Jonathan Baron's very valuable R search engine. I found this surprising
since I've long regarded this idea as one of the more interesting notions
in the multivariate robustness literature. To begin to remedy this oversight
I wrote a bivariate version and then decided that writing a general p-variate
version might make a nice weekend programming puzzle for R-help.
Here are a few more details:
The Oja median of n p-variate observations minimizes over theta
in R^p the sum of the volumes of the simplices formed by theta,
and p of the observed points, the sum being taken over all n choose p
groups of p observations. Thus, in the bivariate case we are minimizing
the sum of the areas of all triangles formed by the the point theta
and pairs of observations. Here is a simple bivariate implementation:
oja.median <-function(x) {
#bivariate version -- x is assumed to be an n by 2 matrix
require(quantreg)
n <- dim(x)[1]
A <- matrix(rep(1:n, n), n)
i <- A[col(A) < row(A)]
j <- A[n + 1. - col(A) > row(A)]
xx <- cbind(x[i, ], x[j, ])
y <- xx[, 1] * xx[, 4] - xx[, 2] * xx[, 3]
z1 <- (xx[, 4] - xx[, 2])
z2 <- - (xx[, 3] - xx[, 1])
return(rq(y~cbind(z1, z2)-1)$coef)
}
To understand the strategy, note that the area of the triangle formed
by the points x_i = (x_i1,x_i2), x_j = (x_j1,x_j2),
and theta = (theta_1,theta_2) is given by the determinant,
| 1 1 1 |
Delta(x_i, x_j, theta) = .5 |y_i1 yj1 theta_1|.
|y_i2 yj2 theta_2|
Expanding the determinant in the unknown parameters theta gives
the l1 regression formulation. Remarkably, a result of Wilks says
that if the call to rq() is replaced with a call to lm() you get
the sample mean -- this gives an impressively inefficient least
squares regression based alternative to apply(x,2,mean)!
It also provides a useful debugging check for proposed algorithms.
Obviously, the expansion of the determinant gives the same formulation
for p>2, the challenge is to find a clean way to generate the
design matrix and response vector for the general setting.
Bon weekend!
url: www.econ.uiuc.edu/~roger/my.html Roger Koenker
email rkoenker at uiuc.edu Department of Economics
vox: 217-333-4558 University of Illinois
fax: 217-244-6678 Champaign, IL 61820
More information about the R-help
mailing list