[R] some thoughts on outlier detection, need help!
Weiwei Shi
helprhelp at gmail.com
Thu Aug 4 21:13:14 CEST 2005
Dear listers:
I have an idea to do the outlier detection and I need to use R to
implement it first. Here I hope I can get some input from all the
guru's here.
I select distance-based approach---
step 1:
calculate the distance of any two rows for a dataframe. considering
the scaling among different variables, I choose mahalanobis, using
variance as scaler.
step 2:
Let k be the number of points in one "cluster". K is decided by
answering the following question: how many neighbors a point needs for
not being an outlier.
for each point, get the smallest (k-1) distances from step1. Among
the (k-1) distances of each point, get the max for the point.
step 3:
get the distribution of those max for all the points. Thus, the
multivariate problem becomes a univariate one. Then the outlier in
those max's will define the outlier of the point.
My question is:
1. I don't know if using mahalanobis is proper or not since most
clustering algorithms implemented in R (like pam or clara) use
euclidean or mahattan.
2. Is there a way to get the mahalanobis distance matrix for any two
rows of a dataframe or matrix?
3. My approach does allow a point belonging to more than one
k-cluster. Is there similar algorithm in R or published?
Thanks for any suggestions,
weiwei
--
Weiwei Shi, Ph.D
"Did you always know?"
"No, I did not. But I believed..."
---Matrix III
More information about the R-help
mailing list