[R] Finding suspicious data points?
Carl Witthoft
carl at witthoft.com
Thu Jan 26 15:00:13 CET 2012
According to the help file for 'outlier' , (quoting)
x a data sample, vector in most cases. If argument is a dataframe, then
outlier is
calculated for each column by sapply. The same behavior is applied by apply
when the matrix is given. (endquote)
Looks like you could create a matrix that looks like an "upper
triangular" like
1 1 1
NA 2 2
NA NA 3
and see the results. However, since 'outlier' just returns the value
furthest from the mean, this doesn't really provide much information.
If I were to write a function to find "genuine" outliers, I would do
something like
x[ abs(x-mean(x)) > 3*sd(x)] , thus returning all values more than
3-sigma from the mean.
<quote>
I would like to find data points that at least should be checked one
more time before I process them further.
I've had a look at the outliers package for this, and the outliers
function in that package, but it appears to only return one value.
An example:
> outlier(c(1:3,rnorm(1000,mean=100000,sd=300)))
[1] 1
I think at least 1,2 and 3 should be checked in this case.
Any ideas on how to achieve this in R?
Actually, the real data I will be investigating consist of vector norms
and angles (in an attempt to identify either very short, very long
vectors, or vectors pointing in an odd angle for the category to which
it has been assigned) so a 2D method would be even better.
I would much appreciate any help I can get on this,
--
Sent from my Cray XK6
"Pendeo-navem mei anguillae plena est."
More information about the R-help
mailing list