[R] Data frame manipulation by eliminating rows containing extreme values

aajit75 aajit75 at yahoo.co.in
Sun Oct 23 11:26:42 CEST 2011


Hi David,

Thanks for the reply,


f=function(x){quantile(x, c(0.25, 0.75),na.rm = TRUE) - matrix(IQR(x,na.rm =
TRUE) * c(1.5), nrow = 1) %*% c(-1, 1)} 

Here parameter 1.5 is set for example in the above function as argument, it
can be even more may be 3.0 after analyzing actual data. Here expectation is
to find cut-off on both sides(higher and lower values) for each variable as
like in box plot. And then I would like to eliminate observations based on
the cut-off.

For the second point, I am extremly sorry. It was because of the typo
mistake, actually in 
xyz <- lapply(data1, f) here it is data2

n <- 100 
x1 <- runif(n) 
x2 <- runif(n) 
x3 <- x1 + x2 + runif(n)/10 
x4 <- x1 + x2 + x3 + runif(n)/10 
x5 <- factor(sample(c('a','b','c'),n,replace=TRUE)) 
x6 <- 1*(x5=='a' | x5=='c') 
data1 <- cbind(x1,x2,x3,x4,x5,x6) 
data2 <- data.frame(data1) 
xyz <- lapply(data2, f) 
str (xyz)

Now it has list of six only
List of 6
 $ x1: num [1, 1:2] 0.7797 0.0613
 $ x2: num [1, 1:2] 0.9533 0.0194
 $ x3: num [1, 1:2] 1.438 0.532
 $ x4: num [1, 1:2] 2.85 1.03
 $ x5: num [1, 1:2] 4 0
 $ x6: num [1, 1:2] 1.5 -0.5

Third point you mentioned is the problem to resolved, now I am overwriting
data2 applying these cut-offs for each variable. Is there any efficient way
to do this?

 data2 <- subset (data2, x1<=xyz$x1[,1] &  x1>=xyz$x1[,2]) 
 data2 <- subset (data2, x1<=xyz$x2[,1] &  x1>=xyz$x2[,2]) 

On the last point you mentioned, I agree on the removing "extreme values" is
a serious distortion of the data.  But in my data values to some
observations is set to very high number like say 999999999999. Also this is
not consistent across all variables in the data. So I can set value higher
than 1.5 in the function and get cut-offs for each varibales and remove such
obervations. As rm.outlier removes only one value, I am using above
function.

Thanks for the help in advance.

Regards,
-Ajit




--
View this message in context: http://r.789695.n4.nabble.com/Data-frame-manipulation-by-eliminating-rows-containing-extreme-values-tp3927941p3929927.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list