[R] Capping outliers

jim holtman jholtman at gmail.com
Tue Nov 22 13:54:27 CET 2011


Here is the solution using pmin/pmax for 10,000 rows.

> min_pctle_cut <- 0.01
> max_pctle_cut <- 0.99
> library(outliers)
>
> n <- 10000
> x1 <- runif(n)
> x2 <- runif(n)
> x3 <- x1 + x2 + runif(n)/10
> x4 <- x1 + x2 + x3 + runif(n)/10
> x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))
> x6 <- factor(1*(x5=='a' | x5=='c'))
> data1 <- cbind(x1,x2,x3,x4,x5,x6)
> x <- data.frame(data1)
>
> z <- x[,sapply(x,is.numeric)]
> zNew <- z  # save for 2nd test
>
> qs <- sapply(z, function(z) quantile(z,
+        c(min_pctle_cut, max_pctle_cut), na.rm = TRUE))
>
>
> #Loop below taking time for execution
>
> system.time(for (i in 1:ncol(z))
+ {
+        for (j in 1:nrow(z))
+ {
+ if (z[j,i] < qs[1,i]) z[j,i]=qs[1,i]
+ if (z[j,i] > qs[2,i]) z[j,i]=qs[2,i]
+ }
+ })
   user  system elapsed
   6.64    0.00    7.76
>
> system.time({
+     for (i in 1:ncol(z)) zNew[[i]] <- pmax(qs[1,i], pmin(qs[2,i], z[[i]]))
+ })
   user  system elapsed
   0.02    0.00    0.00
>
> all(z == zNew)  # are they the same?
[1] TRUE
>


On Tue, Nov 22, 2011 at 6:24 AM, Jim Holtman <jholtman at gmail.com> wrote:
> You can easily vectorize this code using pmin/pmax.
>
> Sent from my iPad
>
> On Nov 22, 2011, at 1:06, Aher <ajit.aher at cedar-consulting.com> wrote:
>
>> Hi Experts,
>>
>> I am new to R, using following sample code for capping outliers using
>> percentile information.  Working on large data (30000 observations and 150
>> variables), loop I am using in the below mentioned code for detecting
>> outliers and capping to upper /lower percentile value is taking much time
>> for the execution.
>> Is there anything wrong with code, can anyone suggest improvement in the
>> script to enhance performance!
>> min_pctle_cut <- 0.01
>> max_pctle_cut <- 0.99
>> library(outliers)
>>
>> n <- 100
>> x1 <- runif(n)
>> x2 <- runif(n)
>> x3 <- x1 + x2 + runif(n)/10
>> x4 <- x1 + x2 + x3 + runif(n)/10
>> x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))
>> x6 <- factor(1*(x5=='a' | x5=='c'))
>> data1 <- cbind(x1,x2,x3,x4,x5,x6)
>> x <- data.frame(data1)
>>
>> z <- x[,sapply(x,is.numeric)]
>>
>> qs <- sapply(z, function(z) quantile(z,
>>    c(min_pctle_cut, max_pctle_cut), na.rm = TRUE))
>>
>>
>> #Loop below taking time for execution
>>
>> system.time(for (i in 1:ncol(z))
>> {
>>    for (j in 1:nrow(z))
>> {
>> if (z[j,i] < qs[1,i]) z[j,i]=qs[1,i]
>> if (z[j,i] > qs[2,i]) z[j,i]=qs[2,i]
>> }
>> })
>>
>>
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/Capping-outliers-tp4094647p4094647.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.



More information about the R-help mailing list