[R] remove extreme values or winsorize – loop - dataframe

jim holtman jholtman at gmail.com
Sun Aug 1 04:10:38 CEST 2010


This will split the data by industry & year and then return the values
that include the 80%-tile (>=10% & <= 90%)

# split the data by industry/year
d.s <- split(data, list(data$industry, data$year), drop=TRUE)
result <- lapply(d.s, function(.id){
    # get 10/90% values
    .limit <- quantile(.id$X1, prob=c(.1, .9))
    subset(.id, X1 >= .limit[1] & X1 <= .limit[2])
})

This returns a list of 100 elements for each combination.

On Sat, Jul 31, 2010 at 9:39 PM, Cecilia Carmo <cecilia.carmo at ua.pt> wrote:
> Hi everyone!
>
> #I need a loop or a function that creates a X2 variable that is X1 without
> the extreme values (or X1 winsorized) by industry and year.
>
> #My reproducible example:
> firm<-sort(rep(1:1000,10),decreasing=F)
> year<-rep(1998:2007,1000)
> industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10),
> rep(10,10)),1000)
> X1<-rnorm(10000)
> data<-data.frame(firm, industry,year,X1)
> data
>
> The way I’m doing this is very hard. I split my sample by industry and year,
> for each industry and year I calculate the 10% and 90% quantiles, then I
> create a X2 variable like this:
>
> industry1<-subset(data,data$industry==1)
>
> ind1year1999<-subset(industry1,industry1$year==1999)
> q1<-quantile(ind1year1999$X1,probs=0.1,na.rm=TRUE)
> q99<-quantile(ind1year1999$X1,probs=0.90,na.rm=TRUE)
> ind1year1999winsorized<-transform(ind1year1999,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1)))
>
> ind1year2000<-subset(industry1,industry1$year==2000)
> q1<-quantile(ind1year2000$X1,probs=0.1,na.rm=TRUE)
> q99<-quantile(ind1year2000$X1,probs=0.90,na.rm=TRUE)
> ind1year2000winsorized<-transform(ind1year2000,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1)))
>
> I repeat this for all years and industries, and then I merge/bind all again
> to have a new dataframe with all the columns of the dataframe «data» plus
> X2.
>
> Could anyone help me doing this in a easier way?
>
> Thanks
> Cecília Carmo
> Universidade de Aveiro - Portugal
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



More information about the R-help mailing list