[R] remove extreme values or winsorize – loop - dataframe
Cecilia Carmo
cecilia.carmo at ua.pt
Tue Aug 3 00:54:12 CEST 2010
Thank you again, but I think I need to do some homework
about the split function, because I'm not understanding it
very well.
Besides, I think I still have a problem. I also need X2 =
X1 winsorized: X2 is equal to X1 between 10%-90%, and is
equal to the 10% value when < 10% and equal to the 90%
value when it is >.
Could you help me?
Thank you
Cecília
Em Mon, 2 Aug 2010 18:42:27 -0400
jim holtman <jholtman at gmail.com> escreveu:
> This is just following up with the example data you
>sent. This will
> create a list 'result' that will have the subset of data
>between the
> 10% & 90%-tiles of the data:
>
>> #My reproducible example:
>> firm<-sort(rep(1:1000,10),decreasing=F)
>> year<-rep(1998:2007,1000)
>> industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10),
> + rep(10,10)),1000)
>> X1<-rnorm(10000)
>> data<-data.frame(firm, industry,year,X1)
>> # split the data by industry/year
>> d.s <- split(data, list(data$industry, data$year),
>>drop=TRUE)
>> result <- lapply(d.s, function(.id){
> + # get 10/90% values
> + .limit <- quantile(.id$X1, prob=c(.1, .9))
> + subset(.id, X1 >= .limit[1] & X1 <= .limit[2])
> + })
>> str(result)
> List of 100
> $ 1.1998 :'data.frame': 800 obs. of 4 variables:
> ..$ firm : int [1:800] 1 21 31 41 51 61 71 81 91 111
>...
> ..$ industry: num [1:800] 1 1 1 1 1 1 1 1 1 1 ...
> ..$ year : int [1:800] 1998 1998 1998 1998 1998 1998
>1998 1998
> 1998 1998 ...
> ..$ X1 : num [1:800] 0.659 -0.105 -0.617 0.342
>-1.077 ...
> $ 2.1998 :'data.frame': 800 obs. of 4 variables:
> ..$ firm : int [1:800] 2 32 42 52 62 72 102 112 132
>162 ...
> ..$ industry: num [1:800] 2 2 2 2 2 2 2 2 2 2 ...
> ..$ year : int [1:800] 1998 1998 1998 1998 1998 1998
>1998 1998
> 1998 1998 ...
> ..$ X1 : num [1:800] -1.1044 -0.0666 -0.9184
>0.3469 -0.2348 ...
>
> You can see that the 'name' of the list element is the
>industry.year
> combination; this can also be seen in the data.
>
> On Mon, Aug 2, 2010 at 6:20 PM, Cecilia Carmo
><cecilia.carmo at ua.pt> wrote:
>> Thank you for your help but I don't understand how can I
>>have a dataframe
>> with the columns: firm, year, industry, X1 and X2. Could
>>you help me
>> (again)?
>>
>>
>> Cecília Carmo
>>
>>
>> Em Sat, 31 Jul 2010 22:10:38 -0400
>> jim holtman <jholtman at gmail.com> escreveu:
>>>
>>> This will split the data by industry & year and then
>>>return the values
>>> that include the 80%-tile (>=10% & <= 90%)
>>>
>>> # split the data by industry/year
>>> d.s <- split(data, list(data$industry, data$year),
>>>drop=TRUE)
>>> result <- lapply(d.s, function(.id){
>>> # get 10/90% values
>>> .limit <- quantile(.id$X1, prob=c(.1, .9))
>>> subset(.id, X1 >= .limit[1] & X1 <= .limit[2])
>>> })
>>>
>>> This returns a list of 100 elements for each
>>>combination.
>>>
>>> On Sat, Jul 31, 2010 at 9:39 PM, Cecilia Carmo
>>><cecilia.carmo at ua.pt>
>>> wrote:
>>>>
>>>> Hi everyone!
>>>>
>>>> #I need a loop or a function that creates a X2 variable
>>>>that is X1
>>>> without
>>>> the extreme values (or X1 winsorized) by industry and
>>>>year.
>>>>
>>>> #My reproducible example:
>>>> firm<-sort(rep(1:1000,10),decreasing=F)
>>>> year<-rep(1998:2007,1000)
>>>>
>>>> industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10),
>>>> rep(10,10)),1000)
>>>> X1<-rnorm(10000)
>>>> data<-data.frame(firm, industry,year,X1)
>>>> data
>>>>
>>>> The way I’m doing this is very hard. I split my sample
>>>>by industry and
>>>> year,
>>>> for each industry and year I calculate the 10% and 90%
>>>>quantiles, then I
>>>> create a X2 variable like this:
>>>>
>>>> industry1<-subset(data,data$industry==1)
>>>>
>>>> ind1year1999<-subset(industry1,industry1$year==1999)
>>>> q1<-quantile(ind1year1999$X1,probs=0.1,na.rm=TRUE)
>>>> q99<-quantile(ind1year1999$X1,probs=0.90,na.rm=TRUE)
>>>>
>>>> ind1year1999winsorized<-transform(ind1year1999,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1)))
>>>>
>>>> ind1year2000<-subset(industry1,industry1$year==2000)
>>>> q1<-quantile(ind1year2000$X1,probs=0.1,na.rm=TRUE)
>>>> q99<-quantile(ind1year2000$X1,probs=0.90,na.rm=TRUE)
>>>>
>>>> ind1year2000winsorized<-transform(ind1year2000,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1)))
>>>>
>>>> I repeat this for all years and industries, and then I
>>>>merge/bind all
>>>> again
>>>> to have a new dataframe with all the columns of the
>>>>dataframe «data» plus
>>>> X2.
>>>>
>>>> Could anyone help me doing this in a easier way?
>>>>
>>>> Thanks
>>>> Cecília Carmo
>>>> Universidade de Aveiro - Portugal
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained,
>>>>reproducible code.
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Cincinnati, OH
>>> +1 513 646 9390
>>>
>>> What is the problem that you are trying to solve?
>>
>>
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
More information about the R-help
mailing list