[R] Discretize factors?
Peter Ehlers
ehlers at ucalgary.ca
Mon May 17 00:10:49 CEST 2010
And if you do have many variables in one dataframe, you might
wish to construct the formula first using paste():
nm <- c("0", names(d)[-c(1,2)])
fo <- as.formula(paste("~", paste(nm, collapse= "+")))
d <- cbind(d, model.matrix(fo, data=d)
-Peter Ehlers
On 2010-05-16 15:30, Thomas Stewart wrote:
> Maybe this will lead you to an acceptable solution. Note that changed how
> the data set is created. (In your example, the numeric variables were being
> converted to factor variables. It seems to me that you want something
> different.) The key difference between my code and yours is that I use the
> variable name in the model matrix function; that is, I use ~0+grp instead of
> ~0+d[,3]. As seen below, this change creates non-ugly results.
>
>> grp<- c("A", "B","B","C","C","C")
>> a<- c(1,4,3,4,5,6)
>> b<- c(5,4,5,3,4,5)
>> d<- data.frame(a=a,b=b,grp=grp)
>>
>> str(d)
> 'data.frame': 6 obs. of 3 variables:
> $ a : num 1 4 3 4 5 6
> $ b : num 5 4 5 3 4 5
> $ grp: Factor w/ 3 levels "A","B","C": 1 2 2 3 3 3
>>
>> d<-cbind(d,model.matrix(~0+grp,data=d))
>>
>> d
> a b grp grpA grpB grpC
> 1 1 5 A 1 0 0
> 2 4 4 B 0 1 0
> 3 3 5 B 0 1 0
> 4 4 3 C 0 0 1
> 5 5 4 C 0 0 1
> 6 6 5 C 0 0 1
>> str(d)
> 'data.frame': 6 obs. of 6 variables:
> $ a : num 1 4 3 4 5 6
> $ b : num 5 4 5 3 4 5
> $ grp : Factor w/ 3 levels "A","B","C": 1 2 2 3 3 3
> $ grpA: num 1 0 0 0 0 0
> $ grpB: num 0 1 1 0 0 0
> $ grpC: num 0 0 0 1 1 1
>
> If you are trying to automate the process---convert factor variables to
> dummy variables without direct user input of variables names---you have
> several options. Here is a quick function I wrote that you may have to
> alter for your own needs.
>
> -tgs
>
> grp<- c("A", "B","B","C","C","C")
> sex<-c("m","m","m","f","f","f")
> educ<-c("none","some","some","grad","law","med")
> a<- c(1,4,3,4,5,6)
> b<- c(5,4,5,3,4,5)
> d<- data.frame(a=a,b=b,grp=grp,sex=sex,educ=educ)
>
> Factors.to.dummies<-function(data){
> Factor.Flag<-sapply(data,is.factor)
> formula<-paste("~0+",paste(colnames(data)[Factor.Flag],collapse="+"),sep="")
> data2<-model.matrix(as.formula(formula),data=data)
> return(cbind(data,data2))}
>
> Factors.to.dummies(d)
> a b grp sex educ grpA grpB grpC sexm educlaw educmed educnone educsome
> 1 1 5 A m none 1 0 0 1 0 0 1 0
> 2 4 4 B m some 0 1 0 1 0 0 0 1
> 3 3 5 B m some 0 1 0 1 0 0 0 1
> 4 4 3 C f grad 0 0 1 0 0 0 0 0
> 5 5 4 C f law 0 0 1 0 1 0 0 0
> 6 6 5 C f med 0 0 1 0 0 1 0 0
>
> On Sun, May 16, 2010 at 2:24 PM, Noah Silverman<noah at smartmediacorp.com>wrote:
>
>> I could, but with close to 100 columns, its messy.
>>
>>
>> On 5/16/10 11:22 AM, Peter Ehlers wrote:
>>> On 2010-05-16 11:06, Noah Silverman wrote:
>>>> Update,
>>>>
>>>> I have it working, but now its producing really ugly labels. Must be a
>>>> small adjustment to the code. Any ideas??
>>>>
>>>> ##Create example data.frame
>>>> group<- c("A", "B","B","C","C","C")
>>>> a<- c(1,4,3,4,5,6)
>>>> b<- c(5,4,5,3,4,5)
>>>> d<- data.frame(cbind(a,b,group))
>>>>
>>>> #create new frame with discretized group
>>>>> cbind(d[,1:2], model.matrix(~0+d[,3]) )
>>>> a b d[, 3]A d[, 3]B d[, 3]C
>>>> 1 1 5 1 0 0
>>>> 2 4 4 0 1 0
>>>> 3 3 5 0 1 0
>>>> 4 4 3 0 0 1
>>>> 5 5 4 0 0 1
>>>> 6 6 5 0 0 1
>>>>
>>>>
>>>> So, as you can see, it works, but the labels for the groups don't
>>>>
>>>> I then tried using the column name instead of number and still got ugly
>>>> results:
>>>>
>>>>> cbind(d[,1:2], model.matrix(~0+d[,"group"]) )
>>>> a b d[, "group"]A d[, "group"]B d[, "group"]C
>>>> 1 1 5 1 0 0
>>>> 2 4 4 0 1 0
>>>> 3 3 5 0 1 0
>>>> 4 4 3 0 0 1
>>>> 5 5 4 0 0 1
>>>> 6 6 5 0 0 1
>>>>
>>>>
>>>>
>>>> Any ideas?
>>>>
>>>
>>> Can't you just use names(...)<- c() on your final dataframe?
>>>
>>> -Peter Ehlers
>>>
>>>> -N
>>>>
>>>>
>>>>
>>>> On 5/15/10 11:02 AM, Noah Silverman wrote:
>>>>> Hi,
>>>>>
>>>>> I'm looking for an easy way to discretize factors in R
>>>>>
>>>>> I've noticed that the lm function does this automatically with a nice
>>>>> result.
>>>>>
>>>>> If I have
>>>>>
>>>>> group<- c("A", "B","B","C","C","C")
>>>>>
>>>>> and run:
>>>>>
>>>>> lm(result ~ x1 + group)
>>>>>
>>>>> The lm function has split the group into separate binary variables
>>>>> {0,1}
>>>>> before performing the regression. I now have:
>>>>> groupA
>>>>> groupB
>>>>> groupC
>>>>>
>>>>> Some of the other models that I want to try won't accept factors, so
>>>>> they need to be discretized this way.
>>>>>
>>>>> Is there a command in R for this, or some easy shortcut? (I tried
>>>>> digging into the lm code, but couldn't find where this is being done.)
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -N
>>>>>
More information about the R-help
mailing list