[R] Discretize factors?

Peter Ehlers ehlers at ucalgary.ca
Mon May 17 00:10:49 CEST 2010


And if you do have many variables in one dataframe, you might
wish to construct the formula first using paste():

  nm <- c("0", names(d)[-c(1,2)])
  fo <- as.formula(paste("~", paste(nm, collapse= "+")))
  d <- cbind(d, model.matrix(fo, data=d)

  -Peter Ehlers

On 2010-05-16 15:30, Thomas Stewart wrote:
> Maybe this will lead you to an acceptable solution.  Note that changed how
> the data set is created.  (In your example, the numeric variables were being
> converted to factor variables.  It seems to me that you want something
> different.)  The key difference between my code and yours is that I use the
> variable name in the model matrix function; that is, I use ~0+grp instead of
> ~0+d[,3].  As seen below, this change creates non-ugly results.
>
>> grp<- c("A", "B","B","C","C","C")
>> a<- c(1,4,3,4,5,6)
>> b<- c(5,4,5,3,4,5)
>> d<- data.frame(a=a,b=b,grp=grp)
>>
>> str(d)
> 'data.frame':   6 obs. of  3 variables:
>   $ a  : num  1 4 3 4 5 6
>   $ b  : num  5 4 5 3 4 5
>   $ grp: Factor w/ 3 levels "A","B","C": 1 2 2 3 3 3
>>
>> d<-cbind(d,model.matrix(~0+grp,data=d))
>>
>> d
>    a b grp grpA grpB grpC
> 1 1 5   A    1    0    0
> 2 4 4   B    0    1    0
> 3 3 5   B    0    1    0
> 4 4 3   C    0    0    1
> 5 5 4   C    0    0    1
> 6 6 5   C    0    0    1
>> str(d)
> 'data.frame':   6 obs. of  6 variables:
>   $ a   : num  1 4 3 4 5 6
>   $ b   : num  5 4 5 3 4 5
>   $ grp : Factor w/ 3 levels "A","B","C": 1 2 2 3 3 3
>   $ grpA: num  1 0 0 0 0 0
>   $ grpB: num  0 1 1 0 0 0
>   $ grpC: num  0 0 0 1 1 1
>
> If you are trying to automate the process---convert factor variables to
> dummy variables without direct user input of variables names---you have
> several options.  Here is a quick function I wrote that you may have to
> alter for your own needs.
>
> -tgs
>
> grp<- c("A", "B","B","C","C","C")
> sex<-c("m","m","m","f","f","f")
> educ<-c("none","some","some","grad","law","med")
> a<- c(1,4,3,4,5,6)
> b<- c(5,4,5,3,4,5)
> d<- data.frame(a=a,b=b,grp=grp,sex=sex,educ=educ)
>
> Factors.to.dummies<-function(data){
> Factor.Flag<-sapply(data,is.factor)
> formula<-paste("~0+",paste(colnames(data)[Factor.Flag],collapse="+"),sep="")
> data2<-model.matrix(as.formula(formula),data=data)
> return(cbind(data,data2))}
>
> Factors.to.dummies(d)
>    a b grp sex educ grpA grpB grpC sexm educlaw educmed educnone educsome
> 1 1 5   A   m none    1    0    0    1       0       0        1        0
> 2 4 4   B   m some    0    1    0    1       0       0        0        1
> 3 3 5   B   m some    0    1    0    1       0       0        0        1
> 4 4 3   C   f grad    0    0    1    0       0       0        0        0
> 5 5 4   C   f  law    0    0    1    0       1       0        0        0
> 6 6 5   C   f  med    0    0    1    0       0       1        0        0
>
> On Sun, May 16, 2010 at 2:24 PM, Noah Silverman<noah at smartmediacorp.com>wrote:
>
>> I could, but with close to 100 columns, its messy.
>>
>>
>> On 5/16/10 11:22 AM, Peter Ehlers wrote:
>>> On 2010-05-16 11:06, Noah Silverman wrote:
>>>> Update,
>>>>
>>>> I have it working, but now its producing really ugly labels.  Must be a
>>>> small adjustment to the code.  Any ideas??
>>>>
>>>> ##Create example data.frame
>>>> group<- c("A", "B","B","C","C","C")
>>>> a<- c(1,4,3,4,5,6)
>>>> b<- c(5,4,5,3,4,5)
>>>> d<- data.frame(cbind(a,b,group))
>>>>
>>>> #create new frame with discretized group
>>>>> cbind(d[,1:2], model.matrix(~0+d[,3]) )
>>>>     a b d[, 3]A d[, 3]B d[, 3]C
>>>> 1 1 5       1       0       0
>>>> 2 4 4       0       1       0
>>>> 3 3 5       0       1       0
>>>> 4 4 3       0       0       1
>>>> 5 5 4       0       0       1
>>>> 6 6 5       0       0       1
>>>>
>>>>
>>>> So, as you can see, it works, but the labels for the groups don't
>>>>
>>>> I then tried using the column name instead of number and still got ugly
>>>> results:
>>>>
>>>>> cbind(d[,1:2], model.matrix(~0+d[,"group"]) )
>>>>     a b d[, "group"]A d[, "group"]B d[, "group"]C
>>>> 1 1 5             1             0             0
>>>> 2 4 4             0             1             0
>>>> 3 3 5             0             1             0
>>>> 4 4 3             0             0             1
>>>> 5 5 4             0             0             1
>>>> 6 6 5             0             0             1
>>>>
>>>>
>>>>
>>>> Any ideas?
>>>>
>>>
>>> Can't you just use names(...)<- c() on your final dataframe?
>>>
>>>   -Peter Ehlers
>>>
>>>> -N
>>>>
>>>>
>>>>
>>>> On 5/15/10 11:02 AM, Noah Silverman wrote:
>>>>> Hi,
>>>>>
>>>>> I'm looking for an easy way to discretize factors in R
>>>>>
>>>>> I've noticed that the lm function does this automatically with a nice
>>>>> result.
>>>>>
>>>>> If I have
>>>>>
>>>>> group<- c("A", "B","B","C","C","C")
>>>>>
>>>>> and run:
>>>>>
>>>>> lm(result ~ x1 + group)
>>>>>
>>>>> The lm function has split the group into separate binary variables
>>>>> {0,1}
>>>>> before performing the regression.  I now have:
>>>>> groupA
>>>>> groupB
>>>>> groupC
>>>>>
>>>>> Some of the other models that I want to try won't accept factors, so
>>>>> they need to be discretized this way.
>>>>>
>>>>> Is there a command in R for this, or some easy shortcut?  (I tried
>>>>> digging into the lm code, but couldn't find where this is being done.)
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -N
>>>>>



More information about the R-help mailing list