[R] Lasso with Categorical Variables

Tue May 3 08:40:55 CEST 2011

For performance reasons, I advise on using the following function instead of
model.matrix:

factorsToDummyVariables<-function(dfr, betweenColAndLevel="")
{
	nc<-dim(dfr)[2]
	firstRow<-dfr[1,]
	coln<-colnames(dfr)
	retval<-do.call(cbind, lapply(seq(nc), function(ci){
			if(is.factor(firstRow[,ci]))
			{
				lvls<-levels(firstRow[,ci])[-1]
				stretchedcols<-sapply(lvls, function(lvl){
						rv<-dfr[,ci]==lvl
						mode(rv)<-"integer"
						return(rv)
					})
				if(!is.matrix(stretchedcols))
stretchedcols<-matrix(stretchedcols, nrow=1)
				colnames(stretchedcols)<-paste(coln[ci],
lvls, sep=betweenColAndLevel)
				return(stretchedcols)
			}
			else
			{
				curcol<-matrix(dfr[,ci], ncol=1)
				colnames(curcol)<-coln[ci]
				return(curcol)
			}
		}))
	rownames(retval)<-rownames(dfr)
	return(retval)
}

Just for comparison: here is my old version of the same function, using
model.matrix:

factorsToDummyVariables.old<-function(dfrPredictors,
form=paste("~",paste(colnames(dfrPredictors), collapse="+"), sep=""))
{
	#note: this function seems to operate quite slowly!
	#Because it is used often, it may be worth improving its speed
	dfrTmp<-model.frame(dfrPredictors, na.action=na.pass)
	frm<-as.formula(form)
	mm<-model.matrix(frm, data=dfrTmp)
	retval<-as.matrix(mm)[,-1]

	return(retval)
}

In a testcase with a reasonably big dataset, I compared the speeds:

#system.time(tmp.fd.convds.full.man<-manualFactorsToDummyVariables(ds))
##   user  system elapsed
##   9.44    0.00    9.48
#system.time(tmp.fd.convds.full<-factorsToDummyVariables.old(ds))
##   user  system elapsed
##  15.49    0.00   15.64
#system.time(invisible(factorsToDummyVariables (ds[10,])))
##   user  system elapsed
##   0.36    0.00    0.36
#system.time(invisible(factorsToDummyVariables.old (ds[10,])))
##   user  system elapsed
##   2.18    0.00    2.20
#system.time(invisible(factorsToDummyVariables (ds[20:30,])))
##   user  system elapsed
##   0.34    0.00    0.38
#system.time(invisible(factorsToDummyVariables.old (ds[20:30,])))
##   user  system elapsed
##   2.11    0.00    2.15

If you have to do this quite often, the difference surely adds up...
More improvements may be possible.
This function only works if you don't include interactions, though.

Nick Sabbe
--
ping: nick.sabbe at ugent.be
link: http://biomath.ugent.be
wink: A1.056, Coupure Links 653, 9000 Gent
ring: 09/264.59.36

-- Do Not Disapprove

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of David Winsemius
Sent: maandag 2 mei 2011 20:48
To: Steve Lianoglou
Cc: r-help at r-project.org
Subject: Re: [R] Lasso with Categorical Variables

On May 2, 2011, at 10:51 AM, Steve Lianoglou wrote:

> Hi,
>
> On Mon, May 2, 2011 at 12:45 PM, Clemontina Alexander <ckalexa2 at ncsu.edu 
> > wrote:
>> Hi! This is my first time posting. I've read the general rules and
>> guidelines, but please bear with me if I make some fatal error in
>> posting. Anyway, I have a continuous response and 29 predictors made
>> up of continuous variables and nominal and ordinal categorical
>> variables. I'd like to do lasso on these, but I get an error. The way
>> I am using "lars" doesn't allow for the factors. Is there a special
>> option or some other method in order to do lasso with cat. variables?
>>
>> Here is and example (considering ordinal variables as just nominal):
>>
>> set.seed(1)
>> Y <- rnorm(10,0,1)
>> X1 <- factor(sample(x=LETTERS[1:4], size=10, replace = TRUE))
>> X2 <- factor(sample(x=LETTERS[5:10], size=10, replace = TRUE))
>> X3 <- sample(x=30:55, size=10, replace=TRUE)  # think age
>> X4 <- rchisq(10, df=4, ncp=0)
>> X <- data.frame(X1,X2,X3,X4)
>>
>>> str(X)
>> 'data.frame':   10 obs. of  4 variables:
>>  $ X1: Factor w/ 4 levels "A","B","C","D": 4 1 3 1 2 2 1 2 4 2
>>  $ X2: Factor w/ 5 levels "E","F","G","H",..: 3 4 3 2 5 5 5 1 5 3
>>  $ X3: int  51 46 50 44 43 50 30 42 49 48
>>  $ X4: num  2.86 1.55 1.94 2.45 2.75 ...
>>
>>
>> I'd like to do:
>> obj <- lars(x=X, y=Y, type = "lasso")
>>
>> Instead, what I have been doing is converting all data to continuous
>> but I think this is really bad!
>
> Yeah, it is.
>
> Check out the "Categorical Predictor Variables" section here for a way
> to handle such predictor vars:
> http://www.psychstat.missouristate.edu/multibook/mlt08m.html

Steve's citation is somewhat helpful, but not sufficient to take the  
next steps. You can find details regarding the mechanics of typical  
linear regression in R on the ?lm page where you find that the factor  
variables are typically handled by model.matrix. See below:

 > model.matrix(~X1 + X2 + X3 + X4, X)
    (Intercept) X1B X1C X1D X2F X2G X2H X2I X3        X4
1            1   0   0   1   0   1   0   0 51 2.8640884
2            1   0   0   0   0   0   1   0 46 1.5462243
3            1   0   1   0   0   1   0   0 50 1.9430901
4            1   0   0   0   1   0   0   0 44 2.4504180
5            1   1   0   0   0   0   0   1 43 2.7535052
6            1   1   0   0   0   0   0   1 50 1.6200326
7            1   0   0   0   0   0   0   1 30 0.5750533
8            1   1   0   0   0   0   0   0 42 5.9224777
9            1   0   0   1   0   0   0   1 49 2.0401528
10           1   1   0   0   0   1   0   0 48 6.2995288
attr(,"assign")
  [1] 0 1 1 1 2 2 2 2 3 4
attr(,"contrasts")
attr(,"contrasts")$X1
[1] "contr.treatment"

attr(,"contrasts")$X2
[1] "contr.treatment"

The numeric variables are passed through, while the dummy variables  
for factor columns are constructed (as treatment contrasts) and the  
whole thing it returned in a neat package.

-- 
David.
>
> HTH,
> -steve
>
-- 
David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.