[R] Preparing dataset for glmnet: factors to dummies

Tue Feb 1 12:33:53 CET 2011

>>>>> "NS" == Nick Sabbe <nick.sabbe at ugent.be>
>>>>>     on Tue, 1 Feb 2011 10:46:01 +0100 writes:

    NS> Hello list.
    NS> For some reason, the makers of glmnet do not accept a dataframe as input.
    NS> They expect the input to be a matrix, where the dummies are already
    NS> precoded.
    NS> Now I have created a sample dataset with
    NS> . 11 factor columns with two levels
    NS> . 4 factor columns with three levels
    NS> . 135 continuous columns (from a standard normal)
    NS> . 100 observations (rows)
    NS> Say this dataframe is in dfrPredictors.

please do provide your R code next time, so we'll have a fully
reproducible example ....

    NS> What I do now, is use the following code:

    NS> form<-paste("~",paste(colnames(dfrPredictors), collapse="+"), sep="")
    NS> dfrTmp<-model.frame(dfrPredictors, na.action=na.pass)
    NS> result<- as.matrix(model.matrix(as.formula(form), data=dfrTmp))[,-1]

    NS> This works (although admittedly, I don't understand everything of it).
    NS> However, I notice that for this rather limited dataset, this conversion
    NS> takes around 0.1 seconds user/elapsed time (on a relatively speedy laptop).

    NS> For my current work, I need to do this a lot of times on very similar
    NS> dataframes (in fact, they are multiply imputed from the same 'original'
    NS> dataframe), so I need all the speed I can get.

    NS> Does anybody know of a way that is quicker than the above? Note: because of
    NS> other uses of the dataframe, I don't have the option to do this conversion
    NS> before the imputation, so I really need the conversion itself to work
    NS> quickly.

The glmnet package fortunately also works with sparse matrices
(as from the 'Matrix' package).  In Matrix, there's the function
sparse.model.matrix()   which should work like model.matrix()
but produce a sparse matrix. 
This is typically considerably faster when the resulting matrix
is large and sparse, notably because the memory footprint is so
much smaller.

We (Matrix authors) have gone a step further, and written
a  model.Matrix()  function with argument  'sparse = FALSE / TRUE'
which should even more closely mirror the functionality of R's
model.matrix() (as that produces only standard, i.e., dense matrices).

The functionality of model.Matrix() has been moved out of the
Matrix package into the package 'MatrixModels',
and that package also provides -- somewhat experimental --
functionality for fitting GLMs with sparse model matrices.

We'd be glad to get feedback on your uses and observations with
these sparse model matrices.

Martin Maechler, ETH Zurich