[R] Creating dummy vars with contrasts - why does the returned identity matrix contain all levels (and not n-1 levels) ?

E Joffe ejoffe at hotmail.com
Sat Sep 14 08:21:10 CEST 2013


Hi David,

First I ordered the levels of each factor in a descending order based on
frequency.
Then, I used the following code to generate a matrix from the dataframe with
dummy variables and  subsequently run the glmnet (coxnet)

  ## tranform categorical variables into binary variables with dummy for
trainSet
  predict_matrix <- model.matrix(~ ., data=trainSet, 
                                 contrasts.arg = lapply
(trainSet[,sapply(trainSet, is.factor)], contrasts))
  
  ## remove the status/time variables from the predictor matrix (x) for
glmnet
  predict_matrix <- subset (predict_matrix, select=c(-time,-status))
  
  ## create a glmnet cox object using lasso regularization and cross
validation
  glmnet.cv <- cv.glmnet (predict_matrix, surv_obj, family="cox")


I hope I did not do anything wrong .....

Can't thank you enough for your advice and interest.

Erel 



-----Original Message-----
From: David Winsemius [mailto:dwinsemius at comcast.net] 
Sent: Friday, September 13, 2013 8:51 PM
To: E Joffe
Cc: r-help at r-project.org
Subject: Re: [R] Creating dummy vars with contrasts - why does the returned
identity matrix contain all levels (and not n-1 levels) ?


On Sep 13, 2013, at 9:33 AM, E Joffe wrote:

> Thank you so much for your answer  !
> As far as I understand, glmnet doesn't accept categorical variables 
> only binary factors - so I had to create dummy variables for all 
> categorical variables.

I was rather puzzled by your question. The conventions used by glmnet should
prevent constrasts from being pre-specified. Only matrices are accepted as
data objects and one cannot assign contrast attributes to matrix columns.

> It worked perfectly.
> Erel
>
>
> Erel Joffe MD MSc
> School of Biomedical Informatics
> University of Texas - Health Science Center in Houston
> 832.287.0829 (c)
>
> -----Original Message-----
> From: David Winsemius [mailto:dwinsemius at comcast.net]
> Sent: Friday, September 13, 2013 3:05 PM
> To: E Joffe
> Cc: r-help at r-project.org
> Subject: Re: [R] Creating dummy vars with contrasts - why does the 
> returned identity matrix contain all levels (and not n-1 levels) ?
>
>
> On Sep 13, 2013, at 4:15 AM, E Joffe wrote:
>
>> Hello,
>>
>>
>>
>> I have a problem with creating an identity matrix for glmnet by using 
>> the contrasts function.
>
> Why do you want to do this?
>
>> I have a factor with 4 levels.
>>
>> When I create dummy variables I think there should be n-1 variables 
>> (in this case 3) - so that the contrasts would be against the 
>> baseline level.
>>
>> This is also what is written in the help file for 'contrasts'.
>>
>> The problem is that the function creates a matrix with n variables 
>> (i.e. the same as the number of levels) and not n-1 (where I would 
>> have a baseline level for comparison).
>
> Only if you specify contrasts=FALSE does it do so and this is 
> documented in that help file.
>>
>>
>>
>> My questions are:
>>
>> 1.       How can I create a matrix with n-1 dummy vars ?
>
> See below.
>
>> was I supposed to
>> define explicitly that I want contr.treatment (contrasts) ?
>
> No need to do so.
>
>>
>> 2.       If it is not possible, how should I interpret the hazard
>> ratios in
>> the Cox regression I am generating (I use glmnet for variable
>> selection and
>> then generate a Cox regression)  - That is, if I get an HR of 3 for
>> the
>> variable 300mg what does it mean ? the hazard is 3 times higher of
>> what ?
>>
>
> Relative hazards are generally referenced to the "baseline hazard",
> i.e. the hazard for a group with the omitted level for treatment
> constrasts and the mean value for any numeric predictors.
>
>> Here is some code to reproduce the issue:
>>
>> # Create a 4 level example factor
>>
>> trt <- factor( sample( c("PLACEBO", "300 MG", "600 MG", "1200 MG"),
>>
>>                      100, replace=TRUE ) )
>
> # If your intent is to use constrasts different than the defaults used
> by
> #  regression functions, these factor contrasts need to be assigned,
> either
> # within the construction of the factor or after the fact.
>
>> contrasts(trt)
>         300 MG 600 MG PLACEBO
> 1200 MG      0      0       0
> 300 MG       1      0       0
> 600 MG       0      1       0
> PLACEBO      0      0       1
>
> # the default value for the contrasts parameter is TRUE and the
> default type is treatement
>
> # That did not cause any change to the 'trt'-object:
> trt
>
> #To make a change you need to use the `contrasts<-` function:
>
> contrasts (trt) <- contrasts(trt)
> trt
>
>>
>> # Use contrasts to get the identity matrix of dummy variables to be
>> used in
>> glmnet
>>
>> trt2 <- contrasts (trt,contrasts=FALSE)
>>
>> Results (as you can see all levels are represented in the identity
>> matrix):
>>
>>> levels (trt)
>> [1] "1200 MG" "300 MG"  "600 MG"  "PLACEBO"
>>
>>
>>> print (trt2)
>>
>>       1200 MG 300 MG 600 MG PLACEBO
>>
>> 1200 MG       1      0      0       0
>>
>> 300 MG        0      1      0       0
>>
>> 600 MG        0      0      1       0
>>
>> PLACEBO       0      0      0       1
>>
>>
>>
>> 	[[alternative HTML version deleted]]
>
> Rhelp is a plain text mailing list.
>
> -- 
> David Winsemius, MD
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Alameda, CA, USA



More information about the R-help mailing list