[R] Creating dummy vars with contrasts - why does the returned identity matrix contain all levels (and not n-1 levels) ?

E Joffe ejoffe at hotmail.com
Fri Sep 13 16:33:16 CEST 2013


Thank you so much for your answer  !
As far as I understand, glmnet doesn't accept categorical variables only
binary factors - so I had to create dummy variables for all categorical
variables.
It worked perfectly.
Erel 


Erel Joffe MD MSc
School of Biomedical Informatics
University of Texas - Health Science Center in Houston
832.287.0829 (c)

-----Original Message-----
From: David Winsemius [mailto:dwinsemius at comcast.net] 
Sent: Friday, September 13, 2013 3:05 PM
To: E Joffe
Cc: r-help at r-project.org
Subject: Re: [R] Creating dummy vars with contrasts - why does the returned
identity matrix contain all levels (and not n-1 levels) ?


On Sep 13, 2013, at 4:15 AM, E Joffe wrote:

> Hello,
>
>
>
> I have a problem with creating an identity matrix for glmnet by using 
> the contrasts function.

Why do you want to do this?

> I have a factor with 4 levels.
>
> When I create dummy variables I think there should be n-1 variables 
> (in this case 3) - so that the contrasts would be against the baseline 
> level.
>
> This is also what is written in the help file for 'contrasts'.
>
> The problem is that the function creates a matrix with n variables 
> (i.e. the same as the number of levels) and not n-1 (where I would 
> have a baseline level for comparison).

Only if you specify contrasts=FALSE does it do so and this is documented in
that help file.
>
>
>
> My questions are:
>
> 1.       How can I create a matrix with n-1 dummy vars ?

See below.

> was I supposed to
> define explicitly that I want contr.treatment (contrasts) ?

No need to do so.

>
> 2.       If it is not possible, how should I interpret the hazard  
> ratios in
> the Cox regression I am generating (I use glmnet for variable  
> selection and
> then generate a Cox regression)  - That is, if I get an HR of 3 for  
> the
> variable 300mg what does it mean ? the hazard is 3 times higher of  
> what ?
>

Relative hazards are generally referenced to the "baseline hazard",  
i.e. the hazard for a group with the omitted level for treatment  
constrasts and the mean value for any numeric predictors.

> Here is some code to reproduce the issue:
>
> # Create a 4 level example factor
>
> trt <- factor( sample( c("PLACEBO", "300 MG", "600 MG", "1200 MG"),
>
>                       100, replace=TRUE ) )

# If your intent is to use constrasts different than the defaults used  
by
#  regression functions, these factor contrasts need to be assigned,  
either
# within the construction of the factor or after the fact.

 >  contrasts(trt)
         300 MG 600 MG PLACEBO
1200 MG      0      0       0
300 MG       1      0       0
600 MG       0      1       0
PLACEBO      0      0       1

# the default value for the contrasts parameter is TRUE and the  
default type is treatement

# That did not cause any change to the 'trt'-object:
trt

#To make a change you need to use the `contrasts<-` function:

contrasts (trt) <- contrasts(trt)
trt

>
> # Use contrasts to get the identity matrix of dummy variables to be  
> used in
> glmnet
>
> trt2 <- contrasts (trt,contrasts=FALSE)
>
> Results (as you can see all levels are represented in the identity  
> matrix):
>
>> levels (trt)
> [1] "1200 MG" "300 MG"  "600 MG"  "PLACEBO"
>
>
>> print (trt2)
>
>        1200 MG 300 MG 600 MG PLACEBO
>
> 1200 MG       1      0      0       0
>
> 300 MG        0      1      0       0
>
> 600 MG        0      0      1       0
>
> PLACEBO       0      0      0       1
>
>
>
> 	[[alternative HTML version deleted]]

Rhelp is a plain text mailing list.

-- 
David Winsemius, MD
Alameda, CA, USA



More information about the R-help mailing list