[R] "Centered" dummy variables; non zero/one coding
Prof Brian Ripley
ripley at stats.ox.ac.uk
Wed Oct 13 12:48:50 CEST 2004
This can done by setting a contrast function or matrix on a variable.
Look in e.g. chapter 6 of MASS (the only comprehensive tutorial on coding
factors in R, it seems).
On Tue, 12 Oct 2004, Peter Holck wrote:
> I'm uncertain if this is perhaps a stupid question:
>
> I want to create "centered" dummy variables to use in a call to glm(), and
> wondering if there's some slick method in R to do so. That is, rather than
> have a factor, which results in a glm() fit returning coefficients
> specifying either absence or presence of the factor, I'd like to fit a glm()
> without intercept such that the estimated coefficients (standard errors)
> represent the "average" value in my data set for that variable.
Is that really what you want? An `average' person having linear predictor
0, or more precisely, the linear predictor have average zero over the
dataset? What family of glm is this?
> An example: a data set has Race specified with 4 levels. I can manually
> specify 4 dummy variables for a no-intercept model with each variable rather
> than having a value of zero or one, has a centered value based on its
> frequency of occurrence in the data set. Thus if 30% of the records in the
> data set have Race of Hispanic, I can define a variable HISP that has a
> value of either -.3 or .7, resulting in my coefficient estimate for HISP
> representing the effect of an "average" person in the database (and a
> corresponding valid standard error).
Nope. A person can only have one race, so the coefficient estimates can
only represent jointly the effect of picking one of the possible races.
I think what you are striving for is that the average of the term `race'
be zero over the whole dataset. That's easy -- just compute the average
and subtract it via an offset term.
Once you have two or more factor predictors you will get aliasing your
way.
> One way to create these "centered dummy variables" from the original factor
> is:
> "B"=scale(RACE=="B",scale=F),
> "W"=scale(RACE=="W",scale=F),
> "H"=scale(RACE=="H",scale=F),
> "OTHRACE"=scale(RACE=="OTHER",scale=F)
>
> However I wonder if there is some method in R to avoid having to manually
> define a large number of these dummy variables for a more complicated
> dataset.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list