[R] Questions on factors in regression analysis
dwinsemius at comcast.net
Thu Aug 20 21:53:19 CEST 2009
On Aug 20, 2009, at 3:42 PM, guox at ucalgary.ca wrote:
>> On Aug 20, 2009, at 1:46 PM, guox at ucalgary.ca wrote:
>>> I got two questions on factors in regression:
>>> In a table, there a few categorical/factor variables, a few
>>> variables and the response variable is numeric. Some factors are
>>> but others not.
>>> How to determine which categorical variables are significant to the
>>> response variable?
>> Seems that you should engage the services of a consulting
>> for that sort of question. Or post in a venue where statistical
>> consulting is supposed to occur, such as one of the sci.stat.*
> I googled sci.stat.* and got sci.stat.math and sci.stat.consult.
> Are they good?
The quality of responses varies. You may get what you pay for. On the
other hand sometimes you get high-quality advice for free.
> I have no idea to do this. So any clue will be appreciated.
>>> As we knew, lm can deal with categorical variables.
>>> I thought, when there is a categorical predictor, we may use lm
>>> without quantifying these factors and assigning different values to
>>> would not change the fittings as shown:
>> The "numbers" that you are attempting to assign are really just
>> for the factor levels. The regression functions in R will not use
>> for any calculations. They should not be thought of as having
>> "values". Even if the factor is an ordered factor, the labels may not
>> be interpretable as having the same numerical order as the string
>> values might suggest.
>>> x <- 1:20 ## numeric predictor
>>> yes.no <- c("yes","no")
>>> factors <- gl(2,10,20,yes.no) ##factor predictor
>>> factors.quant <- rep(c(18.8,29.9),c(10,10)) ##quantificatio of
>> Not sure what that is supposed to mean. It is not a factor object
>> though you may be misleading yourself in to believing it should be.
>> It's a numeric vector.
> Yes, levels are not numeric but just labels. But
> after the levels factors being assigned to numeric values as
> and factors.quant.1,
> lm(response ~ x + factors.quant) and lm(response ~ x + factors.quant1)
> produced the same fitted curve as lm(response ~ x + factors). This
> is what
> I could not understand.
In for the factor variable case and the numeric variable case there
was no variation in the predictor variable within a level. So the
predictions will all be the same within levels in each case. There
will be differences in the coefficients arrived at to achieve that
>> num [1:20] 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 ...
>>> factors.quant.1 <- rep(c(16.9,38.9),c(10,10))
>>> ##second quantificatio of factors
>>> response <- 0.8*x + 18 + factors.quant + rnorm(20) ##response
>>> lm.quant <- lm(response ~ x + factors.quant) ##lm with
>>> lm.fact <- lm(response ~ x + factors) ##lm with factors
>> lm(formula = response ~ x + factors.quant)
>> (Intercept) x factors.quant
>> 14.9098 0.5385 1.2350
>> lm(formula = response ~ x + factors)
>> (Intercept) x factorsno
>> 38.1286 0.5385 13.7090
>>> lm.quant.1 <- lm(response ~ x + factors.quant.1) ##lm with
>> lm(formula = response ~ x + factors.quant.1)
>> (Intercept) x factors.quant.1
>> 27.5976 0.5385 0.6231
>>> lm.fact.1 <- lm(response ~ x + factors) ##lm with factors
>>> par(mfrow=c(2,2)) ## comparisons of two fittings
>>> plot(x, response)
>>> lines(x,fitted(lm.fact),col = "red")
>>> plot(x, response)
>>> lines(x,fitted(lm.quant.1),lty =2,col="blue")
>>> lines(x,fitted(lm.fact.1),lty =2,col = "red")
>>> par(mfrow = c(1,1))
>>> So, is it right that we can assign any numeric values to factors,
>>> for example, c(yes, no) = c(18.8,29.9) or (16.9,38.9) in the above,
>>> before doing lm, glm, aov, even nls?
>> You can give factor levels any name you like, including any sequence
>> of digit characters. Unlike "ordinary R where unquoted numbers cannot
>> start variable names, factor functions will coerce numeric vectors to
>> character vectors when assigning level names. But you seem to be
>> conflating factors with numeric vectors that have many ties. Those
>> entities would have different handling by R's regression functions.
David Winsemius, MD
West Hartford, CT
More information about the R-help