[R] Questions on factors in regression analysis

guox at ucalgary.ca guox at ucalgary.ca
Thu Aug 20 21:42:01 CEST 2009


Thanks!
>
> On Aug 20, 2009, at 1:46 PM, guox at ucalgary.ca wrote:
>
>> I got two questions on factors in regression:
>>
>> Q1.
>> In a table, there a few categorical/factor variables, a few numerical
>> variables and the response variable is numeric. Some factors are
>> important
>> but others not.
>> How to determine which categorical variables are significant to the
>> response variable?
>
> Seems that you should engage the services of a consulting statistician
> for that sort of question. Or post in a venue where statistical
> consulting is supposed to occur, such as one of the sci.stat.*
> newsgroups.

I googled sci.stat.* and got sci.stat.math and sci.stat.consult.
Are they good?
I have no idea to do this. So any clue will be appreciated.

>
>>
>> Q2.
>> As we knew, lm can deal with categorical variables.
>> I thought, when there is a categorical predictor, we may use lm
>> directly
>> without quantifying these factors and assigning different values to
>> factors
>> would not change the fittings as shown:
>
> The "numbers" that you are attempting to assign are really just labels
> for the factor levels. The regression functions in R will not use them
> for any calculations. They should not be thought of as having
> "values". Even if the factor is an ordered factor, the labels may not
> be interpretable as having the same numerical order as the string
> values might suggest.
>
>>
>> x <- 1:20 ## numeric predictor
>> yes.no <- c("yes","no")
>> factors <- gl(2,10,20,yes.no) ##factor predictor
>> factors.quant <-  rep(c(18.8,29.9),c(10,10)) ##quantificatio of
>> factors
>
> Not sure what that is supposed to mean. It is not a factor object even
> though you may be misleading yourself in to believing it should be.
> It's a numeric vector.

Yes, levels are not numeric but just labels. But
after the levels factors being assigned to numeric values as factors.quant
and factors.quant.1,
lm(response ~ x + factors.quant) and lm(response ~ x + factors.quant1)
produced the same fitted curve as lm(response ~ x + factors). This is what
I could not understand.

>  > str(factors.quant)
>   num [1:20] 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 ...
>
>> factors.quant.1 <-  rep(c(16.9,38.9),c(10,10))
>>   ##second quantificatio of factors
>> response <- 0.8*x + 18 + factors.quant + rnorm(20) ##response
>> lm.quant <- lm(response ~ x + factors.quant) ##lm with quantifications
>> lm.fact <- lm(response ~ x + factors) ##lm with factors
>
>  > lm.quant
>
> Call:
> lm(formula = response ~ x + factors.quant)
>
> Coefficients:
>    (Intercept)              x  factors.quant
>        14.9098         0.5385         1.2350
>
>  > lm.fact
>
> Call:
> lm(formula = response ~ x + factors)
>
> Coefficients:
> (Intercept)            x    factorsno
>      38.1286       0.5385      13.7090
>>
>> lm.quant.1 <- lm(response ~ x + factors.quant.1) ##lm with
>> quantifications
>
>  > lm.quant.1
>
> Call:
> lm(formula = response ~ x + factors.quant.1)
>
> Coefficients:
>      (Intercept)                x  factors.quant.1
>          27.5976           0.5385           0.6231
>
>> lm.fact.1 <- lm(response ~ x + factors) ##lm with factors
>>
>> par(mfrow=c(2,2)) ## comparisons of two fittings
>> plot(x, response)
>> lines(x,fitted(lm.quant),col="blue")
>> grid()
>> plot(x,response)
>> lines(x,fitted(lm.fact),col = "red")
>> grid()
>> plot(x, response)
>> lines(x,fitted(lm.quant.1),lty =2,col="blue")
>> grid()
>> plot(x,response)
>> lines(x,fitted(lm.fact.1),lty =2,col = "red")
>> grid()
>> par(mfrow = c(1,1))
>>
>> So, is it right that we can assign any numeric values to factors,
>> for example, c(yes, no) = c(18.8,29.9) or (16.9,38.9) in the above,
>> before doing lm, glm, aov, even nls?
>
> You can give factor levels any name you like, including any sequence
> of digit characters. Unlike "ordinary R where unquoted numbers cannot
> start variable names, factor functions will coerce numeric vectors to
> character vectors when assigning level names. But you seem to be
> conflating factors with numeric vectors that have many ties. Those two
> entities would have different handling by R's regression functions.
>
> --
>
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
>
>
>




More information about the R-help mailing list