[R] Questions on factors in regression analysis

Thu Aug 20 23:36:41 CEST 2009

>
> On Aug 20, 2009, at 4:07 PM, guox at ucalgary.ca wrote:
>
>>>
>>> On Aug 20, 2009, at 3:42 PM, guox at ucalgary.ca wrote:
>>>
>>>> Thanks!
>>>>>
>>>>> On Aug 20, 2009, at 1:46 PM, guox at ucalgary.ca wrote:
>>>>>
>>>>>> I got two questions on factors in regression:
>>>>>>
>>>>>> Q1.
>>>>>> In a table, there a few categorical/factor variables, a few
>>>>>> numerical
>>>>>> variables and the response variable is numeric. Some factors are
>>>>>> important
>>>>>> but others not.
>>>>>> How to determine which categorical variables are significant to
>>>>>> the
>>>>>> response variable?
>>>>>
>>>>> Seems that you should engage the services of a consulting
>>>>> statistician
>>>>> for that sort of question. Or post in a venue where statistical
>>>>> consulting is supposed to occur, such as one of the sci.stat.*
>>>>> newsgroups.
>>>>
>>>> I googled sci.stat.* and got sci.stat.math and sci.stat.consult.
>>>> Are they good?
>>>
>>> The quality of responses varies. You may get what you pay for. On the
>>> other hand sometimes you get high-quality advice for free.
>>>
>>>> I have no idea to do this. So any clue will be appreciated.
>>>
>>> http://groups.google.com/?hl=en
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Q2.
>>>>>> As we knew, lm can deal with categorical variables.
>>>>>> I thought, when there is a categorical predictor, we may use lm
>>>>>> directly
>>>>>> without quantifying these factors and assigning different values
>>>>>> to
>>>>>> factors
>>>>>> would not change the fittings as shown:
>>>>>
>>>>> The "numbers" that you are attempting to assign are really just
>>>>> labels
>>>>> for the factor levels. The regression functions in R will not use
>>>>> them
>>>>> for any calculations. They should not be thought of as having
>>>>> "values". Even if the factor is an ordered factor, the labels may
>>>>> not
>>>>> be interpretable as having the same numerical order as the string
>>>>> values might suggest.
>>>>>
>>>>>>
>>>>>> x <- 1:20 ## numeric predictor
>>>>>> yes.no <- c("yes","no")
>>>>>> factors <- gl(2,10,20,yes.no) ##factor predictor
>>>>>> factors.quant <-  rep(c(18.8,29.9),c(10,10)) ##quantificatio of
>>>>>> factors
>>>>>
>>>>> Not sure what that is supposed to mean. It is not a factor object
>>>>> even
>>>>> though you may be misleading yourself in to believing it should be.
>>>>> It's a numeric vector.
>>>>
>>>> Yes, levels are not numeric but just labels. But
>>>> after the levels factors being assigned to numeric values as
>>>> factors.quant
>>>> and factors.quant.1,
>>>> lm(response ~ x + factors.quant) and lm(response ~ x +
>>>> factors.quant1)
>>>> produced the same fitted curve as lm(response ~ x + factors). This
>>>> is what
>>>> I could not understand.
>>>
>>> In for the factor variable case and the numeric variable case there
>>> was no variation in the predictor variable within a level. So the
>>> predictions will all be the same within levels in each case. There
>>> will be differences in the coefficients arrived at to achieve that
>>> result, however.
>>
>> I even tried
>>
>>> cor(response, factors)
>> [1] 0.968241
>>> cor(response, factors.quant)
>> [1] 0.968241
>>> cor(response, factors.quant.1)
>> [1] 0.968241
>>
>> If assigning values to factors does not change curve-fitting,
>> one may use factors.quant to do regression analysis if he wants to
>> find the curve patterns.
>> The coefficients are different since they use different predictors.
>> If they are the same, then the curves fitted are different.
>
> Try setting up with 3 factor levels and three discrete values for the
> numeric predictor. the cor() function will continue to give meaningful
> results for the numeric variable but not for the factor variable. The
> interpretation of the coefficients from a model with three level
> factors may require further study on your part.
>
Yes, when the number of the levels is greater than or equal 3.
That is not true. Thanks,
>
>>
>> Can I rank factors.1 and factors.2 using
>> cor(response factors.1) and cor(response factors.1)?
>> Thanks,
>>>
>>>>
>>>>>> str(factors.quant)
>>>>> num [1:20] 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 ...
>>>>>
>>>>>> factors.quant.1 <-  rep(c(16.9,38.9),c(10,10))
>>>>>> ##second quantificatio of factors
>>>>>> response <- 0.8*x + 18 + factors.quant + rnorm(20) ##response
>>>>>> lm.quant <- lm(response ~ x + factors.quant) ##lm with
>>>>>> quantifications
>>>>>> lm.fact <- lm(response ~ x + factors) ##lm with factors
>>>>>
>>>>>> lm.quant
>>>>>
>>>>> Call:
>>>>> lm(formula = response ~ x + factors.quant)
>>>>>
>>>>> Coefficients:
>>>>>  (Intercept)              x  factors.quant
>>>>>      14.9098         0.5385         1.2350
>>>>>
>>>>>> lm.fact
>>>>>
>>>>> Call:
>>>>> lm(formula = response ~ x + factors)
>>>>>
>>>>> Coefficients:
>>>>> (Intercept)            x    factorsno
>>>>>    38.1286       0.5385      13.7090
>>>>>>
>>>>>> lm.quant.1 <- lm(response ~ x + factors.quant.1) ##lm with
>>>>>> quantifications
>>>>>
>>>>>> lm.quant.1
>>>>>
>>>>> Call:
>>>>> lm(formula = response ~ x + factors.quant.1)
>>>>>
>>>>> Coefficients:
>>>>>    (Intercept)                x  factors.quant.1
>>>>>        27.5976           0.5385           0.6231
>>>>>
>>>>>> lm.fact.1 <- lm(response ~ x + factors) ##lm with factors
>>>>>>
>>>>>> par(mfrow=c(2,2)) ## comparisons of two fittings
>>>>>> plot(x, response)
>>>>>> lines(x,fitted(lm.quant),col="blue")
>>>>>> grid()
>>>>>> plot(x,response)
>>>>>> lines(x,fitted(lm.fact),col = "red")
>>>>>> grid()
>>>>>> plot(x, response)
>>>>>> lines(x,fitted(lm.quant.1),lty =2,col="blue")
>>>>>> grid()
>>>>>> plot(x,response)
>>>>>> lines(x,fitted(lm.fact.1),lty =2,col = "red")
>>>>>> grid()
>>>>>> par(mfrow = c(1,1))
>>>>>>
>>>>>> So, is it right that we can assign any numeric values to factors,
>>>>>> for example, c(yes, no) = c(18.8,29.9) or (16.9,38.9) in the
>>>>>> above,
>>>>>> before doing lm, glm, aov, even nls?
>>>>>
>>>>> You can give factor levels any name you like, including any
>>>>> sequence
>>>>> of digit characters. Unlike "ordinary R where unquoted numbers
>>>>> cannot
>>>>> start variable names, factor functions will coerce numeric
>>>>> vectors to
>>>>> character vectors when assigning level names. But you seem to be
>>>>> conflating factors with numeric vectors that have many ties. Those
>>>>> two
>>>>> entities would have different handling by R's regression functions.
> --
>
>
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
>
>
>