[R] Multiplication of regression coefficient by factor type variable

Sat Nov 17 21:01:12 CET 2018

Dear all,

In a context of regression, I have three regressors, two of which are
categorical variables (sex and education) and have class 'factor'.

y = data$income
x1 = as.factor(data$sex)  # 2 levels
x2 = data$age  # continuous
x3 = as.factor(data$ed)  # 8 levels

for example, the first entries of x3 are

head(x3)[1] 5 3 5 5 4 2
Levels: 1 2 3 4 5 6 7 8

When we call the model, the output looks like this

model1=lm(y ~ x1 + x2 + x3, data = data)
summary(model1)

Residuals:
Min     1Q Median     3Q    Max -31220  -6300   -594   4429 190731

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    (Intercept)  1440.66
 3809.99   0.378 0.705417
x1          -4960.88     772.96  -6.418 2.13e-10 ***
x2            181.45      25.03   7.249 8.41e-13 ***
x32          2174.95    3453.22   0.630 0.528948
x33          7497.68    3428.94   2.187 0.029004 *
x34          8278.97    3576.30   2.315 0.020817 *
x35         13686.88    3454.93   3.962 7.97e-05 ***
x36         15902.92    4408.49   3.607 0.000325 ***
x37         28773.13    3696.77   7.783 1.76e-14 ***
x38         31455.55    5448.11   5.774 1.03e-08 ***---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12060 on 1001 degrees of freedom
Multiple R-squared:  0.2486,    Adjusted R-squared:  0.2418
F-statistic: 36.79 on 9 and 1001 DF,  p-value: < 2.2e-16

Now suppose I want to compute the residuals. To do so I first need to
compute the prediction by the model. (I use it in a cross validation
context so it is a partial display of the code)

yhat1 = model1$coef[1] + model1$coef[2]*x1[i] + model1$coef[3]*x2[i] +
model1$coef[4]*x3[i]

But I get the following warnings

Warning messages:1: In Ops.factor(model1$coef[2], x1[i]) : ‘*’ not
meaningful for factors2: In Ops.factor(model1$coef[4], x3[i]) : ‘*’
not meaningful for factors

1st question: Is there a way to multiply the coefficient by the 'factor'
without having to transform my 'factor' into a 'numeric' type variable ?

2nd question: Since x3 is associated with 7 parameters (one for x32, one
for x33, ... , one for x38), how do I multiply the 'correct' parameter
coefficient with my 'factor' x3 ?

I have been considering a 'if then' solution, but to no avail. I also have
considered splitting my x3 variable into 8 binary variables without
succeeding. What may be the best approach ? Thank you for your help.

Since I understand this my not be specific enough, I add here the complete
code

# for n-fold cross validation# fit models on leave-one-out samples
x1= as.factor(data$sex)
x2= data$age
x3= as.factor(data$ed)
yn=data$income
n = length(yn)
e1 = e2 = numeric(n)

for (i in 1:n) {
  # the ith observation is excluded
  y = yn[-i]
  x_1 = x1[-i]
  x_2 = x2[-i]
  x_3 = x3[-i]
  x_4 = as.factor(cf4)[-i]
  # fit the first model without the ith observation
  J1 = lm(y ~ x_1 + x_2 + x_3)
  yhat1 = J1$coef[1] + J1$coef[2]*x1[i] + J1$coef[3]*x2[i] + J1$coef[4]*x3[i]
  # construct the ith part of the loss function for model 1
  e1[i] = yn[i] - yhat1
  # fit the second model without the ith observation
  J2 = lm(y ~ x_1 + x_2 + x_3  + x_4)
  yhat2=J2$coef[1]+J2$coef[2]*x1[i]+J2$coef[3]*x2[i]+J2$coef[4]*x3[i]+J2$coef[5]*cf4[i]
  e2[i] = yn[i] - yhat2
 }
 sqrt(c(mean(e1^2),mean(e2^2))) # RMSE

cf4 is a variable corresponding to groups (using mixtures) . What we want
to demonstrate is that the prediction error after cross-validation is lower
when we include this latent grouping variable. It works wonders when the
categorical variables are treated as 'numeric' variables. Though the ols
estimates are obviously very different.

Thank you in advance for your views on the problem.

Best Regards,

julian

	[[alternative HTML version deleted]]