[R] glm predict issue

Mon Dec 26 13:29:51 CET 2011

Hello,

I have tried reading the documentation and googling for the answer but reviewing the online matches I end up more confused than before.

My problem is apparently simple. I fit a glm model (2^k experiment), and then I would like to predict the response variable (Throughput) for unseen factor levels.

When I try to predict I get the following error:
> throughput.pred <- predict(throughput.fit,experiments,type="response")
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000

Of course these are new factor levels, it is exactly what I am trying to achieve i.e. extrapolate the values of Throughput.

Can anyone please advice? Below I include all details.

Thanks in advance,
Best regards,
Giovanni

> # define the extreme (factors and levels)
> experiments <- expand.grid(No_databases   = seq(1000,100,by=-200), 
+ 		   	  			   Partitioning   = c("sharding", "replication"), 
+ 		   	  			   No_middlewares = seq(500,100,by=-100), 
+ 		   	  			   Queue_size     = c(100))
> experiments$No_databases <- as.factor(experiments$No_databases)
> experiments$Partitioning <- as.factor(experiments$Partitioning)
> experiments$No_middlewares <- as.factor(experiments$No_middlewares)
> experiments$Queue_size <- as.factor(experiments$Queue_size)		   	  			   
> str(experiments)
'data.frame':	50 obs. of  4 variables:
 $ No_databases  : Factor w/ 5 levels "200","400","600",..: 5 4 3 2 1 5 4 3 2 1 ...
 $ Partitioning  : Factor w/ 2 levels "sharding","replication": 1 1 1 1 1 2 2 2 2 2 ...
 $ No_middlewares: Factor w/ 5 levels "100","200","300",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Queue_size    : Factor w/ 1 level "100": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "out.attrs")=List of 2
  ..$ dim     : Named int  5 2 5 1
  .. ..- attr(*, "names")= chr  "No_databases" "Partitioning" "No_middlewares" "Queue_size"
  ..$ dimnames:List of 4
  .. ..$ No_databases  : chr  "No_databases=1000" "No_databases= 800" "No_databases= 600" "No_databases= 400" ...
  .. ..$ Partitioning  : chr  "Partitioning=sharding" "Partitioning=replication"
  .. ..$ No_middlewares: chr  "No_middlewares=500" "No_middlewares=400" "No_middlewares=300" "No_middlewares=200" ...
  .. ..$ Queue_size    : chr "Queue_size=100"
> head(experiments)
  No_databases Partitioning No_middlewares Queue_size
1         1000     sharding            500        100
2          800     sharding            500        100
3          600     sharding            500        100
4          400     sharding            500        100
5          200     sharding            500        100
6         1000  replication            500        100
> # or
> throughput.fit <- glm(log(Throughput)~(No_databases*No_middlewares)+Partitioning+Queue_size,
+ 					data=throughput)
> summary(throughput.fit)

Call:
glm(formula = log(Throughput) ~ (No_databases * No_middlewares) + 
    Partitioning + Queue_size, data = throughput)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5966  -0.6612  -0.1944   0.5548   3.2136  

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    5.74701    0.09127  62.970  < 2e-16 ***
No_databases4                  0.43309    0.10985   3.943 8.66e-05 ***
No_middlewares2               -1.99374    0.11035 -18.067  < 2e-16 ***
No_middlewares4               -1.23004    0.10969 -11.214  < 2e-16 ***
Partitioningreplication        0.33291    0.06181   5.386 9.15e-08 ***
Queue_size100                  0.15850    0.06181   2.564   0.0105 *  
No_databases4:No_middlewares2  2.71525    0.15262  17.791  < 2e-16 ***
No_databases4:No_middlewares4  1.94191    0.15226  12.754  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for gaussian family taken to be 0.8921778)

    Null deviance: 2175.58  on 936  degrees of freedom
Residual deviance:  828.83  on 929  degrees of freedom
AIC: 2562.2

Number of Fisher Scoring iterations: 2

> throughput.pred <- predict(throughput.fit,experiments,type="response")
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
  factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000