[R] glm predict issue
Giovanni Azua
bravegag at gmail.com
Mon Dec 26 13:29:51 CET 2011
Hello,
I have tried reading the documentation and googling for the answer but reviewing the online matches I end up more confused than before.
My problem is apparently simple. I fit a glm model (2^k experiment), and then I would like to predict the response variable (Throughput) for unseen factor levels.
When I try to predict I get the following error:
> throughput.pred <- predict(throughput.fit,experiments,type="response")
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000
Of course these are new factor levels, it is exactly what I am trying to achieve i.e. extrapolate the values of Throughput.
Can anyone please advice? Below I include all details.
Thanks in advance,
Best regards,
Giovanni
> # define the extreme (factors and levels)
> experiments <- expand.grid(No_databases = seq(1000,100,by=-200),
+ Partitioning = c("sharding", "replication"),
+ No_middlewares = seq(500,100,by=-100),
+ Queue_size = c(100))
> experiments$No_databases <- as.factor(experiments$No_databases)
> experiments$Partitioning <- as.factor(experiments$Partitioning)
> experiments$No_middlewares <- as.factor(experiments$No_middlewares)
> experiments$Queue_size <- as.factor(experiments$Queue_size)
> str(experiments)
'data.frame': 50 obs. of 4 variables:
$ No_databases : Factor w/ 5 levels "200","400","600",..: 5 4 3 2 1 5 4 3 2 1 ...
$ Partitioning : Factor w/ 2 levels "sharding","replication": 1 1 1 1 1 2 2 2 2 2 ...
$ No_middlewares: Factor w/ 5 levels "100","200","300",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Queue_size : Factor w/ 1 level "100": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "out.attrs")=List of 2
..$ dim : Named int 5 2 5 1
.. ..- attr(*, "names")= chr "No_databases" "Partitioning" "No_middlewares" "Queue_size"
..$ dimnames:List of 4
.. ..$ No_databases : chr "No_databases=1000" "No_databases= 800" "No_databases= 600" "No_databases= 400" ...
.. ..$ Partitioning : chr "Partitioning=sharding" "Partitioning=replication"
.. ..$ No_middlewares: chr "No_middlewares=500" "No_middlewares=400" "No_middlewares=300" "No_middlewares=200" ...
.. ..$ Queue_size : chr "Queue_size=100"
> head(experiments)
No_databases Partitioning No_middlewares Queue_size
1 1000 sharding 500 100
2 800 sharding 500 100
3 600 sharding 500 100
4 400 sharding 500 100
5 200 sharding 500 100
6 1000 replication 500 100
> # or
> throughput.fit <- glm(log(Throughput)~(No_databases*No_middlewares)+Partitioning+Queue_size,
+ data=throughput)
> summary(throughput.fit)
Call:
glm(formula = log(Throughput) ~ (No_databases * No_middlewares) +
Partitioning + Queue_size, data = throughput)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5966 -0.6612 -0.1944 0.5548 3.2136
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.74701 0.09127 62.970 < 2e-16 ***
No_databases4 0.43309 0.10985 3.943 8.66e-05 ***
No_middlewares2 -1.99374 0.11035 -18.067 < 2e-16 ***
No_middlewares4 -1.23004 0.10969 -11.214 < 2e-16 ***
Partitioningreplication 0.33291 0.06181 5.386 9.15e-08 ***
Queue_size100 0.15850 0.06181 2.564 0.0105 *
No_databases4:No_middlewares2 2.71525 0.15262 17.791 < 2e-16 ***
No_databases4:No_middlewares4 1.94191 0.15226 12.754 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.8921778)
Null deviance: 2175.58 on 936 degrees of freedom
Residual deviance: 828.83 on 929 degrees of freedom
AIC: 2562.2
Number of Fisher Scoring iterations: 2
> throughput.pred <- predict(throughput.fit,experiments,type="response")
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor 'No_databases' has new level(s) 200, 400, 600, 800, 1000
More information about the R-help
mailing list