[R] Time and space considerations in using predict.glm.
David Winsemius
dwinsemius at comcast.net
Tue Aug 24 22:31:58 CEST 2010
On Aug 24, 2010, at 3:16 PM, Daniel Yarlett wrote:
> Hello,
>
> I am using R to train a logistic regression model and save the
> resulting
> model to disk. I am then subsequently reloading these saved objects,
> and
> using predict.glm on them in order to make predictions about single-
> row data
> frames that are generated in real-time from requests arriving at an
> HTTP
> server. The following code demonstrates the sort of R calls that I
> have in
> mind:
>
>> cases <- 2000000
>> data <-
> data
> .frame
> (x1=runif(cases),x2=runif(cases),y=sample(0:1,cases,replace=TRUE))
>> lr1 <- glm(y~x1*x2,family=binomial,data=data)
>> new_data <- data.frame(x1=0,x2=0)
>> out <- predict(lr1,type="response",newdata=new_data)
>
> The first thing I am noticing is that the models that I am storing
> are very
> large because I am using large data-sets, and the models seem to store
> residuals, fitted values and so on, by default.
>
>> object.size(lr1)
> 1056071320 bytes
>
> Access to all this information is not necessary for my application
> -- all I
> really need is access to model$coefficients in order to make my
> predictions
> -- so I am wondering if there is some way to prevent this
> information from
> getting stored in the glm objects when they are created (or of
> removing it
> after the models have been trained)? I have discovered the
> model=FALSE,x=FALSE,y=FALSE switches to glm() and these seem to help
> somewhat, but perhaps there is some other way of only recording the
> coefficients of the model and other minimal details?
Perhaps instead:
>
>> lr2 <-
>> coef
>> ( glm
>> (y~x1*x2,family=binomial,data=data,model=FALSE,x=FALSE,y=FALSE) )
>> object.size(lr2)
>
Will be much smaller
>
> Secondly, on data-sets of the scale I am using, predict.glm seems to
> be
> taking a very long time to make its predictions.
>
>> print(system.time(predict(lr1,type="response",newdata=new_data)))
> user system elapsed
> 0.136 0.040 0.175
>> print(system.time(predict(lr2,type="response",newdata=new_data)))
> user system elapsed
> 0.109 0.013 0.121
>
> This may be an issue of swap-time, and so it could potentially be
> solved by
> addressing my first question above. However, given that I am
> essentially
> asking R to compute
>
> 1 / (1 + exp(-(b0 + b1*x1 + b2*x2 + b3*x1*x2)))
>
> I can't see any reason why this request should be taking longer than a
> hundredth or a thousandth of a second, say.
You could try crossprod with a data.matrix and a matrix of coefficients.
> 1 / (1 + exp(-(crossprod(lr2, new_data)))
> cases <- 2000
> data <-
data
.frame(x1=runif(cases),x2=runif(cases),y=sample(0:1,cases,replace=TRUE))
> lr1 <- coef(glm(y~x1*x2,family=binomial,data=data))
> new_data <- matrix(c(1, x1=0,x2=0, x1x2=0), nrow=4)
# took me a while to figure out that I needed an interaction entry.
> out <- 1 / (1 + exp(-(crossprod(new_data,lr1))))
> out
[,1]
[1,] 0.5107252
> lr1
(Intercept) x1 x2 x1:x2
0.04290728 -0.16826991 -0.03561711 0.06229122
> > object.size(lr1)
456 bytes
> Obviously R is providing a much
> greater level of functionality than I am requiring in this particular
> instance, so my overall question is what is the best way for me to
> reduce
> the size of the data I have to store in my GLM models, and to
> increase the
> speed at which I can use R to generate predictions of this sort
> (i.e. for
> novel x1,x2 pairs)?
>
> I could obviously write a custom function / class which only stores
> the
> model coefficients and computes predictions based on these using the
> equation above, but before I go down this route I wanted to get come
> advice
> from the R community about whether there might be a better way to
> address
> this problem and/or whether I have missed something obvious (to
> others). I
> also want to avoid writing custom code if possible because that
> obviously
> means sacrificing the great generality and power of R which could
> clearly be
> useful in my application down the line.
>
> Many thanks in advance for your assistance,
>
> Dan.
>
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list