[R] Time and space considerations in using predict.glm.

Tue Aug 24 22:31:58 CEST 2010

On Aug 24, 2010, at 3:16 PM, Daniel Yarlett wrote:

> Hello,
>
> I am using R to train a logistic regression model and save the  
> resulting
> model to disk. I am then subsequently reloading these saved objects,  
> and
> using predict.glm on them in order to make predictions about single- 
> row data
> frames that are generated in real-time from requests arriving at an  
> HTTP
> server. The following code demonstrates the sort of R calls that I  
> have in
> mind:
>
>> cases <- 2000000
>> data <-
> data 
> .frame 
> (x1=runif(cases),x2=runif(cases),y=sample(0:1,cases,replace=TRUE))
>> lr1 <- glm(y~x1*x2,family=binomial,data=data)
>> new_data <- data.frame(x1=0,x2=0)
>> out <- predict(lr1,type="response",newdata=new_data)
>
> The first thing I am noticing is that the models that I am storing  
> are very
> large because I am using large data-sets, and the models seem to store
> residuals, fitted values and so on, by default.
>
>> object.size(lr1)
> 1056071320 bytes
>
> Access to all this information is not necessary for my application  
> -- all I
> really need is access to model$coefficients in order to make my  
> predictions
> -- so I am wondering if there is some way to prevent this  
> information from
> getting stored in the glm objects when they are created (or of  
> removing it
> after the models have been trained)? I have discovered the
> model=FALSE,x=FALSE,y=FALSE switches to glm() and these seem to help
> somewhat, but perhaps there is some other way of only recording the
> coefficients of the model and other minimal details?

Perhaps instead:
>
>> lr2 <-  
>> coef 
>> ( glm 
>> (y~x1*x2,family=binomial,data=data,model=FALSE,x=FALSE,y=FALSE) )

>> object.size(lr2)
>
Will be much smaller

>
> Secondly, on data-sets of the scale I am using, predict.glm seems to  
> be
> taking a very long time to make its predictions.
>
>> print(system.time(predict(lr1,type="response",newdata=new_data)))
>   user  system elapsed
>  0.136   0.040   0.175
>> print(system.time(predict(lr2,type="response",newdata=new_data)))
>   user  system elapsed
>  0.109   0.013   0.121
>
> This may be an issue of swap-time, and so it could potentially be  
> solved by
> addressing my first question above. However, given that I am  
> essentially
> asking R to compute
>
> 1 / (1 + exp(-(b0 + b1*x1 + b2*x2 + b3*x1*x2)))
>
> I can't see any reason why this request should be taking longer than a
> hundredth or a thousandth of a second, say.

You could try crossprod with a data.matrix and a matrix of coefficients.

> 1 / (1 + exp(-(crossprod(lr2, new_data)))

 > cases <- 2000
 > data <-  
data 
.frame(x1=runif(cases),x2=runif(cases),y=sample(0:1,cases,replace=TRUE))
 > lr1 <- coef(glm(y~x1*x2,family=binomial,data=data))
 > new_data <- matrix(c(1, x1=0,x2=0, x1x2=0), nrow=4)
# took me a while to figure out that I needed an interaction entry.
 > out <- 1 / (1 + exp(-(crossprod(new_data,lr1))))
 > out
           [,1]
[1,] 0.5107252
 > lr1
(Intercept)          x1          x2       x1:x2
  0.04290728 -0.16826991 -0.03561711  0.06229122
 > > object.size(lr1)
456 bytes

> Obviously R is providing a much
> greater level of functionality than I am requiring in this particular
> instance, so my overall question is what is the best way for me to  
> reduce
> the size of the data I have to store in my GLM models, and to  
> increase the
> speed at which I can use R to generate predictions of this sort  
> (i.e. for
> novel x1,x2 pairs)?
>
> I could obviously write a custom function / class which only stores  
> the
> model coefficients and computes predictions based on these using the
> equation above, but before I go down this route I wanted to get come  
> advice
> from the R community about whether there might be a better way to  
> address
> this problem and/or whether I have missed something obvious (to  
> others). I
> also want to avoid writing custom code if possible because that  
> obviously
> means sacrificing the great generality and power of R which could  
> clearly be
> useful in my application down the line.
>
> Many thanks in advance for your assistance,
>
> Dan.
>

David Winsemius, MD
West Hartford, CT