[R] Regression model with proportional dependent variable

Tue Apr 12 09:43:51 CEST 2011

On Tue, 12 Apr 2011, peter dalgaard wrote:

>
> On Apr 12, 2011, at 08:45 , Achim Zeileis wrote:
>
>> On Mon, 11 Apr 2011, ty ty wrote:
>>
>>> Hello, dear experts. I don't have much experience in building
>>> regression models, so sorry if this is too simple and not very
>>> interesting question.
>>> Currently I'm working on the model that have to predict proportion of
>>> the debt returned by the debtor in some period of time. So the
>>> dependent variable can be any number between 0 and 1 with very high
>>> probability of 0 (if there are no payment) and if there are some
>>> payments it can very likely be 1 (all debt paid) although can be any
>>> number from 0 to 1.
>>> Not having much knowledge in this area I can't think about any
>>> appropriate model and wasn't able to find much on the Internet. Can
>>> anyone give me some ideas about possible models, any information
>>> on-line and some R functions and packages that can implement it.
>>> Thank you in advance for any help.
>>
>> Beta regression is one possibility to model proportions in the open unit interval (0, 1). It is available in R in the package "betareg":
>>
>>  http://CRAN.R-project.org/package=betareg
>>  http://www.jstatsoft.org/v34/i02/
>>
>> If 0 and 1 can occur, some authors have suggested to scale the response so that 0 and 1 are avoided. See the paper linked above for an example. If, however, there are many 0s and/or 1s, one might want to take a hurdle or inflation type approach. One such approach is implemented in the "gamlss" package:
>>
>>  http://CRAN.R-project.org/package=gamlss
>>  http://www.jstatsoft.org/v23/i07/
>>  http://www.gamlss.org/
>>
>> The hurdle approach can be implemented using separate building blocks.
>> First a binary regression model that captures whether the dependent variable is greater than 0 (i.e., crosses the hurdle): glm(I(y > 0) ~ ...,
>> family = binomial). Second a beta regression for only the observations in (0, 1) that crossed the hurdle: betareg(y ~ ..., subset = y > 0). A recent technical report introduces such a family of models along with many further techniques (specialized residuals and regression diagnostics) that are not yet available in R:
>>
>>  http://arxiv.org/abs/1103.2372
>
> Hmm, but this is actually 0-_and_-1 inflated, is it not?

That is also my understanding. But you could also set up two hurdles 
instead of just one.

> Various versions of censored regression comes to mind (like a 
> generalized tobit), but I don't know anything that is spot on.

With the tobit() function from "AER" -- a convenience interface to 
survreg() from "survival -- you can set up such a doubly censored 
regression: tobit(y ~ ..., left = 0, right = 1).

> Doubly censored regression is not hard to set up using generic 
> likelihood methods, once you decide on the underlying distribution. 
> Obviously, a basic modelling decision is whether the same parameters 
> apply to the censoring process as to the continuous part.

Yes, this is one limitation. Other potential disadvantages may include 
that some link function should be employed and that the response is 
heteroskedastic. If these are an issue, it is typically more convenient to 
address them using the beta regression approach which encompasses a link 
function and a second regression equation for the precision.

> -- 
> Peter Dalgaard
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>
>