[R] Prediction with multiple zeros in the dependent variable
Thomas Lumley
tlumley at u.washington.edu
Thu Sep 8 16:22:32 CEST 2005
On Thu, 8 Sep 2005, John Sorkin wrote:
> I have a batch of data in each line of data contains three values,
> calcium score, age, and sex. I would like to predict calcium scores as a
> function of age and sex, i.e. calcium=f(age,sex). Unfortunately the
> calcium scorers have a very "ugly distribution". There are multiple
> zeros, and multiple values between 300 and 600. There are no values
> between zero and 300. Needless to say, the calcium scores are not
> normally distributed, however, the values between 300 and 600 have a
> distribution that is log normal.
[Coronary artery calcium by EBCT, I presume]
Our approach to modelling calcium scores is to do it in two parts. First
fit something like a logistic regression model where the outcome is zero
vs non-zero calcium. Then, for the non-zero use something like a linear
regression model for log calcium.
You could presumably use such a model for prediction or imputation too,
and you can work out means, medians etc from the two models.
One particular reason for using this two-part model is that we find
different predictors of zero/non-zero and of amount. This makes biological
sense -- a factor that makes arterial plaques calcify might well have no
impact until you have arterial plaques.
Or you could use smooth quantile regression in the rq package.
-thomas
More information about the R-help
mailing list