[R] prediction based on conditional logistic regression clogit
Charles Berry
ccberry at ucsd.edu
Tue Jun 17 02:48:28 CEST 2014
peter dalgaard <pdalgd <at> gmail.com> writes:
>
>
> On 16 Jun 2014, at 05:22 , array chip <arrayprofile <at> yahoo.com> wrote:
>
> > Hi, I am using clogit() from survival package to do conditional
> > logistic regression. I also need to make prediction on an
> > independent dataset to calculate predicted probability. Here is an
> > example:
[snip]
> > Can anyone suggest what's wrong here?
> >
> The direct cause is that clogit() works by using the fact that the
> likelihood is equivalent to a coxph() likelihood with stratification
> and all observation lengths set to 1. Therefore the analysis is
> formally on Surv(rep(1, 150L), status) and that goes belly-up if you
> apply the same formula to a data set of different length.
> However, notice that there is no such thing as predict.clogit(), so
> you are attempting predict.coxph() on a purely formal Cox model. It
> is unclear to what extent predicted values, in the sense of coxph()
> are compatible with predictions in conditional logit models.
>
> I'm rusty on this, but I think what you want is something like
>
> m <- model.matrix(~ x1 + x2 - 1, data=dat.test)
> pp <- exp(m %*% coef(fit))
> pps <- ave(pp, dat.test$set, FUN=sum)
> pp/pps
>
> i.e. the conditional probability that an observation is a case given
> covariates and that there is on case in each set (in the data given,
> you have sets of three with one being a case, so all predicted
> probabilities are close to 0.33). For more general matched sets, I'm
> not really sure what one wants. Real experts are welcome to chime
> in.
For the general situation of n cases in a stratum of size N, you want the
probability that the unit in question is one of n units drawn from a
stratum of size N without replacement with unequal probabilities of
selection over the units.
I am *not* an expert on that, but there is plenty written on it.
Horvitz, Daniel G., and Donovan J. Thompson. "A generalization of
sampling without replacement from a finite universe." Journal of the
American Statistical Association 47.260 (1952): 663-685.
is a place to start.
The probability in question is a sum over the
factorial(n)*choose(N-1,n-1))
elements corresponding to the number of samples (and orders) that
include a chosen element.
Of course, for n=1 there is just the one element, pp/pps.
HTH,
Chuck
More information about the R-help
mailing list