[R] "logistic" + "neg binomial" + ...

Sat Sep 23 13:23:48 CEST 2006

On 23-Sep-06 Prof Brian Ripley wrote:
> On Fri, 22 Sep 2006, Ted.Harding at nessie.mcc.ac.uk wrote:
> 
>> Hi Folks,
>>
>> I've just come across a kind of problem which leads
>> me to wonder how to approach it in R.
>>
>> Basically, each a set of items is subjected to a series
>> of "impacts" until it eventually "fails". The "force"
>> of each impact would depend on  covariates X,Y say;
>> but as a result of preceding impacts an item would be
>> expected to have a "cumulative frailty" such that the
>> probability of failure due to a particular impact would
>> possibly increase according to the series of impacts
>> already survived.
> 
> So this is a discrete-time survival model.

Essentially, yes. Though without "cumulative frailty" one
need not take into account that each item is repeatedly
"hit" until it "breaks" -- it would be the same as if
each item was discarded after impact, being replaced by
an identical one.

>> Without the "cumulative frailty" one could envisage
>> something like a logistic model for the probabiliy
>> of failure at each impact, leading to a kind of
>> generalised "exponential distribution" -- that is,
>> the likelihood for each item would be of the form
>>
>>  (1-P[1])*(1-P[2])*...*(1-P[n-1])*P[n]
>>
>> where P[i] could have a logistic model in terms of
>> the values of X[i] and Y[i], and n is the index of
>> the impact at which failure occurred. That is then
>> a solvable problem.
>>
>> Even so, I'm not (so far) finding in the R resources
>> the appropriate analogue of glm for this kind of
>> model. I dare say a prolonged trawl through the various
>> "survival" resources might lead to something applicable,
>> but ...
> 
> What is inadequate about glm itself?

My concern about using glm is that it does not have an option
like "family=negbinomial", so that binary data are treated
as if binomially distributed. For example, consider the two
cases (without covariates):

A: 0 0 0 0 0 0 0 0 0 1
B: 0 0 0 0 0 0 0 0 0 1

where A results from a fixed number of 10 trials, of which
one results in "1", while B results from repeating trials
until a "1" is obtained, as it happens on the 10th trial.

Now, granted that both A and B have the same likelihood
function for p (prob of "1"), namely p*(1-p)^9, so the same
MLE of p would result, namely p=1/10. However, they have
quite different sampling properties. If, with

  y <- cbind(c(0,0,0,0,0,0,0,0,0,1),c(1,1,1,1,1,1,1,1,1,0))

you fit y~1, the estimate returned by glm is the Intercept
which is the estimated value of t = log(p/(1-p)):

  summary(glm(y~1,family=binomial))
  [...]
  Coefficients:
              Estimate Std. Error z value Pr(>|z|)  
  (Intercept)   -2.197      1.054  -2.085   0.0371 *

and indeed, with t=Intercept, you find that exp(t)/(1+exp(t)) = 0.1

However, I would not trust the "Std. Error" from this output!
It is computed on the basis that the 0/1 data are binomial with
fixed size n=10, whereas I would want a "Std. Error" computed
on the basis that the random variable is n, for fixed r=1.
(Indeed, in both cases the "Std. Error" is not quite what it seems,
since there is positive probability in case A for both t = -inf
and t = +inf when estimated, and in case B positive probability
for t = +inf).

Although in the simple cases of A and B one can work directly
with p, avoiding such problems, when one introduces covariates
for each trial and it is natural to postulate a logistic model
for P[i] in the i-th trial, then it would be very convenient
to use the 'glm' mechanism for this. Again, the same estimated
values of the coefficients would result from maximising the
likelihood function whether the data are generated as in A or
as in B, for the same reasons (and even more strongly) I would
not trust the SEs of the coefficients as output from glm.

> The log-likelihood is a sum of terms over impacts, so fitting
> logisitic models for each impact can be done separately for
> each model.

I'm not sure how to interpret this. I'm envisaging only one
model, say logit(P[i]) = a + b*X[i] + c*Y[i] for the i-th
impact; "each impact" is a single event so I woudn't envisage
fitting any model for this single event.

> However, nnet() can fit them simultaneously (and couple
> them if you want).
> 
>> And then there's the cumulative frailty ... !
> 
> Add the history to that point into the model.

I'll look futher into those suggestions.

Thanks,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 23-Sep-06                                       Time: 12:23:44
------------------------------ XFMail ------------------------------