[R] random truncation

Sat Jul 13 14:47:29 CEST 2019

On 2019-07-12 22:31, Abby Spurdle wrote:
> > The distribution of the randomly truncated variable has thus four
> > parameters: a, b, mu and sigma.  I was able to write down the likelihood
> > and attempted to maximise it
>
> I read the Wikipedia article more carefully.
> The formula is relatively simple, and is based on the application of 
> Bayes Theorem.
> If one doesn't want to work out the integral, numerical methods can be 
> used.
>
> However, the problem needs to be defined *precisely* first.
>

Correct:  In my case, I confess I hadn't thought this through completely 
before posting.  I tried Rseek, as Bert Gunter suggested.  That led me 
to the "truncreg" and "DTDA" packages, neither of which seemed to be 
what I wanted;  thanks to Bert, Rolf, and Abby for your comments.

       I'm observing Y[i] = (X[i]'b+e) given Y[i]>(z[i]'c+f) where e and 
f are normally distributed with standard deviations s and t, 
respectively, i = 1:n.  I want the total of all the Y's, including those 
truncated (and not observed).

       I think the likelihood is the product of the density of Y[i] 
given x[i] and given that Y[i] is actually observed.  The latter can be 
further written as follows:

             (x[i]'b+e)>(z[i]c+f)
iff
             (x[i]'b-z[i]'c)>(f-e),

Therefore, the probability that Y[i] is observed is

             Pr{Y[i] observed} = Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2))

where Phi is the cdf of the standard normal.

       And then the likelihood for observation i can be written as follows:

             f(y[i]|x[i], z[i], b, c, s, t) = phi((y[i]-x[i]'b)/s) / 
Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2).

       We may not be able to estimate "c" in this, because if one of the 
z[i]'s is nonzero, we can pick "c" so z[i]'c is Inf.  That makes the 
denominator 0 and the likelihood Inf.  (If all the z[i]'s are zero, we 
still cannot estimate "c".)  However, if "b" is estimable, ignoring the 
truncation, then we can estimate "b", "s" and "t" given "c".

       And then the desired total of all the Y's, observed and 
unobserved, would be the sum of y[i] divided by Pr{Y[i] observed}.

       This likelihood is simple enough, it can be easily programmed in 
R and maximized over variations in "b", "s" and "t" given "c".  I can 
get starting values for "b" and "s" from "lm", ignoring the truncation.  
And I can first fit the model assuming t = s, then test whether it's 
different using likelihood ratio.  And I can try to estimate "c", but I 
should probably use values I can estimate from other sources until I'm 
comfortable with the estimates I get for "b", "s" and "t" given an 
assumed value for "c".

       Comments?
       Thanks so much.
       Spencer Graves