[R] random truncation
Spencer Graves
@pencer@gr@ve@ @end|ng |rom e||ect|vede|en@e@org
Sun Jul 14 04:04:29 CEST 2019
PLEASE EXCUSE: This discussion has diverged from R into discussing the
precise assumptions seemingly descriptive of an application that drove
the initial post to this thread. A reply by Abby Spurdle seemed to me
to raise questions, whose answers may not be intelligible without
material snipped from Spurdle's reply. I wish to thank Spurdle from the
reply and apologize to those who feel this is an abuse of this list. I
trust that those in the latter category will please not bother to read
further. For anyone still interested in this problem, below please find
my earlier analysis with corrections that attempt to respond to
Spurdle's most recent concerns. Thanks, Spencer Graves
On 2019-07-12 22:31, Abby Spurdle wrote:
> > The distribution of the randomly truncated variable has thus four
> > parameters: a, b, mu and sigma. I was able to write down the likelihood
> > and attempted to maximise it
>
> I read the Wikipedia article more carefully.
> The formula is relatively simple, and is based on the application of
> Bayes Theorem.
> If one doesn't want to work out the integral, numerical methods can be
> used.
>
> However, the problem needs to be defined *precisely* first.
>
Correct: In my case, I confess I hadn't thought this through completely
before posting. I tried Rseek, as Bert Gunter suggested. That led me
to the "truncreg" and "DTDA" packages, neither of which seemed to be
what I wanted; thanks to Bert, Rolf, and Abby for your comments.
I'm observing a random variable Y[i] = (x[i]'b+e[i]) given
Y[i]>(z[i]'c+f[i]) where the tick mark (') denotes transpose of a
vector, and e and f are normally distributed with mean 0 and standard
deviations s and t, respectively, i = 1:n. Thus, Y[i] follows a
truncated normal distribution with mean x[i]'b and standard deviation s,
with the truncation condition being that Y[i]>(z[i]'c+f[i]).
I want the total of all the Y's from the untruncated
distribution, i.e., including those truncated (and not observed).
I think the likelihood is the product of the density of Y[i]
given x[i] and given that Y[i] is actually observed. By substituting
Y[i] = (x[i]'b+e[i]) into the truncation condition Y[i]>(z[i]'c+f[i]),
we get the following:
(x[i]'b+e)>(z[i]c+f).
This occurs if and only if:
(x[i]'b-z[i]'c)>(f-e),
Therefore, the probability that Y[i] is observed (and not truncated) is
Pr{Y[i] observed} = Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2))
where Phi is the cdf of the standard normal.
And then the likelihood for observation i can be written as follows:
f(y[i]|x[i], z[i], b, c, s, t) = phi((y[i]-x[i]'b)/s) /
Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2)).
We may not be able to estimate "c" in this, because if one of the
z[i]'s is nonzero, we can pick "c" so z[i]'c is Inf. That makes the
denominator 0 and the likelihood Inf. (If all the z[i]'s are zero, we
still cannot estimate "c".) However, if "b" is estimable, ignoring the
truncation, then we can estimate "b", "s" and "t" given "c".
And then the desired total of all the Y's, observed and
unobserved, would be the sum of y[i] divided by Pr{Y[i] observed}.
This likelihood is simple enough, it can be easily programmed in
R and maximized over variations in "b", "s" and "t" given "c". I can
get starting values for "b" and "s" from "lm", ignoring the truncation.
And I can first fit the model assuming t = s, then test whether it's
different using likelihood ratio. And I can try to estimate "c", but I
should probably use values I can estimate from other sources until I'm
comfortable with the estimates I get for "b", "s" and "t" given an
assumed value for "c".
Comments?
Thanks so much.
Spencer Graves
More information about the R-help
mailing list