[R] Help with predict.lm
(Ted Harding)
Ted.Harding at nessie.mcc.ac.uk
Tue Apr 19 23:27:01 CEST 2005
On 19-Apr-05 Ted Harding wrote:
> On 19-Apr-05 Mike White wrote:
>> Hi
>> I have measured the UV absorbance (abs) of 10 solutions
>> of a substance at known concentrations (conc) and have
>> used a linear model to plot a calibration graph with
>> confidence limits. I now want to predict the concentration
>> of solutions with UV absorbance results given in the
>> new.abs data.frame, however predict.lm only appears to work
>> for new "conc" variables not new "abs" variables.
>> [...]
>> conc<-seq(100, 280, 20) # mg/l
>> abs<-c(1.064, 1.177, 1.303, 1.414, 1.534, 1.642, 1.744,
>> 1.852, 1.936,2.046) # absorbance units
>> lm.calibration<-lm(abs ~ conc)
>> pred.w.plim <- predict(lm.calibration, interval="prediction")
>> pred.w.clim <- predict(lm.calibration, interval="confidence")
>> matplot(conc, cbind(pred.w.clim, pred.w.plim[,-1]),
>> lty=c(1,2,2,3,3), type="l", ylab="abs", xlab= "conc mg/l")
>> points(conc, abs, pch=21, col="blue")
>>
>> new.abs<-data.frame(abs=c(1.251, 1.324, 1.452))
>> [...]
> The correct approach has in the past been the subject of
> at times quite controversial discussion, under the title
> indeed of "The Calibration Problem". Nowadays this problem
> would be approached by making the concentrations to be
> "predicted" additional unknown parameters, and evaluating
> likelihood ratios for possible values of these.
>
> I don't have time at the moment to go into this approach,
> but will try to write something later.
Here is the basic theory, and an associated method, for
the case of estimating 1 value of X ("conc") corresponding
to 1 given value of Y ("abs"), when data x1,...,xn and
y1,...yn have been obtained (your data "conc" and "abs"
above, respectively).
The model
y = a + b*x + e
is assumed, where the x-values are given (as typically in
a calibration experiment -- e.g. measuring standard test
samples of known y-value), and the y-values are measured
according to the above, with errors e assumed distributed
as N(0,s^2).
For simplicity, centre the x and y observations at their
means, so that Sum[xi]=0 and Sum[yi]=0.
Let an observation Y of y be given.
To infer the corresponding value of X.
(X is measured from the mean of the xi, Y from the mean of the yi).
Let X play the role of an additional parameter in the
problem. Then the likelihood function for (a,b,s,X) is
(1/(s^(n+1))*exp(-(Sum[(y-a-b*x)^2] + (Y-a-b*X)^2)/s^2)
The MLEs of a#, b#, s#, X# of a, b, s, X are then given by
a# = 0
b# = (Sum[xy])/(Sum[x^2])
X# = Y/b#
(s#)^2 = (1/(n+1))Sum[(y - (b#)*x)^2
Now suppose that X is set at a fixed value (again denoted by X).
Then the MLEs a~, b~, s~ of a, b and s are now given by
a~ = (Y*Sum[x^2] - X*Sum[x*y])/D
b~ = ((n+1)*Sum[x*y] + n*X*Y)/D
(s~)^2 = (Sum[(y - a~ - (b~)*x)^2] + (Y - a~ - (b~)*X)^2)/(n+1)
where
D = (n+1)*Sum[x^2] + n*X^2
The likelihood ratio (profile likelihood for X) of the hypothesis
that X has a fixed value (X) versus the hypothesis that X might
be anything is therefore
((s#)/(s~))^(n+1)
which depends only on
R(X) = (s#)^2/(s~)^2
where s~ (but not s#) depends on X.
Now
(n-1)*((s~)^2 - (s#)^2)/(s#^2) = (n-2)*(1 - R(X))/R(X)
has the F distribution with 1 and (n-1) d.f., quite independent
of the true values of a, b and s^2, and large values of R(X)
correspond to small values of this "F ratio".
Hence the MLE of X is Y/B# and a confidence set for X at
a given confidence level P0 is the set of all X such that
(n-2)*(1 - R(X))/R(X) < qf(1-P0,1,n-2)
(in the notation of R's F functions pf, qf, rf, etc.)
The following function implements the above expressions.
It is a very crude approach to solving the problem, and
I'm sure that a more thoughtful approach would lead more
directly to the answer, but at least it gets you there
eventually.
===========================================
R.calib <- function(x,y,X,Y){
n<-length(x) ; mx<-mean(x) ; my<-mean(y) ;
x<-(x-mx) ; y<-(y-my) ; X<-(X-mx) ; Y<-(Y-my)
ah<-0 ; bh<-(sum(x*y))/(sum(x^2)) ; Xh <- Y/bh
sh2 <- (sum((y-ah-bh*x)^2))/(n+1)
D<-(n+1)*sum(x^2) + n*X^2
at<-(Y*sum(x^2) - X*sum(x*y))/D; bt<-((n+1)*sum(x*y) + n*X*Y)/D
st2<-(sum((y - at - bt*x)^2) + (Y - at - bt*X)^2)/(n+1)
R<-(sh2/st2)
F<-(n-2)*(1-R)/R
x<-(x+mx) ; y<-(y+my) ;
X<-(X+mx) ; Y<-(Y+my) ; Xh<-(Xh+mx) ;
PF<-(pf(F,1,(n-2)))
list(x=x,y=y,X=X,Y=Y,R=R,F=F,PF=PF,
ahat=ah,bhat=bh,sh2=sh2,
atil=at,btil=bt,st2=st2,
Xhat=Xh)
}
============================================
Now lets take your original data and the first Y-value
in your list (namely Y = 1.251), and suppose you want
a 95% confidence interval for X. The X-value corresponding
to Y which you would get by regressing x (conc) on y (abs)
is X = 131.3813 so use this as a "starting value".
So now run this function with x<-conc, y<-abs, and these values
of X and Y:
R.calib(x,y,131.3813,1.251)
You get a long list of stuff, amongst which
$PF
[1] 0.02711878
and
$Xhat
[1] 131.2771
So now you know that Xhat (the MLE of X for that Y) = 131.2771
and the F-ratio probability is 0.027...
You want to push $PF upwards till it reaches 0.05, so work
*outwards* in the X-value:
R.calib(x,y,131.4000,1.251)$PF
[1] 0.03198323
R.calib(x,y,131.4500,1.251)$PF
[1] 0.0449835
...
R.calib(x,y,131.4693,1.251)$PF
[1] 0.04999851
and you're there in that direction. Now go back to the MLE
and work out in the other direction:
R.calib(x,y,131.2771,1.251)$PF
[1] 1.987305e-06
R.calib(x,y,131.2000,1.251)$PF
[1] 0.02005908
R.calib(x,y,131.1000,1.251)$PF
[1] 0.04604698
...
R.calib(x,y,131.0847,1.251)$PF
[1] 0.0500181
and again you're there.
So now you have the MLE Xhat = 131.2771, and the two
limits of a 95% confidence interval (131.0847, 131.4693)
for X, corresponding to the given value 1.251 of Y.
I leave it to you (or any other interested parties)
to wrap this basic method in a proper routine which
will avoid the above groping (but at least it shows
explicitly what's going on).
As a refinement: In you original query, you put up
three values of Y: 1.251, 1.324, 1.452
Now you could, of course, simply apply the above
method to each one separately as above.
However, the same theoretical approach can be used
to obtain a joint confidence region for the three
corresponding X values jointly, and this is theoretically
more correct, since these are 3 extra parameters and
they will have a covariance structure. You would need
to work through the algebra for k such Y-values and the
corresponding X-values.
In practice this may not matter much -- probably not
at all for your data.
Note: The above analysis of the single-case situation
was exhibited in an address I gave to a modelling symposium
in 1985, subsequently published along with the other
presentations in a special number of The Statistician:
Modelling: the classical approach
The Statistician (1986) vol 35, pp. 115-134
NB that the equation for alpha-tilde (corresponding to a~ above)
in that article has a misprint ("Y" missing before Sum(x^2)).
This analysis had something in common with work by Phil Brown:
P.J. Brown (1982) Multivariate Calibration
JRSS Ser. B, vol 44, pp. 287-321
though Brown's approach did not start from the likelihood
ratio principle.
Hoping this helps!
Best wishes,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 19-Apr-05 Time: 22:27:01
------------------------------ XFMail ------------------------------
More information about the R-help
mailing list