[R] Help with predict.lm

Tue Apr 19 23:27:01 CEST 2005

On 19-Apr-05 Ted Harding wrote:
> On 19-Apr-05 Mike White wrote:
>> Hi
>> I have measured the UV absorbance (abs) of 10 solutions
>> of a substance at known concentrations (conc) and have
>> used a linear model to plot a calibration graph with
>> confidence limits.  I now want to predict the concentration
>> of solutions with UV absorbance results given in the 
>> new.abs data.frame, however predict.lm only appears to work
>> for new "conc" variables not new "abs" variables.
>> [...]
>>     conc<-seq(100, 280, 20) #  mg/l
>>     abs<-c(1.064, 1.177, 1.303, 1.414, 1.534, 1.642, 1.744,
>>            1.852, 1.936,2.046) # absorbance units
>>     lm.calibration<-lm(abs ~ conc)
>>     pred.w.plim <- predict(lm.calibration,  interval="prediction")
>>     pred.w.clim <- predict(lm.calibration,  interval="confidence")
>>     matplot(conc, cbind(pred.w.clim, pred.w.plim[,-1]),
>>        lty=c(1,2,2,3,3), type="l", ylab="abs", xlab= "conc mg/l")
>>     points(conc, abs, pch=21, col="blue")
>>
>>     new.abs<-data.frame(abs=c(1.251, 1.324, 1.452))

>> [...]
> The correct approach has in the past been the subject of
> at times quite controversial discussion, under the title
> indeed of "The Calibration Problem". Nowadays this problem
> would be approached by making the concentrations to be
> "predicted" additional unknown parameters, and evaluating
> likelihood ratios for possible values of these.
> 
> I don't have time at the moment to go into this approach,
> but will try to write something later.

Here is the basic theory, and an associated method, for
the case of estimating 1 value of X ("conc") corresponding
to 1 given value of Y ("abs"), when data x1,...,xn and
y1,...yn have been obtained (your data "conc" and "abs"
above, respectively).

The model

   y = a + b*x + e

is assumed, where the x-values are given (as typically in
a calibration experiment -- e.g. measuring standard test
samples of known y-value), and the y-values are measured
according to the above, with errors e assumed distributed
as N(0,s^2).

For simplicity, centre the x and y observations at their
means, so that Sum[xi]=0 and Sum[yi]=0.

Let an observation Y of y be given.
To infer the corresponding value of X.
(X is measured from the mean of the xi, Y from the mean of the yi).

Let X play the role of an additional parameter in the
problem. Then the likelihood function for (a,b,s,X) is

  (1/(s^(n+1))*exp(-(Sum[(y-a-b*x)^2] + (Y-a-b*X)^2)/s^2)

The MLEs of a#, b#, s#, X# of a, b, s, X are then given by

  a# = 0

  b# = (Sum[xy])/(Sum[x^2])

  X# = Y/b#

  (s#)^2 = (1/(n+1))Sum[(y - (b#)*x)^2

Now suppose that X is set at a fixed value (again denoted by X).
Then the MLEs a~, b~, s~ of a, b and s are now given by

  a~ = (Y*Sum[x^2] - X*Sum[x*y])/D

  b~ = ((n+1)*Sum[x*y] + n*X*Y)/D

  (s~)^2 = (Sum[(y - a~ - (b~)*x)^2] + (Y - a~ - (b~)*X)^2)/(n+1)

where

  D = (n+1)*Sum[x^2] + n*X^2

The likelihood ratio (profile likelihood for X) of the hypothesis
that X has a fixed value (X) versus the hypothesis that X might
be anything is therefore

  ((s#)/(s~))^(n+1)

which depends only on

  R(X) = (s#)^2/(s~)^2

where s~ (but not s#) depends on X.

Now

  (n-1)*((s~)^2 - (s#)^2)/(s#^2) = (n-2)*(1 - R(X))/R(X)

has the F distribution with 1 and (n-1) d.f., quite independent
of the true values of a, b and s^2, and large values of R(X)
correspond to small values of this "F ratio".

Hence the MLE of X is Y/B# and a confidence set for X at
a given confidence level P0 is the set of all X such that

  (n-2)*(1 - R(X))/R(X) < qf(1-P0,1,n-2)

(in the notation of R's F functions pf, qf, rf, etc.)

The following function implements the above expressions.
It is a very crude approach to solving the problem, and
I'm sure that a more thoughtful approach would lead more
directly to the answer, but at least it gets you there
eventually.

===========================================

R.calib <- function(x,y,X,Y){
  n<-length(x) ; mx<-mean(x) ; my<-mean(y) ;
  x<-(x-mx) ; y<-(y-my) ; X<-(X-mx) ; Y<-(Y-my)

  ah<-0 ; bh<-(sum(x*y))/(sum(x^2)) ; Xh <- Y/bh
          sh2 <- (sum((y-ah-bh*x)^2))/(n+1)

  D<-(n+1)*sum(x^2) + n*X^2
  at<-(Y*sum(x^2) - X*sum(x*y))/D; bt<-((n+1)*sum(x*y) + n*X*Y)/D
  st2<-(sum((y - at - bt*x)^2) + (Y - at - bt*X)^2)/(n+1)

  R<-(sh2/st2)

  F<-(n-2)*(1-R)/R

  x<-(x+mx) ; y<-(y+my) ;
  X<-(X+mx) ; Y<-(Y+my) ; Xh<-(Xh+mx) ;
  PF<-(pf(F,1,(n-2)))
  list(x=x,y=y,X=X,Y=Y,R=R,F=F,PF=PF,
       ahat=ah,bhat=bh,sh2=sh2,
       atil=at,btil=bt,st2=st2,
       Xhat=Xh)
}

============================================

Now lets take your original data and the first Y-value
in your list (namely Y = 1.251), and suppose you want
a 95% confidence interval for X. The X-value corresponding
to Y which you would get by regressing x (conc) on y (abs)
is X = 131.3813 so use this as a "starting value".

So now run this function with x<-conc, y<-abs, and these values
of X and Y:

  R.calib(x,y,131.3813,1.251)

You get a long list of stuff, amongst which

  $PF
  [1] 0.02711878

and

  $Xhat
  [1] 131.2771

So now you know that Xhat (the MLE of X for that Y) = 131.2771
and the F-ratio probability is 0.027...

You want to push $PF upwards till it reaches 0.05, so work
*outwards* in the X-value:

  R.calib(x,y,131.4000,1.251)$PF
  [1] 0.03198323

  R.calib(x,y,131.4500,1.251)$PF
  [1] 0.0449835

  ...

  R.calib(x,y,131.4693,1.251)$PF
  [1] 0.04999851

and you're there in that direction. Now go back to the MLE
and work out in the other direction:

  R.calib(x,y,131.2771,1.251)$PF
  [1] 1.987305e-06

  R.calib(x,y,131.2000,1.251)$PF
  [1] 0.02005908

  R.calib(x,y,131.1000,1.251)$PF
  [1] 0.04604698

  ...

  R.calib(x,y,131.0847,1.251)$PF
  [1] 0.0500181

and again you're there.

So now you have the MLE Xhat = 131.2771, and the two
limits of a 95% confidence interval (131.0847, 131.4693)
for X, corresponding to the given value 1.251 of Y.

I leave it to you (or any other interested parties)
to wrap this basic method in a proper routine which
will avoid the above groping (but at least it shows
explicitly what's going on).

As a refinement: In you original query, you put up
three values of Y: 1.251, 1.324, 1.452

Now you could, of course, simply apply the above
method to each one separately as above.

However, the same theoretical approach can be used
to obtain a joint confidence region for the three
corresponding X values jointly, and this is theoretically
more correct, since these are 3 extra parameters and
they will have a covariance structure. You would need
to work through the algebra for k such Y-values and the
corresponding X-values.

In practice this may not matter much -- probably not
at all for your data.

Note: The above analysis of the single-case situation
was exhibited in an address I gave to a modelling symposium
in 1985, subsequently published along with the other
presentations in a special number of The Statistician:

  Modelling: the classical approach
  The Statistician (1986) vol 35, pp. 115-134

NB that the equation for alpha-tilde (corresponding to a~ above)
in that article has a misprint ("Y" missing before Sum(x^2)).

This analysis had something in common with work by Phil Brown:

  P.J. Brown (1982) Multivariate Calibration
  JRSS Ser. B, vol 44, pp. 287-321

though Brown's approach did not start from the likelihood
ratio principle.

Hoping this helps!
Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 19-Apr-05                                       Time: 22:27:01
------------------------------ XFMail ------------------------------