[R] Fitting with error on data

(Ted Harding) ted.harding at wlandres.net
Fri Oct 1 15:11:02 CEST 2010


On 27-Sep-10 08:55:13, Maayt wrote:
> As this forum proved to be very helpful, I got another question...
> I'd like to fit data points on which I have an error, dx and dy,
> on each x and y. What would be the common procedure to fit this
> data by a linear model taking into account uncertainty on each point?
> Would weighting each point by 1/sqrt(dx2+dy2) (and taking dx and dy
> as relative errors) in a lm() fit do the job? I would like to
> propagate uncertainty of the points into the uncertainty of the fit,
> would that be the case?
> 
> Thanks for all the help
> -- 

It would seem that there has been no response yet to this query.

This type of problem falls under various headers, typically

[A] Fitting a linear functional relationship
[B] Regression with errors in both variables

For [A], it is envisaged that x and y are, in the real world,
related by an exact lnear equation

  y = a + b*x  or  x = a' + b'y  or  A*x + B*y = C

and that data (X1,X2,...), (Y1,Y2,...) are obtained by simultaneously
measuring the exact values (x1,x2,...), (y1,y2,...) where measurement
errors result in:

  Xi = xi + e.Xi   Yi = yi + e.Yi

where, for each i, e.X is (say) distributed as N(0,s.X^2) and
e.Y as N(0,s.Y^2), where s.X and s.Y are the standard deviations
of the errors of measurement in X and Y.

Then it is a question of estimating a and b from the data.
This can be done by Maximum Likelihood, which requires taking
as parameters not only a and b, and s.X and s.Y, but also the
unknown (only observed with error) exact values (x1,x2,...) and
(y1,y2,...).

This case will not fit into the standard lm() method of fitting.

For [B], whereas in standard regression it is taken that the
observed X values are used as they stand (i.e. taken as fixed),
here it is accepted that they two are subject to error (similar
to [A]). So, whereas (for given values of {Xi}, {Yi}) a standard
lm(Y ~ X) will give an answer, the X-values on which the result
depends will themselves be uncertain and this uncertainty has
to be taken into account, in the sense that it is uncertain what
values of X Y is being regressed on.

The conceptual difference between [A] and [B] is that, in [A},
there is no "directional" aspect: x and y are simply being
considered as related by y = a + b*x, or x = a' + b'*y, with
no preference between either way of expressing it. The linear
relationship can be used for any appropriate purpose.

However, in [B] we are looking at a regression problem: y is
being regressed on x: lm(Y ~ X), and the primary purpose is
to predict the value of y that would result from a given value
of x. So it is "directional": x --> y. If we were interested
in predicting x from y, then we would do it the other way round:
lm(X ~ Y), so Y --> X, and the respective coefficients of the two
different regression equations cannot be deduced from each other.

So, in choosing between approach [A] and approach [B], you would
need to consider what you want to use the results for.

I think the Maximum Likelihood approach to [A] was first properly
considered by D.V. Lindley in 1947:

  D. V. Lindley.
  Regression lines and the linear functional relationship.
  Suppl. J. Roy. Statist. Soc., 9:218-244, 1947.

For this to work properly (i.e. be "consistent" in the technical
sense), you need to know the ratio of the two standard errors
(lambda = s.Y/s.X). From your statement of your problem, it looks
as though you would know this ratio.

The study of [B], regression with errors in both variables, goes
back a very long way, and many approaches have been considered.
These include several studies by J.B. Copas.

Neither [A] nor [B] is, in general, a straightforward problem!

A useful overview of approaches to both [A] and [B] can be found
in the freely downloadable:

  An historical overview of regression with errors in both variables.
  J.W. Gillard (Cardiff University)

http://www.cardiff.ac.uk/maths/resources/Gillard_Tech_Report.pdf

Now, as to what may be available in R:

I was a bit surprised to find that a full R site search on either of

  "linear functional relationship"
  "errors in both variables"

yielded nothing relevant. It may be that using different search
terms would find appropriate methods (such as considered by Gillard,
or the Lindley approach for [A]), but I'm having difficulty
thinking what such might be!

So I hope that R-help readers who have used R for this category
of problem can help!

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at wlandres.net>
Fax-to-email: +44 (0)870 094 0861
Date: 01-Oct-10                                       Time: 14:10:54
------------------------------ XFMail ------------------------------



More information about the R-help mailing list