[R] predict "interval" for lmRob?

Wed Apr 8 20:28:14 CEST 2009

Hi Greg,

Thanks for your guidance. 

In this case, the evidence is that the primary subpopulation of the data, accounting for  observes the standard statistical model in the sense that Rice uses the term.  It may by all accounts be normally distributed, and a Q-Q shows a large portion of the primary subpopulation behaves that way, out to 2 theoretical quantiles. But, for the measurement ranges of interest, the complement of the "normal subpopulation", accounting for some 20% of the total two million data points, behaves in other ways, which are, as a matter of fact, poorly understood.  That's not likely to change soon.

The choice of a robust regression framework and of "robust" (and possibly "quantreg" as Prof Koenker suggested) was simply to automatically fit a line to the primary subpopulation, without having to make arbitrary choices as what to keep or what to discard. Also, use of any preexisting package was simply pursued as a timesaver, worksaver, and to have some conceptual framework within to proceed other than just throwing least squares at arbitrarily chosen subsets.  

It sounds to me like I might use the robust regression to decide what to discard and then apply standard linear "lm" to the remainder, minding the diagnostics. Should they prove favorable, I'll proceed with the result of "lm".

Thanks for pointing out the limitations of "robust" and its kin for me. 

BTW, if "robust" does not adopt a normal model for the y variable, what's the proper interpretation of the standard errors for slope and intercept it yields?  A reference?

 - Jan

-----Original Message-----
From: Greg Snow [mailto:Greg.Snow at imail.org] 
Sent: Wednesday, April 08, 2009 1:20 PM
To: Galkowski, Jan; r-help at r-project.org
Subject: RE: predict "interval" for lmRob?

Your problem is related to the theory underlying linear models (and is an example as to why it is important to understand the theory, not just know how to plug numbers into a computer).

The lm function is based on theory that assumes the y variable in normally distributed with the mean of that normal based on the model and the x values.  This allows the predict function for lm to create prediction intervals based on the normal distribution, the predicted mean of that distribution, the estimated standard deviation, and the uncertainty in the predicted mean.  Note that if your y variable is not normally distributed, but the sample size is large enough for the Central Limit Theorem to hold, then the confidence intervals will be approximately correct, but the prediction intervals will probably not be.

When you switch to a robust regression approach, the assumption is that the y variable is not normal, so a prediction interval based on the normal distribution does not make sense.  To get an appropriate prediction interval you need some information on what the distribution of the y values is (conditional on the model), but most robust techniques are not based on a specific distribution, just some properties of the distribution.  Without some information (or at least an assumption) on that distribution, the predict method cannot create prediction intervals.

I know that this does not answer your question, but hopefully helps you to understand what is happening.  Think about what your actual scientific question is, it may be that you can answer the question without prediction intervals.

If you feel that you really need the prediction intervals, then you will need to do some additional background research into what distribution you think the data comes from, then you can proceed from there.  Some options include fitting a model based on that distribution, simulating data from the distribution given the model estimates and the uncertainty in those estimates, quantile regression, mixture of regressions, and others.

Hope this helps,

--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Galkowski, Jan
> Sent: Wednesday, April 08, 2009 9:32 AM
> To: r-help at r-project.org
> Subject: [R] predict "interval" for lmRob?
> 
> lm's "predict" function offers an "interval" parameter to choose
> between 'confidence' and 'prediction' bands. In the package "robust"
> and for "lmRob", there is also a "predict" but it lacks such a
> parameter, and the documented "type" parameter has only "response"
> offerred.  Is there some way of obtaining prediction bands from lmRob?
> Is there an alternative robust (linear) regression package that offers
> such a capability?
> 
> Thanks for any and all help.
> 
>   - Jan Galkowski, Akamai Technologies, Cambridge, MA.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.