[R] question about linear regression and leverage

Bert Gunter gunter.berton at gene.com
Tue Jun 21 15:40:35 CEST 2011


You really really need to consult with a local statistician for help.
You are making a valiant effort, but it is clear that you have
insufficient background and experience. Get help from an expert if you
can. It is no dishonor, you will learn a lot, and you will avoid
incorrect conclusions.

Cheers,
Bert

On Tue, Jun 21, 2011 at 4:49 AM, George Markomanolis
<george at markomanolis.com> wrote:
> Dear David,
>
> Thanks for your answer. Yes now that you mentioned these points are in
> the beginning of a variable range. From the plot of the residuals seems
> to have non constant variance which is solved by a transformation. I
> checked also for interactions by using the symbol : between two
> variables and the change on the result was not so important. I am
> working on computer science field but I wanted to do an analysis from
> scratch because some previous results that I have seen are not good for
> such cases. Moreover the data are not the same of course.
>
> Thanks,
> George
>
> On 06/21/2011 01:08 PM, David Winsemius wrote:
>>
>> On Jun 21, 2011, at 3:49 AM, George Markomanolis wrote:
>>
>>> Dear all,
>>>
>>> I am new to this field and I have a question about a linear regression.
>>> I have a dataset of around to 31000 points and I want to apply a linear
>>> regression. The R-squared is 0.9 however when I check the diagnostic
>>> plots I can see that there are around to 250 points with big leverage
>>> value. As I know the points with big leverage influence a lot the fit.
>>> If I remove these points in order to check their influence, the
>>> R-squared of the rest points is 0.71. So I removed less than 1% of my
>>> data and the fit is not so good. Could you please give me any advice
>>> about this? Is it right to let these 250 points in my dataset or not?
>>> Could I do something else? The data are measured through an experiment
>>> so even these 250 points are real values.
>>
>> You could be looking at the descriptive statistics on the points.
>> Perhaps they are at one end of a variable range, or you perhaps have
>> some other feature that is scientifically interesting. So far you have
>> only been examining one set of simple linear hypotheses and have not
>> (presumably) been looking at any non-linear possibilities or the
>> potential that interactions are affecting the outcome. The prior
>> science of your (so far undescribed) domain should be carefully
>> considered, but in your message we see no evidence of such.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
"Men by nature long to get on to the ultimate truths, and will often
be impatient with elementary studies or fight shy of them. If it were
possible to reach the ultimate truths without the elementary studies
usually prefixed to them, these would not be preparatory studies but
superfluous diversions."

-- Maimonides (1135-1204)

Bert Gunter
Genentech Nonclinical Biostatistics



More information about the R-help mailing list