[R-sig-Geo] Alternate statistical test to linear regression?

r@i@1290 m@iii@g oii @im@com r@i@1290 m@iii@g oii @im@com
Wed Oct 23 22:52:34 CEST 2019


Hi Greg and others,

Thank you for these explanations and clarifications, as they are much appreciated! 
Indeed, I do have some datasets that exhibit some distinct skewness. Simple scatter plots do show at least some linearity between my x and y variables (albeit weak, given the scattered nature of data points), but could this be sufficient to try simple linear regression? Also, if the data is overly skewed, could transforming it (such as logarithmically) justify the use of simple linear regression and/or correlation, if it causes the data to become mildly skewed in distribution? I have large sample sizes for all of my datasets, and the variables are continuous. 
That would pretty much cover all of my questions concerning this!
Thank you, once again, for your time!
-----Original Message-----
From: Greg Snow <538280 using gmail.com>
To: rain1290 <rain1290 using aim.com>
Cc: r-sig-geo <r-sig-geo using r-project.org>
Sent: Wed, Oct 23, 2019 3:49 pm
Subject: Re: [R-sig-Geo] Alternate statistical test to linear regression?

First, please expunge the "(N>30)" concept from your mind.  This is an
oversimplified rule of thumb used in introductory statistics courses
(I am guilty of doing this in intro stat as well, but I try to
emphasize to my students that it is only a rule of thumb for that
class and the truth is more complex once you are in the real world, so
consult with a statistician).  There is nothing magical about a sample
size of 30, I have seen cases where n=6 is large enough for the CLT
and cases where n=10,000 was not big enough.

If the data is not overly skewed and your sample size is large then
you can just use regression as is and the inference will be
approximately correct (with a really good approximation).  But with
skewness we often prefer the median over the mean and least squares
regression is equivalent to fitting a mean, some of the robust
regression options are equivalent to fitting a median, so they may be
preferable on that count.

Note that Pearson's correlation does not test linearity, it assumes
linearity (and bivariate normality).  Most issues with regression will
be the same for the correlation.

On Wed, Oct 23, 2019 at 11:25 AM <rain1290 using aim.com> wrote:
>
> Hi Greg and others,
>
> Thank you for your very informative response! I actually made a mistake in my initial message, in that I was actually testing for the y variable, not the x. I will also look into those packages on CRAN, but even if there is some skewness on the y, because my sample size is much larger than 30 (N>30), it might be safe to apply a linear regression analysis, if we can assume linearity?
>
> A useful alternative would be to use correlation coefficients to test the degree of association between the x and y variables; specifically, the Pearson correlation coefficient, since both x and y variables are quantitative. Does that make sense?
>
> Thanks again,
>
>
> -----Original Message-----
> From: Greg Snow <538280 using gmail.com>
> To: rain1290 <rain1290 using aim.com>
> Cc: r-sig-geo <r-sig-geo using r-project.org>
> Sent: Wed, Oct 23, 2019 1:00 pm
> Subject: Re: [R-sig-Geo] Alternate statistical test to linear regression?
>
> Note that the normality assumptions are about the residuals (or about
> y conditional on x), not on the x variable(s) or all of y
> (non-conditional).  If x is highly skewed and the residuals are normal
> then diagnostics just on y will also show skewness (if there is a
> relationship between x and y).
>
> Also, the normality assumptions are about the tests and confidence
> intervals, the least squares fit is legitimate (but possibly not the
> most interesting fit) whether the residuals are normal or not.  The
> Central Limit Theorem also applies in regression, so if the residuals
> are non-normal, but you have a large sample size then the tests and
> intervals will still be approximately correct (with the quality of the
> approximation depending on the degree of non-normality and sample
> size).
>
> There are many alternative tools.  There is a task view on CRAN for
> Robust Statistical Methods that gives summaries of many packages and
> tools for robust regression (and other things as well) which does not
> depend on the normality assumptions.
>
>
> On Wed, Oct 23, 2019 at 9:21 AM rain1290--- via R-sig-Geo
> <r-sig-geo using r-project.org> wrote:
> >
> > Greetings,
> > I am testing to see if linear relationships exist between my x and y variables. I conducted various diagnoses in R to test for normality of the x variable data by using qqnorm, qqline and histograms that show the distribution of the data. If the data is shown to be normally distributed in either normal quantile plots or in the histograms (i.e. a bell curve-shaped distribution), I would assume normality and apply the linear regression model, using "lm". However, in some cases, my distributions do not satisfy the normality criteria, and so I feel that using the linear regression model, in those cases, would not be appropriate. For that reason, would you be able to suggest an alternate test to the linear regression model in R? Maybe a non-parametric counterpart to it?
> > Thank you, and any help would be greatly appreciated!
> >        [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-Geo mailing list
> > R-sig-Geo using r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
>
>
>
> --
> Gregory (Greg) L. Snow Ph.D.
> 538280 using gmail.com



-- 
Gregory (Greg) L. Snow Ph.D.
538280 using gmail.com
	[[alternative HTML version deleted]]



More information about the R-sig-Geo mailing list