[R] A question on Statistics regarding regression

Ben Bolker bbo|ker @end|ng |rom gm@||@com
Sat Aug 24 20:21:09 CEST 2024


   This is probably better for Cross Validated 
[https://stats.stackexchange.com]. Surprisingly, I can't quickly find an 
answered question on this topic. My "tl;dr" answer would be: "inflated" 
relative to what? Having an unbalanced sample certainly decreases the 
*power* of an analysis, but there's nothing 'incorrect' (AFAICS) with 
the estimated SEs, and no reason to try to fix them.

https://stats.stackexchange.com/questions/23108/unbalanced-design-effect

https://stats.stackexchange.com/questions/347050/unbalanced-sample-in-dummy-variable-for-ols-linear-regression

On 8/24/24 14:15, Jeff Newmiller via R-help wrote:
> you say you asked elsewhere, but so many hits come up when I just search for "unbalanced sample size" your justification for not following the posting guide does not seem honest.
>
> I also recall that various discussions of statistical power address this in basic statistics.
>
> On August 24, 2024 11:05:12 AM PDT, Christofer Bogaso <bogaso.christofer using gmail.com> wrote:
>> Hi,
>>
>> I have asked this question elsewhere however failed to get any
>> response, so hoping to get some insight from experts and statisticians
>> here.
>>
>> Let say we are fitting a regression equation where one explanatory
>> variable is categorical with 2 categories. However in the sample, one
>> category has 95% of values but other category has just 5%. Means, the
>> categories are highly unbalanced.
>>
>> Typically SE of estimate may be inflated for such highly unbalanced
>> categorical explanatory variable.
>>
>> Such unbalanced case may come from 2 scenarios 1) there is a flaw in
>> sample or it is just by chance that second category has just 5% values
>> in the sample or 2) in the population itself, the second category has
>> very small number of occurrences which is reflected in the sample.
>>
>> My question how the SE would be impacted in above 2 cases? Will the
>> impact be same i.e. we would get incorrect estimate of SE in both
>> cases? If yes, is there any way to prove analytically or may be based
>> on simulation?
>>
>> My apologies as this question is not directly R related. However I
>> just wanted to get some insight on above problem related to Statistics
> >from some of the great Statisticians in this forum.
>> Thanks for your time.
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list