[R] logistic regression: wls and unbalanced samples

Prof Brian Ripley ripley at stats.ox.ac.uk
Wed Apr 27 12:29:02 CEST 2011

```On Wed, 27 Apr 2011, peter dalgaard wrote:

> On Apr 27, 2011, at 00:22 , Andre Guimaraes wrote:
>
>> Greetings from Rio de Janeiro, Brazil.
>>
>> I am looking for advice / references on binary logistic regression
>> with weighted least squares (using lrm & weights), on the following
>> context:
>>
>> 1) unbalanced sample (n0=10000, n1=700);
>> 2) sampling weights used to rebalance the sample (w0=1, w1=14.29); e
>> 3) after modelling, adjust the intercept in order to reflect the
>> expected % of 1’s in the population (e.g., circa 7%, as opposed to
>> 50%).
>
> ??
>
> If the proportion of 1 in the population is about 7%, how exactly is
> the sample "unbalanced". I don't see a reason to use weights at all
> if the sample is representative of the population. The opposite
> situation, where the sample is balanced (e.g. case-control), the
> population not, and you are interested in the population values,
> _that_ might require weighting, with some care because case
> weighting and sample weighting are two different things so the s.e.
> will be wrong. That sort of stuff handled by the survey package.
>
> However what you seem to be doing is to create results for an
> artificial 50/50 population, then project back to the population you
> were sampling from all along. I don't think this makes sense at all.

There are circumstances where it might.  It is quite common in pattern
recognition for the proportions in the training set to not reflect the
population.  And if the misclassification costs are asymmetric, you
may want to weight the fit.

The case I encountered was SGA births.  By definition there are about
10% 'successes', but false negatives are far more important than false
positives (or one would simply predict all births as normal).  This
means that you want accurate estimation of probabilities in the right
tail of the population distribution, and plug-in estimation of
logistic regression is biased.  One of many ways to reduce that bias
is to re-weight the training set so the estimated probabilities of
marginal cases are in the middle of the range.

Note that logistic regression is not normally fitted by 'weighted
least squares' (not even by 'lrm' from some unstated package).

This is not a list for tutorials in advanced statistics, but one
reference is my Pattern Recognition and Neural Networks book.

>
> --
> Peter Dalgaard
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help