[Rd] Discourage the weights= option of lm with summarized data
peter dalgaard
pdalgd at gmail.com
Tue Nov 28 13:01:24 CET 2017
My local R-devel version now has (in ?lm)
Non-‘NULL’ ‘weights’ can be used to indicate that different
observations have different variances (with the values in
‘weights’ being inversely proportional to the variances); or
equivalently, when the elements of ‘weights’ are positive integers
w_i, that each response y_i is the mean of w_i unit-weight
observations (including the case that there are w_i observations
equal to y_i and the data have been summarized). However, in the
latter case, notice that within-group variation is not used.
Therefore, the sigma estimate and residual degrees of freedom may
be suboptimal; in the case of replication weights, even wrong.
Hence, standard errors and analysis of variance tables should be
treated with care.
OK?
-pd
> On 12 Oct 2017, at 13:48 , Arie ten Cate <arietencate at gmail.com> wrote:
>
> OK. We have now three suggestions to repair the text:
> - remove the text
> - add "not" at the beginning of the text
> - add at the end of the text a warning; something like:
>
> "Note that in this case the standard estimates of the parameters are
> in general not correct, and hence also the t values and the p value.
> Also the number of degrees of freedom is not correct. (The parameter
> values are correct.)"
>
> A remark about the glm example: the Reference manual says: "For a
> binomial GLM prior weights are used to give the number of trials when
> the response is the proportion of successes ....". Hence in the
> binomial case the weights are frequencies.
> With y <- 0.51 and w <- 100 you get the same result.
>
> Arie
>
> On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard <pdalgd at gmail.com> wrote:
>> AFAIR, it is a little more subtle than that.
>>
>> If you have replication weights, then the estimates are right, it is "just" that the SE from summary.lm() are wrong. Somehow, the text should reflect this.
>>
>> It is of some importance when you put glm() into the mix, because you can in fact get correct results from things like
>>
>> y <- c(0,1)
>> w <- c(49,51)
>> glm(y~1, weights=w, family=binomial)
>>
>> -pd
>>
>>> On 9 Oct 2017, at 07:58 , Arie ten Cate <arietencate at gmail.com> wrote:
>>>
>>> Yes. Thank you; I should have quoted it.
>>> I suggest to remove this text or to add the word "not" at the beginning.
>>>
>>> Arie
>>>
>>> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP)
>>> <wolfgang.viechtbauer at maastrichtuniversity.nl> wrote:
>>>> Ah, I think you are referring to this part from ?lm:
>>>>
>>>> "(including the case that there are w_i observations equal to y_i and the data have been summarized)"
>>>>
>>>> I see; indeed, I don't think this is what 'weights' should be used for (the other part before that is correct). Sorry, I misunderstood the point you were trying to make.
>>>>
>>>> Best,
>>>> Wolfgang
>>>>
>>>> -----Original Message-----
>>>> From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of Arie ten Cate
>>>> Sent: Sunday, 08 October, 2017 14:55
>>>> To: r-devel at r-project.org
>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data
>>>>
>>>> Indeed: Using 'weights' is not meant to indicate that the same
>>>> observation is repeated 'n' times. As I showed, this gives erroneous
>>>> results. Hence I suggested that it is discouraged rather than
>>>> encouraged in the Details section of lm in the Reference manual.
>>>>
>>>> Arie
>>>>
>>>> ---Original Message-----
>>>> On Sat, 7 Oct 2017, wolfgang.viechtbauer at maastrichtuniversity.nl wrote:
>>>>
>>>> Using 'weights' is not meant to indicate that the same observation is
>>>> repeated 'n' times. It is meant to indicate different variances (or to
>>>> be precise, that the variance of the last observation in 'x' is
>>>> sigma^2 / n, while the first three observations have variance
>>>> sigma^2).
>>>>
>>>> Best,
>>>> Wolfgang
>>>>
>>>> -----Original Message-----
>>>> From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of Arie ten Cate
>>>> Sent: Saturday, 07 October, 2017 9:36
>>>> To: r-devel at r-project.org
>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data
>>>>
>>>> In the Details section of lm (linear models) in the Reference manual,
>>>> it is suggested to use the weights= option for summarized data. This
>>>> must be discouraged rather than encouraged. The motivation for this is
>>>> as follows.
>>>>
>>>> With summarized data the standard errors get smaller with increasing
>>>> numbers of observations. However, the standard errors in lm do not get
>>>> smaller when for instance all weights are multiplied with the same
>>>> constant larger than one, since the inverse weights are merely
>>>> proportional to the error variances.
>>>>
>>>> Here is an example of the estimated standard errors being too large
>>>> with the weights= option. The p value and the number of degrees of
>>>> freedom are also wrong. The parameter estimates are correct.
>>>>
>>>> n <- 10
>>>> x <- c(1,2,3,4)
>>>> y <- c(1,2,5,4)
>>>> w <- c(1,1,1,n)
>>>> xb <- c(x,rep(x[4],n-1)) # restore the original data
>>>> yb <- c(y,rep(y[4],n-1))
>>>> print(summary(lm(yb ~ xb)))
>>>> print(summary(lm(y ~ x, weights=w)))
>>>>
>>>> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a
>>>> FREQ statement (for summarized data).
>>>>
>>>> Arie
>>>>
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> --
>> Peter Dalgaard, Professor,
>> Center for Statistics, Copenhagen Business School
>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501
>> Office: A 4.23
>> Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
More information about the R-devel
mailing list