[Rd] Discourage the weights= option of lm with summarized data

Tue Nov 28 11:15:14 CET 2017

It's on my todo list (for R-devel, it is not _that_ important), other things just keep taking priority...

-pd

> On 28 Nov 2017, at 09:29 , Arie ten Cate <arietencate at gmail.com> wrote:
> 
> Since the three posters agree (only) that there is a bug, I propose to
> file it as a bug, which is the least we can do now.
> 
> There is more to it: the only other case of a change in the Reference
> Manual which I know of, is also about the weights option! This is in
> coxph. The Reference Manual version 3.0.0 (2013) says about coxph:
> 
>   " ... If weights is a vector of integers, then the estimated
> coefficients are equivalent to estimating the model from data with the
> individual cases replicated as many times as indicated by weights."
> 
> This is not true, as can be seen from the following code, which uses
> the data from the first example in the Reference Manual of coxph:
> 
>   library(survival)
>   print(df1 <- as.data.frame(list(
>     time=c(4,3,1,1,2,2,3),
>   status=c(1,1,1,0,1,1,0),
>        x=c(0,2,1,1,1,0,0),
>      sex=c(0,0,0,0,1,1,1)
>   )))
>   print(w <- rep(2,7))
>   print(coxph(Surv(time,status) ~ x + strata(sex),data=df1,weights=w))
>      # manually doubling the data:
>   print(df2 <- rbind(df1,df1))
>   print(coxph(Surv(time,status) ~ x + strata(sex), data=df2))
> 
> This should not come as a surprise, since with coxph the computation
> of the likelihood (given the parameters) for a single observation uses
> also the other observations.
> 
> This bug has been repaired. The present Reference Manual of coxph says
> that the weights option specifies a vector of case weights, to which
> is added only: "For a thorough discussion of these see the book by
> Therneau and Grambsch."
> 
> Let us repair the other bug also.
> 
>    Arie
> 
> On Thu, Oct 12, 2017 at 1:48 PM, Arie ten Cate <arietencate at gmail.com> wrote:
>> OK. We have now three suggestions to repair the text:
>> - remove the text
>> - add "not" at the beginning of the text
>> - add at the end of the text a warning; something like:
>> 
>>  "Note that in this case the standard estimates of the parameters are
>> in general not correct, and hence also the t values and the p value.
>> Also the number of degrees of freedom is not correct. (The parameter
>> values are correct.)"
>> 
>> A remark about the glm example: the Reference manual says: "For a
>> binomial GLM prior weights are used to give the number of trials when
>> the response is the proportion of successes ....".  Hence in the
>> binomial case the weights are frequencies.
>> With y <- 0.51 and w <- 100 you get the same result.
>> 
>>   Arie
>> 
>> On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard <pdalgd at gmail.com> wrote:
>>> AFAIR, it is a little more subtle than that.
>>> 
>>> If you have replication weights, then the estimates are right, it is "just" that the SE from summary.lm() are wrong. Somehow, the text should reflect this.
>>> 
>>> It is of some importance when you put glm() into the mix, because you can in fact get correct results from things like
>>> 
>>> y <- c(0,1)
>>> w <- c(49,51)
>>> glm(y~1, weights=w, family=binomial)
>>> 
>>> -pd
>>> 
>>>> On 9 Oct 2017, at 07:58 , Arie ten Cate <arietencate at gmail.com> wrote:
>>>> 
>>>> Yes.  Thank you; I should have quoted it.
>>>> I suggest to remove this text or to add the word "not" at the beginning.
>>>> 
>>>>  Arie
>>>> 
>>>> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP)
>>>> <wolfgang.viechtbauer at maastrichtuniversity.nl> wrote:
>>>>> Ah, I think you are referring to this part from ?lm:
>>>>> 
>>>>> "(including the case that there are w_i observations equal to y_i and the data have been summarized)"
>>>>> 
>>>>> I see; indeed, I don't think this is what 'weights' should be used for (the other part before that is correct). Sorry, I misunderstood the point you were trying to make.
>>>>> 
>>>>> Best,
>>>>> Wolfgang
>>>>> 
>>>>> -----Original Message-----
>>>>> From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of Arie ten Cate
>>>>> Sent: Sunday, 08 October, 2017 14:55
>>>>> To: r-devel at r-project.org
>>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data
>>>>> 
>>>>> Indeed: Using 'weights' is not meant to indicate that the same
>>>>> observation is repeated 'n' times.  As I showed, this gives erroneous
>>>>> results. Hence I suggested that it is discouraged rather than
>>>>> encouraged in the Details section of lm in the Reference manual.
>>>>> 
>>>>>  Arie
>>>>> 
>>>>> ---Original Message-----
>>>>> On Sat, 7 Oct 2017, wolfgang.viechtbauer at maastrichtuniversity.nl wrote:
>>>>> 
>>>>> Using 'weights' is not meant to indicate that the same observation is
>>>>> repeated 'n' times. It is meant to indicate different variances (or to
>>>>> be precise, that the variance of the last observation in 'x' is
>>>>> sigma^2 / n, while the first three observations have variance
>>>>> sigma^2).
>>>>> 
>>>>> Best,
>>>>> Wolfgang
>>>>> 
>>>>> -----Original Message-----
>>>>> From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf Of Arie ten Cate
>>>>> Sent: Saturday, 07 October, 2017 9:36
>>>>> To: r-devel at r-project.org
>>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data
>>>>> 
>>>>> In the Details section of lm (linear models) in the Reference manual,
>>>>> it is suggested to use the weights= option for summarized data. This
>>>>> must be discouraged rather than encouraged. The motivation for this is
>>>>> as follows.
>>>>> 
>>>>> With summarized data the standard errors get smaller with increasing
>>>>> numbers of observations. However, the standard errors in lm do not get
>>>>> smaller when for instance all weights are multiplied with the same
>>>>> constant larger than one, since the inverse weights are merely
>>>>> proportional to the error variances.
>>>>> 
>>>>> Here is an example of the estimated standard errors being too large
>>>>> with the weights= option. The p value and the number of degrees of
>>>>> freedom are also wrong. The parameter estimates are correct.
>>>>> 
>>>>> n <- 10
>>>>> x <- c(1,2,3,4)
>>>>> y <- c(1,2,5,4)
>>>>> w <- c(1,1,1,n)
>>>>> xb <- c(x,rep(x[4],n-1))  # restore the original data
>>>>> yb <- c(y,rep(y[4],n-1))
>>>>> print(summary(lm(yb ~ xb)))
>>>>> print(summary(lm(y ~ x, weights=w)))
>>>>> 
>>>>> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a
>>>>> FREQ statement (for summarized data).
>>>>> 
>>>>>   Arie
>>>>> 
>>>>> ______________________________________________
>>>>> R-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>> 
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>>> --
>>> Peter Dalgaard, Professor,
>>> Center for Statistics, Copenhagen Business School
>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>>> Phone: (+45)38153501
>>> Office: A 4.23
>>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>>> 
>>> 
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com