[R] Weighted least squares
Adaikalavan Ramasamy
ramasamy at cancer.org.uk
Tue May 8 16:12:44 CEST 2007
Sorry, you did not explain that your weights correspond to your
frequency in the original post. I assumed they were repeated
measurements with within group variation.
I was merely responding to your query why the following differed.
summary(lm(y ~ x, data=df, weights=rep(2, 100)))
summary(lm(y ~ x, data=rbind(df,df)))
Let me also clarify my statement about "artificial". If one treats
repeated observations as independent, then they obtain estimates with
inflated precision. I was not calling your data artificial in any way.
Using frequency as weights may be valid. Your data points appear to
arise from discrete distribution, so I am not entirely sure if you can
use the linear model which assumes the errors are normally distributed.
Regards, Adai
hadley wickham wrote:
> On 5/8/07, Adaikalavan Ramasamy <ramasamy at cancer.org.uk> wrote:
>> See below.
>>
>> hadley wickham wrote:
>> > Dear all,
>> >
>> > I'm struggling with weighted least squares, where something that I had
>> > assumed to be true appears not to be the case. Take the following
>> > data set as an example:
>> >
>> > df <- data.frame(x = runif(100, 0, 100))
>> > df$y <- df$x + 1 + rnorm(100, sd=15)
>> >
>> > I had expected that:
>> >
>> > summary(lm(y ~ x, data=df, weights=rep(2, 100)))
>> > summary(lm(y ~ x, data=rbind(df,df)))
>>
>> You assign weights to different points according to some external
>> quality or reliability measure not number of times the data point was
>> measured.
>
> That is one type of weighting - but what if I have already aggregated
> data? That is a perfectly valid type of weighting too.
>
>> Look at the estimates and standard error of the two models below:
>>
>> coefficients( summary(f.w <- lm(y ~ x, data=df, weights=rep(2, 100))) )
>> Estimate Std. Error t value Pr(>|t|)
>> (Intercept) 1.940765 3.30348066 0.587491 5.582252e-01
>> x 0.982610 0.05893262 16.673448 2.264258e-30
>>
>> coefficients( summary( f.u <- lm(y ~ x, data=rbind(df,df) ) ) )
>> Estimate Std. Error t value Pr(>|t|)
>> (Intercept) 1.940765 2.32408609 0.8350659 4.046871e-01
>> x 0.982610 0.04146066 23.6998165 1.012067e-59
>>
>> You can see that they have same coefficient estimates but the second one
>> has smaller variances.
>>
>> The repeated values artificially deflates the variance and thus inflates
>> the precision. This is why you cannot treat replicate data as
>> independent observations.
>
> Hardly artificially - I have repeated observations.
>
>> > would be equivalent, but they are not. I suspect the difference is
>> > how the degrees of freedom is calculated - I had expected it to be
>> > sum(weights), but seems to be sum(weights > 0). This seems
>> > unintuitive to me:
>> >
>> > summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50)))
>> > summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50)))
>> >
>> > What am I missing? And what is the usual way to do a linear
>> > regression when you have aggregated data?
>>
>> I would be best to use the individual data points instead of aggregated
>> data as it allows you to estimate the within-group variations as well.
>
> There is no within group variation - these are observations that occur
> with same values many times in the dataset, so have been aggregated
> into the a contingency table-like format.
>
>> If you had individual data points, you could try something as follows.
>> Please check the codes as I am no expert in the area of repeated
>> measures.
>>
>> x <- runif(100, 0, 100)
>> y1 <- x + rnorm(100, mean=1, sd=15)
>> y2 <- y1 + rnorm(100, sd=5)
>>
>> df <- data.frame( y=c(y1, y2),
>> x=c(x,x),
>> subject=factor(rep( paste("p", 1:100, sep=""), 2 ) ))
>>
>> library(nlme)
>> summary( lme( y ~ x, random = ~ 1 | subject, data=df ) )
>>
>> Try reading Pinheiro and Bates (http://tinyurl.com/yvvrr7) or related
>> material for more information. Hope this helps.
>
> I'm not interested in a mixed model, and I don't have individual data
> points.
>
> Hadley
>
>
>
More information about the R-help
mailing list