[R] question about the degrees of freedom
David Winsemius
dwinsemius at comcast.net
Mon May 3 17:05:38 CEST 2010
On May 3, 2010, at 10:38 AM, Ista Zahn wrote:
> Hi Serdal,
> There is a lot of confusion here (how much is yours and how much is
> mine remains to be seen). See specific comments in line.
Also inline comments.
>
> On Mon, May 3, 2010 at 9:19 AM, serdal ozusaglam
> <saint-filth at hotmail.com> wrote:
>>
>> Dear R users,
>>
>>
>> I think i have a simple question which i want to explain by an
>> example;
>>
>> i have several 2-digit industry codes that i want to use for
>> conducting by-industry analysis but i think there is a problem with
>> the degrees of freedom!
>>
>> for example, when i do my analysis without any 2-digit industry
>> code, i got the following summary (i have 146574 observations in
>> total):
>>> abc<-lm(lnQ~lnC+lnM+lnL+lnE+eco+inno, data=ds)
>>> summary(abc)
>>
>> Call:
>> lm(formula = lnQ ~ lnC + lnM + lnL + lnE + eco + inno, data = ds)
>>
>> Residuals:
>> Min 1Q Median 3Q Max
>> -11.01340 -0.17637 -0.02217 0.14974 7.79005
>>
>> Coefficients:
>> Estimate Std. Error t value Pr(>|t|)
>> (Intercept) 0.8870369 0.0050646 175.144 <2e-16 ***
>> lnC 0.0658922 0.0006549 100.614 <2e-16 ***
>> lnM 0.8027478 0.0006549 1225.764 <2e-16 ***
>> lnL 0.0173622 0.0004025 43.138 <2e-16 ***
>> lnE 0.0657710 0.0006745 97.516 <2e-16 ***
>> ecoTRUE 0.0101649 0.0045892 2.215 0.0268 *
>> innoTRUE 0.0945100 0.0030317 31.174 <2e-16 ***
>> ---
>> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>
>> Residual standard error: 0.294 on 146160 degrees of freedom
>> (407 observations deleted due to missingness)
>> Multiple R-squared: 0.9705, Adjusted R-squared: 0.9705
>> F-statistic: 8.027e+05 on 6 and 146160 DF, p-value: < 2.2e-16
>>
>> as we can see from the last row there are 146160 DF (407 deleted)
>> this is ok!
>>
>>
>
> Usually it is better to make a small example that demonstrates your
> issue. I have no idea what these variable are which makes it harder to
> diagnose your problem.
>
>>
>>
>> but when i want to use for example just one of the industry lets
>> say just the 11th industry
>> 1st: i create the dummy for this industry such as;
>>
>>
>>> ind1=(ind_2d==11)# so here the R supposed to consider just the
>>> 11th industry!!
>
> This makes no sense to me. What are you trying to do here? What is
> ind_2d? Are you trying to subset your data.frame? If so, see ?subset,
> or ?"["
Serdal is just making a logical indicator variable.
>
>>> abc<-lm(lnQ~lnC+lnM+lnL+lnE+eco+inno+ind, data=ds)
>>> summary(abc)
>>
>> Call:
>> lm(formula = lnQ ~ lnC + lnM + lnL + lnE + eco + inno + ind,
>> data = ds)
>>
>> Residuals:
>> Min 1Q Median 3Q Max
>> -11.03392 -0.17647 -0.02301 0.14901 7.74957
>>
>> Coefficients:
>> Estimate Std. Error t value Pr(>|t|)
>> (Intercept) 0.8980397 0.0050451 178.001 < 2e-16 ***
>> lnC 0.0672255 0.0006523 103.065 < 2e-16 ***
>> lnM 0.7990819 0.0006579 1214.596 < 2e-16 ***
>> lnL 0.0171633 0.0004004 42.870 < 2e-16 ***
>> lnE 0.0670030 0.0006716 99.770 < 2e-16 ***
>> ecoTRUE 0.0162249 0.0045672 3.552 0.000382 ***
>> innoTRUE 0.0966967 0.0030160 32.062 < 2e-16 ***
>> indTRUE -0.1251466 0.0031509 -39.717 < 2e-16 ***
>> ---
>> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>
>> Residual standard error: 0.2924 on 146159 degrees of freedom
>> (407 observations deleted due to missingness)
>> Multiple R-squared: 0.9709, Adjusted R-squared: 0.9709
>> F-statistic: 6.957e+05 on 7 and 146159 DF, p-value: < 2.2e-16
>>
>> but as we can see it again counted in all the industries! so the DF
>> is 146159!!!
>>
>>
>> So i just wonder, where do i made mistake, or there is no mistake
>> at all, and i just misunderstood the DF issue?
>
> I think the misunderstanding runs deeper than that. Try creating a
> minimal example, and clearly stating a) what you are trying to
> accomplish, b) what you tried, and c) what doesn't work as you expect.
I, too, was puzzled by the OP's reaction. Serdal added a single
logical predictor variable to an existing model that already had two
such variables and as a result his degrees of freedom in the model
increased by one and the degrees of freedom in the residuals decreased
by one. Where is the problem? And why wasn't this question posed even
earlier at the point of addition of "eco" and "inno" variables? He
perhaps was expecting that the degrees of freedom in the model would
increase by the number of records that shared an indTRUE value of
TRUE, but that is not the way ordinary regression works. Perhaps he
should do some reading on mixed effects modeling? Or perhaps that is
what his professor or supervisor is hoping he will learn by assigning
this task? Or perhaps he needs to learn to use the anova() function?
>
> Best,
> Ista
>
--
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list