[R] Where to find the source code for survey::SE
Iulia Dumitru
|u||@dmtru @end|ng |rom gm@||@com
Tue Jul 5 11:00:03 CEST 2022
Hello! I want to port some functionality from the Survey package to Julia and I am stuck on standard error calculation. How is the standard error calculated in the Survey package?
I searched in the can/survey repo on GitHub and I couldn’t find the source code of the SE function. I also looked through Thomas Lumley’s book Complex Surveys A Guide to Analysis Using R and from there I understood that the standard error is calculated by first calculating the Horvitz-Thompson variance estimator and then taking the square root of that, but I get completely different results.
This is the R code I used:
> library(survey)
> data(api)
> dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1, fpc = ~fpc)
> svyby(~api00, by = ~cname, design = dclus1, svymean)
And this is what I obtain:
cname api00 se
Alameda Alameda 669.0000 1.264044e-15
Fresno Fresno 472.0000 0.000000e+00
Kern Kern 452.5000 0.000000e+00
Los Angeles Los Angeles 647.2667 1.724959e+01
Mendocino Mendocino 623.2500 0.000000e+00
Merced Merced 519.2500 0.000000e+00
Orange Orange 710.5625 0.000000e+00
Plumas Plumas 709.5556 1.248655e-13
San Diego San Diego 659.4364 2.296800e+00
San Joaquin San Joaquin 551.1892 1.354175e-13
Santa Clara Santa Clara 732.0769 1.577292e+01
With Julia this is the result I get:
11×3 DataFrame
Row │ cname mean se
│ String15 Float64 Float64
─────┼────────────────────────────────
1 │ Alameda 669.0 6469.11
2 │ Fresno 472.0 7832.61
3 │ Kern 452.5 10703.1
4 │ Los Angeles 647.267 5232.24
5 │ Mendocino 623.25 10364.8
6 │ Merced 519.25 8616.22
7 │ Orange 710.563 5534.39
8 │ Plumas 709.556 7673.15
9 │ San Diego 659.436 1882.06
10 │ San Joaquin 551.189 2254.24
11 │ Santa Clara 732.077 4000.03
And just to have the full picture, this is my code for the Horvitz-Thompson variance estimator:
function hte(y, pw, n)
p = 1 .- (1 .- 1 ./ pw) .^ n
first_sum = sum((1 .- p) ./ (p .^ 2) .* y .^ 2)
second_sum = 0
for i in 1:length(p)
for j in 1:length(p)
if i == j
continue
end
p_ij = p[i] + p[j] - (1 - (1 - 1 / pw[i] - 1 / pw[j]) ^ n)
second_sum += (p_ij - p[i] * p[j]) / (p[i] * p[j]) * y[i] * y[j]
end
end
return first_sum + second_sum
end
The standard error in the resulted data frame above is just the square root of the hte result.
[[alternative HTML version deleted]]
More information about the R-help
mailing list