Error estimation

2022-12-19

For the most part, this document will present the functionalities of the function surveysd::calc.stError() which generates point estimates and standard errors for user-supplied estimation functions.

Prerequisites

In order to use a dataset with calc.stError(), several weight columns have to be present. Each weight column corresponds to a bootstrap sample. In the following examples, we will use the data from demo.eusilc() and attach the bootstrap weights using draw.bootstrap() and recalib(). Please refer to the documentation of those functions for more detail.

library(surveysd)

set.seed(1234)
eusilc <- demo.eusilc(prettyNames = TRUE)
dat_boot <- draw.bootstrap(eusilc, REP = 10, hid = "hid", weights = "pWeight",
                           strata = "region", period = "year")
dat_boot_calib <- recalib(dat_boot, conP.var = "gender", conH.var = "region",
                          epsP = 1e-2, epsH = 2.5e-2, verbose = FALSE)
dat_boot_calib[, onePerson := nrow(.SD) == 1, by = .(year, hid)]

## print part of the dataset
dat_boot_calib[1:5, .(year, povertyRisk, eqIncome, onePerson, pWeight, w1, w2, w3, w4, w5)]
year povertyRisk eqIncome onePerson pWeight w1 w2 w3 w4 w5
2010 FALSE 16090.69 FALSE 504.5696 0.4477743 1007.5651 0.4466992 0.4489062 1019.721
2010 FALSE 16090.69 FALSE 504.5696 0.4477743 1007.5651 0.4466992 0.4489062 1019.721
2010 FALSE 16090.69 FALSE 504.5696 0.4477743 1007.5651 0.4466992 0.4489062 1019.721
2011 FALSE 16090.69 FALSE 504.5696 0.4237016 968.2711 0.4468813 0.4654705 1005.100
2011 FALSE 16090.69 FALSE 504.5696 0.4237016 968.2711 0.4468813 0.4654705 1005.100

Estimator functions

The parameters fun and var in calc.stError() define the estimator to be used in the error analysis. There are two built-in estimator functions weightedSum() and weightedRatio() which can be used as follows.

povertyRate <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = weightedRatio)
totalIncome <- calc.stError(dat_boot_calib, var = "eqIncome", fun = weightedSum)

Those functions calculate the ratio of persons at risk of poverty (in percent) and the total income. By default, the results are calculated separately for each reference period.

povertyRate$Estimates
year n N val_povertyRisk stE_povertyRisk
2010 14827 8182222 14.44422 0.5838199
2011 14827 8182222 14.77393 0.6771848
2012 14827 8182222 15.04515 0.4907275
2013 14827 8182222 14.89013 0.4756386
2014 14827 8182222 15.14556 0.5195149
2015 14827 8182222 15.53640 0.4808834
2016 14827 8182222 15.08315 0.3566648
2017 14827 8182222 15.42019 0.5592589
totalIncome$Estimates
year n N val_eqIncome stE_eqIncome
2010 14827 8182222 162750998071 960851034
2011 14827 8182222 161926931417 901915958
2012 14827 8182222 162576509628 1198230549
2013 14827 8182222 163199507862 1474893389
2014 14827 8182222 163986275009 1528829971
2015 14827 8182222 163416275447 1338217239
2016 14827 8182222 162706205137 1207314432
2017 14827 8182222 164314959107 1322744977

Columns that use the val_ prefix denote the point estimate belonging to the “main weight” of the dataset, which is pWeight in case of the dataset used here.

Columns with the stE_ prefix denote standard errors calculated with bootstrap replicates. The replicates result in using w1, w2, …, w10 instead of pWeight when applying the estimator.

n denotes the number of observations for the year and N denotes the total weight of those persons.

Custom estimators

In order to define a custom estimator function to be used in fun, the function needs to have at least two arguments like the example below.

## define custom estimator
myWeightedSum <- function(x, w) {
  sum(x*w)
}

## check if results are equal to the one using `surveysd::weightedSum()`
totalIncome2 <- calc.stError(dat_boot_calib, var = "eqIncome", fun = myWeightedSum)
all.equal(totalIncome$Estimates, totalIncome2$Estimates)
## [1] TRUE

The parameters x and w can be assumed to be vectors with equal length with w being numeric weight vector and x being the column defined in the var argument. It will be called once for each period (in this case year) and for each weight column (in this case pWeight, w1, w2, …, w10).

Custom estimators using additional parameters can also be supplied and parameter add.arg can be used to set the additional arguments for the custom estimator.

## use add.arg-argument
fun <- function(x, w, b) {
  sum(x*w*b)
}
add.arg = list(b="onePerson")

err.est <- calc.stError(dat_boot_calib, var = "povertyRisk", fun = fun,
                        period.mean = 0, add.arg=add.arg)
err.est$Estimates
year n N val_povertyRisk stE_povertyRisk
2010 14827 8182222 273683.9 12521.133
2011 14827 8182222 261883.6 9478.316
2012 14827 8182222 243083.9 8099.185
2013 14827 8182222 238004.4 12366.440
2014 14827 8182222 218572.1 16077.990
2015 14827 8182222 219984.1 14706.184
2016 14827 8182222 201753.9 14324.090
2017 14827 8182222 196881.2 9721.776
# compare with direct computation
compare.value <- dat_boot_calib[,fun(povertyRisk,pWeight,b=onePerson),
                                 by=c("year")]
all((compare.value$V1-err.est$Estimates$val_povertyRisk)==0)
## [1] TRUE

The above chunk computes the weighted poverty ratio for single person households.

Adjust variable depending on bootstrap weights

In our example the variable povertyRisk is a boolean and is TRUE if the income is less than 60% of the weighted median income. Thus it directly depends on the original weight vector pWeight. To further reduce the estimated error one should calculate for each bootstrap replicate weight \(w\) the weighted median income \(medIncome_{w}\) and then define \(povertyRisk_w\) as

\[ povertyRisk_w = \cases{1 \quad\text{if Income}<0.6\cdot medIncome_{w}\\ 0 \quad\text{else}} \]

The estimator can then be applied to the new variable \(povertyRisk_w\). This can be realized using a custom estimator function.

# custom estimator to first derive poverty threshold 
# and then estimate a weighted ratio
povmd <- function(x, w) {
 md <- laeken::weightedMedian(x, w)*0.6
 pmd60 <- x < md
 # weighted ratio is directly estimated inside the function
 return(sum(w[pmd60])/sum(w)*100)
}

err.est <- calc.stError(
  dat_boot_calib, var = "povertyRisk", fun = weightedRatio,
  fun.adjust.var = povmd, adjust.var = "eqIncome")
err.est$Estimates
year n N val_povertyRisk stE_povertyRisk
2010 14827 8182222 14.44422 0
2011 14827 8182222 14.77393 0
2012 14827 8182222 15.04515 0
2013 14827 8182222 14.89013 0
2014 14827 8182222 15.14556 0
2015 14827 8182222 15.53640 0
2016 14827 8182222 15.08315 0
2017 14827 8182222 15.42019 0

The approach shown above is only valid if not grouping variables are supplied (parameter group). If grouping variables are supplied one should use parameters fun.adjust.var and adjust.var such that the \(povertyRisk_w\) is first calculated for each period and then used for each grouping in group.

# using fun.adjust.var and adjust.var to estimate povmd60 indicator
# for each period and bootstrap weight before applying the weightedRatio
povmd2 <- function(x, w) {
 md <- laeken::weightedMedian(x, w)*0.6
 pmd60 <- x < md
 return(as.integer(pmd60))
}

# set adjust.var="eqIncome" so the income vector is used to estimate
# the povmd60 indicator for each bootstrap weight
# and the resulting indicators are passed to function weightedRatio
group <- "gender"
err.est <- calc.stError(
  dat_boot_calib, var = "povertyRisk", fun = weightedRatio, group = "gender",
  fun.adjust.var = povmd2, adjust.var = "eqIncome")
err.est$Estimates
year n N gender val_povertyRisk stE_povertyRisk
2010 7267 3979572 male 12.02660 0.6237242
2010 7560 4202650 female 16.73351 0.7234573
2010 14827 8182222 NA 14.44422 0.6330581
2011 7267 3979572 male 12.81921 0.5870065
2011 7560 4202650 female 16.62488 0.6907626
2011 14827 8182222 NA 14.77393 0.6103274
2012 7267 3979572 male 13.76065 0.5105388
2012 7560 4202650 female 16.26147 0.7726815
2012 14827 8182222 NA 15.04515 0.6236907
2013 7267 3979572 male 13.88962 0.5193635
2013 7560 4202650 female 15.83754 0.6342728
2013 14827 8182222 NA 14.89013 0.5654875
2014 7267 3979572 male 14.50351 0.5988502
2014 7560 4202650 female 15.75353 0.5824421
2014 14827 8182222 NA 15.14556 0.5411586
2015 7267 3979572 male 15.12289 0.4298710
2015 7560 4202650 female 15.92796 0.5197222
2015 14827 8182222 NA 15.53640 0.3928424
2016 7267 3979572 male 14.57968 0.3692397
2016 7560 4202650 female 15.55989 0.4607921
2016 14827 8182222 NA 15.08315 0.2784766
2017 7267 3979572 male 14.94816 0.3491172
2017 7560 4202650 female 15.86717 0.4099472
2017 14827 8182222 NA 15.42019 0.3168769

Multiple estimators

In case an estimator should be applied to several columns of the dataset, var can be set to a vector containing all necessary columns.

multipleRates <- calc.stError(dat_boot_calib, var = c("povertyRisk", "onePerson"), fun = weightedRatio)
multipleRates$Estimates
year n N val_povertyRisk stE_povertyRisk val_onePerson stE_onePerson
2010 14827 8182222 14.44422 0.5838199 14.85737 0.2572088
2011 14827 8182222 14.77393 0.6771848 14.85737 0.2778282
2012 14827 8182222 15.04515 0.4907275 14.85737 0.2909612
2013 14827 8182222 14.89013 0.4756386 14.85737 0.3458258
2014 14827 8182222 15.14556 0.5195149 14.85737 0.4394058
2015 14827 8182222 15.53640 0.4808834 14.85737 0.3810024
2016 14827 8182222 15.08315 0.3566648 14.85737 0.3105411
2017 14827 8182222 15.42019 0.5592589 14.85737 0.3030293

Here we see the relative number of persons at risk of poverty and the relative number of one-person households.

Grouping

The groups argument can be used to calculate estimators for different subsets of the data. This argument can take the grouping variable as a string that refers to a column name (usually a factor) in dat. If set, all estimators are not only split by the reference period but also by the grouping variable. For simplicity, only one reference period of the above data is used.

dat2 <- subset(dat_boot_calib, year == 2010)
for (att  in c("period", "weights", "b.rep"))
  attr(dat2, att) <- attr(dat_boot_calib, att)

To calculate the ratio of persons at risk of poverty for each federal state of Austria, group = "region" can be used.

povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, group = "region")
povertyRates$Estimates
year n N region val_povertyRisk stE_povertyRisk
2010 549 260564 Burgenland 19.53984 1.8042764
2010 733 377355 Vorarlberg 16.53731 3.2896891
2010 924 535451 Salzburg 13.78734 2.2585866
2010 1078 563648 Carinthia 13.08627 1.7469304
2010 1317 701899 Tyrol 15.30819 1.3943715
2010 2295 1167045 Styria 14.37464 1.3500431
2010 2322 1598931 Vienna 17.23468 1.0428496
2010 2804 1555709 Lower Austria 13.84362 1.7034375
2010 2805 1421620 Upper Austria 10.88977 0.8709790
2010 14827 8182222 NA 14.44422 0.5838199

The last row with region = NA denotes the aggregate over all regions. Note that the columns N and n now show the weighted and unweighted number of persons in each region.

Several grouping variables

In case more than one grouping variable is used, there are several options of calling calc.stError() depending on whether combinations of grouping levels should be regarded or not. We will consider the variables gender and region as our grouping variables and show three options on how calc.stError() can be called.

Option 1: All regions and all genders

Calculate the point estimate and standard error for each region and each gender. The number of rows in the output is therefore

\[n_\text{periods}\cdot(n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9 + 2 + 1) = 12.\]

The last row is again the estimate for the whole period.

povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, 
                             group = c("gender", "region"))
povertyRates$Estimates
year n N gender region val_povertyRisk stE_povertyRisk
2010 549 260564 NA Burgenland 19.53984 1.8042764
2010 733 377355 NA Vorarlberg 16.53731 3.2896891
2010 924 535451 NA Salzburg 13.78734 2.2585866
2010 1078 563648 NA Carinthia 13.08627 1.7469304
2010 1317 701899 NA Tyrol 15.30819 1.3943715
2010 2295 1167045 NA Styria 14.37464 1.3500431
2010 2322 1598931 NA Vienna 17.23468 1.0428496
2010 2804 1555709 NA Lower Austria 13.84362 1.7034375
2010 2805 1421620 NA Upper Austria 10.88977 0.8709790
2010 7267 3979572 male NA 12.02660 0.6303407
2010 7560 4202650 female NA 16.73351 0.6300745
2010 14827 8182222 NA NA 14.44422 0.5838199

Option 2: All combinations of state and gender

Split the data by all combinations of the two grouping variables. This will result in a larger output-table of the size

\[n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + 1) = 1\cdot(9\cdot2 + 1)= 19.\]

povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, 
                             group = list(c("gender", "region")))
povertyRates$Estimates
year n N gender region val_povertyRisk stE_povertyRisk
2010 261 122741.8 male Burgenland 17.414524 2.2332286
2010 288 137822.2 female Burgenland 21.432598 2.0899871
2010 359 182732.9 male Vorarlberg 12.973259 3.1093204
2010 374 194622.1 female Vorarlberg 19.883637 3.7576712
2010 440 253143.7 male Salzburg 9.156964 1.9028612
2010 484 282307.3 female Salzburg 17.939382 2.5374889
2010 517 268581.4 male Carinthia 10.552148 2.0614209
2010 561 295066.6 female Carinthia 15.392924 1.9756282
2010 650 339566.5 male Tyrol 12.857542 1.0371712
2010 667 362332.5 female Tyrol 17.604861 2.2145631
2010 1128 571011.7 male Styria 11.671247 1.5111686
2010 1132 774405.4 male Vienna 15.590616 1.3348673
2010 1167 596033.3 female Styria 16.964539 1.3758548
2010 1190 824525.6 female Vienna 18.778813 0.9304295
2010 1363 684272.5 male Upper Austria 9.074690 1.1711956
2010 1387 772593.2 female Lower Austria 16.372949 1.8366058
2010 1417 783115.8 male Lower Austria 11.348283 1.6437512
2010 1442 737347.5 female Upper Austria 12.574205 0.7710579
2010 14827 8182222.0 NA NA 14.444218 0.5838199

Option 3: Cobination of Option 1 and Option 2

In this case, the estimates and standard errors are calculated for

  • every gender,
  • every state and
  • every combination of state and gender.

The number of rows in the output is therefore

\[n_\text{periods}\cdot(n_\text{regions} \cdot n_\text{genders} + n_\text{regions} + n_\text{genders} + 1) = 1\cdot(9\cdot2 + 9 + 2 + 1) = 30.\]

povertyRates <- calc.stError(dat2, var = "povertyRisk", fun = weightedRatio, 
                             group = list("gender", "region", c("gender", "region")))
povertyRates$Estimates
year n N gender region val_povertyRisk stE_povertyRisk
2010 261 122741.8 male Burgenland 17.414524 2.2332286
2010 288 137822.2 female Burgenland 21.432598 2.0899871
2010 359 182732.9 male Vorarlberg 12.973259 3.1093204
2010 374 194622.1 female Vorarlberg 19.883637 3.7576712
2010 440 253143.7 male Salzburg 9.156964 1.9028612
2010 484 282307.3 female Salzburg 17.939382 2.5374889
2010 517 268581.4 male Carinthia 10.552148 2.0614209
2010 549 260564.0 NA Burgenland 19.539836 1.8042764
2010 561 295066.6 female Carinthia 15.392924 1.9756282
2010 650 339566.5 male Tyrol 12.857542 1.0371712
2010 667 362332.5 female Tyrol 17.604861 2.2145631
2010 733 377355.0 NA Vorarlberg 16.537310 3.2896891
2010 924 535451.0 NA Salzburg 13.787343 2.2585866
2010 1078 563648.0 NA Carinthia 13.086268 1.7469304
2010 1128 571011.7 male Styria 11.671247 1.5111686
2010 1132 774405.4 male Vienna 15.590616 1.3348673
2010 1167 596033.3 female Styria 16.964539 1.3758548
2010 1190 824525.6 female Vienna 18.778813 0.9304295
2010 1317 701899.0 NA Tyrol 15.308191 1.3943715
2010 1363 684272.5 male Upper Austria 9.074690 1.1711956
2010 1387 772593.2 female Lower Austria 16.372949 1.8366058
2010 1417 783115.8 male Lower Austria 11.348283 1.6437512
2010 1442 737347.5 female Upper Austria 12.574205 0.7710579
2010 2295 1167045.0 NA Styria 14.374637 1.3500431
2010 2322 1598931.0 NA Vienna 17.234683 1.0428496
2010 2804 1555709.0 NA Lower Austria 13.843623 1.7034375
2010 2805 1421620.0 NA Upper Austria 10.889773 0.8709790
2010 7267 3979571.7 male NA 12.026600 0.6303407
2010 7560 4202650.3 female NA 16.733508 0.6300745
2010 14827 8182222.0 NA NA 14.444218 0.5838199