Title: | Statistical Inference for Weighted Data |
Version: | 0.1.0 |
Description: | Analyzes and models data subject to sampling biases. Provides functions to estimate the density and cumulative distribution functions from biased samples of continuous distributions. Includes the estimators proposed by Bhattacharyya et al. (1988) <doi:10.1080/03610928808829825> and Jones (1991) <doi:10.2307/2337020> for density, and by Cox (2005, ISBN:052184939X) and Bose and Dutta (2022) <doi:10.1007/s00184-021-00824-3> for distribution, with different bandwidth selectors. Also includes a real length-biased dataset on shrub width from Muttlak (1988) https://www.proquest.com/openview/3dd74592e623cdbcfa6176e85bd3d390/1?cbl=18750&diss=y&pq-origsite=gscholar. |
License: | GPL-3 |
URL: | https://github.com/noeliasanchmrt/WData |
BugReports: | https://github.com/noeliasanchmrt/WData/issues |
Depends: | R (≥ 3.5.0) |
Imports: | bayesmeta, evmix, KScorrect, latex2exp, progress, Rcpp, Rdpack |
Suggests: | roxygen2, testthat (≥ 3.0.0), vdiffr |
Enhances: | stats |
RdMacros: | Rdpack |
Config/testthat/edition: | 3 |
Encoding: | UTF-8 |
Language: | en-US |
LazyData: | true |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | no |
Packaged: | 2025-07-30 16:43:21 UTC; noeliasanchezmartinez |
Author: | Sánchez Martínez Noelia
|
Maintainer: | Sánchez Martínez Noelia <noelia.sanchez.martinez@rai.usc.es> |
Repository: | CRAN |
Date/Publication: | 2025-08-18 14:40:08 UTC |
WData: Statistical Inference for Weighted Data
Description
Analyzes and models data subject to sampling biases. Provides functions to estimate the density and cumulative distribution functions from biased samples of continuous distributions. Includes the estimators proposed by Bhattacharyya et al. (1988) doi:10.1080/03610928808829825 and Jones (1991) doi:10.2307/2337020 for density, and by Cox (2005, ISBN:052184939X) and Bose and Dutta (2022) doi:10.1007/s00184-021-00824-3 for distribution, with different bandwidth selectors. Also includes a real length-biased dataset on shrub width from Muttlak (1988) https://www.proquest.com/openview/3dd74592e623cdbcfa6176e85bd3d390/1?cbl=18750&diss=y&pq-origsite=gscholar.
Author(s)
Maintainer: Sánchez Martínez Noelia noelia.sanchez.martinez@rai.usc.es (ORCID)
Authors:
Borrajo Garcia Maria Isabel mariaisabel.borrajo@usc.es (ORCID)
Conde Amboage Mercedes mercedes.amboage@usc.es (ORCID)
See Also
Useful links:
Report bugs at https://github.com/noeliasanchmrt/WData/issues
Bose and Dutta (2022) local bandwidth selector for Bose and Dutta (2022) kernel distribution estimator
Description
This function implements the local bandwidth selector proposed by Bose and Dutta (2022) for their own kernel distribution estimator.
Usage
bw.F.BD(
y,
w = function(y) {
ifelse(y >= 0, y, NA)
},
y.seq,
cy.seq
)
Arguments
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function to be used. It must be evaluable and positive in each point of the sample |
y.seq |
A numeric vector containing the points on which the local bandwidth is estimated. |
cy.seq |
A numeric vector representing the constants to be used in the bandwidth estimation for each point of |
Details
Local bandwidths selectors are estimated using the formula:
\widehat{h}_{F, \mathrm{BD}, C(y)} (y) = \frac{C(y) \widehat{\sigma}_{w}}{(n w(y))^{1/3}},
where C(y)
is a positive parameter that depends on the point y
and \widehat{\sigma}_{w}
is an estimation of the standard deviation of the distribution given by
\widehat{\sigma}_w=\sqrt{\left(\frac{1}{n} \sum_{i=1}^n \frac{1}{w(Y_i)}\right)^{-1}
\left[\left(\frac{1}{n} \sum_{i=1}^n w(Y_i)\right)-\left(\frac{1}{n} \sum_{i=1}^n \frac{1}{w(Y_i)}\right)^{-1}\right]}.
The parameter C(y)
is provided to the function by using the argument cy.seq
, which is a vector of positive values that is used to compute the bandwidth for each point in y.seq
. Alternatively, a single numeric value can be provided, which will be used for all points in the y.seq
vector.
The simulations carried out by Bose and Dutta (2022) suggest that choosing C(y)=0.25
or C(y)=0.5
provides good results in the tail region of the distribution, with tails defined as points below the 5th percentile or above the 95th percentile. On the other hand, C(y)=1.3
provides good results for the remaining points.
If some bandwidths are not positive, they are replaced by the mean of the neighbors.
Value
A numeric vector containing the bandwidths for each point in y.seq
.
References
Bose A, Dutta S (2022). “Kernel based estimation of the distribution function for length biased data.” Metrika, 85, 269–287.
See Also
Examples
bw.F.BD(shrub.data$Width, y.seq = seq(0, 1, length.out = 512), cy.seq = rep(1, 512))
Cross-validation bandwidth selector for Bose and Dutta (2022) kernel distribution estimator
Description
This function performs bandwidth selection for Bose and Dutta (2022) kernel distribution estimator using cross-validation criteria. It iterates through a range of bandwidth values and computes the cross-validation score for each bandwidth. The bandwidth that minimizes the cross-validation function is selected as the optimal bandwidth.
Usage
bw.F.SBC.cv(
y,
w = function(y) {
ifelse(y >= 0, y, NA)
},
kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight",
"cosine", "optcosine"),
lower = IQR(y) * length(y)^(-1/3) * 0.05,
upper = IQR(y) * length(y)^(-1/3) * 5,
nh = 200L,
tol = 0.1 * lower,
plot = TRUE
)
Arguments
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample the sample |
kernel |
A character string specifying the kernel function. Available options: |
lower |
Numeric value specifying the lower bound for bandwidth selection. Default is computed based on the interquartile range (IQR) and sample size. |
upper |
Numeric value specifying the upper bound for bandwidth selection. Default is computed based on the interquartile range (IQR) and sample size. |
nh |
An integer specifying the number of points to evaluate the cross-validation function. Default is 200. |
tol |
Tolerance value used to check whether the minimum found lies at the boundaries of the interval; that is, the function will return a warning if the window minimizing the cross-validation function lies within |
plot |
A logical value indicating whether to plot the cross-validation function. Default is |
Details
The optimal bandwidth is obtained as the one that minimizes the cross-validation function, that is,
\widehat{h}_{F, \mathrm{CV}} = \arg \min_{h_{F}>0} \frac{1}{n} \sum_{j=1}^n \left( \frac{\widehat{\mu}_w}{w(Y_j)} \mathbb{I} (y \geq Y_j) - \widehat{F}_{h_{F}, -j}(y)\right)^2 \!\!,
\quad \text{with} \quad \widehat{\mu}_w=n \left(\sum_{i=1}^{n} \frac{1}{w(Y_i)} \right)^{-1}
and \widehat{F}_{h_{F}, -j}
is the Bose and Dutta (2022) kernel distribution estimator without the observation Y_j
.
Value
The optimal bandwidth based on cross-validation criteria.
References
Bose A, Dutta S (2022). “Kernel based estimation of the distribution function for length biased data.” Metrika, 85, 269–287.
See Also
Examples
bw.F.SBC.cv(shrub.data$Width)
bw.F.SBC.cv(shrub.data$Width, kernel = "epanechnikov")
Plug-in bandwidth selector for Bose and Dutta (2022) kernel distribution estimator
Description
This function computes the bandwidth selector for Bose and Dutta (2022) kernel distribution estimator using the plug-in method.
Usage
bw.F.SBC.pi(
y,
w = function(y) {
ifelse(y >= 0, y, NA)
},
kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight",
"cosine", "optcosine")
)
Arguments
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function to be used. It must be evaluable and positive in each point of the sample the sample |
kernel |
A character string specifying the kernel function. Available options: |
Details
The bandwidth is given by:
\widehat{h}_{F, \mathrm{PI}} = \left(\frac{\widehat{\mu}_w \widehat{\bar{\mu}}_w \kappa(K) }{n \eta(K)^2 R \left(\widehat{f}_{\mathrm{J},\widehat{h}_{F,0, \mathrm{opt}}}^{(1)}\right)}\right)^{1/3},
where \kappa(K)
and \eta(K)
depend only on the kernel and are defined as
\kappa(K) = \int_{-\infty}^{+\infty} 2 u W (u) K(u) du
\quad \text{and} \quad
\eta(K) = \int_{-\infty}^{+\infty} u^2 K(u) du,
where W
is the kernel distribution function associated with the kernel density function K
.
The estimators \widehat{\mu}_w
and \widehat{\bar{\mu}}_w
are given by
\widehat{\mu}_w = n \left(\sum_{i=1}^{n} \frac{1}{w(Y_i)} \right)^{-1}
\quad \text{and} \quad
\widehat{\bar{\mu}}_w = \frac{\widehat{\mu}_w}{n} \sum_{i=1}^{n} \frac{1}{w(Y_i)^2}.
\widehat{h}_{F,0, \mathrm{opt}}
is an estimator of
h_{F,0, \mathrm{opt}} = \left(\frac{3 \mu_w \bar{\mu}_w R\left(L^{(1)}\right)}{2 n \eta(L) R\left( f^{(2)} \right)^2}\right)^{1/5},
where R\left(f^{(2)}\right)
is estimated assuming that f
follows a gaussian distribution and \mu_w
and \bar{\mu}_w
are estimated by \widehat{\mu}_w
and \widehat{\bar{\mu}}_w
as defined above.
Value
The bandwidth value using the plug-in method.
References
Bose A, Dutta S (2022). “Kernel based estimation of the distribution function for length biased data.” Metrika, 85, 269–287.
See Also
Examples
bw.F.SBC.pi(shrub.data$Width, kernel = "epanechnikov")
Rule of thumb bandwidth selector for Bose and Dutta (2022) kernel distribution estimator
Description
This function computes the bandwidth selector for Bose and Dutta (2022) kernel distribution estimator using the rule of thumb.
Usage
bw.F.SBC.rt(
y,
w = function(y) {
ifelse(y >= 0, y, NA)
},
kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight",
"cosine", "optcosine")
)
Arguments
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample |
kernel |
A character string specifying the kernel function. Available options: |
Details
The bandwidth is computed as follows:
\widehat{h}_{F, \mathrm{RT}} =
\left(\frac{4 \sqrt{\pi} \widehat{\mu}_{w} \widehat{\bar{\mu}}_{w}\kappa(K)}{n \eta(K)^{2} }\right)^{1/3} \widehat{\sigma},
where both \kappa(K)
and \eta(K)
depend only on the kernel and are defined as
\kappa(K) = \int_{-\infty}^{+\infty} 2 u W (u) K(u) du
\quad \text{and} \quad
\eta(K) = \int_{-\infty}^{+\infty} u^2 K(u) du,
where W
is the kernel distribution function associated with the kernel density function K
.
The estimators \widehat{\mu}_w
and \widehat{\bar{\mu}}_w
are given by
\widehat{\mu}_w = n \left(\sum_{i=1}^{n} \frac{1}{w(Y_i)} \right)^{-1}
\quad \text{and} \quad
\widehat{\bar{\mu}}_w = \frac{\widehat{\mu}_w}{n} \sum_{i=1}^{n} \frac{1}{w(Y_i)^2}.
\widehat{\sigma}
is an estimation of the standard deviation of the distribution given by
\widehat{\sigma}_w=\sqrt{\left(\frac{1}{n} \sum_{i=1}^n \frac{1}{w(Y_i)}\right)^{-1}
\left[\left(\frac{1}{n} \sum_{i=1}^n w(Y_i)\right)-\left(\frac{1}{n} \sum_{i=1}^n \frac{1}{w(Y_i)}\right)^{-1}\right]}.
Value
The optimal bandwidth value for Bose and Dutta (2022) kernel distribution estimator based on the rule of thumb.
References
Bose A, Dutta S (2022). “Kernel based estimation of the distribution function for length biased data.” Metrika, 85, 269–287.
See Also
Examples
bw.F.SBC.rt(shrub.data$Width)
bw.F.SBC.rt(shrub.data$Width, kernel = "epanechnikov")
Borrajo et al. (2017) bootstrap bandwidth selector for Jones (1991) kernel density estimator
Description
This function computes the bandwidth selector for Jones (1991) kernel density estimator using the bias-corrected bootstrap method developed by Borrajo et al. (2017).
Usage
bw.f.BGM.boot1(
y,
w = function(y) {
ifelse(y >= 0, y, NA)
},
kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight",
"cosine", "optcosine"),
bw0 = c("RT", "PI")
)
Arguments
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample |
kernel |
A character string specifying the kernel function. Available options: |
bw0 |
A character string specifying the method to determine the pilot bandwidth. Options are |
Details
When bw0="RT"
, the bandwidth is given by
\widehat{h}_{f, \mathrm{B}_{\mathrm{RT}}}= \left( \frac{R(K^{\ast})\widehat{\mu}_w\widehat{\bar{\mu}}_w}{n \eta(K^{\ast})^{2} R\left(\widehat{f}_{J,\widehat{h}_{f, 0,\mathrm{RT}}}^{(2)}\right)}\right)^{1/5},
\quad
\text{where} \quad
\widehat{h}_{f, 0, \mathrm{RT}}= \frac{n^{1/5}}{n^{1/7}} \widehat{h}_{f, \mathrm{RT}}.
\widehat{h}_{f, \mathrm{RT}}
is the value returned by bw.f.BGM.rt
. An alternative is to consider the following pilot bandwidth:
\widehat{h}_{f, \mathrm{B}_{\mathrm{opt}}}= \left( \frac{R(K^{\ast})\widehat{\mu}_w\widehat{\bar{\mu}}_w}{n \eta(K^{\ast})^{2} R\left(\widehat{f}_{J,\widehat{h}_{f, 0,\mathrm{opt}}}^{(2)}\right)}\right)^{1/5},
\quad
\text{where} \quad
\widehat{h}_{f, 0, \mathrm{opt}}= \left(\frac{5 \widehat{\mu}_w\widehat{\bar{\mu}}_w R\left(L^{(2)}\right) }{2 n \eta(L) R\left(f^{(3)}\right) } \right)^{1 / 7}
and R(f^{(3)})
is estimated under the assumption that f
is gaussian, which is implemented by setting bw0="PI"
. The quantities R(K^{\ast})
and \eta(K^{\ast})^{2}
depend only on the kernel and are defined as
R(K^{\ast}) = \int_{-\infty}^{+\infty} K^{\ast}(u)^2 du
\quad \text{and} \quad
\eta(K^{\ast}) = \int_{-\infty}^{+\infty} u^2 K^{\ast}(u) du.
The estimators \widehat{\mu}_w
and \widehat{\bar{\mu}}_w
are given by
\widehat{\mu}_w = n \left(\sum_{i=1}^{n} \frac{1}{w(Y_i)} \right)^{-1}
\quad \text{and} \quad
\widehat{\bar{\mu}}_w = \frac{\widehat{\mu}_w}{n} \sum_{i=1}^{n} \frac{1}{w(Y_i)^2}.
Value
The bootstrap bandwidth value.
References
Borrajo MI, González-Manteiga W, Martínez-Miranda MD (2017).
“Bandwidth selection for kernel density estimation with length-biased data.”
Journal of Nonparametric Statistics, 29(3), 636–668.
Jones MC (1991).
“Kernel density estimation for length biased data.”
Biometrika, 78(3), 511–519.
See Also
Examples
# Bandwidth value using bootstrap method with "RT" as pilot bandwidth
bw.f.BGM.boot1(y = shrub.data$Width)
# Bandwidth value using bootstrap method with "PI" as pilot bandwidth
bw.f.BGM.boot1(y = shrub.data$Width, bw0 = "PI")
Borrajo et al. (2017) bootstrap bandwidth selector for Jones (1991) kernel density estimator
Description
This function computes the bandwidth selector for Jones (1991) kernel density estimator using the bias-corrected bootstrap method developed by Borrajo et al. (2017).
Usage
bw.f.BGM.boot2(
y,
w = function(y) {
ifelse(y >= 0, y, NA)
},
kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight",
"cosine", "optcosine"),
bw0 = 1/8 * n^(-1/9),
lower = IQR(y) * n^(-0.2) * 2000^(-1),
upper = IQR(y) * (log(n)/n)^(0.2) * 500,
nh = 200L,
tol = 0.1 * lower,
from = min(y) - (sort(y)[5] - min(y)),
to = max(y) + (max(y) - sort(y, decreasing = TRUE)[5]),
plot = TRUE
)
Arguments
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample |
kernel |
A character string specifying the kernel function. Available options: |
bw0 |
The bandwidth value to be used in |
lower |
Numeric value specifying the lower bound for bandwidth selection. Default is computed based on the interquartile range (IQR) and sample size. |
upper |
Numeric value specifying the upper bound for bandwidth selection. Default is computed based on the interquartile range (IQR) and sample size. |
nh |
An integer specifying the number of points in the grid to evaluate the mean integrated squared error function. Default is 200. |
tol |
Tolerance value used to check whether the minimum found lies at the boundaries of the interval; that is, the function will return a warning if the window minimizing the cross-validation function lies within |
from |
Numeric value specifying the lower bound to be used in |
to |
Numeric value specifying the upper bound to be used in |
plot |
Logical value indicating whether to plot the mean integrated squared error function. Default is |
Details
The bandwidth returned is the one minimizing \mathrm{MISE}^{\ast}
over a compact interval [h_1,h_2]
(determined by arguments lower
and upper
), i.e.,
\widehat{h}_{f, \mathrm{B}} = \arg \min_{h_{f} \in [h_1,h_2]} \int_{-\infty}^{+ \infty} \mathrm{MSE}^{\ast}\left(\widehat{f}^{\ast}_{\mathrm{J}, h_{f}}(y)\right)dy.
\mathrm{MISE}^{\ast}
and \mathrm{MSE}^{\ast}
correspond with the expression of the mean integrated squared error and the mean squared error of the bootstrap estimator \widehat{f}^{\ast}_{\mathrm{J}, h_{f}}
provided by Borrajo et al. (2017).
Value
The bootstrap bandwidth value.
References
Borrajo MI, González-Manteiga W, Martínez-Miranda MD (2017).
“Bandwidth selection for kernel density estimation with length-biased data.”
Journal of Nonparametric Statistics, 29(3), 636–668.
Jones MC (1991).
“Kernel density estimation for length biased data.”
Biometrika, 78(3), 511–519.
See Also
Examples
bw.f.BGM.boot2(shrub.data$Width, nh = 50L)
Cross-validation bandwidth selector for Jones (1991) kernel density estimator
Description
This function estimates the bandwidth for Jones (1991) kernel density estimator using cross-validation criteria from Guillamón et al. (1998). It iterates through a range of bandwidth values and computes the cross-validation score for each bandwidth. The bandwidth that minimizes the cross-validation function is selected as the optimal bandwidth.
Usage
bw.f.BGM.cv(
y,
w = function(y) {
ifelse(y >= 0, y, NA)
},
kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight",
"cosine", "optcosine"),
lower = IQR(y) * n^(-0.2) * 2000^(-1),
upper = IQR(y) * (log(n))^(0.2) * n^(-0.2) * 500,
nh = 200L,
tol = 0.1 * lower,
plot = TRUE
)
Arguments
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample the sample |
kernel |
A character string specifying the kernel function. Available options: |
lower |
Numeric value specifying the lower bound for bandwidth selection. Default is computed based on the interquartile range (IQR) and sample size. |
upper |
Numeric value specifying the upper bound for bandwidth selection. Default is computed based on the interquartile range (IQR) and sample size. |
nh |
An integer specifying the number of points to evaluate the cross-validation function. Default is 200. |
tol |
Tolerance value used to check whether the minimum found lies at the boundaries of the interval; that is, the function will return a warning if the window minimizing the cross-validation function lies within |
plot |
Logical value indicating whether to plot the cross-validation function. Default is |
Details
The optimal bandwidth is the one that minimizes the cross-validation function, i.e.,
\widehat{h}_{f, \mathrm{CV}}= \mathrm{arg} \min_{h_{f}>0} \mathrm{CV}(h_{f}) = \mathrm{arg} \min_{h_{f}>0} \int_{-\infty}^{+\infty} \widehat{f}_{\mathrm{J},h_{f}}(y)^2 d y-2 \widehat{\mathbb{E}}[\widehat{f}_{\mathrm{J},h_{f}}].
It holds that
\int_{-\infty}^{+\infty} \widehat{f}_{\mathrm{J}, h_{f}}(y)^2 dy = \frac{\widehat{\mu}_w^2}{n^{2} h_{f}} \sum_{i=1}^{n} \sum_{\substack{j=1}}^{n}
\frac{1}{w(Y_i)} \frac{1}{w(Y_j)}
(K \circ K) \left(\frac{Y_i-Y_j}{h_{f}} \right),
where \circ
denotes convolution between two functions and \widehat{\mathbb{E}}[\widehat{f}_{\mathrm{J}, h_{f}}]
is computed as
\widehat{\mathbb{E}}[\widehat{f}_{\mathrm{J}, h_{f}}] = \frac{\widehat{\mu}_w}{n} \sum_{i=1}^{n} \frac{1}{w(Y_i)} \left( \sum_{j\neq i} \frac{1}{w(Y_j)} \right)^{-1} \left(\sum_{j\neq i} \frac{1}{w(Y_j)} K_{h_{f}}(Y_i-Y_j)\right).
This function computes the bandwidth that minimizes the cross validation function, \mathrm{CV}
, on the interval I
determined by lower
and upper
. By default,
I
is the one suggested by Borrajo et al. (2017):
I = \left[\frac{\mathrm{IQR}}{2000n^{1/5}}, \frac{500 \mathrm{IQR} \log(n)^{1/5}}{n^{1/5}}\right],
where IQR is the interquartile range.
Value
The optimal bandwidth value based on cross-validation criteria.
References
Borrajo MI, González-Manteiga W, Martínez-Miranda MD (2017).
“Bandwidth selection for kernel density estimation with length-biased data.”
Journal of Nonparametric Statistics, 29(3), 636–668.
Guillamón A, Navarro J, Ruiz JM (1998).
“Kernel density estimation using weighted data.”
Communications in Statistics - Theory and Methods, 27(9), 2123-2135.
Jones MC (1991).
“Kernel density estimation for length biased data.”
Biometrika, 78(3), 511–519.
See Also
Examples
bw.f.BGM.cv(shrub.data$Width)
bw.f.BGM.cv(shrub.data$Width, kernel = "epanechnikov")
Rule of thumb bandwidth selector for Jones (1991) kernel density estimator
Description
This function computes the bandwidth selector for Jones (1991) density estimator using the rule of thumb.
Usage
bw.f.BGM.rt(
y,
w = function(y) {
ifelse(y >= 0, y, NA)
},
kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight",
"cosine", "optcosine")
)
Arguments
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample |
kernel |
A character string specifying the kernel function. Available options: |
Details
The bandwidth is given by
\widehat{h}_{f,\mathrm{RT}}
= \left(\frac{8 \sqrt{\pi} \widehat{\mu}_{w} \widehat{\bar{\mu}}_{w} R(K)}{3n \eta(K)^{2}}\right)^{1 / 5}\widehat{\sigma}_{w},
where R(K)
and \eta(K)
depend only on the kernel and are defined as
R(K) = \int_{-\infty}^{+\infty} K(u)^2 du
\quad \text{and} \quad
\eta(K) = \int_{-\infty}^{+\infty} u^2 K(u) du.
The estimators \widehat{\mu}_w
and \widehat{\bar{\mu}}_w
are given by
\widehat{\mu}_w = n \left(\sum_{i=1}^{n} \frac{1}{w(Y_i)} \right)^{-1}
\quad \text{and} \quad
\widehat{\bar{\mu}}_w = \frac{\widehat{\mu}_w}{n} \sum_{i=1}^{n} \frac{1}{w(Y_i)^2}.
\widehat{\sigma}_{w}
is an estimation of the standard deviation of the distribution given by
\widehat{\sigma}_w=\sqrt{\left(\frac{1}{n} \sum_{i=1}^n \frac{1}{w(Y_i)}\right)^{-1}
\left[\left(\frac{1}{n} \sum_{i=1}^n w(Y_i)\right)-\left(\frac{1}{n} \sum_{i=1}^n \frac{1}{w(Y_i)}\right)^{-1}\right]}.
Value
The rule of thumb bandwidth value.
References
Jones MC (1991). “Kernel density estimation for length biased data.” Biometrika, 78(3), 511–519.
See Also
Examples
bw.f.BGM.rt(shrub.data$Width)
bw.f.BGM.rt(shrub.data$Width, kernel = "epanechnikov")
Bose and Dutta (2022) kernel distribution estimator
Description
This function computes Bose and Dutta (2022) kernel distribution estimator given a sample and the corresponding biased function.
Usage
cdf.bd(
y,
w = function(y) {
ifelse(y >= 0, y, NA)
},
y.seq,
bw = "bw.F.SBC.rt",
kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight",
"cosine", "optcosine"),
from,
to,
nb = 512L,
plot = TRUE,
correction = c("none", "left", "right", "both"),
...
)
Arguments
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample |
y.seq |
A numeric vector specifying the points where the distribution is estimated. Alternatively, |
bw |
The bandwidth to be used in the distribution estimation. |
kernel |
A character string specifying the kernel function. Available options: |
from |
Numeric value specifying the lower bound of the grid where the estimator is computed when |
to |
Numeric value specifying the upper bound of the grid where the estimator is computed when |
nb |
An integer specifying the number of points at which the estimator is computed when |
plot |
A logical value indicating whether to plot the estimation. Default is |
correction |
A character string specifying the boundary correction to be applied. Options are "none", "left", "right" and "both". Default is "none". |
... |
Additional arguments to be passed to bandwidth selection functions. |
Details
Bose and Dutta (2022) kernel distribution estimator is expressed as
\widehat{F}_{h_{F}}(y) = \frac{\widehat{\mu}_w}{n} \sum_{i=1}^n \frac{1}{w(Y_i)} W_{h_{F}}\left(y-Y_i\right),
\quad
\text{where}\quad \widehat{\mu}_w=n \left(\sum_{i=1}^{n}\frac{1}{w(Y_i)}\right)^{-1},
h_{F}
is the bandwidth, W
is the kernel distribution function and W_{h_{F}}(u) = W(u/{h_{F}})
.
Bose and Dutta (2022) propose a truncation correction for variables with compact support [a,b]
for the estimator as follows:
\widehat{F}_{[a,b], h_{F}}(y)=\left\{
\begin{array}{ll}
0, & y<a \\
\dfrac{\widehat{F}_{h_{F}}(y)-\widehat{F}_{h_{F}}(a)}{\widehat{F}_{h_{F}}(b)-\widehat{F}_{h_{F}}(a)}, & a \leq y<b \\
1, & y \geq b.
\end{array}
\right.
The truncation correction is also valid for variables supported on [a, +\infty)
or (-\infty, b]
, replacing \widehat{F}_{h_{F}}(b)
by 1 or \widehat{F}_{h_{F}}(a)
by 0, respectively, in the above expression. This correction is implemented in the correction
argument, which can take values "none", "left", "right" or "both". If "left", the estimator is corrected to 0 for values less than the minimum of y.seq
; if "right", it is corrected to 1 for values greater than the maximum of y.seq
; if "both", it applies both corrections simultaneously.
Value
A list with the following components:
y.seq |
The points where the distribution is estimated. |
F.hat |
The estimated distribution values. |
bw |
The bandwidth value. |
n |
The sample size after removal of |
call |
The call which produced the result. |
has.na |
Logical; indicates whether the original vector |
References
Bose A, Dutta S (2022). “Kernel based estimation of the distribution function for length biased data.” Metrika, 85, 269–287.
See Also
bw.F.SBC.rt
, bw.F.BD
, bw.F.SBC.cv
, bw.F.SBC.pi
Examples
cdf.bd(shrub.data$Width, kernel = "epanechnikov")
cdf.bd(shrub.data$Width, bw = "bw.F.SBC.cv")
Cox (2005) distribution estimator
Description
This function computes Cox (2005) distribution estimator given a sample and the corresponding biased function.
Usage
cdf.cox(
y,
w = function(y) {
ifelse(y >= 0, y, NA)
}
)
Arguments
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample |
Details
Cox (2005) distribution estimator is expressed as
\widehat{F}_n(y) = \frac{\widehat{\mu}_w}{n}\sum_{i=1}^{n} \frac{1}{w(Y_i)} \mathbb{I}(Y_i \leq y),
\quad \text{where} \quad
\widehat{\mu}_w = n \left(\sum_{i=1}^{n} \frac{1}{w(Y_i)} \right)^{-1}.
Value
A function of class ecdf
, inheriting from the stepfun
class, and hence inheriting a knots
method.
References
Cox D (2005). “Some sampling problems in technology.” In Hand D, Herzberg A (eds.), Selected Statistical Papers of Sir David Cox, volume 1, 81–92. Cambridge University Press.
Examples
cdf.cox(y = shrub.data$Width)
Bhattacharyya et al. (1988) density estimator
Description
This function computes Bhattacharyya et al. (1988) density estimator given a sample and the corresponding biased function.
Usage
df.bhatta(
y,
w = function(y) {
ifelse(y >= 0, y, NA)
},
plot = TRUE,
...
)
Arguments
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function to be used. It must be evaluable and positive in each point of the sample |
plot |
Logical indicating whether to plot the estimated density. Default is |
... |
Additional arguments to be passed to
|
Details
Bhattacharyya et al. (1988) density estimator is computed as follows:
\widehat{f}_{\mathrm{B}, h_{g}}(y)= \widehat{\mu}_w w(y)^{-1} \widehat{g}_{h_{g}}(y),
\quad \text{where} \quad \widehat{\mu}_w=n \left(\sum_{i=1}^{n} \frac{1}{w(Y_i)}\right)^{-1},
and \widehat{g}_{h_{g}}(y)
is the kernel density estimate of the given data y
using density
function with main arguments bw
and kernel
.
Value
A list with the following components:
y.seq |
The points where the density is estimated. |
f.hat |
The estimated density values. |
bw |
The bandwidth value. |
n |
The sample size after removal of |
call |
The call which produced the result. |
has.na |
Logical; indicates whether the original vector |
References
Bhattacharyya BB, Franklin LA, Richardson GD (1988). “A comparison of nonparametric unweighted and length-biased density estimation of fibres.” Communications in Statistics - Theory and Methods, 17(11), 3629–3644.
Examples
# Rule of thumb
df.bhatta(shrub.data$Width, bw = "nrd0")
# Cross Validation
df.bhatta(shrub.data$Width, bw = "ucv")
# Sheather & Jones
bhata_sj <- df.bhatta(shrub.data$Width, bw = "SJ-ste")
# Rectangular kernel
df.bhatta(shrub.data$Width, bw = "nrd0", kernel = "epanechnikov")
Jones (1991) kernel density estimator
Description
This function computes Jones (1991) kernel density estimator given a sample and the corresponding biased function.
Usage
df.jones(
y,
w = function(y) {
ifelse(y >= 0, y, NA)
},
y.seq,
bw = "bw.f.BGM.rt",
kernel = c("gaussian", "epanechnikov", "rectangular", "triangular", "biweight",
"cosine", "optcosine"),
from,
to,
nb = 512L,
plot = TRUE,
...
)
Arguments
y |
A numeric vector containing the biased sample. |
w |
A function representing the bias function applied to the data points.
It must be evaluable and positive in each point of the sample |
y.seq |
A numeric vector specifying the points where the density is estimated. Alternatively, |
bw |
The smoothing bandwidth to be used in the density estimation. |
kernel |
A character string specifying the kernel function. Available options: |
from |
Numeric value specifying the lower bound of the grid where the estimator is computed when |
to |
Numeric value specifying the upper bound of the grid where the estimator is computed when |
nb |
An integer specifying the number of points at which the estimator is computed when |
plot |
A logical value indicating whether to plot the density estimation. Default is |
... |
Additional arguments to be passed to bandwidth selection functions. |
Details
Jones (1991) kernel density estimator is expressed as
\widehat{f}_{\mathrm{J}, h_{f}}(y)=\frac{\widehat{\mu}_w}{n}\sum_{i=1}^{n} \frac{1}{w(Y_i)} K_{h_{f}}(y-Y_i),\quad
\text{where}\quad \widehat{\mu}_w=n \left(\sum_{i=1}^{n} \frac{1}{w(Y_i)}\right)^{-1},
h_{f}
is the bandwidth, K
is the kernel density function and K_{h_{f}}(u)=1/h_{f} K\left(u / h_{f}\right)
.
Value
A list with the following components:
y.seq |
The points where the density is estimated. |
f.hat |
The estimated density values. |
bw |
The bandwidth value. |
n |
The sample size after removal of |
call |
The call which produced the result. |
has.na |
Logical; indicates whether the original vector |
References
Jones MC (1991). “Kernel density estimation for length biased data.” Biometrika, 78(3), 511–519.
See Also
bw.f.BGM.rt
, bw.f.BGM.cv
, bw.f.BGM.boot1
, bw.f.BGM.boot2
Examples
# Rule of thumb
df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.rt")
# Cross Validation
df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.cv")
# Bootstrap
df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.boot1", bw0 = "RT")
df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.boot1", bw0 = "PI")
df.jones(y = shrub.data$Width, kernel = "epanechnikov", bw = "bw.f.BGM.boot2", nh = 50L)
Generate a biased sample using Neumann (1951) acceptance-rejection method
Description
This function generates a biased sample of size n
using Neumann (1951) acceptance-rejection method. The generated sample is biased according to the provided bias function w
, with respect to the unbiased density function fx
.
Usage
rbiased(
n,
w = function(y) {
ifelse(y >= 0, y, NA)
},
fx,
lim = 0.01,
plot = TRUE,
stop = TRUE,
shape1,
shape2,
location,
scale,
df,
ncp,
rate,
df1,
df2,
shape,
meanlog,
sdlog,
min,
max,
mgshape,
mgscale,
mgweight,
pro,
mean,
sd
)
Arguments
n |
Sample size. |
w |
A function representing the bias function applied to the data points. It must be evaluable and positive in each point of the sample |
fx |
Unbiased density function. Values allowed are
Beta ( |
lim |
Lower and upper limits for the range where the bias is significant and, hence, where |
plot |
Logical value indicating whether to generate a plot of the biased sample. Default is |
stop |
Logical value indicating whether to stop when bias function can not be evaluated in a generated value. Default is |
shape1 , shape2 |
Additional arguments to be passed to the unbiased density function |
df , ncp |
Additional arguments to be passed to the unbiased density function |
df1 , df2 |
Additional arguments to be passed to the unbiased density function |
shape , rate , scale , location |
Additional arguments to be passed to the unbiased density function |
min , max |
Additional arguments to be passed to the unbiased density function |
mgshape , mgscale , mgweight |
Additional arguments to be passed to the unbiased density function |
mean , sd , pro , meanlog , sdlog |
Additional arguments to be passed to the unbiased density function |
Details
This function implements Neumann (1951) acceptance-rejection method to generate a biased sample given an unbiased density function fx
and a bias function w
.
Value
A numeric vector containing a biased sample from density fx
and bias function w
.
References
Neumann V (1951). “Various techniques used in connection with random digits.” Notes by G. E. Forsythe, Journal of Research of the National Bureau of Standards, Applied Math Series, 12(3), 36–38.
Examples
# Generate a length-biased sample of size 100 from an exponential distribution
rbiased(n = 100, fx = "exp", rate = 2, plot = FALSE)
# Generate a length-biased sample from a gamma distribution
rbiased(n = 100, fx = "gamma", rate = 1.5^2, shape = 1.5)
# Generate a biased sample from a gaussian distribution
custom_bias <- function(y) {
y^2
}
rbiased(n = 100, w = custom_bias, fx = "norm", mean = 3, sd = 10, plot = TRUE)
# Generate a biased sample from a mixture of gaussian distributions
custom_bias <- function(y) {
sqrt(abs(y)) + 5
}
rbiased(
n = 100, w = custom_bias, fx = "mixnorm", pro = rep(1 / 3, 3), mean = c(0.25, 0.5, 0.75),
sd = rep(0.075, 3)
)
Shrub data
Description
Dataset containing the size of the Cercocarpus montanus species in an ancient quarry.
Usage
data(shrub.data)
Format
A data.frame
with the following variables:
Replica | Replica identifier (I or II). |
Transect | Transect identifier (1,2 or 3). |
Number | Shrub/clump identifier. |
Intercept | Length of the intersection of the clump of overlapping shrubs with the transect. |
Width | Width between two lines tangent to the shrub and parallel to the transect. |
Height | Maximum height of the shrub encountered by the transect. |
Stems | Number of stems on the shrub encountered by the transect. |
Details
During the fall semester of 1986, students in a graduate course on biological sampling techniques, taught by Lyman L. McDonald at the University of Wyoming, conducted a field study using the linear transect method to measure the size of Cercocarpus montanus shrubs in an old limestone quarry located just east of Laramie, Wyoming. In this area, rock fissures run predominantly north to south, and vegetation is denser within them. To align with this structure, a baseline was established approximately parallel to the fissures, and six transects were placed perpendicular to this baseline, grouped into two independent replicates (I and II), each comprising three equally spaced transects.
Students walked along the transects and recorded all Cercocarpus montanus individuals intersected. For each shrub, they measured maximum height, width (defined as the greatest distance between two tangents to the shrub’s contour, parallel to the transect), and the number of stems. Given the rhizomatous nature of the species and possible interconnections between neighboring shrubs, an individual was defined as a group of stems at the base separated by at least fifteen centimeters from the nearest neighbor. Additionally, the length of intersection with the transect line was recorded for each shrub cluster.
Due to the nature of the sampling method, wider shrubs had a higher probability of being intersected by a transect. As a result, the sample of shrub widths exhibits length bias. Measurements of height and number of stems are also affected by sampling bias, although the associated bias function is more complex, as it depends on the relationship between width and these other morphological features.
For further details on the sampling protocol and data structure, see Muttlak (1988).
Source
References
Muttlak HA (1988). Some aspects of ranked set sampling with size biased probability of selection. Ph.D. thesis, University of Wyoming.
Examples
data(shrub.data)
names(shrub.data)
str(shrub.data)
class(shrub.data)