Data sometimes have special missing values to indicate specific reasons for missingness. For example, “9999” is sometimes used in weather data, say for for example, the Global Historical Climate Network (GHCN) data), to indicate specific types of missingness, such as instrument failure.
You might be interested in creating your own special missing values so that you can mark specific, known reasons for missingness. For example, an individual dropping out of a study, known instrument failure in weather instruments, or for values being censored in analysis. In these cases, the data is missing, but we have information about why it is missing. Coding these cases as
NA would cause us to lose this valuable information. Other stats programming languages like STATA, SAS, and SPSS have this capacity, but currently
R does not. So, we need a way to create these special missing values.
We can use
recode_shadow to recode missingness by recoding the special missing value as something like
naniar records these values in the
shadow part of
nabular data, which is a special dataframe that contains missingness information.
This vignette describes how to add special missing values using the
recode_shadow() function. First we consider some terminology to explain these ideas, if you are not familiar with the workflows in
Missing data can be represented as a binary matrix of “missing” or “not missing”, which in
naniar we call a “shadow matrix”, a term borrowed from Swayne and Buja, 1998.
library(naniar) as_shadow(oceanbuoys) #> # A tibble: 736 x 8 #> year_NA latitude_NA longitude_NA sea_temp_c_NA air_temp_c_NA humidity_NA #> <fct> <fct> <fct> <fct> <fct> <fct> #> 1 !NA !NA !NA !NA !NA !NA #> 2 !NA !NA !NA !NA !NA !NA #> 3 !NA !NA !NA !NA !NA !NA #> 4 !NA !NA !NA !NA !NA !NA #> 5 !NA !NA !NA !NA !NA !NA #> 6 !NA !NA !NA !NA !NA !NA #> 7 !NA !NA !NA !NA !NA !NA #> 8 !NA !NA !NA !NA !NA !NA #> 9 !NA !NA !NA !NA !NA !NA #> 10 !NA !NA !NA !NA !NA !NA #> # … with 726 more rows, and 2 more variables: wind_ew_NA <fct>, #> # wind_ns_NA <fct>
shadow matrix has three key features to facilitate analysis
Coordinated names: Variables in the shadow matrix gain the same name as in the data, with the suffix "_NA".
Special missing values: Values in the shadow matrix can be “special” missing values, indicated as
NA_suffix, where “suffix” is a very short message of the type of missings.
Cohesiveness: Binding the shadow matrix column-wise to the original data creates a cohesive “nabular” data form, useful for visualization and summaries.
nabular data by
binding the shadow to the data:
bind_shadow(oceanbuoys) #> # A tibble: 736 x 16 #> year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1997 0 -110 27.6 27.1 79.6 -6.40 5.40 #> 2 1997 0 -110 27.5 27.0 75.8 -5.30 5.30 #> 3 1997 0 -110 27.6 27 76.5 -5.10 4.5 #> 4 1997 0 -110 27.6 26.9 76.2 -4.90 2.5 #> 5 1997 0 -110 27.6 26.8 76.4 -3.5 4.10 #> 6 1997 0 -110 27.8 26.9 76.7 -4.40 1.60 #> 7 1997 0 -110 28.0 27.0 76.5 -2 3.5 #> 8 1997 0 -110 28.0 27.1 78.3 -3.70 4.5 #> 9 1997 0 -110 28.0 27.2 78.6 -4.20 5 #> 10 1997 0 -110 28.0 27.2 76.9 -3.60 3.5 #> # … with 726 more rows, and 8 more variables: year_NA <fct>, latitude_NA <fct>, #> # longitude_NA <fct>, sea_temp_c_NA <fct>, air_temp_c_NA <fct>, #> # humidity_NA <fct>, wind_ew_NA <fct>, wind_ns_NA <fct>
This keeps the data values tied to their missingness, and has great benefits for exploring missing and imputed values in data. See the vignettes Getting Started with naniar and Exploring Imputations with naniar for more details.
To demonstrate recoding of missing values, we use a toy dataset,
To recode the value -99 as a missing value “broken_machine”, we first create nabular data with
Special types of missingness are encoded in the shadow part nabular data, using the
recode_shadow function, we can recode the missing values like so:
This reads as “recode shadow for wind where wind is equal to -99, and give it the label”broken_machine". The
.where function is used to help make our intent clearer, and reads very much like the
dplyr::case_when() function, but takes care of encoding extra factor levels into the missing data.
The extra types of missingness are recoded in the shadow part of the nabular data as additional factor levels:
All additional types of missingness are recorded across all shadow variables, even if those variables don’t contain that special missing value. This ensures all flavours of missingness are known.
To summarise, to use
recode_shadow, the user provides the following information:
recode_shadow(var = ...))
.where(condition ~ ...))
.where(condition ~ suffix))
Under the hood, this special missing value is recoded as a new factor level in the shadow matrix, so that every column is aware of all possible new values of missingness.
Some examples of using
recode_shadow in a workflow will be discussed in more detail in the near future, for the moment, here is a recommended workflow:
recode_shadow()with actual data
replace_with_na()(see the vignette on replacing values with NA)