--- title: "Special Missing Values" author: "Nicholas Tierney" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Special Missing Values} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Data sometimes have special missing values to indicate specific reasons for missingness. For example, "9999" is sometimes used in weather data, say for for example, the [Global Historical Climate Network (GHCN) data](https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily), to indicate specific types of missingness, such as instrument failure. You might be interested in creating your own special missing values so that you can mark specific, known reasons for missingness. For example, an individual dropping out of a study, known instrument failure in weather instruments, or for values being censored in analysis. In these cases, the data is missing, but we have information about _why_ it is missing. Coding these cases as `NA` would cause us to lose this valuable information. Other stats programming languages like STATA, SAS, and SPSS have this capacity, but currently `R` does not. So, we need a way to create these special missing values. We can use `recode_shadow` to recode missingness by recoding the special missing value as something like `NA_reason`. `naniar` records these values in the `shadow` part of `nabular` data, which is a special dataframe that contains missingness information. This vignette describes how to add special missing values using the `recode_shadow()` function. First we consider some terminology to explain these ideas, if you are not familiar with the workflows in `naniar`. # Terminology Missing data can be represented as a binary matrix of "missing" or "not missing", which in `naniar` we call a "shadow matrix", a term borrowed from [Swayne and Buja, 1998](https://www.researchgate.net/publication/2758672_Missing_Data_in_Interactive_High-Dimensional_Data_Visualization). ```{r show-shadow} library(naniar) as_shadow(oceanbuoys) ``` The `shadow matrix` has three key features to facilitate analysis 1. Coordinated names: Variables in the shadow matrix gain the same name as in the data, with the suffix "_NA". 2. Special missing values: Values in the shadow matrix can be "special" missing values, indicated as `NA_suffix`, where "suffix" is a very short message of the type of missings. 3. Cohesiveness: Binding the shadow matrix column-wise to the original data creates a cohesive "nabular" data form, useful for visualization and summaries. We create `nabular` data by `bind`ing the shadow to the data: ```{r show-bind-shadow} bind_shadow(oceanbuoys) ``` This keeps the data values tied to their missingness, and has great benefits for exploring missing and imputed values in data. See the vignettes [Getting Started with naniar](https://naniar.njtierney.com/articles/naniar.html) and [Exploring Imputations with naniar](http://naniar.njtierney.com/articles/exploring-imputed-values.html) for more details. # Recoding missing values To demonstrate recoding of missing values, we use a toy dataset, `dat`: ```{r create-toy-dataset} df <- tibble::tribble( ~wind, ~temp, -99, 45, 68, NA, 72, 25 ) df ``` To recode the value -99 as a missing value "broken_machine", we first create nabular data with `bind_shadow`: ```{r create-nab} dfs <- bind_shadow(df) dfs ``` Special types of missingness are encoded in the shadow part nabular data, using the `recode_shadow` function, we can recode the missing values like so: ```{r example-recode-shadow} dfs_recode <- dfs %>% recode_shadow(wind = .where(wind == -99 ~ "broken_machine")) ``` This reads as "recode shadow for wind where wind is equal to -99, and give it the label "broken_machine". The `.where` function is used to help make our intent clearer, and reads very much like the `dplyr::case_when()` function, but takes care of encoding extra factor levels into the missing data. The extra types of missingness are recoded in the shadow part of the nabular data as additional factor levels: ```{r show-additional-factor-levels} levels(dfs_recode$wind_NA) levels(dfs_recode$temp_NA) ``` All additional types of missingness are recorded across all shadow variables, even if those variables don't contain that special missing value. This ensures all flavours of missingness are known. To summarise, to use `recode_shadow`, the user provides the following information: * A variable that they want to effect (`recode_shadow(var = ...)`) * A condition that they want to implement (`.where(condition ~ ...)`) * A suffix for the new type of missing value (`.where(condition ~ suffix)`) Under the hood, this special missing value is recoded as a new factor level in the shadow matrix, so that every column is aware of all possible new values of missingness. Some examples of using `recode_shadow` in a workflow will be discussed in more detail in the near future, for the moment, here is a recommended workflow: * Use `recode_shadow()` with actual data * Replacing the previous actual values using `replace_with_na()` (see the vignette on [replacing values with NA](https://naniar.njtierney.com/articles/replace-with-na.html)) * Explore missings where special cases are considered * Explore imputed values, looking at these special cases