Package 'naniar' reference manual

Title:	Data Structures, Summaries, and Visualisations for Missing Data
Description:	Missing values are ubiquitous in data and need to be explored and handled in the initial stages of analysis. 'naniar' provides data structures and functions that facilitate the plotting of missing values and examination of imputations. This allows missing data dependencies to be explored with minimal deviation from the common work patterns of 'ggplot2' and tidy data. The work is fully discussed at Tierney & Cook (2023) <doi:10.18637/jss.v105.i07>.
Authors:	Nicholas Tierney [aut, cre] , Di Cook [aut] , Miles McBain [aut] , Colin Fay [aut] , Mitchell O'Hara-Wild [ctb], Jim Hester [ctb], Luke Smith [ctb], Andrew Heiss [ctb]
Maintainer:	Nicholas Tierney <[email protected]>
License:	MIT + file LICENSE
Version:	1.1.0.9000
Built:	2025-03-13 05:19:50 UTC
Source:	https://github.com/njtierney/naniar

Add a column describing presence of any missing values

Description

This adds a column named "any_miss" (by default) that describes whether there are any missings in all of the variables (default), or whether any of the specified columns, specified using variables names or dplyr verbs, starts_with, contains, ends_with, etc. By default the added column will be called "any_miss_all", if no variables are specified, otherwise, if variables are specified, the label will be "any_miss_vars" to indicate that not all variables have been used to create the labels.

Usage

add_any_miss(
  data,
  ...,
  label = "any_miss",
  missing = "missing",
  complete = "complete"
)
add_any_miss(
  data,
  ...,
  label = "any_miss",
  missing = "missing",
  complete = "complete"
)

Arguments

`data`	data.frame
`...`	Variable names to use instead of the whole dataset. By default this looks at the whole dataset. Otherwise, this is one or more unquoted expressions separated by commas. These also respect the dplyr verbs `starts_with`, `contains`, `ends_with`, etc. By default will add "_all" to the label if left blank, otherwise will add "_vars" to distinguish that it has not been used on all of the variables.
`label`	label for the column, defaults to "any_miss". By default if no additional variables are listed the label col is "any_miss_all", otherwise it is "any_miss_vars", if variables are specified.
`missing`	character a label for when values are missing - defaults to "missing"
`complete`	character character a label for when values are complete - defaults to "complete"

Details

By default the prefix "any_miss" is used, but this can be changed in the label argument.

Value

data.frame with data and the column labelling whether that row (for those variables) has any missing values - indicated by "missing" and "complete".

Examples


airquality %>% add_any_miss()
airquality %>% add_any_miss(Ozone, Solar.R)

airquality %>% add_any_miss()
airquality %>% add_any_miss(Ozone, Solar.R)

Add a column describing if there are any missings in the dataset

Description

Add a column describing if there are any missings in the dataset

Usage

add_label_missings(data, ..., missing = "Missing", complete = "Not Missing")
add_label_missings(data, ..., missing = "Missing", complete = "Not Missing")

Arguments

`data`	data.frame
`...`	extra variable to label
`missing`	character a label for when values are missing - defaults to "Missing"
`complete`	character character a label for when values are complete - defaults to "Not Missing"

Value

data.frame with a column "any_missing" that is either "Not Missing" or "Missing" for the purposes of plotting / exploration / nice print methods

Examples


airquality %>% add_label_missings()
airquality %>% add_label_missings(Ozone, Solar.R)
airquality %>% add_label_missings(Ozone, Solar.R, missing = "yes", complete = "no")

airquality %>% add_label_missings()
airquality %>% add_label_missings(Ozone, Solar.R)
airquality %>% add_label_missings(Ozone, Solar.R, missing = "yes", complete = "no")

Add a column describing whether there is a shadow

Description

Instead of focussing on labelling whether there are missings, we instead focus on whether there have been any shadows created. This can be useful when data has been imputed and you need to determine which rows contained missing values when the shadow was bound to the dataset.

Usage

add_label_shadow(data, ..., missing = "Missing", complete = "Not Missing")
add_label_shadow(data, ..., missing = "Missing", complete = "Not Missing")

Arguments

`data`	data.frame
`...`	extra variable to label
`missing`	character a label for when values are missing - defaults to "Missing"
`complete`	character character a label for when values are complete - defaults to "Not Missing"

Value

data.frame with a column, "any_missing", which describes whether or not there are any rows that have a shadow value.

Examples


airquality %>%
  add_shadow(Ozone, Solar.R) %>%
  add_label_shadow()

airquality %>%
  add_shadow(Ozone, Solar.R) %>%
  add_label_shadow()

Add a column that tells us which "missingness cluster" a row belongs to

Description

A way to extract the cluster of missingness that a group belongs to. For example, if you use vis_miss(airquality, cluster = TRUE), you can see some clustering in the data, but you do not have a way to identify the cluster. Future work will incorporate the seriation package to allow for better control over the clustering from the user.

Usage

add_miss_cluster(data, cluster_method = "mcquitty", n_clusters = 2)
add_miss_cluster(data, cluster_method = "mcquitty", n_clusters = 2)

Arguments

`data`	a dataframe
`cluster_method`	character vector of the agglomeration method to use, the default is "mcquitty". Options are taken from `stats::hclust` helpfile, and options include: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).
`n_clusters`	numeric the number of clusters you expect. Defaults to 2.

Examples


add_miss_cluster(airquality)
add_miss_cluster(airquality, n_clusters = 3)
add_miss_cluster(airquality, cluster_method = "ward.D", n_clusters = 3)
add_miss_cluster(airquality)
add_miss_cluster(airquality, n_clusters = 3)
add_miss_cluster(airquality, cluster_method = "ward.D", n_clusters = 3)

Add column containing number of missing data values

Description

It can be useful when doing data analysis to add the number of missing data points into your dataframe. add_n_miss adds a column named "n_miss", which contains the number of missing values in that row.

Usage

add_n_miss(data, ..., label = "n_miss")
add_n_miss(data, ..., label = "n_miss")

Arguments

`data`	a dataframe
`...`	Variable names to use instead of the whole dataset. By default this looks at the whole dataset. Otherwise, this is one or more unquoted expressions separated by commas. These also respect the dplyr verbs `starts_with`, `contains`, `ends_with`, etc. By default will add "_all" to the label if left blank, otherwise will add "_vars" to distinguish that it has not been used on all of the variables.
`label`	character default is "n_miss".

Value

a dataframe

Examples


airquality %>% add_n_miss()
airquality %>% add_n_miss(Ozone, Solar.R)
airquality %>% add_n_miss(dplyr::contains("o"))


airquality %>% add_n_miss()
airquality %>% add_n_miss(Ozone, Solar.R)
airquality %>% add_n_miss(dplyr::contains("o"))

Add column containing proportion of missing data values

Description

It can be useful when doing data analysis to add the proportion of missing data values into your dataframe. add_prop_miss adds a column named "prop_miss", which contains the proportion of missing values in that row. You can specify the variables that you would like to show the missingness for.

Usage

add_prop_miss(data, ..., label = "prop_miss")
add_prop_miss(data, ..., label = "prop_miss")

Arguments

`data`	a dataframe
`...`	Variable names to use instead of the whole dataset. By default this looks at the whole dataset. Otherwise, this is one or more unquoted expressions separated by commas. These also respect the dplyr verbs `starts_with`, `contains`, `ends_with`, etc. By default will add "_all" to the label if left blank, otherwise will add "_vars" to distinguish that it has not been used on all of the variables.
`label`	character string of what you need to name variable

Value

a dataframe

Examples


airquality %>% add_prop_miss()
airquality %>% add_prop_miss(Solar.R, Ozone)
airquality %>% add_prop_miss(Solar.R, Ozone, label = "testing")

# this can be applied to model the proportion of missing data
# as in Tierney et al \doi{10.1136/bmjopen-2014-007450}
# see "Modelling missingness" in vignette "Getting Started with naniar"
# for details
airquality %>% add_prop_miss()
airquality %>% add_prop_miss(Solar.R, Ozone)
airquality %>% add_prop_miss(Solar.R, Ozone, label = "testing")

# this can be applied to model the proportion of missing data
# as in Tierney et al \doi{10.1136/bmjopen-2014-007450}
# see "Modelling missingness" in vignette "Getting Started with naniar"
# for details

Add a shadow column to dataframe

Description

As an alternative to bind_shadow(), you can add specific individual shadow columns to a dataset. These also respect the dplyr verbs starts_with, contains, ends_with, etc.

Usage

add_shadow(data, ...)
add_shadow(data, ...)

Arguments

`data`	data.frame
`...`	One or more unquoted variable names, separated by commas. These also respect the dplyr verbs `starts_with`, `contains`, `ends_with`, etc.

Value

data.frame

Examples


airquality %>% add_shadow(Ozone)
airquality %>% add_shadow(Ozone, Solar.R)

airquality %>% add_shadow(Ozone)
airquality %>% add_shadow(Ozone, Solar.R)

Add a shadow shifted column to a dataset

Description

Shadow shift missing values using only the selected variables in a dataset, by specifying variable names or use dplyr vars and dplyr verbs starts_with, contains, ends_with, etc.

Usage

add_shadow_shift(data, ..., suffix = "shift")
add_shadow_shift(data, ..., suffix = "shift")

Arguments

`data`	data.frame
`...`	One or more unquoted variable names separated by commas. These also respect the dplyr verbs `starts_with`, `contains`, `ends_with`, etc.
`suffix`	suffix to add to variable, defaults to "shift"

Value

data with the added variable shifted named as var_suffix

Examples


airquality %>% add_shadow_shift(Ozone, Solar.R)

airquality %>% add_shadow_shift(Ozone, Solar.R)

Add a counter variable for a span of dataframe

Description

Adds a variable, span_counter to a dataframe. Used internally to facilitate counting of missing values over a given span.

Usage

add_span_counter(data, span_size)
add_span_counter(data, span_size)

Arguments

`data`	data.frame
`span_size`	integer

Value

data.frame with extra variable "span_counter".

Examples

## Not run: 
# add_span_counter(pedestrian, span_size = 100)

## End(Not run)
## Not run: 
# add_span_counter(pedestrian, span_size = 100)

## End(Not run)

Helper function to determine whether there are any missings

Description

Helper function to determine whether there are any missings

Usage

any_row_miss(x)
any_row_miss(x)

Arguments

x

a vector

Value

logical vector TRUE = missing FALSE = complete

Identify if there are any or all missing or complete values

Description

It is useful when exploring data to search for cases where there are any or all instances of missing or complete values. For example, these can help you identify and potentially remove or keep columns in a data frame that are all missing, or all complete.

For the any case, we provide two functions: any_miss and any_complete. Note that any_miss has an alias, any_na. These both under the hood call anyNA. any_complete is the complement to any_miss - it returns TRUE if there are any complete values. Note that in a dataframe any_complete will look for complete cases, which are complete rows, which is different to complete variables.

For the all case, there are two functions: all_miss, and all_complete.

Usage

any_na(x)

any_miss(x)

any_complete(x)

all_na(x)

all_miss(x)

all_complete(x)
any_na(x)

any_miss(x)

any_complete(x)

all_na(x)

all_miss(x)

all_complete(x)

Arguments

`x`	an object to explore missings/complete values

Examples


# for vectors
misses <- c(NA, NA, NA)
complete <- c(1, 2, 3)
mixture <- c(NA, 1, NA)

all_na(misses)
all_na(complete)
all_na(mixture)
all_complete(misses)
all_complete(complete)
all_complete(mixture)

any_na(misses)
any_na(complete)
any_na(mixture)

# for data frames
all_na(airquality)
# an alias of all_na
all_miss(airquality)
all_complete(airquality)

any_na(airquality)
any_complete(airquality)

# use in identifying columns with all missing/complete

library(dplyr)
# for printing
aq <- as_tibble(airquality)
aq
# select variables with all missing values
aq %>% select(where(all_na))
# there are none!
#' # select columns with any NA values
aq %>% select(where(any_na))
# select only columns with all complete data
aq %>% select(where(all_complete))

# select columns where there are any complete cases (all the data)
aq %>% select(where(any_complete))

# for vectors
misses <- c(NA, NA, NA)
complete <- c(1, 2, 3)
mixture <- c(NA, 1, NA)

all_na(misses)
all_na(complete)
all_na(mixture)
all_complete(misses)
all_complete(complete)
all_complete(mixture)

any_na(misses)
any_na(complete)
any_na(mixture)

# for data frames
all_na(airquality)
# an alias of all_na
all_miss(airquality)
all_complete(airquality)

any_na(airquality)
any_complete(airquality)

# use in identifying columns with all missing/complete

library(dplyr)
# for printing
aq <- as_tibble(airquality)
aq
# select variables with all missing values
aq %>% select(where(all_na))
# there are none!
#' # select columns with any NA values
aq %>% select(where(any_na))
# select only columns with all complete data
aq %>% select(where(all_complete))

# select columns where there are any complete cases (all the data)
aq %>% select(where(any_complete))

Create shadows

Description

Return a tibble in shadow matrix form, where the variables are the same but have a suffix _NA attached to distinguish them.

Usage

as_shadow(data, ...)
as_shadow(data, ...)

Arguments

`data`	dataframe
`...`	selected variables to use

Details

Representing missing data structure is achieved using the shadow matrix, introduced in Swayne and Buja. The shadow matrix is the same dimension as the data, and consists of binary indicators of missingness of data values, where missing is represented as "NA", and not missing is represented as "!NA". Although these may be represented as 1 and 0, respectively.

Value

appended shadow with column names

Examples


as_shadow(airquality)
as_shadow(airquality)

Convert data into shadow format for doing an upset plot

Description

Upset plots are a way of visualising common sets, this function transforms the data into a format that feeds directly into an upset plot

Usage

as_shadow_upset(data)
as_shadow_upset(data)

Arguments

data

a data.frame

Value

a data.frame

Examples


## Not run: 

library(UpSetR)
airquality %>%
  as_shadow_upset() %>%
  upset()

## End(Not run)

## Not run: 

library(UpSetR)
airquality %>%
  as_shadow_upset() %>%
  upset()

## End(Not run)

Bind a shadow dataframe to original data

Description

Binding a shadow matrix to a regular dataframe helps visualise and work with missing data.

Usage

bind_shadow(data, only_miss = FALSE, ...)
bind_shadow(data, only_miss = FALSE, ...)

Arguments

`data`	a dataframe
`only_miss`	logical - if FALSE (default) it will bind a dataframe with all of the variables duplicated with their shadow. Setting this to TRUE will bind variables only those variables that contain missing values. See the examples for more details.
`...`	extra options to pass to `recode_shadow()` - a work in progress.

Value

data with the added variable shifted and the suffix ⁠_NA⁠

Examples


bind_shadow(airquality)

# bind only the variables that contain missing values
bind_shadow(airquality, only_miss = TRUE)

aq_shadow <- bind_shadow(airquality)

## Not run: 
# explore missing data visually
library(ggplot2)

# using the bounded shadow to visualise Ozone according to whether Solar
# Radiation is missing or not.

ggplot(data = aq_shadow,
       aes(x = Ozone)) +
       geom_histogram() +
       facet_wrap(~Solar.R_NA,
       ncol = 1)

## End(Not run)

bind_shadow(airquality)

# bind only the variables that contain missing values
bind_shadow(airquality, only_miss = TRUE)

aq_shadow <- bind_shadow(airquality)

## Not run: 
# explore missing data visually
library(ggplot2)

# using the bounded shadow to visualise Ozone according to whether Solar
# Radiation is missing or not.

ggplot(data = aq_shadow,
       aes(x = Ozone)) +
       geom_histogram() +
       facet_wrap(~Solar.R_NA,
       ncol = 1)

## End(Not run)

Add a shadow column to a dataset

Description

Casting a shadow shifted column performs the equivalent pattern to data %>% select(var) %>% impute_below(). This is a convenience function that makes it easy to perform certain visualisations, in line with the principle that the user should have a way to flexibly return data formats containing information about the missing data. It forms the base building block for the functions cast_shadow_shift, and cast_shadow_shift_label. It also respects the dplyr verbs starts_with, contains, ends_with, etc. to select variables.

Usage

cast_shadow(data, ...)
cast_shadow(data, ...)

Arguments

`data`	data.frame
`...`	One or more unquoted variable names separated by commas. These respect the dplyr verbs `starts_with`, `contains`, `ends_with`, etc.

Value

data with the added variable shifted and the suffix ⁠_NA⁠

Examples


airquality %>% cast_shadow(Ozone, Solar.R)
## Not run: 
library(ggplot2)
library(magrittr)
airquality  %>%
  cast_shadow(Ozone,Solar.R) %>%
  ggplot(aes(x = Ozone,
             colour = Solar.R_NA)) +
        geom_density()

## End(Not run)

airquality %>% cast_shadow(Ozone, Solar.R)
## Not run: 
library(ggplot2)
library(magrittr)
airquality  %>%
  cast_shadow(Ozone,Solar.R) %>%
  ggplot(aes(x = Ozone,
             colour = Solar.R_NA)) +
        geom_density()

## End(Not run)

Add a shadow and a shadow_shift column to a dataset

Description

Shift the values and add a shadow column. It also respects the dplyr verbs starts_with, contains, ends_with, etc.

Usage

cast_shadow_shift(data, ...)
cast_shadow_shift(data, ...)

Arguments

`data`	data.frame
`...`	One or more unquoted variable names separated by commas. These respect the dplyr verbs `starts_with`, `contains`, `ends_with`, etc.

Value

data.frame with the shadow and shadow_shift vars

Examples


airquality %>% cast_shadow_shift(Ozone,Temp)

airquality %>% cast_shadow_shift(dplyr::contains("o"))

airquality %>% cast_shadow_shift(Ozone,Temp)

airquality %>% cast_shadow_shift(dplyr::contains("o"))

Add a shadow column and a shadow shifted column to a dataset

Description

Shift the values, add shadow, add missing label

Usage

cast_shadow_shift_label(data, ...)
cast_shadow_shift_label(data, ...)

Arguments

`data`	data.frame
`...`	One or more unquoted expressions separated by commas. These also respect the dplyr verbs "starts_with", "contains", "ends_with", etc.

Value

data.frame with the shadow and shadow_shift vars, and missing labels

Examples


airquality %>% cast_shadow_shift_label(Ozone, Solar.R)

# replicate the plot generated by geom_miss_point()
## Not run: 
library(ggplot2)

airquality %>%
  cast_shadow_shift_label(Ozone,Solar.R) %>%
  ggplot(aes(x = Ozone_shift,
             y = Solar.R_shift,
             colour = any_missing)) +
        geom_point()

## End(Not run)

airquality %>% cast_shadow_shift_label(Ozone, Solar.R)

# replicate the plot generated by geom_miss_point()
## Not run: 
library(ggplot2)

airquality %>%
  cast_shadow_shift_label(Ozone,Solar.R) %>%
  ggplot(aes(x = Ozone_shift,
             y = Solar.R_shift,
             colour = any_missing)) +
        geom_point()

## End(Not run)

Common number values for NA

Description

This vector contains common number values of NA (missing), which is aimed to be used inside naniar functions miss_scan_count() and replace_with_na(). The current list of numbers can be found by printing out common_na_numbers. It is a useful way to explore your data for possible missings, but I strongly warn against using this to replace NA values without very carefully looking at the incidence for each of the cases. Common NA strings are in the data object common_na_strings.

Usage

common_na_numbers
common_na_numbers

Format

An object of class numeric of length 8.

Note

original discussion here https://github.com/njtierney/naniar/issues/168

Examples


dat_ms <- tibble::tribble(~x,  ~y,    ~z,
                          1,   "A",   -100,
                          3,   "N/A", -99,
                          NA,  NA,    -98,
                          -99, "E",   -101,
                          -98, "F",   -1)

miss_scan_count(dat_ms, -99)
miss_scan_count(dat_ms, c("-99","-98","N/A"))
common_na_numbers
miss_scan_count(dat_ms, common_na_numbers)
dat_ms <- tibble::tribble(~x,  ~y,    ~z,
                          1,   "A",   -100,
                          3,   "N/A", -99,
                          NA,  NA,    -98,
                          -99, "E",   -101,
                          -98, "F",   -1)

miss_scan_count(dat_ms, -99)
miss_scan_count(dat_ms, c("-99","-98","N/A"))
common_na_numbers
miss_scan_count(dat_ms, common_na_numbers)

Common string values for NA

Description

This vector contains common values of NA (missing), which is aimed to be used inside naniar functions miss_scan_count() and replace_with_na(). The current list of strings used can be found by printing out common_na_strings. It is a useful way to explore your data for possible missings, but I strongly warn against using this to replace NA values without very carefully looking at the incidence for each of the cases. Please note that common_na_strings uses ⁠\\⁠ around the "?", "." and "*" characters to protect against using their wildcard features in grep. Common NA numbers are in the data object common_na_numbers.

Usage

common_na_strings
common_na_strings

Format

An object of class character of length 26.

Note

original discussion here https://github.com/njtierney/naniar/issues/168

Examples


dat_ms <- tibble::tribble(~x,  ~y,    ~z,
                          1,   "A",   -100,
                          3,   "N/A", -99,
                          NA,  NA,    -98,
                          -99, "E",   -101,
                          -98, "F",   -1)

miss_scan_count(dat_ms, -99)
miss_scan_count(dat_ms, c("-99","-98","N/A"))
common_na_strings
miss_scan_count(dat_ms, common_na_strings)
replace_with_na(dat_ms, replace = list(y = common_na_strings))
dat_ms <- tibble::tribble(~x,  ~y,    ~z,
                          1,   "A",   -100,
                          3,   "N/A", -99,
                          NA,  NA,    -98,
                          -99, "E",   -101,
                          -98, "F",   -1)

miss_scan_count(dat_ms, -99)
miss_scan_count(dat_ms, c("-99","-98","N/A"))
common_na_strings
miss_scan_count(dat_ms, common_na_strings)
replace_with_na(dat_ms, replace = list(y = common_na_strings))

Long form representation of a shadow matrix

Description

gather_shadow is a long-form representation of binding the shadow matrix to your data, producing variables named case, variable, and missing, where missing contains the missing value representation.

Usage

gather_shadow(data)
gather_shadow(data)

Arguments

data

a dataframe

Value

dataframe in long, format, containing information about the missings

Examples


gather_shadow(airquality)

gather_shadow(airquality)

Plot Missing Data Points

Description

geom_miss_point provides a way to transform and plot missing values in ggplot2. To do so it uses methods from ggobi to display missing data points 10\ the same axis.

Usage

geom_miss_point(
  mapping = NULL,
  data = NULL,
  prop_below = 0.1,
  jitter = 0.05,
  stat = "miss_point",
  position = "identity",
  colour = ..missing..,
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE,
  ...
)
geom_miss_point(
  mapping = NULL,
  data = NULL,
  prop_below = 0.1,
  jitter = 0.05,
  stat = "miss_point",
  position = "identity",
  colour = ..missing..,
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE,
  ...
)

Arguments

`mapping`	Set of aesthetic mappings created by `ggplot2::aes()` or `ggplot2::aes_()`. If specified and `inherit.aes = TRUE` (the default), is combined with the default mapping at the top level of the plot. You only need to supply mapping if there isn't a mapping defined for the plot.
`data`	A data frame. If specified, overrides the default data frame defined at the top level of the plot.
`prop_below`	the degree to shift the values. The default is 0.1
`jitter`	the amount of jitter to add. The default is 0.05
`stat`	The statistical transformation to use on the data for this layer, as a string.
`position`	Position adjustment, either as a string, or the result of a call to a position adjustment function.
`colour`	the colour chosen for the aesthetic
`na.rm`	If `FALSE` (the default), removes missing values with a warning. If `TRUE` silently removes missing values.
`show.legend`	logical. Should this layer be included in the legends? `NA`, the default, includes if any aesthetics are mapped. `FALSE` never includes, and `TRUE` always includes.
`inherit.aes`	If `FALSE`, overrides the default aesthetics, rather than combining with them. This is most useful for helper functions that define both data and aesthetics and shouldn't inherit behaviour from the default plot specification, e.g. borders.
`...`	other arguments passed on to `ggplot2::layer()`. There are three types of arguments you can use here: Aesthetics: to set an aesthetic to a fixed value, like `color = "red"` or `size = 3.` Other arguments to the layer, for example you override the default `stat` associated with the layer. Other arguments passed on to the stat.

Note

Warning message if na.rm = T is supplied.

Examples

## Not run: 
library(ggplot2)

# using regular geom_point()
ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
geom_point()

# using  geom_miss_point()
ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
 geom_miss_point()

 # using facets

ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
 geom_miss_point() +
 facet_wrap(~Month)

## End(Not run)
## Not run: 
library(ggplot2)

# using regular geom_point()
ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
geom_point()

# using  geom_miss_point()
ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
 geom_miss_point()

 # using facets

ggplot(airquality,
       aes(x = Ozone,
           y = Solar.R)) +
 geom_miss_point() +
 facet_wrap(~Month)

## End(Not run)

naniar-ggproto

Description

These are the stat and geom overrides using ggproto from ggplot2 that make naniar work.

Usage

StatMissPoint
StatMissPoint

Format

An object of class StatMissPoint (inherits from Stat, ggproto, gg) of length 6.

Plot the number of missings per case (row)

Description

This is a visual analogue to miss_case_summary. It draws a ggplot of the number of missings in each case (row). A default minimal theme is used, which can be customised as normal for ggplot.

Usage

gg_miss_case(x, facet, order_cases = TRUE, show_pct = FALSE)
gg_miss_case(x, facet, order_cases = TRUE, show_pct = FALSE)

Arguments

`x`	data.frame
`facet`	(optional) a single bare variable name, if you want to create a faceted plot.
`order_cases`	logical Order the rows by missingness (default is FALSE - no order).
`show_pct`	logical Show the percentage of cases

Value

a ggplot object depicting the number of missings in a given case.

Examples


gg_miss_case(airquality)
## Not run: 
library(ggplot2)
gg_miss_case(airquality) + labs(x = "Number of Cases")
gg_miss_case(airquality, show_pct = TRUE)
gg_miss_case(airquality, order_cases = FALSE)
gg_miss_case(airquality, facet = Month)
gg_miss_case(airquality, facet = Month, order_cases = FALSE)
gg_miss_case(airquality, facet = Month, show_pct = TRUE)

## End(Not run)
gg_miss_case(airquality)
## Not run: 
library(ggplot2)
gg_miss_case(airquality) + labs(x = "Number of Cases")
gg_miss_case(airquality, show_pct = TRUE)
gg_miss_case(airquality, order_cases = FALSE)
gg_miss_case(airquality, facet = Month)
gg_miss_case(airquality, facet = Month, order_cases = FALSE)
gg_miss_case(airquality, facet = Month, show_pct = TRUE)

## End(Not run)

Plot of cumulative sum of missing for cases

Description

A plot showing the cumulative sum of missing values for cases, reading the rows from the top to bottom. A default minimal theme is used, which can be customised as normal for ggplot.

Usage

gg_miss_case_cumsum(x, breaks = 20)
gg_miss_case_cumsum(x, breaks = 20)

Arguments

`x`	a dataframe
`breaks`	the breaks for the x axis default is 20

Value

a ggplot object depicting the number of missings

Examples


gg_miss_case_cumsum(airquality)
gg_miss_case_cumsum(airquality)

Plot the number of missings for each variable, broken down by a factor

Description

This function draws a ggplot plot of the number of missings in each column, broken down by a categorical variable from the dataset. A default minimal theme is used, which can be customised as normal for ggplot.

Usage

gg_miss_fct(x, fct)
gg_miss_fct(x, fct)

Arguments

`x`	data.frame
`fct`	column containing the factor variable to visualise

Value

ggplot object depicting the % missing of each factor level for each variable.

Examples


gg_miss_fct(x = riskfactors, fct = marital)
## Not run: 
library(ggplot2)
gg_miss_fct(x = riskfactors, fct = marital) + labs(title = "NA in Risk Factors and Marital status")

## End(Not run)

gg_miss_fct(x = riskfactors, fct = marital)
## Not run: 
library(ggplot2)
gg_miss_fct(x = riskfactors, fct = marital) + labs(title = "NA in Risk Factors and Marital status")

## End(Not run)

Plot the number of missings in a given repeating span

Description

gg_miss_span is a replacement function to imputeTS::plotNA.distributionBar(tsNH4, breaksize = 100), which shows the number of missings in a given span, or breaksize. A default minimal theme is used, which can be customised as normal for ggplot.

Usage

gg_miss_span(data, var, span_every, facet)
gg_miss_span(data, var, span_every, facet)

Arguments

`data`	data.frame
`var`	a bare unquoted variable name from `data`.
`span_every`	integer describing the length of the span to be explored
`facet`	(optional) a single bare variable name, if you want to create a faceted plot.

Value

ggplot2 showing the number of missings in a span (window, or breaksize)

Examples


miss_var_span(pedestrian, hourly_counts, span_every = 3000)
## Not run: 
library(ggplot2)
gg_miss_span(pedestrian, hourly_counts, span_every = 3000)
gg_miss_span(pedestrian, hourly_counts, span_every = 3000, facet = sensor_name)
# works with the rest of ggplot
gg_miss_span(pedestrian, hourly_counts, span_every = 3000) + labs(x = "custom")
gg_miss_span(pedestrian, hourly_counts, span_every = 3000) + theme_dark()

## End(Not run)
miss_var_span(pedestrian, hourly_counts, span_every = 3000)
## Not run: 
library(ggplot2)
gg_miss_span(pedestrian, hourly_counts, span_every = 3000)
gg_miss_span(pedestrian, hourly_counts, span_every = 3000, facet = sensor_name)
# works with the rest of ggplot
gg_miss_span(pedestrian, hourly_counts, span_every = 3000) + labs(x = "custom")
gg_miss_span(pedestrian, hourly_counts, span_every = 3000) + theme_dark()

## End(Not run)

Plot the pattern of missingness using an upset plot.

Description

Upset plots are a way of visualising common sets, gg_miss_upset shows the number of missing values for each of the sets of data. The default option of gg_miss_upset is taken from UpSetR::upset - which is to use up to 5 sets and up to 40 interactions. We also set the ordering to by the frequency of the intersections. Setting nsets = 5 means to look at 5 variables and their combinations. The number of combinations or rather intersections is controlled by nintersects. If there are 40 intersections, there will be 40 combinations of variables explored. The number of sets and intersections can be changed by passing arguments nsets = 10 to look at 10 sets of variables, and nintersects = 50 to look at 50 intersections.

Usage

gg_miss_upset(data, order.by = "freq", ...)
gg_miss_upset(data, order.by = "freq", ...)

Arguments

`data`	data.frame
`order.by`	(from UpSetR::upset) How the intersections in the matrix should be ordered by. Options include frequency (entered as "freq"), degree, or both in any order. See `?UpSetR::upset` for more options
`...`	arguments to pass to upset plot - see `?UpSetR::upset`

Value

a ggplot visualisation of missing data

Examples


## Not run: 
gg_miss_upset(airquality)
gg_miss_upset(riskfactors)
gg_miss_upset(riskfactors, nsets = 10)
gg_miss_upset(riskfactors, nsets = 10, nintersects = 10)

## End(Not run)
## Not run: 
gg_miss_upset(airquality)
gg_miss_upset(riskfactors)
gg_miss_upset(riskfactors, nsets = 10)
gg_miss_upset(riskfactors, nsets = 10, nintersects = 10)

## End(Not run)

Plot the number of missings for each variable

Description

This is a visual analogue to miss_var_summary. It draws a ggplot of the number of missings in each variable, ordered to show which variables have the most missing data. A default minimal theme is used, which can be customised as normal for ggplot.

Usage

gg_miss_var(x, facet, show_pct = FALSE)
gg_miss_var(x, facet, show_pct = FALSE)

Arguments

`x`	a dataframe
`facet`	(optional) bare variable name, if you want to create a faceted plot.
`show_pct`	logical shows the number of missings (default), but if set to TRUE, it will display the proportion of missings.

Value

a ggplot object depicting the number of missings in a given column

Examples


gg_miss_var(airquality)
## Not run: 
library(ggplot2)
gg_miss_var(airquality) + labs(y = "Look at all the missing ones")
gg_miss_var(airquality, Month)
gg_miss_var(airquality, Month, show_pct = TRUE)
gg_miss_var(airquality, Month, show_pct = TRUE) + ylim(0, 100)

## End(Not run)
gg_miss_var(airquality)
## Not run: 
library(ggplot2)
gg_miss_var(airquality) + labs(y = "Look at all the missing ones")
gg_miss_var(airquality, Month)
gg_miss_var(airquality, Month, show_pct = TRUE)
gg_miss_var(airquality, Month, show_pct = TRUE) + ylim(0, 100)

## End(Not run)

Plot of cumulative sum of missing value for each variable

Description

A plot showing the cumulative sum of missing values for each variable, reading columns from the left to the right of the initial dataframe. A default minimal theme is used, which can be customised as normal for ggplot.

Usage

gg_miss_var_cumsum(x)
gg_miss_var_cumsum(x)

Arguments

`x`	a data.frame

Value

a ggplot object showing the cumulative sum of missings over the variables

Examples


gg_miss_var_cumsum(airquality)
gg_miss_var_cumsum(airquality)

Plot which variables contain a missing value

Description

This plot produces a set of rectangles indicating whether there is a missing element in a column or not. A default minimal theme is used, which can be customised as normal for ggplot.

Usage

gg_miss_which(x)
gg_miss_which(x)

Arguments

`x`	a dataframe

Value

a ggplot object of which variables contains missing values

Examples


gg_miss_which(airquality)
gg_miss_which(airquality)

Impute data with values shifted 10 percent below range.

Description

It can be useful in exploratory graphics to impute data outside the range of the data. impute_below imputes variables with missings to have values 10 percent below the range for numeric values, plus some jittered noise, to separate repeated values, so that missing values can be visualised along with the rest of the data. For character or factor values, it adds a new string or label.

Usage

impute_below(x, ...)
impute_below(x, ...)

Arguments

`x`	a variable of interest to shift
`...`	extra arguments to pass

Examples

library(dplyr)
vec <- rnorm(10)

vec[sample(1:10, 3)] <- NA

impute_below(vec)
impute_below(vec, prop_below = 0.25)
impute_below(vec,
            prop_below = 0.25,
            jitter = 0.2)

dat <- tibble(
 num = rnorm(10),
 int = as.integer(rpois(10, 5)),
 fct = factor(LETTERS[1:10])
) %>%
 mutate(
   across(
     everything(),
     \(x) set_prop_miss(x, prop = 0.25)
   )
 )

dat

dat %>%
 nabular() %>%
 mutate(
   num = impute_below(num),
   int = impute_below(int),
   fct = impute_below(fct),
 )

dat %>%
 nabular() %>%
 mutate(
   across(
     where(is.numeric),
     impute_below
   )
 )

dat %>%
 nabular() %>%
 mutate(
   across(
     c("num", "int"),
     impute_below
   )
 )


library(dplyr)
vec <- rnorm(10)

vec[sample(1:10, 3)] <- NA

impute_below(vec)
impute_below(vec, prop_below = 0.25)
impute_below(vec,
            prop_below = 0.25,
            jitter = 0.2)

dat <- tibble(
 num = rnorm(10),
 int = as.integer(rpois(10, 5)),
 fct = factor(LETTERS[1:10])
) %>%
 mutate(
   across(
     everything(),
     \(x) set_prop_miss(x, prop = 0.25)
   )
 )

dat

dat %>%
 nabular() %>%
 mutate(
   num = impute_below(num),
   int = impute_below(int),
   fct = impute_below(fct),
 )

dat %>%
 nabular() %>%
 mutate(
   across(
     where(is.numeric),
     impute_below
   )
 )

dat %>%
 nabular() %>%
 mutate(
   across(
     c("num", "int"),
     impute_below
   )
 )

Impute data with values shifted 10 percent below range.

Description

It can be useful in exploratory graphics to impute data outside the range of the data. impute_below_all imputes all variables with missings to have values 10\ values adds a new string or label.

Usage

impute_below_all(.tbl, prop_below = 0.1, jitter = 0.05, ...)
impute_below_all(.tbl, prop_below = 0.1, jitter = 0.05, ...)

Arguments

`.tbl`	a data.frame
`prop_below`	the degree to shift the values. default is
`jitter`	the amount of jitter to add. default is 0.05
`...`	additional arguments

Details

Value

an dataset with values imputed

Examples


# you can impute data like so:
airquality %>%
  impute_below_all()

# However, this does not show you WHERE the missing values are.
# to keep track of them, you want to use `bind_shadow()` first.

airquality %>%
  bind_shadow() %>%
  impute_below_all()

# This identifies where the missing values are located, which means you
# can do things like this:

## Not run: 
library(ggplot2)
airquality %>%
  bind_shadow() %>%
  impute_below_all() %>%
  # identify where there are missings across rows.
  add_label_shadow() %>%
  ggplot(aes(x = Ozone,
             y = Solar.R,
             colour = any_missing)) +
  geom_point()
# Note that this ^^ is a long version of `geom_miss_point()`.

## End(Not run)

# you can impute data like so:
airquality %>%
  impute_below_all()

# However, this does not show you WHERE the missing values are.
# to keep track of them, you want to use `bind_shadow()` first.

airquality %>%
  bind_shadow() %>%
  impute_below_all()

# This identifies where the missing values are located, which means you
# can do things like this:

## Not run: 
library(ggplot2)
airquality %>%
  bind_shadow() %>%
  impute_below_all() %>%
  # identify where there are missings across rows.
  add_label_shadow() %>%
  ggplot(aes(x = Ozone,
             y = Solar.R,
             colour = any_missing)) +
  geom_point()
# Note that this ^^ is a long version of `geom_miss_point()`.

## End(Not run)

Scoped variants of `impute_below`

Description

impute_below imputes missing values to a set percentage below the range of the data. To impute many variables at once, we recommend that you use the across function workflow, shown in the examples for impute_below(). impute_below_all operates on all variables. To only impute variables that satisfy a specific condition, use the scoped variants, impute_below_at, and impute_below_if. To use ⁠_at⁠ effectively, you must know that ⁠_at`` affects variables selected with a character vector, or with ⁠vars()'.

Usage

impute_below_at(.tbl, .vars, prop_below = 0.1, jitter = 0.05, ...)
impute_below_at(.tbl, .vars, prop_below = 0.1, jitter = 0.05, ...)

Arguments

`.tbl`	a data.frame
`.vars`	variables to impute
`prop_below`	the degree to shift the values. default is
`jitter`	the amount of jitter to add. default is 0.05
`...`	extra arguments

Details

Value

an dataset with values imputed

Examples

# select variables starting with a particular string.
impute_below_at(airquality,
                .vars = c("Ozone", "Solar.R"))

impute_below_at(airquality, .vars = 1:2)

## Not run: 
library(dplyr)
impute_below_at(airquality,
                .vars = vars(Ozone))

library(ggplot2)
airquality %>%
  bind_shadow() %>%
  impute_below_at(vars(Ozone, Solar.R)) %>%
  add_label_shadow() %>%
  ggplot(aes(x = Ozone,
             y = Solar.R,
             colour = any_missing)) +
         geom_point()

## End(Not run)

# select variables starting with a particular string.
impute_below_at(airquality,
                .vars = c("Ozone", "Solar.R"))

impute_below_at(airquality, .vars = 1:2)

## Not run: 
library(dplyr)
impute_below_at(airquality,
                .vars = vars(Ozone))

library(ggplot2)
airquality %>%
  bind_shadow() %>%
  impute_below_at(vars(Ozone, Solar.R)) %>%
  add_label_shadow() %>%
  ggplot(aes(x = Ozone,
             y = Solar.R,
             colour = any_missing)) +
         geom_point()

## End(Not run)

Scoped variants of `impute_below`

Description

impute_below operates on all variables. To only impute variables that satisfy a specific condition, use the scoped variants, impute_below_at, and impute_below_if.

Usage

impute_below_if(.tbl, .predicate, prop_below = 0.1, jitter = 0.05, ...)
impute_below_if(.tbl, .predicate, prop_below = 0.1, jitter = 0.05, ...)

Arguments

`.tbl`	data.frame
`.predicate`	A predicate function (such as is.numeric)
`prop_below`	the degree to shift the values. default is
`jitter`	the amount of jitter to add. default is 0.05
`...`	extra arguments

Value

an dataset with values imputed

Examples


airquality %>%
  impute_below_if(.predicate = is.numeric)

airquality %>%
  impute_below_if(.predicate = is.numeric)

Impute numeric values below a range for graphical exploration

Description

Impute numeric values below a range for graphical exploration

Usage

## S3 method for class 'numeric'
impute_below(
  x,
  prop_below = 0.1,
  jitter = 0.05,
  seed_shift = 2017 - 7 - 1 - 1850,
  ...
)
## S3 method for class 'numeric'
impute_below(
  x,
  prop_below = 0.1,
  jitter = 0.05,
  seed_shift = 2017 - 7 - 1 - 1850,
  ...
)

Arguments

`x`	a variable of interest to shift
`prop_below`	the degree to shift the values. default is
`jitter`	the amount of jitter to add. default is 0.05
`seed_shift`	a random seed to set, if you like
`...`	extra arguments to pass

Impute a factor value into a vector with missing values

Description

For imputing fixed factor levels. It adds the new imputed value to the end of the levels of the vector. We generally recommend to impute using other model based approaches. See the simputation package, for example simputation::impute_lm().

Usage

impute_factor(x, value)

## Default S3 method:
impute_factor(x, value)

## S3 method for class 'factor'
impute_factor(x, value)

## S3 method for class 'character'
impute_factor(x, value)

## S3 method for class 'shade'
impute_factor(x, value)
impute_factor(x, value)

## Default S3 method:
impute_factor(x, value)

## S3 method for class 'factor'
impute_factor(x, value)

## S3 method for class 'character'
impute_factor(x, value)

## S3 method for class 'shade'
impute_factor(x, value)

Arguments

`x`	vector
`value`	factor to impute

Value

vector with a factor values replaced

Examples


vec <- factor(LETTERS[1:10])

vec[sample(1:10, 3)] <- NA

vec

impute_factor(vec, "wat")

library(dplyr)

dat <- tibble(
  num = rnorm(10),
  int = rpois(10, 5),
  fct = factor(LETTERS[1:10])
) %>%
  mutate(
    across(
      everything(),
      \(x) set_prop_miss(x, prop = 0.25)
    )
  )

dat

dat %>%
  nabular() %>%
  mutate(
    num = impute_fixed(num, -9999),
    int = impute_zero(int),
    fct = impute_factor(fct, "out")
  )

vec <- factor(LETTERS[1:10])

vec[sample(1:10, 3)] <- NA

vec

impute_factor(vec, "wat")

library(dplyr)

dat <- tibble(
  num = rnorm(10),
  int = rpois(10, 5),
  fct = factor(LETTERS[1:10])
) %>%
  mutate(
    across(
      everything(),
      \(x) set_prop_miss(x, prop = 0.25)
    )
  )

dat

dat %>%
  nabular() %>%
  mutate(
    num = impute_fixed(num, -9999),
    int = impute_zero(int),
    fct = impute_factor(fct, "out")
  )

Impute a fixed value into a vector with missing values

Description

This can be useful if you are imputing specific values, however we would generally recommend to impute using other model based approaches. See the simputation package, for example simputation::impute_lm().

Usage

impute_fixed(x, value)

## Default S3 method:
impute_fixed(x, value)
impute_fixed(x, value)

## Default S3 method:
impute_fixed(x, value)

Arguments

`x`	vector
`value`	value to impute

Value

vector with a fixed values replaced

Examples


vec <- rnorm(10)

vec[sample(1:10, 3)] <- NA

vec

impute_fixed(vec, -999)

library(dplyr)

dat <- tibble(
  num = rnorm(10),
  int = rpois(10, 5),
  fct = factor(LETTERS[1:10])
) %>%
  mutate(
    across(
      everything(),
      \(x) set_prop_miss(x, prop = 0.25)
    )
  )

dat

dat %>%
  nabular() %>%
  mutate(
    num = impute_fixed(num, -9999),
    int = impute_zero(int),
    fct = impute_factor(fct, "out")
  )

vec <- rnorm(10)

vec[sample(1:10, 3)] <- NA

vec

impute_fixed(vec, -999)

library(dplyr)

dat <- tibble(
  num = rnorm(10),
  int = rpois(10, 5),
  fct = factor(LETTERS[1:10])
) %>%
  mutate(
    across(
      everything(),
      \(x) set_prop_miss(x, prop = 0.25)
    )
  )

dat

dat %>%
  nabular() %>%
  mutate(
    num = impute_fixed(num, -9999),
    int = impute_zero(int),
    fct = impute_factor(fct, "out")
  )

Impute the mean value into a vector with missing values

Description

Usage

impute_mean(x)

## Default S3 method:
impute_mean(x)

## S3 method for class 'factor'
impute_mean(x)
impute_mean(x)

## Default S3 method:
impute_mean(x)

## S3 method for class 'factor'
impute_mean(x)

Arguments

x

vector

Value

vector with mean values replaced

Examples


library(dplyr)
vec <- rnorm(10)

vec[sample(1:10, 3)] <- NA

impute_mean(vec)

dat <- tibble(
  num = rnorm(10),
  int = as.integer(rpois(10, 5)),
  fct = factor(LETTERS[1:10])
) %>%
  mutate(
    across(
      everything(),
      \(x) set_prop_miss(x, prop = 0.25)
    )
  )

dat

dat %>%
  nabular() %>%
  mutate(
    num = impute_mean(num),
    int = impute_mean(int),
    fct = impute_mean(fct),
  )

dat %>%
  nabular() %>%
  mutate(
    across(
      where(is.numeric),
      impute_mean
    )
  )

dat %>%
  nabular() %>%
  mutate(
    across(
      c("num", "int"),
      impute_mean
    )
  )

library(dplyr)
vec <- rnorm(10)

vec[sample(1:10, 3)] <- NA

impute_mean(vec)

dat <- tibble(
  num = rnorm(10),
  int = as.integer(rpois(10, 5)),
  fct = factor(LETTERS[1:10])
) %>%
  mutate(
    across(
      everything(),
      \(x) set_prop_miss(x, prop = 0.25)
    )
  )

dat

dat %>%
  nabular() %>%
  mutate(
    num = impute_mean(num),
    int = impute_mean(int),
    fct = impute_mean(fct),
  )

dat %>%
  nabular() %>%
  mutate(
    across(
      where(is.numeric),
      impute_mean
    )
  )

dat %>%
  nabular() %>%
  mutate(
    across(
      c("num", "int"),
      impute_mean
    )
  )

Impute the median value into a vector with missing values

Description

Impute the median value into a vector with missing values

Usage

impute_median(x)

## Default S3 method:
impute_median(x)

## S3 method for class 'factor'
impute_median(x)
impute_median(x)

## Default S3 method:
impute_median(x)

## S3 method for class 'factor'
impute_median(x)

Arguments

x

vector

Value

vector with median values replaced

Examples


vec <- rnorm(10)

vec[sample(1:10, 3)] <- NA

impute_median(vec)

library(dplyr)

dat <- tibble(
  num = rnorm(10),
  int = as.integer(rpois(10, 5)),
  fct = factor(LETTERS[1:10])
) %>%
  mutate(
    across(
      everything(),
      \(x) set_prop_miss(x, prop = 0.25)
    )
  )

dat

dat %>%
  nabular() %>%
  mutate(
    num = impute_median(num),
    int = impute_median(int),
  )

dat %>%
  nabular() %>%
  mutate(
    across(
      where(is.numeric),
      impute_median
    )
  )

dat %>%
  nabular() %>%
  mutate(
    across(
      c("num", "int"),
      impute_median
    )
 )

vec <- rnorm(10)

vec[sample(1:10, 3)] <- NA

impute_median(vec)

library(dplyr)

dat <- tibble(
  num = rnorm(10),
  int = as.integer(rpois(10, 5)),
  fct = factor(LETTERS[1:10])
) %>%
  mutate(
    across(
      everything(),
      \(x) set_prop_miss(x, prop = 0.25)
    )
  )

dat

dat %>%
  nabular() %>%
  mutate(
    num = impute_median(num),
    int = impute_median(int),
  )

dat %>%
  nabular() %>%
  mutate(
    across(
      where(is.numeric),
      impute_median
    )
  )

dat %>%
  nabular() %>%
  mutate(
    across(
      c("num", "int"),
      impute_median
    )
 )

Impute the mode value into a vector with missing values

Description

Impute the mode value into a vector with missing values

Usage

impute_mode(x)

## Default S3 method:
impute_mode(x)

## S3 method for class 'integer'
impute_mode(x)

## S3 method for class 'factor'
impute_mode(x)
impute_mode(x)

## Default S3 method:
impute_mode(x)

## S3 method for class 'integer'
impute_mode(x)

## S3 method for class 'factor'
impute_mode(x)

Arguments

x

vector

This approach adapts examples provided from stack overflow, and for the integer case, just rounds the value. While this can be useful if you are imputing specific values, however we would generally recommend to impute using other model based approaches. See the simputation package, for example simputation::impute_lm().

Value

vector with mode values replaced

Examples


vec <- rnorm(10)

vec[sample(1:10, 3)] <- NA

impute_mode(vec)

library(dplyr)

dat <- tibble(
  num = rnorm(10),
  int = rpois(10, 5),
  fct = factor(LETTERS[1:10])
) %>%
  mutate(
    across(
      everything(),
      \(x) set_prop_miss(x, prop = 0.25)
    )
  )

dat


dat %>%
  nabular() %>%
  mutate(
    num = impute_mode(num),
    int = impute_mode(int),
    fct = impute_mode(fct)
  )


vec <- rnorm(10)

vec[sample(1:10, 3)] <- NA

impute_mode(vec)

library(dplyr)

dat <- tibble(
  num = rnorm(10),
  int = rpois(10, 5),
  fct = factor(LETTERS[1:10])
) %>%
  mutate(
    across(
      everything(),
      \(x) set_prop_miss(x, prop = 0.25)
    )
  )

dat


dat %>%
  nabular() %>%
  mutate(
    num = impute_mode(num),
    int = impute_mode(int),
    fct = impute_mode(fct)
  )

Impute zero into a vector with missing values

Description

Usage

impute_zero(x)
impute_zero(x)

Arguments

x

vector

Value

vector with a fixed values replaced

Examples


vec <- rnorm(10)

vec[sample(1:10, 3)] <- NA

vec

impute_zero(vec)

library(dplyr)

dat <- tibble(
  num = rnorm(10),
  int = rpois(10, 5),
  fct = factor(LETTERS[1:10])
) %>%
  mutate(
    across(
      everything(),
      \(x) set_prop_miss(x, prop = 0.25)
    )
  )

dat

dat %>%
  nabular() %>%
  mutate(
    num = impute_fixed(num, -9999),
    int = impute_zero(int),
    fct = impute_factor(fct, "out")
  )

vec <- rnorm(10)

vec[sample(1:10, 3)] <- NA

vec

impute_zero(vec)

library(dplyr)

dat <- tibble(
  num = rnorm(10),
  int = rpois(10, 5),
  fct = factor(LETTERS[1:10])
) %>%
  mutate(
    across(
      everything(),
      \(x) set_prop_miss(x, prop = 0.25)
    )
  )

dat

dat %>%
  nabular() %>%
  mutate(
    num = impute_fixed(num, -9999),
    int = impute_zero(int),
    fct = impute_factor(fct, "out")
  )

Detect if this is a shade

Description

This tells us if this column is a shade

Usage

is_shade(x)

are_shade(x)

any_shade(x)
is_shade(x)

are_shade(x)

any_shade(x)

Arguments

`x`	a vector you want to test if is a shade

Value

logical - is this a shade?

Examples


xs <- shade(c(NA, 1, 2, "3"))

is_shade(xs)
are_shade(xs)
any_shade(xs)

aq_s <- as_shadow(airquality)

is_shade(aq_s)
are_shade(aq_s)
any_shade(aq_s)
any_shade(airquality)


xs <- shade(c(NA, 1, 2, "3"))

is_shade(xs)
are_shade(xs)
any_shade(xs)

aq_s <- as_shadow(airquality)

is_shade(aq_s)
are_shade(aq_s)
any_shade(aq_s)
any_shade(airquality)

Label a missing from one column

Description

Label whether a value is missing in a row of one columns.

Usage

label_miss_1d(x1)
label_miss_1d(x1)

Arguments

`x1`	a variable of a dataframe

Value

a vector indicating whether any of these rows had missing values

Note

can we generalise label_miss to work for any number of variables?

Examples


label_miss_1d(airquality$Ozone)

label_miss_1d(airquality$Ozone)

label_miss_2d

Description

Label whether a value is missing in either row of two columns.

Usage

label_miss_2d(x1, x2)
label_miss_2d(x1, x2)

Arguments

`x1`	a variable of a dataframe
`x2`	another variable of a dataframe

Value

a vector indicating whether any of these rows had missing values

Examples


label_miss_2d(airquality$Ozone, airquality$Solar.R)

label_miss_2d(airquality$Ozone, airquality$Solar.R)

Is there a missing value in the row of a dataframe?

Description

Creates a character vector describing presence/absence of missing values

Usage

label_missings(data, ..., missing = "Missing", complete = "Not Missing")
label_missings(data, ..., missing = "Missing", complete = "Not Missing")

Arguments

`data`	a dataframe or set of vectors of the same length
`...`	extra variable to label
`missing`	character a label for when values are missing - defaults to "Missing"
`complete`	character character a label for when values are complete - defaults to "Not Missing"

Value

character vector of "Missing" and "Not Missing".

Examples


label_missings(airquality)

## Not run: 
library(dplyr)

airquality %>%
  mutate(is_missing = label_missings(airquality)) %>%
  head()

airquality %>%
  mutate(is_missing = label_missings(airquality,
                                     missing = "definitely missing",
                                     complete = "absolutely complete")) %>%
  head()

## End(Not run)
label_missings(airquality)

## Not run: 
library(dplyr)

airquality %>%
  mutate(is_missing = label_missings(airquality)) %>%
  head()

airquality %>%
  mutate(is_missing = label_missings(airquality,
                                     missing = "definitely missing",
                                     complete = "absolutely complete")) %>%
  head()

## End(Not run)

Little's missing completely at random (MCAR) test

Description

Use Little's (1988) test statistic to assess if data is missing completely at random (MCAR). The null hypothesis in this test is that the data is MCAR, and the test statistic is a chi-squared value. The example below shows the output of mcar_test(airquality). Given the high statistic value and low p-value, we can conclude the airquality data is not missing completely at random.

Usage

mcar_test(data)
mcar_test(data)

Arguments

data

A data frame

Value

A tibble::tibble() with one row and four columns:

`statistic`	Chi-squared statistic for Little's test
`df`	Degrees of freedom used for chi-squared statistic
`p.value`	P-value for the chi-squared statistic
`missing.patterns`	Number of missing data patterns in the data

Note

Code is adapted from LittleMCAR() in the now-orphaned BaylorEdPsych package: https://rdrr.io/cran/BaylorEdPsych/man/LittleMCAR.html. Some of code is adapted from Eric Stemmler: https://web.archive.org/web/20201120030409/https://stats-bayes.com/post/2020/08/14/r-function-for-little-s-test-for-data-missing-completely-at-random/ using Maximum likelihood estimation from norm.

Author(s)

Andrew Heiss, [email protected]

References

Little, Roderick J. A. 1988. "A Test of Missing Completely at Random for Multivariate Data with Missing Values." Journal of the American Statistical Association 83 (404): 1198–1202. doi:10.1080/01621459.1988.10478722.

Examples

mcar_test(airquality)
mcar_test(oceanbuoys)

# If there are non-numeric columns, there will be a warning
mcar_test(riskfactors)

mcar_test(airquality)
mcar_test(oceanbuoys)

# If there are non-numeric columns, there will be a warning
mcar_test(riskfactors)

Summarise the missingness in each case

Description

Provide a data.frame containing each case (row), the number and percent of missing values in each case.

Usage

miss_case_cumsum(data)
miss_case_cumsum(data)

Arguments

data

a dataframe

Value

a tibble containing the number and percent of missing data in each case

Examples


miss_case_cumsum(airquality)

## Not run: 
library(dplyr)

airquality %>%
  group_by(Month) %>%
  miss_case_cumsum()

## End(Not run)
miss_case_cumsum(airquality)

## Not run: 
library(dplyr)

airquality %>%
  group_by(Month) %>%
  miss_case_cumsum()

## End(Not run)

Summarise the missingness in each case

Description

Provide a summary for each case in the data of the number, percent missings, and cumulative sum of missings of the order of the variables. By default, it orders by the most missings in each variable.

Usage

miss_case_summary(data, order = TRUE, add_cumsum = FALSE, ...)
miss_case_summary(data, order = TRUE, add_cumsum = FALSE, ...)

Arguments

`data`	a data.frame
`order`	a logical indicating whether or not to order the result by n_miss. Defaults to TRUE. If FALSE, order of cases is the order input.
`add_cumsum`	logical indicating whether or not to add the cumulative sum of missings to the data. This can be useful when exploring patterns of nonresponse. These are calculated as the cumulative sum of the missings in the variables as they are first presented to the function.
`...`	extra arguments

Value

a tibble of the percent of missing data in each case.

Examples


miss_case_summary(airquality)

## Not run: 
# works with group_by from dplyr
library(dplyr)
airquality %>%
  group_by(Month) %>%
  miss_case_summary()

## End(Not run)

miss_case_summary(airquality)

## Not run: 
# works with group_by from dplyr
library(dplyr)
airquality %>%
  group_by(Month) %>%
  miss_case_summary()

## End(Not run)

Tabulate missings in cases.

Description

Provide a tidy table of the number of cases with 0, 1, 2, up to n, missing values and the proportion of the number of cases those cases make up.

Usage

miss_case_table(data)
miss_case_table(data)

Arguments

data

a dataframe

Value

a dataframe

Examples


miss_case_table(airquality)
## Not run: 
library(dplyr)
airquality %>%
  group_by(Month) %>%
  miss_case_table()

## End(Not run)
miss_case_table(airquality)
## Not run: 
library(dplyr)
airquality %>%
  group_by(Month) %>%
  miss_case_table()

## End(Not run)

Proportions of missings in data, variables, and cases.

Description

Return missing data info about the dataframe, the variables, and the cases. Specifically, returning how many elements in a dataframe contain a missing value, how many elements in a variable contain a missing value, and how many elements in a case contain a missing.

Usage

miss_prop_summary(data)
miss_prop_summary(data)

Arguments

data

a dataframe

Value

a dataframe

Examples


miss_prop_summary(airquality)
## Not run: 
library(dplyr)
# respects dplyr::group_by
airquality %>% group_by(Month) %>% miss_prop_summary()

## End(Not run)

miss_prop_summary(airquality)
## Not run: 
library(dplyr)
# respects dplyr::group_by
airquality %>% group_by(Month) %>% miss_prop_summary()

## End(Not run)

Search and present different kinds of missing values

Description

Searching for different kinds of missing values is really annoying. If you have values like -99 in your data, when they shouldn't be there, or they should be encoded as missing, it can be difficult to ascertain if they are there, and if so, where they are. miss_scan_count makes it easier for users to search for particular occurrences of these values across their variables. Note that the searches are done with regular expressions, which are special ways of searching for text. See the example below to see how to look for characters like ⁠?⁠.

Usage

miss_scan_count(data, search)
miss_scan_count(data, search)

Arguments

`data`	data
`search`	values to search for

Value

a dataframe of the occurrences of the values you searched for

Examples


dat_ms <- tibble::tribble(~x,  ~y,    ~z,  ~specials,
                         1,   "A",   -100, "?",
                         3,   "N/A", -99,  "!",
                         NA,  NA,    -98,  ".",
                         -99, "E",   -101, "*",
                         -98, "F",   -1,  "-")

miss_scan_count(dat_ms,-99)
miss_scan_count(dat_ms,c(-99,-98))
miss_scan_count(dat_ms,c("-99","-98","N/A"))
miss_scan_count(dat_ms, "\\?")
miss_scan_count(dat_ms, "\\!")
miss_scan_count(dat_ms, "\\.")
miss_scan_count(dat_ms, "\\*")
miss_scan_count(dat_ms, "-")
miss_scan_count(dat_ms,common_na_strings)


dat_ms <- tibble::tribble(~x,  ~y,    ~z,  ~specials,
                         1,   "A",   -100, "?",
                         3,   "N/A", -99,  "!",
                         NA,  NA,    -98,  ".",
                         -99, "E",   -101, "*",
                         -98, "F",   -1,  "-")

miss_scan_count(dat_ms,-99)
miss_scan_count(dat_ms,c(-99,-98))
miss_scan_count(dat_ms,c("-99","-98","N/A"))
miss_scan_count(dat_ms, "\\?")
miss_scan_count(dat_ms, "\\!")
miss_scan_count(dat_ms, "\\.")
miss_scan_count(dat_ms, "\\*")
miss_scan_count(dat_ms, "-")
miss_scan_count(dat_ms,common_na_strings)

Collate summary measures from naniar into one tibble

Description

miss_summary performs all of the missing data helper summaries and puts them into lists within a tibble

Usage

miss_summary(data, order = TRUE)
miss_summary(data, order = TRUE)

Arguments

`data`	a dataframe
`order`	whether or not to order the result by n_miss

Value

a tibble of missing data summaries

Examples


s_miss <- miss_summary(airquality)
s_miss$miss_df_prop
s_miss$miss_case_table
s_miss$miss_var_summary
# etc, etc, etc.

## Not run: 
library(dplyr)
s_miss_group <- group_by(airquality, Month) %>% miss_summary()
s_miss_group$miss_df_prop
s_miss_group$miss_case_table
# etc, etc, etc.

## End(Not run)

s_miss <- miss_summary(airquality)
s_miss$miss_df_prop
s_miss$miss_case_table
s_miss$miss_var_summary
# etc, etc, etc.

## Not run: 
library(dplyr)
s_miss_group <- group_by(airquality, Month) %>% miss_summary()
s_miss_group$miss_df_prop
s_miss_group$miss_case_table
# etc, etc, etc.

## End(Not run)

Cumulative sum of the number of missings in each variable

Description

Calculate the cumulative sum of number & percentage of missingness for each variable.

Usage

miss_var_cumsum(data)
miss_var_cumsum(data)

Arguments

data

a data.frame

Value

a tibble of the cumulative sum of missing data in each variable

Examples


miss_var_cumsum(airquality)
## Not run: 
library(dplyr)

# respects dplyr::group_by

airquality %>%
  group_by(Month) %>%
  miss_var_cumsum()

## End(Not run)
miss_var_cumsum(airquality)
## Not run: 
library(dplyr)

# respects dplyr::group_by

airquality %>%
  group_by(Month) %>%
  miss_var_cumsum()

## End(Not run)

Find the number of missing and complete values in a single run

Description

It us useful to find the number of missing values that occur in a single run. The function, miss_var_run(), returns a dataframe with the column names "run_length" and "is_na", which describe the length of the run, and whether that run describes a missing value.

Usage

miss_var_run(data, var)
miss_var_run(data, var)

Arguments

`data`	data.frame
`var`	a bare variable name

Value

dataframe with column names "run_length" and "is_na", which describe the length of the run, and whether that run describes a missing value.

Examples


miss_var_run(pedestrian, hourly_counts)

## Not run: 
# find the number of runs missing/complete for each month
library(dplyr)


pedestrian %>%
  group_by(month) %>%
  miss_var_run(hourly_counts)

library(ggplot2)

# explore the number of missings in a given run
miss_var_run(pedestrian, hourly_counts) %>%
  filter(is_na == "missing") %>%
  count(run_length) %>%
  ggplot(aes(x = run_length,
             y = n)) +
      geom_col()

# look at the number of missing values and the run length of these.
miss_var_run(pedestrian, hourly_counts) %>%
  ggplot(aes(x = is_na,
             y = run_length)) +
      geom_boxplot()

# using group_by
 pedestrian %>%
   group_by(month) %>%
   miss_var_run(hourly_counts)

## End(Not run)

miss_var_run(pedestrian, hourly_counts)

## Not run: 
# find the number of runs missing/complete for each month
library(dplyr)


pedestrian %>%
  group_by(month) %>%
  miss_var_run(hourly_counts)

library(ggplot2)

# explore the number of missings in a given run
miss_var_run(pedestrian, hourly_counts) %>%
  filter(is_na == "missing") %>%
  count(run_length) %>%
  ggplot(aes(x = run_length,
             y = n)) +
      geom_col()

# look at the number of missing values and the run length of these.
miss_var_run(pedestrian, hourly_counts) %>%
  ggplot(aes(x = is_na,
             y = run_length)) +
      geom_boxplot()

# using group_by
 pedestrian %>%
   group_by(month) %>%
   miss_var_run(hourly_counts)

## End(Not run)

Summarise the number of missings for a given repeating span on a variable

Description

To summarise the missing values in a time series object it can be useful to calculate the number of missing values in a given time period. miss_var_span takes a data.frame object, a variable, and a span_every argument and returns a dataframe containing the number of missing values within each span. When the number of observations isn't a perfect multiple of the span length, the final span is whatever the last remainder is. For example, the pedestrian dataset has 37,700 rows. If the span is set to 4000, then there will be 1700 rows remaining. This can be provided using modulo (%%): nrow(data) %% 4000. This remainder number is provided in n_in_span.

Usage

miss_var_span(data, var, span_every)
miss_var_span(data, var, span_every)

Arguments

`data`	data.frame
`var`	bare unquoted variable name of interest.
`span_every`	integer describing the length of the span to be explored

Value

dataframe with variables n_miss, n_complete, prop_miss, and prop_complete, which describe the number, or proportion of missing or complete values within that given time span. The final variable, n_in_span states how many observations are in the span.

Examples


miss_var_span(data = pedestrian,
             var = hourly_counts,
             span_every = 168)

## Not run: 
 library(dplyr)
 pedestrian %>%
   group_by(month) %>%
     miss_var_span(var = hourly_counts,
                   span_every = 168)

## End(Not run)
miss_var_span(data = pedestrian,
             var = hourly_counts,
             span_every = 168)

## Not run: 
 library(dplyr)
 pedestrian %>%
   group_by(month) %>%
     miss_var_span(var = hourly_counts,
                   span_every = 168)

## End(Not run)

Summarise the missingness in each variable

Description

Provide a summary for each variable of the number, percent missings, and cumulative sum of missings of the order of the variables. By default, it orders by the most missings in each variable.

Usage

miss_var_summary(data, order = FALSE, add_cumsum = FALSE, digits, ...)
miss_var_summary(data, order = FALSE, add_cumsum = FALSE, digits, ...)

Arguments

`data`	a data.frame
`order`	a logical indicating whether to order the result by `n_miss`. Defaults to TRUE. If FALSE, order of variables is the order input.
`add_cumsum`	logical indicating whether or not to add the cumulative sum of missings to the data. This can be useful when exploring patterns of nonresponse. These are calculated as the cumulative sum of the missings in the variables as they are first presented to the function.
`digits`	how many digits to display in `pct_miss` column. Useful when you are working with small amounts of missing data.
`...`	extra arguments

Value

a tibble of the percent of missing data in each variable

Note

n_miss_cumsum is calculated as the cumulative sum of missings in the variables in the order that they are given in the data when entering the function

Examples


miss_var_summary(airquality)
miss_var_summary(oceanbuoys, order = TRUE)

## Not run: 
# works with group_by from dplyr
library(dplyr)
airquality %>%
  group_by(Month) %>%
  miss_var_summary()

## End(Not run)
miss_var_summary(airquality)
miss_var_summary(oceanbuoys, order = TRUE)

## Not run: 
# works with group_by from dplyr
library(dplyr)
airquality %>%
  group_by(Month) %>%
  miss_var_summary()

## End(Not run)

Tabulate the missings in the variables

Description

Provide a tidy table of the number of variables with 0, 1, 2, up to n, missing values and the proportion of the number of variables those variables make up.

Usage

miss_var_table(data)
miss_var_table(data)

Arguments

data

a dataframe

Value

a dataframe

Examples


miss_var_table(airquality)
## Not run: 
library(dplyr)
airquality %>%
  group_by(Month) %>%
  miss_var_table()

## End(Not run)
miss_var_table(airquality)
## Not run: 
library(dplyr)
airquality %>%
  group_by(Month) %>%
  miss_var_table()

## End(Not run)

Which variables contain missing values?

Description

It can be helpful when writing other functions to just return the names of the variables that contain missing values. miss_var_which returns a vector of variable names that contain missings. It will return NULL when there are no missings.

Usage

miss_var_which(data)
miss_var_which(data)

Arguments

data

a data.frame

Value

character vector of variable names

Examples

miss_var_which(airquality)

miss_var_which(mtcars)

miss_var_which(airquality)

miss_var_which(mtcars)

Proportion of variables containing missings or complete values

Description

Defunct. Please see prop_miss_var(), prop_complete_var(), pct_miss_var(), pct_complete_var(), prop_miss_case(), prop_complete_case(), pct_miss_case(), pct_complete_case().

Usage

miss_var_prop(...)

complete_var_prop(...)

miss_var_pct(...)

complete_var_pct(...)

miss_case_prop(...)

complete_case_prop(...)

miss_case_pct(...)

complete_case_pct(...)
miss_var_prop(...)

complete_var_prop(...)

miss_var_pct(...)

complete_var_pct(...)

miss_case_prop(...)

complete_case_prop(...)

miss_case_pct(...)

complete_case_pct(...)

Arguments

...

arguments

Return the number of complete values

Description

A complement to n_miss

Usage

n_complete(x)
n_complete(x)

Arguments

x

a vector

Value

numeric number of complete values

Examples


n_complete(airquality)
n_complete(airquality$Ozone)

n_complete(airquality)
n_complete(airquality$Ozone)

Return a vector of the number of complete values in each row

Description

Substitute for rowSums(!is.na(data)) but it also checks if input is NULL or is a dataframe

Usage

n_complete_row(data)
n_complete_row(data)

Arguments

data

a dataframe

Value

numeric vector of the number of complete values in each row

Examples


n_complete_row(airquality)

n_complete_row(airquality)

Return the number of missing values

Description

Substitute for sum(is.na(data))

Usage

n_miss(x)
n_miss(x)

Arguments

x

a vector

Value

numeric the number of missing values

Examples


n_miss(airquality)
n_miss(airquality$Ozone)

n_miss(airquality)
n_miss(airquality$Ozone)

Return a vector of the number of missing values in each row

Description

Substitute for rowSums(is.na(data)), but it also checks if input is NULL or is a dataframe

Usage

n_miss_row(data)
n_miss_row(data)

Arguments

data

a dataframe

Value

numeric vector of the number of missing values in each row

Examples


n_miss_row(airquality)

n_miss_row(airquality)

The number of variables with complete values

Description

This function calculates the number of variables that contain a complete value

Usage

n_var_complete(data)

n_case_complete(data)
n_var_complete(data)

n_case_complete(data)

Arguments

data

data.frame

Value

integer number of complete values

Examples


# how many variables contain complete values?
n_var_complete(airquality)
n_case_complete(airquality)

# how many variables contain complete values?
n_var_complete(airquality)
n_case_complete(airquality)

The number of variables or cases with missing values

Description

This function calculates the number of variables or cases that contain a missing value

Usage

n_var_miss(data)

n_case_miss(data)
n_var_miss(data)

n_case_miss(data)

Arguments

data

data.frame

Value

integer, number of missings

Examples

# how many variables contain missing values?
n_var_miss(airquality)
n_case_miss(airquality)

# how many variables contain missing values?
n_var_miss(airquality)
n_case_miss(airquality)

Convert data into nabular form by binding shade to it

Description

Binding a shadow matrix to a regular dataframe converts it into nabular data, which makes it easier to visualise and work with missing data.

Usage

nabular(data, only_miss = FALSE, ...)
nabular(data, only_miss = FALSE, ...)

Arguments

`data`	a dataframe
`only_miss`	logical - if FALSE (default) it will bind a dataframe with all of the variables duplicated with their shadow. Setting this to TRUE will bind variables only those variables that contain missing values. See the examples for more details.
`...`	extra options to pass to `recode_shadow()` - a work in progress.

Value

data with the added variable shifted and the suffix ⁠_NA⁠

Examples


aq_nab <- nabular(airquality)
aq_s <- bind_shadow(airquality)

all.equal(aq_nab, aq_s)

aq_nab <- nabular(airquality)
aq_s <- bind_shadow(airquality)

all.equal(aq_nab, aq_s)

naniar: Data Structures, Summaries, and Visualisations for Missing Data

Description

naniar is a package to make it easier to summarise and handle missing values in R. It strives to do this in a way that is as consistent with tidyverse principles as possible. The work is fully discussed at Tierney & Cook (2023) doi:10.18637/jss.v105.i07.

Author(s)

Maintainer: Nicholas Tierney [email protected] (ORCID)

Authors:

Di Cook [email protected] (ORCID)
Miles McBain [email protected] (ORCID)
Colin Fay [email protected] (ORCID)

Other contributors:

Mitchell O'Hara-Wild [contributor]
Jim Hester [email protected] [contributor]
Luke Smith [contributor]
Andrew Heiss [email protected] (ORCID) [contributor]

West Pacific Tropical Atmosphere Ocean Data, 1993 & 1997.

Description

Real-time data from moored ocean buoys for improved detection, understanding and prediction of El Ni'o and La Ni'a. The data is collected by the Tropical Atmosphere Ocean project (https://www.pmel.noaa.gov/gtmba/pmel-theme/pacific-ocean-tao).

Usage

data(oceanbuoys)
data(oceanbuoys)

Format

An object of class tbl_df (inherits from tbl, data.frame) with 736 rows and 8 columns.

Details

Format: a data frame with 736 observations on the following 8 variables.

year: A numeric with levels 1993 1997.
latitude: A numeric with levels -5 -2 0.
longitude: A numeric with levels -110 -95.
sea_temp_c: Sea surface temperature(degree Celsius), measured by the TAO buoys at one meter below the surface.
air_temp_c: Air temperature(degree Celsius), measured by the TAO buoys three meters above the sea surface.
humidity: Relative humidity(%), measured by the TAO buoys 3 meters above the sea surface.
wind_ew: The East-West wind vector components(M/s). TAO buoys measure the wind speed and direction four meters above the sea surface. If it is positive, the East-West component of the wind is blowing towards the East. If it is negative, this component is blowing towards the West.
wind_ns: The North-South wind vector components(M/s). TAO buoys measure the wind speed and direction four meters above the sea surface. If it is positive, the North-South component of the wind is blowing towards the North. If it is negative, this component is blowing towards the South.

Source

https://www.pmel.noaa.gov/tao/drupal/disdel/

Examples


vis_miss(oceanbuoys)

# Look at the missingness in the variables
miss_var_summary(oceanbuoys)
## Not run: 
# Look at the missingness in air temperature and humidity
library(ggplot2)
p <-
ggplot(oceanbuoys,
       aes(x = air_temp_c,
           y = humidity)) +
     geom_miss_point()

 p

 # for each year?
 p + facet_wrap(~year)

 # this shows that there are more missing values in humidity in 1993, and
 # more air temperature missing values in 1997

 # see more examples in the vignette, "getting started with naniar".

## End(Not run)
vis_miss(oceanbuoys)

# Look at the missingness in the variables
miss_var_summary(oceanbuoys)
## Not run: 
# Look at the missingness in air temperature and humidity
library(ggplot2)
p <-
ggplot(oceanbuoys,
       aes(x = air_temp_c,
           y = humidity)) +
     geom_miss_point()

 p

 # for each year?
 p + facet_wrap(~year)

 # this shows that there are more missing values in humidity in 1993, and
 # more air temperature missing values in 1997

 # see more examples in the vignette, "getting started with naniar".

## End(Not run)

Return the percent of complete values

Description

The complement to pct_miss

Usage

pct_complete(x)
pct_complete(x)

Arguments

`x`	vector or data.frame

Value

numeric percent of complete values

Examples


pct_complete(airquality)
pct_complete(airquality$Ozone)

pct_complete(airquality)
pct_complete(airquality$Ozone)

Return the percent of missing values

Description

This is shorthand for mean(is.na(x)) * 100

Usage

pct_miss(x)
pct_miss(x)

Arguments

`x`	vector or data.frame

Value

numeric the percent of missing values in x

Examples


pct_miss(airquality)
pct_miss(airquality$Ozone)

pct_miss(airquality)
pct_miss(airquality$Ozone)

Percentage of cases that contain a missing or complete values.

Description

Calculate the percentage of cases (rows) that contain a missing or complete value.

Usage

pct_miss_case(data)

pct_complete_case(data)
pct_miss_case(data)

pct_complete_case(data)

Arguments

data

a dataframe

Value

numeric the percentage of cases that contain a missing or complete value

Examples


pct_miss_case(airquality)
pct_complete_case(airquality)

pct_miss_case(airquality)
pct_complete_case(airquality)

Percentage of variables containing missings or complete values

Description

Calculate the percentage of variables that contain a single missing or complete value.

Usage

pct_miss_var(data)

pct_complete_var(data)
pct_miss_var(data)

pct_complete_var(data)

Arguments

data

a dataframe

Value

numeric the percent of variables that contain missing or complete data

Examples


prop_miss_var(airquality)
prop_complete_var(airquality)

prop_miss_var(airquality)
prop_complete_var(airquality)

Pedestrian count information around Melbourne for 2016

Description

This dataset contains hourly counts of pedestrians from 4 sensors around Melbourne: Birrarung Marr, Bourke Street Mall, Flagstaff station, and Spencer St-Collins St (south), recorded from January 1st 2016 at 00:00:00 to December 31st 2016 at 23:00:00. The data is made free and publicly available from https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-monthly-counts-per-hour/information/

Usage

data(pedestrian)
data(pedestrian)

Format

A tibble with 37,700 rows and 9 variables:

hourly_counts: (integer) the number of pedestrians counted at that sensor at that time
date_time: (POSIXct, POSIXt) The time that the count was taken
year: (integer) Year of record
month: (factor) Month of record as an ordered factor (1 = January, 12 = December)
month_day: (integer) Full day of the month
week_day: (factor) Full day of the week as an ordered factor (1 = Sunday, 7 = Saturday)
hour: (integer) The hour of the day in 24 hour format
sensor_id: (integer) the id of the sensor
sensor_name: (character) the full name of the sensor

Source

https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-monthly-counts-per-hour/information/

Examples

# explore the missingness with vis_miss

vis_miss(pedestrian)

# Look at the missingness in the variables
miss_var_summary(pedestrian)

## Not run: 
# There is only missingness in hourly_counts
# Look at the missingness over a rolling window
library(ggplot2)
gg_miss_span(pedestrian, hourly_counts, span_every = 3000)

## End(Not run)
# explore the missingness with vis_miss

vis_miss(pedestrian)

# Look at the missingness in the variables
miss_var_summary(pedestrian)

## Not run: 
# There is only missingness in hourly_counts
# Look at the missingness over a rolling window
library(ggplot2)
gg_miss_span(pedestrian, hourly_counts, span_every = 3000)

## End(Not run)

Return the proportion of complete values

Description

The complement to prop_miss

Usage

prop_complete(x)
prop_complete(x)

Arguments

`x`	vector or data.frame

Value

numeric proportion of complete values

Examples


prop_complete(airquality)
prop_complete(airquality$Ozone)

prop_complete(airquality)
prop_complete(airquality$Ozone)

Return a vector of the proportion of missing values in each row

Description

Substitute for rowMeans(!is.na(data)), but it also checks if input is NULL or is a dataframe

Usage

prop_complete_row(data)
prop_complete_row(data)

Arguments

data

a dataframe

Value

numeric vector of the proportion of missing values in each row

Examples


prop_complete_row(airquality)

prop_complete_row(airquality)

Return the proportion of missing values

Description

This is shorthand for mean(is.na(x))

Usage

prop_miss(x)
prop_miss(x)

Arguments

`x`	vector or data.frame

Value

numeric the proportion of missing values in x

Examples


prop_miss(airquality)
prop_miss(airquality$Ozone)

prop_miss(airquality)
prop_miss(airquality$Ozone)

Return a vector of the proportion of missing values in each row

Description

Substitute for rowMeans(is.na(data)), but it also checks if input is NULL or is a dataframe

Usage

prop_miss_row(data)
prop_miss_row(data)

Arguments

data

a dataframe

Value

numeric vector of the proportion of missing values in each row

Examples


prop_miss_row(airquality)

prop_miss_row(airquality)

Proportion of cases that contain a missing or complete values.

Description

Calculate the proportion of cases (rows) that contain missing or complete values.

Usage

prop_miss_case(data)

prop_complete_case(data)
prop_miss_case(data)

prop_complete_case(data)

Arguments

data

a dataframe

Value

numeric the proportion of cases that contain a missing or complete value

Examples


prop_miss_case(airquality)
prop_complete_case(airquality)

prop_miss_case(airquality)
prop_complete_case(airquality)

Proportion of variables containing missings or complete values

Description

Calculate the proportion of variables that contain a single missing or complete values.

Usage

prop_miss_var(data)

prop_complete_var(data)
prop_miss_var(data)

prop_complete_var(data)

Arguments

data

a dataframe

Value

numeric the proportion of variables that contain missing or complete data

Examples


prop_miss_var(airquality)
prop_complete_var(airquality)

prop_miss_var(airquality)
prop_complete_var(airquality)

Add special missing values to the shadow matrix

Description

It can be useful to add special missing values, naniar supports this with the recode_shadow function.

Usage

recode_shadow(data, ...)

## S3 method for class 'data.frame'
recode_shadow(data, ...)

## S3 method for class 'grouped_df'
recode_shadow(data, ...)
recode_shadow(data, ...)

## S3 method for class 'data.frame'
recode_shadow(data, ...)

## S3 method for class 'grouped_df'
recode_shadow(data, ...)

Arguments

`data`	data.frame
`...`	A sequence of two-sided formulas as in dplyr::case_when, but when a wrapper function `.where` written around it.

Value

a dataframe with altered shadows

Examples


df <- tibble::tribble(
~wind, ~temp,
-99,    45,
68,    NA,
72,    25
)

dfs <- bind_shadow(df)

dfs

recode_shadow(dfs, temp = .where(wind == -99 ~ "bananas"))

recode_shadow(dfs,
              temp = .where(wind == -99 ~ "bananas")) %>%
recode_shadow(wind = .where(wind == -99 ~ "apples"))

df <- tibble::tribble(
~wind, ~temp,
-99,    45,
68,    NA,
72,    25
)

dfs <- bind_shadow(df)

dfs

recode_shadow(dfs, temp = .where(wind == -99 ~ "bananas"))

recode_shadow(dfs,
              temp = .where(wind == -99 ~ "bananas")) %>%
recode_shadow(wind = .where(wind == -99 ~ "apples"))

Replace NA value with provided value

Description

This function helps you replace NA values with a single provided value. This can be classed as a kind of imputation, and is powered by impute_fixed(). However, we would generally recommend to impute using other model based approaches. See the simputation package, for example simputation::impute_lm(). See tidyr::replace_na() for a slightly different approach, dplyr::coalesce() for replacing NAs with values from other vectors, and dplyr::na_if() to replace specified values with NA.

Usage

replace_na_with(x, value)
replace_na_with(x, value)

Arguments

`x`	vector
`value`	value to replace

Value

vector with replaced values

Examples


library(naniar)
x <- c(1:5, NA, NA, NA)
x
replace_na_with(x, 0L)
replace_na_with(x, "unknown")

library(dplyr)
dat <- tibble(
  ones = c(NA,1,1),
  twos = c(NA,NA, 2),
  threes = c(NA, NA, NA)
)

dat

dat %>%
  mutate(
    ones = replace_na_with(ones, 0),
    twos = replace_na_with(twos, -99),
    threes = replace_na_with(threes, "unknowns")
  )

dat %>%
  mutate(
    across(
      everything(),
      \(x) replace_na_with(x, -99)
    )
  )

library(naniar)
x <- c(1:5, NA, NA, NA)
x
replace_na_with(x, 0L)
replace_na_with(x, "unknown")

library(dplyr)
dat <- tibble(
  ones = c(NA,1,1),
  twos = c(NA,NA, 2),
  threes = c(NA, NA, NA)
)

dat

dat %>%
  mutate(
    ones = replace_na_with(ones, 0),
    twos = replace_na_with(twos, -99),
    threes = replace_na_with(threes, "unknowns")
  )

dat %>%
  mutate(
    across(
      everything(),
      \(x) replace_na_with(x, -99)
    )
  )

Replace values with missings

Description

This function is Defunct, please see replace_with_na().

Usage

replace_to_na(...)
replace_to_na(...)

Arguments

...

additional arguments for methods.

Value

values replaced by NA

Replace values with missings

Description

Specify variables and their values that you want to convert to missing values. This is a complement to tidyr::replace_na.

Usage

replace_with_na(data, replace = list(), ...)
replace_with_na(data, replace = list(), ...)

Arguments

`data`	A data.frame
`replace`	A named list given the NA to replace values for each column
`...`	additional arguments for methods. Currently unused

Value

Dataframe with values replaced by NA.

Examples


dat_ms <- tibble::tribble(~x,  ~y,    ~z,
                         1,   "A",   -100,
                         3,   "N/A", -99,
                         NA,  NA,    -98,
                         -99, "E",   -101,
                         -98, "F",   -1)

replace_with_na(dat_ms,
               replace = list(x = -99))

replace_with_na(dat_ms,
             replace = list(x = c(-99, -98)))

replace_with_na(dat_ms,
             replace = list(x = c(-99, -98),
                          y = c("N/A"),
                          z = c(-101)))
dat_ms <- tibble::tribble(~x,  ~y,    ~z,
                         1,   "A",   -100,
                         3,   "N/A", -99,
                         NA,  NA,    -98,
                         -99, "E",   -101,
                         -98, "F",   -1)

replace_with_na(dat_ms,
               replace = list(x = -99))

replace_with_na(dat_ms,
             replace = list(x = c(-99, -98)))

replace_with_na(dat_ms,
             replace = list(x = c(-99, -98),
                          y = c("N/A"),
                          z = c(-101)))

Replace all values with NA where a certain condition is met

Description

This function takes a dataframe and replaces all values that meet the condition specified as an NA value, following a special syntax.

Usage

replace_with_na_all(data, condition)
replace_with_na_all(data, condition)

Arguments

`data`	A dataframe
`condition`	A condition required to be TRUE to set NA. Here, the condition is specified with a formula, following the syntax: `⁠~.x {condition}⁠`. For example, writing `~.x < 20` would mean "where a variable value is less than 20, replace with NA".

Examples

dat_ms <- tibble::tribble(~x,  ~y,    ~z,
                          1,   "A",   -100,
                          3,   "N/A", -99,
                          NA,  NA,    -98,
                          -99, "E",   -101,
                          -98, "F",   -1)

dat_ms
#replace all instances of -99 with NA
replace_with_na_all(data = dat_ms,
                    condition = ~.x == -99)

# replace all instances of -99 or -98, or "N/A" with NA
replace_with_na_all(dat_ms,
                    condition = ~.x %in% c(-99, -98, "N/A"))
# replace all instances of common na strings
replace_with_na_all(dat_ms,
                    condition = ~.x %in% common_na_strings)

# where works with functions
replace_with_na_all(airquality, ~ sqrt(.x) < 5)

dat_ms <- tibble::tribble(~x,  ~y,    ~z,
                          1,   "A",   -100,
                          3,   "N/A", -99,
                          NA,  NA,    -98,
                          -99, "E",   -101,
                          -98, "F",   -1)

dat_ms
#replace all instances of -99 with NA
replace_with_na_all(data = dat_ms,
                    condition = ~.x == -99)

# replace all instances of -99 or -98, or "N/A" with NA
replace_with_na_all(dat_ms,
                    condition = ~.x %in% c(-99, -98, "N/A"))
# replace all instances of common na strings
replace_with_na_all(dat_ms,
                    condition = ~.x %in% common_na_strings)

# where works with functions
replace_with_na_all(airquality, ~ sqrt(.x) < 5)

Replace specified variables with NA where a certain condition is met

Description

Replace specified variables with NA where a certain condition is met

Usage

replace_with_na_at(data, .vars, condition)
replace_with_na_at(data, .vars, condition)

Arguments

`data`	dataframe
`.vars`	A character string of variables to replace with NA values
`condition`	A condition required to be TRUE to set NA. Here, the condition is specified with a formula, following the syntax: `⁠~.x {condition}⁠`. For example, writing `~.x < 20` would mean "where a variable value is less than 20, replace with NA".

Value

a dataframe

Examples


dat_ms <- tibble::tribble(~x,  ~y,    ~z,
                          1,   "A",   -100,
                          3,   "N/A", -99,
                          NA,  NA,    -98,
                          -99, "E",   -101,
                          -98, "F",   -1)

dat_ms

replace_with_na_at(data = dat_ms,
                 .vars = "x",
                 condition = ~.x == -99)

replace_with_na_at(data = dat_ms,
                 .vars = c("x","z"),
                 condition = ~.x == -99)

# replace using values in common_na_strings
replace_with_na_at(data = dat_ms,
                 .vars = c("x","z"),
                 condition = ~.x %in% common_na_strings)


dat_ms <- tibble::tribble(~x,  ~y,    ~z,
                          1,   "A",   -100,
                          3,   "N/A", -99,
                          NA,  NA,    -98,
                          -99, "E",   -101,
                          -98, "F",   -1)

dat_ms

replace_with_na_at(data = dat_ms,
                 .vars = "x",
                 condition = ~.x == -99)

replace_with_na_at(data = dat_ms,
                 .vars = c("x","z"),
                 condition = ~.x == -99)

# replace using values in common_na_strings
replace_with_na_at(data = dat_ms,
                 .vars = c("x","z"),
                 condition = ~.x %in% common_na_strings)

Replace values with NA based on some condition, for variables that meet some predicate

Description

Replace values with NA based on some condition, for variables that meet some predicate

Usage

replace_with_na_if(data, .predicate, condition)
replace_with_na_if(data, .predicate, condition)

Arguments

`data`	Dataframe
`.predicate`	A predicate function to be applied to the columns or a logical vector.
`condition`	A condition required to be TRUE to set NA. Here, the condition is specified with a formula, following the syntax: `⁠~.x {condition}⁠`. For example, writing `~.x < 20` would mean "where a variable value is less than 20, replace with NA".

Value

Dataframe

Examples


dat_ms <- tibble::tribble(~x,  ~y,    ~z,
                          1,   "A",   -100,
                          3,   "N/A", -99,
                          NA,  NA,    -98,
                          -99, "E",   -101,
                          -98, "F",   -1)

dat_ms

replace_with_na_if(data = dat_ms,
                 .predicate = is.character,
                 condition = ~.x == "N/A")
replace_with_na_if(data = dat_ms,
                   .predicate = is.character,
                   condition = ~.x %in% common_na_strings)

replace_with_na(dat_ms,
              to_na = list(x = c(-99, -98),
                           y = c("N/A"),
                           z = c(-101)))


dat_ms <- tibble::tribble(~x,  ~y,    ~z,
                          1,   "A",   -100,
                          3,   "N/A", -99,
                          NA,  NA,    -98,
                          -99, "E",   -101,
                          -98, "F",   -1)

dat_ms

replace_with_na_if(data = dat_ms,
                 .predicate = is.character,
                 condition = ~.x == "N/A")
replace_with_na_if(data = dat_ms,
                   .predicate = is.character,
                   condition = ~.x %in% common_na_strings)

replace_with_na(dat_ms,
              to_na = list(x = c(-99, -98),
                           y = c("N/A"),
                           z = c(-101)))

The Behavioral Risk Factor Surveillance System (BRFSS) Survey Data, 2009.

Description

The data is a subset of the 2009 survey from BRFSS, an ongoing data collection program designed to measure behavioral risk factors for the adult population (18 years of age or older) living in households.

Usage

data(riskfactors)
data(riskfactors)

Format

An object of class tbl_df (inherits from tbl, data.frame) with 245 rows and 34 columns.

Source

https://www.cdc.gov/brfss/annual_data/annual_2009.htm

Examples


vis_miss(riskfactors)

# Look at the missingness in the variables
miss_var_summary(riskfactors)

# and now as a plot
gg_miss_var(riskfactors)

## Not run: 
# Look at the missingness in bmi and poor health
library(ggplot2)
p <-
ggplot(riskfactors,
       aes(x = health_poor,
           y = bmi)) +
     geom_miss_point()

 p

 # for each sex?
 p + facet_wrap(~sex)
 # for each education bracket?
 p + facet_wrap(~education)

## End(Not run)
vis_miss(riskfactors)

# Look at the missingness in the variables
miss_var_summary(riskfactors)

# and now as a plot
gg_miss_var(riskfactors)

## Not run: 
# Look at the missingness in bmi and poor health
library(ggplot2)
p <-
ggplot(riskfactors,
       aes(x = health_poor,
           y = bmi)) +
     geom_miss_point()

 p

 # for each sex?
 p + facet_wrap(~sex)
 # for each education bracket?
 p + facet_wrap(~education)

## End(Not run)

Scoped variants of `impute_mean`

Description

impute_mean imputes the mean for a vector. To get it to work on all variables, use impute_mean_all. To only impute variables that satisfy a specific condition, use the scoped variants, impute_below_at, and impute_below_if. To use ⁠_at⁠ effectively, you must know that ⁠_at`` affects variables selected with a character vector, or with ⁠vars()'.

Usage

impute_mean_all(.tbl)

impute_mean_at(.tbl, .vars)

impute_mean_if(.tbl, .predicate)
impute_mean_all(.tbl)

impute_mean_at(.tbl, .vars)

impute_mean_if(.tbl, .predicate)

Arguments

`.tbl`	a data.frame
`.vars`	variables to impute
`.predicate`	variables to impute

Details

Value

an dataset with values imputed

Examples

# select variables starting with a particular string.
impute_mean_all(airquality)

impute_mean_at(airquality,
               .vars = c("Ozone", "Solar.R"))

## Not run: 
library(dplyr)
impute_mean_at(airquality,
                .vars = vars(Ozone))

impute_mean_if(airquality,
                .predicate = is.numeric)

library(ggplot2)
airquality %>%
  bind_shadow() %>%
  impute_mean_all() %>%
  add_label_shadow() %>%
  ggplot(aes(x = Ozone,
             y = Solar.R,
             colour = any_missing)) +
         geom_point()

## End(Not run)

# select variables starting with a particular string.
impute_mean_all(airquality)

impute_mean_at(airquality,
               .vars = c("Ozone", "Solar.R"))

## Not run: 
library(dplyr)
impute_mean_at(airquality,
                .vars = vars(Ozone))

impute_mean_if(airquality,
                .predicate = is.numeric)

library(ggplot2)
airquality %>%
  bind_shadow() %>%
  impute_mean_all() %>%
  add_label_shadow() %>%
  ggplot(aes(x = Ozone,
             y = Solar.R,
             colour = any_missing)) +
         geom_point()

## End(Not run)

Scoped variants of `impute_median`

Description

impute_median imputes the median for a vector. To only impute many variables at once, we recommend that you use the across function workflow, shown in the examples for impute_median(). You can use the scoped variants, impute_median_all.impute_below_at, and impute_below_if to impute all, some, or just those variables meeting some condition, respectively. To use ⁠_at⁠ effectively, you must know that ⁠_at⁠ affects variables selected with a character vector, or with vars().

Usage

impute_median_all(.tbl)

impute_median_at(.tbl, .vars)

impute_median_if(.tbl, .predicate)
impute_median_all(.tbl)

impute_median_at(.tbl, .vars)

impute_median_if(.tbl, .predicate)

Arguments

`.tbl`	a data.frame
`.vars`	variables to impute
`.predicate`	variables to impute

Details

Value

an dataset with values imputed

Examples

# select variables starting with a particular string.
impute_median_all(airquality)

impute_median_at(airquality,
               .vars = c("Ozone", "Solar.R"))
library(dplyr)
impute_median_at(airquality,
                .vars = vars(Ozone))

impute_median_if(airquality,
                .predicate = is.numeric)

library(ggplot2)
airquality %>%
  bind_shadow() %>%
  impute_median_all() %>%
  add_label_shadow() %>%
  ggplot(aes(x = Ozone,
             y = Solar.R,
             colour = any_missing)) +
         geom_point()

# select variables starting with a particular string.
impute_median_all(airquality)

impute_median_at(airquality,
               .vars = c("Ozone", "Solar.R"))
library(dplyr)
impute_median_at(airquality,
                .vars = vars(Ozone))

impute_median_if(airquality,
                .predicate = is.numeric)

library(ggplot2)
airquality %>%
  bind_shadow() %>%
  impute_median_all() %>%
  add_label_shadow() %>%
  ggplot(aes(x = Ozone,
             y = Solar.R,
             colour = any_missing)) +
         geom_point()

Set a proportion or number of missing values

Description

Set a proportion or number of missing values

Usage

set_prop_miss(x, prop = 0.1)

set_n_miss(x, n = 1)
set_prop_miss(x, prop = 0.1)

set_n_miss(x, n = 1)

Arguments

`x`	vector of values to set missing
`prop`	proportion of values between 0 and 1 to set as missing
`n`	number of values to set missing

Value

vector with missing values added

Examples

vec <- rnorm(5)
set_prop_miss(vec, 0.2)
set_prop_miss(vec, 0.4)
set_n_miss(vec, 1)
set_n_miss(vec, 4)
vec <- rnorm(5)
set_prop_miss(vec, 0.2)
set_prop_miss(vec, 0.4)
set_n_miss(vec, 1)
set_n_miss(vec, 4)

Create new levels of missing

Description

Returns (at least) factors of !NA and NA, where !NA indicates a datum that is not missing, and NA indicates missingness. It also allows you to specify some new missings, if you like. This function is what powers the factor levels in as_shadow().

Usage

shade(x, ..., extra_levels = NULL)
shade(x, ..., extra_levels = NULL)

Arguments

`x`	a vector
`...`	additional levels of missing to add
`extra_levels`	extra levels you might to specify for the factor.

Examples

df <- tibble::tribble(
  ~wind, ~temp,
  -99,    45,
  68,    NA,
  72,    25
  )

shade(df$wind)

shade(df$wind, inst_fail = -99)

df <- tibble::tribble(
  ~wind, ~temp,
  -99,    45,
  68,    NA,
  72,    25
  )

shade(df$wind)

shade(df$wind, inst_fail = -99)

Reshape shadow data into a long format

Description

Once data is in nabular form, where the shadow is bound to the data, it can be useful to reshape it into a long format with the shadow columns in a separate grouping - so you have variable, value, and variable_NA and value_NA.

Usage

shadow_long(shadow_data, ..., fn_value_transform = NULL, only_main_vars = TRUE)
shadow_long(shadow_data, ..., fn_value_transform = NULL, only_main_vars = TRUE)

Arguments

`shadow_data`	a data.frame
`...`	bare name of variables that you want to focus on
`fn_value_transform`	function to transform the "value" column. Default is NULL, which defaults to `as.character`. Be aware that `as.numeric` may fail for some instances if it cannot coerce the value into numeric. See the examples.
`only_main_vars`	logical - do you want to filter down to main variables?

Value

data in long format, with columns variable, value, variable_NA, and value_NA.

Examples


aq_shadow <- nabular(airquality)

shadow_long(aq_shadow)

# then filter only on Ozone
shadow_long(aq_shadow, Ozone)

shadow_long(aq_shadow, Ozone, Solar.R)

# ensure `value` is numeric
shadow_long(aq_shadow, fn_value_transform = as.numeric)
shadow_long(aq_shadow, Ozone, Solar.R, fn_value_transform = as.numeric)


aq_shadow <- nabular(airquality)

shadow_long(aq_shadow)

# then filter only on Ozone
shadow_long(aq_shadow, Ozone)

shadow_long(aq_shadow, Ozone, Solar.R)

# ensure `value` is numeric
shadow_long(aq_shadow, fn_value_transform = as.numeric)
shadow_long(aq_shadow, Ozone, Solar.R, fn_value_transform = as.numeric)

Shift missing values to facilitate missing data exploration/visualisation

Description

shadow_shift transforms missing values to facilitate visualisation, and has different behaviour for different types of variables. For numeric variables, the values are shifted to 10% below the minimum value for a given variable plus some jittered noise, to separate repeated values, so that missing values can be visualised along with the rest of the data.

Usage

shadow_shift(...)
shadow_shift(...)

Arguments

...

arguments to impute_below().

Details

Examples

airquality$Ozone
shadow_shift(airquality$Ozone)
## Not run: 
library(dplyr)
airquality %>%
    mutate(Ozone_shift = shadow_shift(Ozone))

## End(Not run)
airquality$Ozone
shadow_shift(airquality$Ozone)
## Not run: 
library(dplyr)
airquality %>%
    mutate(Ozone_shift = shadow_shift(Ozone))

## End(Not run)

stat_miss_point

Description

stat_miss_point adds a geometry for displaying missingness to geom_point

Usage

stat_miss_point(
  mapping = NULL,
  data = NULL,
  prop_below = 0.1,
  jitter = 0.05,
  geom = "point",
  position = "identity",
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE,
  ...
)
stat_miss_point(
  mapping = NULL,
  data = NULL,
  prop_below = 0.1,
  jitter = 0.05,
  geom = "point",
  position = "identity",
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE,
  ...
)

Arguments

`mapping`	Set of aesthetic mappings created by `ggplot2::aes()` or `ggplot2::aes_()`. If specified and `inherit.aes = TRUE` (the default), is combined with the default mapping at the top level of the plot. You only need to supply mapping if there isn't a mapping defined for the plot.
`data`	A data frame. If specified, overrides the default data frame defined at the top level of the plot.
`prop_below`	the degree to shift the values. The default is 0.1
`jitter`	the amount of jitter to add. The default is 0.05
`geom`	stat Override the default connection between `geom_point` and `stat_point`.
`position`	Position adjustment, either as a string, or the result of a call to a position adjustment function
`na.rm`	If `FALSE` (the default), removes missing values with a warning. If `TRUE` silently removes missing values.
`show.legend`	logical. Should this layer be included in the legends? `NA`, the default, includes if any aesthetics are mapped. `FALSE` never includes, and `TRUE` always includes.
`inherit.aes`	If `FALSE`, overrides the default aesthetics, rather than combining with them. This is most useful for helper functions that define both data and aesthetics and shouldn't inherit behaviour from the default plot specification, e.g. borders.
`...`	other arguments passed on to `ggplot2::layer()`. There are three types of arguments you can use here: Aesthetics: to set an aesthetic to a fixed value, like `color = "red"` or `size = 3.` Other arguments to the layer, for example you override the default `stat` associated with the layer. Other arguments passed on to the stat.

Unbind (remove) shadow from data, and vice versa

Description

Remove the shadow variables (which end in ⁠_NA⁠) from the data, or vice versa. This will also remove the nabular class from the data.

Usage

unbind_shadow(data)

unbind_data(data)
unbind_shadow(data)

unbind_data(data)

Arguments

data

data.frame containing shadow columns (created by bind_shadow())

Value

data.frame without shadow columns if using unbind_shadow(), or without the original data, if using unbind_data().

Examples


# bind shadow columns
aq_sh <- bind_shadow(airquality)

# print data
aq_sh

# remove shadow columns
unbind_shadow(aq_sh)

# remove data
unbind_data(aq_sh)

# errors when you don't use data with shadows
## Not run: 
 unbind_data(airquality)
 unbind_shadow(airquality)

## End(Not run)

# bind shadow columns
aq_sh <- bind_shadow(airquality)

# print data
aq_sh

# remove shadow columns
unbind_shadow(aq_sh)

# remove data
unbind_data(aq_sh)

# errors when you don't use data with shadows
## Not run: 
 unbind_data(airquality)
 unbind_shadow(airquality)

## End(Not run)

Split a call into two components with a useful verb name

Description

This function is used inside recode_shadow to help evaluate the formula call effectively. .where is a special function designed for use in recode_shadow, and you shouldn't use it outside of it

Usage

.where(...)
.where(...)

Arguments

...

case_when style formula

Value

a list of "condition" and "suffix" arguments

Examples


## Not run: 
df <- tibble::tribble(
~wind, ~temp,
-99,    45,
68,    NA,
72,    25
)

dfs <- bind_shadow(df)

recode_shadow(dfs,
              temp = .where(wind == -99 ~ "bananas"))


## End(Not run)

## Not run: 
df <- tibble::tribble(
~wind, ~temp,
-99,    45,
68,    NA,
72,    25
)

dfs <- bind_shadow(df)

recode_shadow(dfs,
              temp = .where(wind == -99 ~ "bananas"))


## End(Not run)

Which rows and cols contain missings?

Description

Internal function that is short for which(is.na(x), arr.ind = TRUE). Creates array index locations of missing values in a dataframe.

Usage

where_na(x)
where_na(x)

Arguments

`x`	a dataframe

Value

a matrix with columns "row" and "col", which refer to the row and column that identify the position of a missing value in a dataframe

Examples


where_na(airquality)
where_na(oceanbuoys$sea_temp_c)

where_na(airquality)
where_na(oceanbuoys$sea_temp_c)

Which variables are shades?

Description

This function tells us which variables contain shade information

Usage

which_are_shade(.tbl)
which_are_shade(.tbl)

Arguments

.tbl

a data.frame or tbl

Value

numeric - which column numbers contain shade information

Examples


df_shadow <- bind_shadow(airquality)

which_are_shade(df_shadow)

df_shadow <- bind_shadow(airquality)

which_are_shade(df_shadow)

Which elements contain missings?

Description

Equivalent to which(is.na()) - returns integer locations of missing values.

Usage

which_na(x)
which_na(x)

Arguments

`x`	a dataframe

Value

integer locations of missing values.

Examples


which_na(airquality)

which_na(airquality)

Package 'naniar'

Help Index

Add a column describing presence of any missing values

Description

Usage

Arguments

Details

Value

See Also

Examples

Add a column describing if there are any missings in the dataset

Description

Usage

Arguments

Value

See Also

Examples

Add a column describing whether there is a shadow

Description

Usage

Arguments

Value

See Also

Examples

Add a column that tells us which "missingness cluster" a row belongs to

Description

Usage

Arguments

See Also

Examples

Add column containing number of missing data values

Description

Usage

Arguments

Value

See Also

Examples

Add column containing proportion of missing data values

Description

Usage

Arguments

Value

See Also

Examples

Add a shadow column to dataframe

Description

Usage

Arguments

Value

See Also

Examples

Add a shadow shifted column to a dataset

Description

Usage

Arguments

Value

See Also

Examples

Add a counter variable for a span of dataframe

Description

Usage

Arguments

Value

Examples

Helper function to determine whether there are any missings

Description

Usage

Arguments

Value

Identify if there are any or all missing or complete values

Description

Usage

Arguments

See Also

Examples

Create shadows

Description

Usage

Arguments

Details