Title: | Data Structures, Summaries, and Visualisations for Missing Data |
---|---|
Description: | Missing values are ubiquitous in data and need to be explored and handled in the initial stages of analysis. 'naniar' provides data structures and functions that facilitate the plotting of missing values and examination of imputations. This allows missing data dependencies to be explored with minimal deviation from the common work patterns of 'ggplot2' and tidy data. The work is fully discussed at Tierney & Cook (2023) <doi:10.18637/jss.v105.i07>. |
Authors: | Nicholas Tierney [aut, cre] , Di Cook [aut] , Miles McBain [aut] , Colin Fay [aut] , Mitchell O'Hara-Wild [ctb], Jim Hester [ctb], Luke Smith [ctb], Andrew Heiss [ctb] |
Maintainer: | Nicholas Tierney <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.0.9000 |
Built: | 2024-12-11 03:26:52 UTC |
Source: | https://github.com/njtierney/naniar |
This adds a column named "any_miss" (by default) that describes whether
there are any missings in all of the variables (default), or whether any
of the specified columns, specified using variables names or dplyr verbs,
starts_with
, contains
, ends_with
, etc. By default the added column
will be called "any_miss_all", if no variables are specified, otherwise,
if variables are specified, the label will be "any_miss_vars" to indicate
that not all variables have been used to create the labels.
add_any_miss( data, ..., label = "any_miss", missing = "missing", complete = "complete" )
add_any_miss( data, ..., label = "any_miss", missing = "missing", complete = "complete" )
data |
data.frame |
... |
Variable names to use instead of the whole dataset. By default this
looks at the whole dataset. Otherwise, this is one or more unquoted
expressions separated by commas. These also respect the dplyr verbs
|
label |
label for the column, defaults to "any_miss". By default if no additional variables are listed the label col is "any_miss_all", otherwise it is "any_miss_vars", if variables are specified. |
missing |
character a label for when values are missing - defaults to "missing" |
complete |
character character a label for when values are complete - defaults to "complete" |
By default the
prefix "any_miss" is used, but this can be changed in the label
argument.
data.frame with data and the column labelling whether that row (for those variables) has any missing values - indicated by "missing" and "complete".
bind_shadow()
add_any_miss()
add_label_missings()
add_label_shadow()
add_miss_cluster()
add_n_miss()
add_prop_miss()
add_shadow_shift()
cast_shadow()
airquality %>% add_any_miss() airquality %>% add_any_miss(Ozone, Solar.R)
airquality %>% add_any_miss() airquality %>% add_any_miss(Ozone, Solar.R)
Add a column describing if there are any missings in the dataset
add_label_missings(data, ..., missing = "Missing", complete = "Not Missing")
add_label_missings(data, ..., missing = "Missing", complete = "Not Missing")
data |
data.frame |
... |
extra variable to label |
missing |
character a label for when values are missing - defaults to "Missing" |
complete |
character character a label for when values are complete - defaults to "Not Missing" |
data.frame with a column "any_missing" that is either "Not Missing" or "Missing" for the purposes of plotting / exploration / nice print methods
bind_shadow()
add_any_miss()
add_label_missings()
add_label_shadow()
add_miss_cluster()
add_n_miss()
add_prop_miss()
add_shadow_shift()
cast_shadow()
airquality %>% add_label_missings() airquality %>% add_label_missings(Ozone, Solar.R) airquality %>% add_label_missings(Ozone, Solar.R, missing = "yes", complete = "no")
airquality %>% add_label_missings() airquality %>% add_label_missings(Ozone, Solar.R) airquality %>% add_label_missings(Ozone, Solar.R, missing = "yes", complete = "no")
Instead of focussing on labelling whether there are missings, we instead focus on whether there have been any shadows created. This can be useful when data has been imputed and you need to determine which rows contained missing values when the shadow was bound to the dataset.
add_label_shadow(data, ..., missing = "Missing", complete = "Not Missing")
add_label_shadow(data, ..., missing = "Missing", complete = "Not Missing")
data |
data.frame |
... |
extra variable to label |
missing |
character a label for when values are missing - defaults to "Missing" |
complete |
character character a label for when values are complete - defaults to "Not Missing" |
data.frame with a column, "any_missing", which describes whether or not there are any rows that have a shadow value.
bind_shadow()
add_any_miss()
add_label_missings()
add_label_shadow()
add_miss_cluster()
add_n_miss()
add_prop_miss()
add_shadow_shift()
cast_shadow()
airquality %>% add_shadow(Ozone, Solar.R) %>% add_label_shadow()
airquality %>% add_shadow(Ozone, Solar.R) %>% add_label_shadow()
A way to extract the cluster of missingness that a group belongs to.
For example, if you use vis_miss(airquality, cluster = TRUE)
, you can
see some clustering in the data, but you do not have a way to identify
the cluster. Future work will incorporate the seriation
package to
allow for better control over the clustering from the user.
add_miss_cluster(data, cluster_method = "mcquitty", n_clusters = 2)
add_miss_cluster(data, cluster_method = "mcquitty", n_clusters = 2)
data |
a dataframe |
cluster_method |
character vector of the agglomeration method to use,
the default is "mcquitty". Options are taken from |
n_clusters |
numeric the number of clusters you expect. Defaults to 2. |
bind_shadow()
add_any_miss()
add_label_missings()
add_label_shadow()
add_miss_cluster()
add_n_miss()
add_prop_miss()
add_shadow_shift()
cast_shadow()
add_miss_cluster(airquality) add_miss_cluster(airquality, n_clusters = 3) add_miss_cluster(airquality, cluster_method = "ward.D", n_clusters = 3)
add_miss_cluster(airquality) add_miss_cluster(airquality, n_clusters = 3) add_miss_cluster(airquality, cluster_method = "ward.D", n_clusters = 3)
It can be useful when doing data analysis to add the number of missing data
points into your dataframe. add_n_miss
adds a column named "n_miss",
which contains the number of missing values in that row.
add_n_miss(data, ..., label = "n_miss")
add_n_miss(data, ..., label = "n_miss")
data |
a dataframe |
... |
Variable names to use instead of the whole dataset. By default this
looks at the whole dataset. Otherwise, this is one or more unquoted
expressions separated by commas. These also respect the dplyr verbs
|
label |
character default is "n_miss". |
a dataframe
bind_shadow()
add_any_miss()
add_label_missings()
add_label_shadow()
add_miss_cluster()
add_prop_miss()
add_shadow_shift()
cast_shadow()
airquality %>% add_n_miss() airquality %>% add_n_miss(Ozone, Solar.R) airquality %>% add_n_miss(dplyr::contains("o"))
airquality %>% add_n_miss() airquality %>% add_n_miss(Ozone, Solar.R) airquality %>% add_n_miss(dplyr::contains("o"))
It can be useful when doing data analysis to add the proportion of missing
data values into your dataframe. add_prop_miss
adds a column named
"prop_miss", which contains the proportion of missing values in that row.
You can specify the variables that you would like to show the missingness
for.
add_prop_miss(data, ..., label = "prop_miss")
add_prop_miss(data, ..., label = "prop_miss")
data |
a dataframe |
... |
Variable names to use instead of the whole dataset. By default this
looks at the whole dataset. Otherwise, this is one or more unquoted
expressions separated by commas. These also respect the dplyr verbs
|
label |
character string of what you need to name variable |
a dataframe
bind_shadow()
add_any_miss()
add_label_missings()
add_label_shadow()
add_miss_cluster()
add_prop_miss()
add_shadow_shift()
cast_shadow()
airquality %>% add_prop_miss() airquality %>% add_prop_miss(Solar.R, Ozone) airquality %>% add_prop_miss(Solar.R, Ozone, label = "testing") # this can be applied to model the proportion of missing data # as in Tierney et al \doi{10.1136/bmjopen-2014-007450} # see "Modelling missingness" in vignette "Getting Started with naniar" # for details
airquality %>% add_prop_miss() airquality %>% add_prop_miss(Solar.R, Ozone) airquality %>% add_prop_miss(Solar.R, Ozone, label = "testing") # this can be applied to model the proportion of missing data # as in Tierney et al \doi{10.1136/bmjopen-2014-007450} # see "Modelling missingness" in vignette "Getting Started with naniar" # for details
As an alternative to bind_shadow()
, you can add specific individual shadow
columns to a dataset. These also respect the dplyr verbs
starts_with
, contains
, ends_with
, etc.
add_shadow(data, ...)
add_shadow(data, ...)
data |
data.frame |
... |
One or more unquoted variable names, separated by commas. These also
respect the dplyr verbs |
data.frame
bind_shadow()
add_any_miss()
add_label_missings()
add_label_shadow()
add_miss_cluster()
add_n_miss()
add_prop_miss()
add_shadow_shift()
cast_shadow()
airquality %>% add_shadow(Ozone) airquality %>% add_shadow(Ozone, Solar.R)
airquality %>% add_shadow(Ozone) airquality %>% add_shadow(Ozone, Solar.R)
Shadow shift missing values using only the selected variables in a dataset,
by specifying variable names or use dplyr vars
and dplyr verbs
starts_with
, contains
, ends_with
, etc.
add_shadow_shift(data, ..., suffix = "shift")
add_shadow_shift(data, ..., suffix = "shift")
data |
data.frame |
... |
One or more unquoted variable names separated by commas. These also
respect the dplyr verbs |
suffix |
suffix to add to variable, defaults to "shift" |
data with the added variable shifted named as var_suffix
bind_shadow()
add_any_miss()
add_label_missings()
add_label_shadow()
add_miss_cluster()
add_n_miss()
add_prop_miss()
add_shadow_shift()
cast_shadow()
airquality %>% add_shadow_shift(Ozone, Solar.R)
airquality %>% add_shadow_shift(Ozone, Solar.R)
Adds a variable, span_counter
to a dataframe. Used internally to facilitate
counting of missing values over a given span.
add_span_counter(data, span_size)
add_span_counter(data, span_size)
data |
data.frame |
span_size |
integer |
data.frame with extra variable "span_counter".
## Not run: # add_span_counter(pedestrian, span_size = 100) ## End(Not run)
## Not run: # add_span_counter(pedestrian, span_size = 100) ## End(Not run)
Helper function to determine whether there are any missings
any_row_miss(x)
any_row_miss(x)
x |
a vector |
logical vector TRUE = missing FALSE = complete
It is useful when exploring data to search for cases where there are any or all instances of missing or complete values. For example, these can help you identify and potentially remove or keep columns in a data frame that are all missing, or all complete.
For the any case, we provide two functions: any_miss
and
any_complete
. Note that any_miss
has an alias, any_na
. These both
under the hood call anyNA
. any_complete
is the complement to
any_miss
- it returns TRUE if there are any complete values. Note
that in a dataframe any_complete
will look for complete cases, which
are complete rows, which is different to complete variables.
For the all case, there are two functions: all_miss
, and
all_complete
.
any_na(x) any_miss(x) any_complete(x) all_na(x) all_miss(x) all_complete(x)
any_na(x) any_miss(x) any_complete(x) all_na(x) all_miss(x) all_complete(x)
x |
an object to explore missings/complete values |
# for vectors misses <- c(NA, NA, NA) complete <- c(1, 2, 3) mixture <- c(NA, 1, NA) all_na(misses) all_na(complete) all_na(mixture) all_complete(misses) all_complete(complete) all_complete(mixture) any_na(misses) any_na(complete) any_na(mixture) # for data frames all_na(airquality) # an alias of all_na all_miss(airquality) all_complete(airquality) any_na(airquality) any_complete(airquality) # use in identifying columns with all missing/complete library(dplyr) # for printing aq <- as_tibble(airquality) aq # select variables with all missing values aq %>% select(where(all_na)) # there are none! #' # select columns with any NA values aq %>% select(where(any_na)) # select only columns with all complete data aq %>% select(where(all_complete)) # select columns where there are any complete cases (all the data) aq %>% select(where(any_complete))
# for vectors misses <- c(NA, NA, NA) complete <- c(1, 2, 3) mixture <- c(NA, 1, NA) all_na(misses) all_na(complete) all_na(mixture) all_complete(misses) all_complete(complete) all_complete(mixture) any_na(misses) any_na(complete) any_na(mixture) # for data frames all_na(airquality) # an alias of all_na all_miss(airquality) all_complete(airquality) any_na(airquality) any_complete(airquality) # use in identifying columns with all missing/complete library(dplyr) # for printing aq <- as_tibble(airquality) aq # select variables with all missing values aq %>% select(where(all_na)) # there are none! #' # select columns with any NA values aq %>% select(where(any_na)) # select only columns with all complete data aq %>% select(where(all_complete)) # select columns where there are any complete cases (all the data) aq %>% select(where(any_complete))
Return a tibble in shadow matrix form, where the variables are the same but have a suffix _NA attached to distinguish them.
as_shadow(data, ...)
as_shadow(data, ...)
data |
dataframe |
... |
selected variables to use |
Representing missing data structure is achieved using the shadow matrix, introduced in Swayne and Buja. The shadow matrix is the same dimension as the data, and consists of binary indicators of missingness of data values, where missing is represented as "NA", and not missing is represented as "!NA". Although these may be represented as 1 and 0, respectively.
appended shadow with column names
as_shadow(airquality)
as_shadow(airquality)
Upset plots are a way of visualising common sets, this function transforms the data into a format that feeds directly into an upset plot
as_shadow_upset(data)
as_shadow_upset(data)
data |
a data.frame |
a data.frame
## Not run: library(UpSetR) airquality %>% as_shadow_upset() %>% upset() ## End(Not run)
## Not run: library(UpSetR) airquality %>% as_shadow_upset() %>% upset() ## End(Not run)
Binding a shadow matrix to a regular dataframe helps visualise and work with missing data.
bind_shadow(data, only_miss = FALSE, ...)
bind_shadow(data, only_miss = FALSE, ...)
data |
a dataframe |
only_miss |
logical - if FALSE (default) it will bind a dataframe with all of the variables duplicated with their shadow. Setting this to TRUE will bind variables only those variables that contain missing values. See the examples for more details. |
... |
extra options to pass to |
data with the added variable shifted and the suffix _NA
bind_shadow(airquality) # bind only the variables that contain missing values bind_shadow(airquality, only_miss = TRUE) aq_shadow <- bind_shadow(airquality) ## Not run: # explore missing data visually library(ggplot2) # using the bounded shadow to visualise Ozone according to whether Solar # Radiation is missing or not. ggplot(data = aq_shadow, aes(x = Ozone)) + geom_histogram() + facet_wrap(~Solar.R_NA, ncol = 1) ## End(Not run)
bind_shadow(airquality) # bind only the variables that contain missing values bind_shadow(airquality, only_miss = TRUE) aq_shadow <- bind_shadow(airquality) ## Not run: # explore missing data visually library(ggplot2) # using the bounded shadow to visualise Ozone according to whether Solar # Radiation is missing or not. ggplot(data = aq_shadow, aes(x = Ozone)) + geom_histogram() + facet_wrap(~Solar.R_NA, ncol = 1) ## End(Not run)
Casting a shadow shifted column performs the equivalent pattern to
data %>% select(var) %>% impute_below(). This is a convenience function
that makes it easy to perform certain visualisations, in line with the
principle that the user should have a way to flexibly return data formats
containing information about the missing data. It forms the base building
block for the functions cast_shadow_shift
, and cast_shadow_shift_label
.
It also respects the dplyr verbs starts_with
, contains
, ends_with
, etc.
to select variables.
cast_shadow(data, ...)
cast_shadow(data, ...)
data |
data.frame |
... |
One or more unquoted variable names separated by commas. These
respect the dplyr verbs |
data with the added variable shifted and the suffix _NA
cast_shadow_shift()
, cast_shadow_shift_label()
bind_shadow()
add_any_miss()
add_label_missings()
add_label_shadow()
add_miss_cluster()
add_prop_miss()
add_shadow_shift()
airquality %>% cast_shadow(Ozone, Solar.R) ## Not run: library(ggplot2) library(magrittr) airquality %>% cast_shadow(Ozone,Solar.R) %>% ggplot(aes(x = Ozone, colour = Solar.R_NA)) + geom_density() ## End(Not run)
airquality %>% cast_shadow(Ozone, Solar.R) ## Not run: library(ggplot2) library(magrittr) airquality %>% cast_shadow(Ozone,Solar.R) %>% ggplot(aes(x = Ozone, colour = Solar.R_NA)) + geom_density() ## End(Not run)
Shift the values and add a shadow column. It also respects the dplyr
verbs starts_with
, contains
, ends_with
, etc.
cast_shadow_shift(data, ...)
cast_shadow_shift(data, ...)
data |
data.frame |
... |
One or more unquoted variable names separated by commas. These
respect the dplyr verbs |
data.frame with the shadow and shadow_shift vars
cast_shadow_shift()
, cast_shadow_shift_label()
bind_shadow()
add_any_miss()
add_label_missings()
add_label_shadow()
add_miss_cluster()
add_prop_miss()
add_shadow_shift()
airquality %>% cast_shadow_shift(Ozone,Temp) airquality %>% cast_shadow_shift(dplyr::contains("o"))
airquality %>% cast_shadow_shift(Ozone,Temp) airquality %>% cast_shadow_shift(dplyr::contains("o"))
Shift the values, add shadow, add missing label
cast_shadow_shift_label(data, ...)
cast_shadow_shift_label(data, ...)
data |
data.frame |
... |
One or more unquoted expressions separated by commas. These also respect the dplyr verbs "starts_with", "contains", "ends_with", etc. |
data.frame with the shadow and shadow_shift vars, and missing labels
cast_shadow_shift()
, cast_shadow_shift_label()
bind_shadow()
add_any_miss()
add_label_missings()
add_label_shadow()
add_miss_cluster()
add_prop_miss()
add_shadow_shift()
airquality %>% cast_shadow_shift_label(Ozone, Solar.R) # replicate the plot generated by geom_miss_point() ## Not run: library(ggplot2) airquality %>% cast_shadow_shift_label(Ozone,Solar.R) %>% ggplot(aes(x = Ozone_shift, y = Solar.R_shift, colour = any_missing)) + geom_point() ## End(Not run)
airquality %>% cast_shadow_shift_label(Ozone, Solar.R) # replicate the plot generated by geom_miss_point() ## Not run: library(ggplot2) airquality %>% cast_shadow_shift_label(Ozone,Solar.R) %>% ggplot(aes(x = Ozone_shift, y = Solar.R_shift, colour = any_missing)) + geom_point() ## End(Not run)
This vector contains common number values of NA (missing), which is aimed to
be used inside naniar functions miss_scan_count()
and
replace_with_na()
. The current list of numbers can be found by printing
out common_na_numbers
. It is a useful way to explore your data for
possible missings, but I strongly warn against using this to replace NA
values without very carefully looking at the incidence for each of the
cases. Common NA strings are in the data object common_na_strings
.
common_na_numbers
common_na_numbers
An object of class numeric
of length 8.
original discussion here https://github.com/njtierney/naniar/issues/168
dat_ms <- tibble::tribble(~x, ~y, ~z, 1, "A", -100, 3, "N/A", -99, NA, NA, -98, -99, "E", -101, -98, "F", -1) miss_scan_count(dat_ms, -99) miss_scan_count(dat_ms, c("-99","-98","N/A")) common_na_numbers miss_scan_count(dat_ms, common_na_numbers)
dat_ms <- tibble::tribble(~x, ~y, ~z, 1, "A", -100, 3, "N/A", -99, NA, NA, -98, -99, "E", -101, -98, "F", -1) miss_scan_count(dat_ms, -99) miss_scan_count(dat_ms, c("-99","-98","N/A")) common_na_numbers miss_scan_count(dat_ms, common_na_numbers)
This vector contains common values of NA (missing), which is aimed to
be used inside naniar functions miss_scan_count()
and
replace_with_na()
. The current list of
strings used can be found by printing out common_na_strings
. It is a
useful way to explore your data for possible missings, but I strongly warn
against using this to replace NA values without very carefully looking at
the incidence for each of the cases. Please note that common_na_strings
uses \\
around the "?", "." and "*" characters to protect against using
their wildcard features in grep. Common NA numbers are in the data object
common_na_numbers
.
common_na_strings
common_na_strings
An object of class character
of length 26.
original discussion here https://github.com/njtierney/naniar/issues/168
dat_ms <- tibble::tribble(~x, ~y, ~z, 1, "A", -100, 3, "N/A", -99, NA, NA, -98, -99, "E", -101, -98, "F", -1) miss_scan_count(dat_ms, -99) miss_scan_count(dat_ms, c("-99","-98","N/A")) common_na_strings miss_scan_count(dat_ms, common_na_strings) replace_with_na(dat_ms, replace = list(y = common_na_strings))
dat_ms <- tibble::tribble(~x, ~y, ~z, 1, "A", -100, 3, "N/A", -99, NA, NA, -98, -99, "E", -101, -98, "F", -1) miss_scan_count(dat_ms, -99) miss_scan_count(dat_ms, c("-99","-98","N/A")) common_na_strings miss_scan_count(dat_ms, common_na_strings) replace_with_na(dat_ms, replace = list(y = common_na_strings))
gather_shadow
is a long-form representation of binding the shadow matrix to
your data, producing variables named case
, variable
, and missing
, where
missing
contains the missing value representation.
gather_shadow(data)
gather_shadow(data)
data |
a dataframe |
dataframe in long, format, containing information about the missings
gather_shadow(airquality)
gather_shadow(airquality)
geom_miss_point
provides a way to transform and plot missing
values in ggplot2. To do so it uses methods from ggobi to display missing
data points 10\
the same axis.
geom_miss_point( mapping = NULL, data = NULL, prop_below = 0.1, jitter = 0.05, stat = "miss_point", position = "identity", colour = ..missing.., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ... )
geom_miss_point( mapping = NULL, data = NULL, prop_below = 0.1, jitter = 0.05, stat = "miss_point", position = "identity", colour = ..missing.., na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ... )
mapping |
Set of aesthetic mappings created by |
data |
A data frame. If specified, overrides the default data frame defined at the top level of the plot. |
prop_below |
the degree to shift the values. The default is 0.1 |
jitter |
the amount of jitter to add. The default is 0.05 |
stat |
The statistical transformation to use on the data for this layer, as a string. |
position |
Position adjustment, either as a string, or the result of a call to a position adjustment function. |
colour |
the colour chosen for the aesthetic |
na.rm |
If |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
... |
other arguments passed on to
|
Warning message if na.rm = T
is supplied.
gg_miss_case()
gg_miss_case_cumsum()
gg_miss_fct()
gg_miss_span()
gg_miss_var()
gg_miss_var_cumsum()
gg_miss_which()
## Not run: library(ggplot2) # using regular geom_point() ggplot(airquality, aes(x = Ozone, y = Solar.R)) + geom_point() # using geom_miss_point() ggplot(airquality, aes(x = Ozone, y = Solar.R)) + geom_miss_point() # using facets ggplot(airquality, aes(x = Ozone, y = Solar.R)) + geom_miss_point() + facet_wrap(~Month) ## End(Not run)
## Not run: library(ggplot2) # using regular geom_point() ggplot(airquality, aes(x = Ozone, y = Solar.R)) + geom_point() # using geom_miss_point() ggplot(airquality, aes(x = Ozone, y = Solar.R)) + geom_miss_point() # using facets ggplot(airquality, aes(x = Ozone, y = Solar.R)) + geom_miss_point() + facet_wrap(~Month) ## End(Not run)
These are the stat and geom overrides using ggproto from ggplot2 that make naniar work.
StatMissPoint
StatMissPoint
An object of class StatMissPoint
(inherits from Stat
, ggproto
, gg
) of length 6.
This is a visual analogue to miss_case_summary
. It draws a ggplot of the
number of missings in each case (row). A default minimal theme is used, which
can be customised as normal for ggplot.
gg_miss_case(x, facet, order_cases = TRUE, show_pct = FALSE)
gg_miss_case(x, facet, order_cases = TRUE, show_pct = FALSE)
x |
data.frame |
facet |
(optional) a single bare variable name, if you want to create a faceted plot. |
order_cases |
logical Order the rows by missingness (default is FALSE - no order). |
show_pct |
logical Show the percentage of cases |
a ggplot object depicting the number of missings in a given case.
geom_miss_point()
gg_miss_case_cumsum()
gg_miss_fct()
gg_miss_span()
gg_miss_var()
gg_miss_var_cumsum()
gg_miss_which()
gg_miss_case(airquality) ## Not run: library(ggplot2) gg_miss_case(airquality) + labs(x = "Number of Cases") gg_miss_case(airquality, show_pct = TRUE) gg_miss_case(airquality, order_cases = FALSE) gg_miss_case(airquality, facet = Month) gg_miss_case(airquality, facet = Month, order_cases = FALSE) gg_miss_case(airquality, facet = Month, show_pct = TRUE) ## End(Not run)
gg_miss_case(airquality) ## Not run: library(ggplot2) gg_miss_case(airquality) + labs(x = "Number of Cases") gg_miss_case(airquality, show_pct = TRUE) gg_miss_case(airquality, order_cases = FALSE) gg_miss_case(airquality, facet = Month) gg_miss_case(airquality, facet = Month, order_cases = FALSE) gg_miss_case(airquality, facet = Month, show_pct = TRUE) ## End(Not run)
A plot showing the cumulative sum of missing values for cases, reading the rows from the top to bottom. A default minimal theme is used, which can be customised as normal for ggplot.
gg_miss_case_cumsum(x, breaks = 20)
gg_miss_case_cumsum(x, breaks = 20)
x |
a dataframe |
breaks |
the breaks for the x axis default is 20 |
a ggplot object depicting the number of missings
geom_miss_point()
gg_miss_case()
gg_miss_fct()
gg_miss_span()
gg_miss_var()
gg_miss_var_cumsum()
gg_miss_which()
gg_miss_case_cumsum(airquality)
gg_miss_case_cumsum(airquality)
This function draws a ggplot plot of the number of missings in each column, broken down by a categorical variable from the dataset. A default minimal theme is used, which can be customised as normal for ggplot.
gg_miss_fct(x, fct)
gg_miss_fct(x, fct)
x |
data.frame |
fct |
column containing the factor variable to visualise |
ggplot object depicting the % missing of each factor level for each variable.
geom_miss_point()
gg_miss_case()
gg_miss_case_cumsum()
gg_miss_span()
gg_miss_var()
gg_miss_var_cumsum()
gg_miss_which()
gg_miss_fct(x = riskfactors, fct = marital) ## Not run: library(ggplot2) gg_miss_fct(x = riskfactors, fct = marital) + labs(title = "NA in Risk Factors and Marital status") ## End(Not run)
gg_miss_fct(x = riskfactors, fct = marital) ## Not run: library(ggplot2) gg_miss_fct(x = riskfactors, fct = marital) + labs(title = "NA in Risk Factors and Marital status") ## End(Not run)
gg_miss_span
is a replacement function to
imputeTS::plotNA.distributionBar(tsNH4, breaksize = 100)
, which shows the
number of missings in a given span, or breaksize. A default minimal theme
is used, which can be customised as normal for ggplot.
gg_miss_span(data, var, span_every, facet)
gg_miss_span(data, var, span_every, facet)
data |
data.frame |
var |
a bare unquoted variable name from |
span_every |
integer describing the length of the span to be explored |
facet |
(optional) a single bare variable name, if you want to create a faceted plot. |
ggplot2 showing the number of missings in a span (window, or breaksize)
geom_miss_point()
gg_miss_case()
gg_miss_case_cumsum()
gg_miss_fct()
gg_miss_var()
gg_miss_var_cumsum()
gg_miss_which()
miss_var_span(pedestrian, hourly_counts, span_every = 3000) ## Not run: library(ggplot2) gg_miss_span(pedestrian, hourly_counts, span_every = 3000) gg_miss_span(pedestrian, hourly_counts, span_every = 3000, facet = sensor_name) # works with the rest of ggplot gg_miss_span(pedestrian, hourly_counts, span_every = 3000) + labs(x = "custom") gg_miss_span(pedestrian, hourly_counts, span_every = 3000) + theme_dark() ## End(Not run)
miss_var_span(pedestrian, hourly_counts, span_every = 3000) ## Not run: library(ggplot2) gg_miss_span(pedestrian, hourly_counts, span_every = 3000) gg_miss_span(pedestrian, hourly_counts, span_every = 3000, facet = sensor_name) # works with the rest of ggplot gg_miss_span(pedestrian, hourly_counts, span_every = 3000) + labs(x = "custom") gg_miss_span(pedestrian, hourly_counts, span_every = 3000) + theme_dark() ## End(Not run)
Upset plots are a way of visualising common sets, gg_miss_upset
shows the
number of missing values for each of the sets of data. The default option
of gg_miss_upset
is taken from UpSetR::upset
- which is to use up to 5
sets and up to 40 interactions. We also set the ordering to by the
frequency of the intersections. Setting nsets = 5
means to look at 5
variables and their combinations. The number of combinations or rather
intersections
is controlled by nintersects
. If there are 40
intersections, there will be 40 combinations of variables explored. The
number of sets and intersections can be changed by passing arguments nsets = 10
to look at 10 sets of variables, and nintersects = 50
to look at 50
intersections.
gg_miss_upset(data, order.by = "freq", ...)
gg_miss_upset(data, order.by = "freq", ...)
data |
data.frame |
order.by |
(from UpSetR::upset) How the intersections in the matrix should be ordered by. Options include frequency (entered as "freq"), degree, or both in any order. See |
... |
arguments to pass to upset plot - see |
a ggplot visualisation of missing data
## Not run: gg_miss_upset(airquality) gg_miss_upset(riskfactors) gg_miss_upset(riskfactors, nsets = 10) gg_miss_upset(riskfactors, nsets = 10, nintersects = 10) ## End(Not run)
## Not run: gg_miss_upset(airquality) gg_miss_upset(riskfactors) gg_miss_upset(riskfactors, nsets = 10) gg_miss_upset(riskfactors, nsets = 10, nintersects = 10) ## End(Not run)
This is a visual analogue to miss_var_summary
. It draws a ggplot of the
number of missings in each variable, ordered to show which variables have
the most missing data. A default minimal theme is used, which can be
customised as normal for ggplot.
gg_miss_var(x, facet, show_pct = FALSE)
gg_miss_var(x, facet, show_pct = FALSE)
x |
a dataframe |
facet |
(optional) bare variable name, if you want to create a faceted plot. |
show_pct |
logical shows the number of missings (default), but if set to TRUE, it will display the proportion of missings. |
a ggplot object depicting the number of missings in a given column
geom_miss_point()
gg_miss_case()
gg_miss_case_cumsum()
gg_miss_fct()
gg_miss_span()
gg_miss_var()
gg_miss_var_cumsum()
gg_miss_which()
gg_miss_var(airquality) ## Not run: library(ggplot2) gg_miss_var(airquality) + labs(y = "Look at all the missing ones") gg_miss_var(airquality, Month) gg_miss_var(airquality, Month, show_pct = TRUE) gg_miss_var(airquality, Month, show_pct = TRUE) + ylim(0, 100) ## End(Not run)
gg_miss_var(airquality) ## Not run: library(ggplot2) gg_miss_var(airquality) + labs(y = "Look at all the missing ones") gg_miss_var(airquality, Month) gg_miss_var(airquality, Month, show_pct = TRUE) gg_miss_var(airquality, Month, show_pct = TRUE) + ylim(0, 100) ## End(Not run)
A plot showing the cumulative sum of missing values for each variable, reading columns from the left to the right of the initial dataframe. A default minimal theme is used, which can be customised as normal for ggplot.
gg_miss_var_cumsum(x)
gg_miss_var_cumsum(x)
x |
a data.frame |
a ggplot object showing the cumulative sum of missings over the variables
geom_miss_point()
gg_miss_case()
gg_miss_case_cumsum()
gg_miss_fct()
gg_miss_span()
gg_miss_var()
gg_miss_which()
gg_miss_var_cumsum(airquality)
gg_miss_var_cumsum(airquality)
This plot produces a set of rectangles indicating whether there is a missing element in a column or not. A default minimal theme is used, which can be customised as normal for ggplot.
gg_miss_which(x)
gg_miss_which(x)
x |
a dataframe |
a ggplot object of which variables contains missing values
geom_miss_point()
gg_miss_case()
gg_miss_case_cumsum()
gg_miss_fct()
gg_miss_span()
gg_miss_var()
gg_miss_var_cumsum()
gg_miss_which()
gg_miss_which(airquality)
gg_miss_which(airquality)
It can be useful in exploratory graphics to impute data outside the range of
the data. impute_below
imputes variables with missings to have values
10 percent below the range for numeric values, plus some jittered noise,
to separate repeated values, so that missing values can be visualised
along with the rest of the data. For character or factor
values, it adds a new string or label.
impute_below(x, ...)
impute_below(x, ...)
x |
a variable of interest to shift |
... |
extra arguments to pass |
add_shadow_shift()
cast_shadow_shift()
cast_shadow_shift_label()
library(dplyr) vec <- rnorm(10) vec[sample(1:10, 3)] <- NA impute_below(vec) impute_below(vec, prop_below = 0.25) impute_below(vec, prop_below = 0.25, jitter = 0.2) dat <- tibble( num = rnorm(10), int = as.integer(rpois(10, 5)), fct = factor(LETTERS[1:10]) ) %>% mutate( across( everything(), \(x) set_prop_miss(x, prop = 0.25) ) ) dat dat %>% nabular() %>% mutate( num = impute_below(num), int = impute_below(int), fct = impute_below(fct), ) dat %>% nabular() %>% mutate( across( where(is.numeric), impute_below ) ) dat %>% nabular() %>% mutate( across( c("num", "int"), impute_below ) )
library(dplyr) vec <- rnorm(10) vec[sample(1:10, 3)] <- NA impute_below(vec) impute_below(vec, prop_below = 0.25) impute_below(vec, prop_below = 0.25, jitter = 0.2) dat <- tibble( num = rnorm(10), int = as.integer(rpois(10, 5)), fct = factor(LETTERS[1:10]) ) %>% mutate( across( everything(), \(x) set_prop_miss(x, prop = 0.25) ) ) dat dat %>% nabular() %>% mutate( num = impute_below(num), int = impute_below(int), fct = impute_below(fct), ) dat %>% nabular() %>% mutate( across( where(is.numeric), impute_below ) ) dat %>% nabular() %>% mutate( across( c("num", "int"), impute_below ) )
It can be useful in exploratory graphics to impute data outside the range of
the data. impute_below_all
imputes all variables with missings to have
values 10\
values adds a new string or label.
impute_below_all(.tbl, prop_below = 0.1, jitter = 0.05, ...)
impute_below_all(.tbl, prop_below = 0.1, jitter = 0.05, ...)
.tbl |
a data.frame |
prop_below |
the degree to shift the values. default is |
jitter |
the amount of jitter to add. default is 0.05 |
... |
additional arguments |
an dataset with values imputed
# you can impute data like so: airquality %>% impute_below_all() # However, this does not show you WHERE the missing values are. # to keep track of them, you want to use `bind_shadow()` first. airquality %>% bind_shadow() %>% impute_below_all() # This identifies where the missing values are located, which means you # can do things like this: ## Not run: library(ggplot2) airquality %>% bind_shadow() %>% impute_below_all() %>% # identify where there are missings across rows. add_label_shadow() %>% ggplot(aes(x = Ozone, y = Solar.R, colour = any_missing)) + geom_point() # Note that this ^^ is a long version of `geom_miss_point()`. ## End(Not run)
# you can impute data like so: airquality %>% impute_below_all() # However, this does not show you WHERE the missing values are. # to keep track of them, you want to use `bind_shadow()` first. airquality %>% bind_shadow() %>% impute_below_all() # This identifies where the missing values are located, which means you # can do things like this: ## Not run: library(ggplot2) airquality %>% bind_shadow() %>% impute_below_all() %>% # identify where there are missings across rows. add_label_shadow() %>% ggplot(aes(x = Ozone, y = Solar.R, colour = any_missing)) + geom_point() # Note that this ^^ is a long version of `geom_miss_point()`. ## End(Not run)
impute_below
impute_below
imputes missing values to a set percentage below the range
of the data. To impute many variables at once, we recommend that you use the
across
function workflow, shown in the examples for impute_below()
.
impute_below_all
operates on all variables. To only impute variables
that satisfy a specific condition, use the scoped variants,
impute_below_at
, and impute_below_if
. To use _at
effectively,
you must know that _at`` affects variables selected with a character vector, or with
vars()'.
impute_below_at(.tbl, .vars, prop_below = 0.1, jitter = 0.05, ...)
impute_below_at(.tbl, .vars, prop_below = 0.1, jitter = 0.05, ...)
.tbl |
a data.frame |
.vars |
variables to impute |
prop_below |
the degree to shift the values. default is |
jitter |
the amount of jitter to add. default is 0.05 |
... |
extra arguments |
an dataset with values imputed
# select variables starting with a particular string. impute_below_at(airquality, .vars = c("Ozone", "Solar.R")) impute_below_at(airquality, .vars = 1:2) ## Not run: library(dplyr) impute_below_at(airquality, .vars = vars(Ozone)) library(ggplot2) airquality %>% bind_shadow() %>% impute_below_at(vars(Ozone, Solar.R)) %>% add_label_shadow() %>% ggplot(aes(x = Ozone, y = Solar.R, colour = any_missing)) + geom_point() ## End(Not run)
# select variables starting with a particular string. impute_below_at(airquality, .vars = c("Ozone", "Solar.R")) impute_below_at(airquality, .vars = 1:2) ## Not run: library(dplyr) impute_below_at(airquality, .vars = vars(Ozone)) library(ggplot2) airquality %>% bind_shadow() %>% impute_below_at(vars(Ozone, Solar.R)) %>% add_label_shadow() %>% ggplot(aes(x = Ozone, y = Solar.R, colour = any_missing)) + geom_point() ## End(Not run)
impute_below
impute_below
operates on all variables. To only impute variables
that satisfy a specific condition, use the scoped variants,
impute_below_at
, and impute_below_if
.
impute_below_if(.tbl, .predicate, prop_below = 0.1, jitter = 0.05, ...)
impute_below_if(.tbl, .predicate, prop_below = 0.1, jitter = 0.05, ...)
.tbl |
data.frame |
.predicate |
A predicate function (such as is.numeric) |
prop_below |
the degree to shift the values. default is |
jitter |
the amount of jitter to add. default is 0.05 |
... |
extra arguments |
an dataset with values imputed
airquality %>% impute_below_if(.predicate = is.numeric)
airquality %>% impute_below_if(.predicate = is.numeric)
Impute numeric values below a range for graphical exploration
## S3 method for class 'numeric' impute_below( x, prop_below = 0.1, jitter = 0.05, seed_shift = 2017 - 7 - 1 - 1850, ... )
## S3 method for class 'numeric' impute_below( x, prop_below = 0.1, jitter = 0.05, seed_shift = 2017 - 7 - 1 - 1850, ... )
x |
a variable of interest to shift |
prop_below |
the degree to shift the values. default is |
jitter |
the amount of jitter to add. default is 0.05 |
seed_shift |
a random seed to set, if you like |
... |
extra arguments to pass |
For imputing fixed factor levels. It adds the new imputed value to the end
of the levels of the vector. We generally recommend to impute using other
model based approaches. See the simputation
package, for example
simputation::impute_lm()
.
impute_factor(x, value) ## Default S3 method: impute_factor(x, value) ## S3 method for class 'factor' impute_factor(x, value) ## S3 method for class 'character' impute_factor(x, value) ## S3 method for class 'shade' impute_factor(x, value)
impute_factor(x, value) ## Default S3 method: impute_factor(x, value) ## S3 method for class 'factor' impute_factor(x, value) ## S3 method for class 'character' impute_factor(x, value) ## S3 method for class 'shade' impute_factor(x, value)
x |
vector |
value |
factor to impute |
vector with a factor values replaced
vec <- factor(LETTERS[1:10]) vec[sample(1:10, 3)] <- NA vec impute_factor(vec, "wat") library(dplyr) dat <- tibble( num = rnorm(10), int = rpois(10, 5), fct = factor(LETTERS[1:10]) ) %>% mutate( across( everything(), \(x) set_prop_miss(x, prop = 0.25) ) ) dat dat %>% nabular() %>% mutate( num = impute_fixed(num, -9999), int = impute_zero(int), fct = impute_factor(fct, "out") )
vec <- factor(LETTERS[1:10]) vec[sample(1:10, 3)] <- NA vec impute_factor(vec, "wat") library(dplyr) dat <- tibble( num = rnorm(10), int = rpois(10, 5), fct = factor(LETTERS[1:10]) ) %>% mutate( across( everything(), \(x) set_prop_miss(x, prop = 0.25) ) ) dat dat %>% nabular() %>% mutate( num = impute_fixed(num, -9999), int = impute_zero(int), fct = impute_factor(fct, "out") )
This can be useful if you are imputing specific values, however we would
generally recommend to impute using other model based approaches. See
the simputation
package, for example simputation::impute_lm()
.
impute_fixed(x, value) ## Default S3 method: impute_fixed(x, value)
impute_fixed(x, value) ## Default S3 method: impute_fixed(x, value)
x |
vector |
value |
value to impute |
vector with a fixed values replaced
vec <- rnorm(10) vec[sample(1:10, 3)] <- NA vec impute_fixed(vec, -999) library(dplyr) dat <- tibble( num = rnorm(10), int = rpois(10, 5), fct = factor(LETTERS[1:10]) ) %>% mutate( across( everything(), \(x) set_prop_miss(x, prop = 0.25) ) ) dat dat %>% nabular() %>% mutate( num = impute_fixed(num, -9999), int = impute_zero(int), fct = impute_factor(fct, "out") )
vec <- rnorm(10) vec[sample(1:10, 3)] <- NA vec impute_fixed(vec, -999) library(dplyr) dat <- tibble( num = rnorm(10), int = rpois(10, 5), fct = factor(LETTERS[1:10]) ) %>% mutate( across( everything(), \(x) set_prop_miss(x, prop = 0.25) ) ) dat dat %>% nabular() %>% mutate( num = impute_fixed(num, -9999), int = impute_zero(int), fct = impute_factor(fct, "out") )
This can be useful if you are imputing specific values, however we would
generally recommend to impute using other model based approaches. See
the simputation
package, for example simputation::impute_lm()
.
impute_mean(x) ## Default S3 method: impute_mean(x) ## S3 method for class 'factor' impute_mean(x)
impute_mean(x) ## Default S3 method: impute_mean(x) ## S3 method for class 'factor' impute_mean(x)
x |
vector |
vector with mean values replaced
library(dplyr) vec <- rnorm(10) vec[sample(1:10, 3)] <- NA impute_mean(vec) dat <- tibble( num = rnorm(10), int = as.integer(rpois(10, 5)), fct = factor(LETTERS[1:10]) ) %>% mutate( across( everything(), \(x) set_prop_miss(x, prop = 0.25) ) ) dat dat %>% nabular() %>% mutate( num = impute_mean(num), int = impute_mean(int), fct = impute_mean(fct), ) dat %>% nabular() %>% mutate( across( where(is.numeric), impute_mean ) ) dat %>% nabular() %>% mutate( across( c("num", "int"), impute_mean ) )
library(dplyr) vec <- rnorm(10) vec[sample(1:10, 3)] <- NA impute_mean(vec) dat <- tibble( num = rnorm(10), int = as.integer(rpois(10, 5)), fct = factor(LETTERS[1:10]) ) %>% mutate( across( everything(), \(x) set_prop_miss(x, prop = 0.25) ) ) dat dat %>% nabular() %>% mutate( num = impute_mean(num), int = impute_mean(int), fct = impute_mean(fct), ) dat %>% nabular() %>% mutate( across( where(is.numeric), impute_mean ) ) dat %>% nabular() %>% mutate( across( c("num", "int"), impute_mean ) )
Impute the median value into a vector with missing values
impute_median(x) ## Default S3 method: impute_median(x) ## S3 method for class 'factor' impute_median(x)
impute_median(x) ## Default S3 method: impute_median(x) ## S3 method for class 'factor' impute_median(x)
x |
vector |
vector with median values replaced
vec <- rnorm(10) vec[sample(1:10, 3)] <- NA impute_median(vec) library(dplyr) dat <- tibble( num = rnorm(10), int = as.integer(rpois(10, 5)), fct = factor(LETTERS[1:10]) ) %>% mutate( across( everything(), \(x) set_prop_miss(x, prop = 0.25) ) ) dat dat %>% nabular() %>% mutate( num = impute_median(num), int = impute_median(int), ) dat %>% nabular() %>% mutate( across( where(is.numeric), impute_median ) ) dat %>% nabular() %>% mutate( across( c("num", "int"), impute_median ) )
vec <- rnorm(10) vec[sample(1:10, 3)] <- NA impute_median(vec) library(dplyr) dat <- tibble( num = rnorm(10), int = as.integer(rpois(10, 5)), fct = factor(LETTERS[1:10]) ) %>% mutate( across( everything(), \(x) set_prop_miss(x, prop = 0.25) ) ) dat dat %>% nabular() %>% mutate( num = impute_median(num), int = impute_median(int), ) dat %>% nabular() %>% mutate( across( where(is.numeric), impute_median ) ) dat %>% nabular() %>% mutate( across( c("num", "int"), impute_median ) )
Impute the mode value into a vector with missing values
impute_mode(x) ## Default S3 method: impute_mode(x) ## S3 method for class 'integer' impute_mode(x) ## S3 method for class 'factor' impute_mode(x)
impute_mode(x) ## Default S3 method: impute_mode(x) ## S3 method for class 'integer' impute_mode(x) ## S3 method for class 'factor' impute_mode(x)
x |
vector This approach adapts examples provided from stack overflow, and for the integer
case, just rounds the value. While this can be useful if you are
imputing specific values, however we would generally recommend to impute
using other model based approaches. See the |
vector with mode values replaced
vec <- rnorm(10) vec[sample(1:10, 3)] <- NA impute_mode(vec) library(dplyr) dat <- tibble( num = rnorm(10), int = rpois(10, 5), fct = factor(LETTERS[1:10]) ) %>% mutate( across( everything(), \(x) set_prop_miss(x, prop = 0.25) ) ) dat dat %>% nabular() %>% mutate( num = impute_mode(num), int = impute_mode(int), fct = impute_mode(fct) )
vec <- rnorm(10) vec[sample(1:10, 3)] <- NA impute_mode(vec) library(dplyr) dat <- tibble( num = rnorm(10), int = rpois(10, 5), fct = factor(LETTERS[1:10]) ) %>% mutate( across( everything(), \(x) set_prop_miss(x, prop = 0.25) ) ) dat dat %>% nabular() %>% mutate( num = impute_mode(num), int = impute_mode(int), fct = impute_mode(fct) )
This can be useful if you are imputing specific values, however we would
generally recommend to impute using other model based approaches. See
the simputation
package, for example simputation::impute_lm()
.
impute_zero(x)
impute_zero(x)
x |
vector |
vector with a fixed values replaced
vec <- rnorm(10) vec[sample(1:10, 3)] <- NA vec impute_zero(vec) library(dplyr) dat <- tibble( num = rnorm(10), int = rpois(10, 5), fct = factor(LETTERS[1:10]) ) %>% mutate( across( everything(), \(x) set_prop_miss(x, prop = 0.25) ) ) dat dat %>% nabular() %>% mutate( num = impute_fixed(num, -9999), int = impute_zero(int), fct = impute_factor(fct, "out") )
vec <- rnorm(10) vec[sample(1:10, 3)] <- NA vec impute_zero(vec) library(dplyr) dat <- tibble( num = rnorm(10), int = rpois(10, 5), fct = factor(LETTERS[1:10]) ) %>% mutate( across( everything(), \(x) set_prop_miss(x, prop = 0.25) ) ) dat dat %>% nabular() %>% mutate( num = impute_fixed(num, -9999), int = impute_zero(int), fct = impute_factor(fct, "out") )
This tells us if this column is a shade
is_shade(x) are_shade(x) any_shade(x)
is_shade(x) are_shade(x) any_shade(x)
x |
a vector you want to test if is a shade |
logical - is this a shade?
xs <- shade(c(NA, 1, 2, "3")) is_shade(xs) are_shade(xs) any_shade(xs) aq_s <- as_shadow(airquality) is_shade(aq_s) are_shade(aq_s) any_shade(aq_s) any_shade(airquality)
xs <- shade(c(NA, 1, 2, "3")) is_shade(xs) are_shade(xs) any_shade(xs) aq_s <- as_shadow(airquality) is_shade(aq_s) are_shade(aq_s) any_shade(aq_s) any_shade(airquality)
Label whether a value is missing in a row of one columns.
label_miss_1d(x1)
label_miss_1d(x1)
x1 |
a variable of a dataframe |
a vector indicating whether any of these rows had missing values
can we generalise label_miss to work for any number of variables?
add_any_miss()
add_label_missings()
add_label_shadow()
label_miss_1d(airquality$Ozone)
label_miss_1d(airquality$Ozone)
Label whether a value is missing in either row of two columns.
label_miss_2d(x1, x2)
label_miss_2d(x1, x2)
x1 |
a variable of a dataframe |
x2 |
another variable of a dataframe |
a vector indicating whether any of these rows had missing values
label_miss_2d(airquality$Ozone, airquality$Solar.R)
label_miss_2d(airquality$Ozone, airquality$Solar.R)
Creates a character vector describing presence/absence of missing values
label_missings(data, ..., missing = "Missing", complete = "Not Missing")
label_missings(data, ..., missing = "Missing", complete = "Not Missing")
data |
a dataframe or set of vectors of the same length |
... |
extra variable to label |
missing |
character a label for when values are missing - defaults to "Missing" |
complete |
character character a label for when values are complete - defaults to "Not Missing" |
character vector of "Missing" and "Not Missing".
bind_shadow()
add_any_miss()
add_label_missings()
add_label_shadow()
add_miss_cluster()
add_n_miss()
add_prop_miss()
add_shadow_shift()
cast_shadow()
label_missings(airquality) ## Not run: library(dplyr) airquality %>% mutate(is_missing = label_missings(airquality)) %>% head() airquality %>% mutate(is_missing = label_missings(airquality, missing = "definitely missing", complete = "absolutely complete")) %>% head() ## End(Not run)
label_missings(airquality) ## Not run: library(dplyr) airquality %>% mutate(is_missing = label_missings(airquality)) %>% head() airquality %>% mutate(is_missing = label_missings(airquality, missing = "definitely missing", complete = "absolutely complete")) %>% head() ## End(Not run)
Use Little's (1988) test statistic to assess if data is missing completely
at random (MCAR). The null hypothesis in this test is that the data is
MCAR, and the test statistic is a chi-squared value. The example below
shows the output of mcar_test(airquality)
. Given the high statistic
value and low p-value, we can conclude the airquality
data is not
missing completely at random.
mcar_test(data)
mcar_test(data)
data |
A data frame |
A tibble::tibble()
with one row and four columns:
statistic |
Chi-squared statistic for Little's test |
df |
Degrees of freedom used for chi-squared statistic |
p.value |
P-value for the chi-squared statistic |
missing.patterns |
Number of missing data patterns in the data |
Code is adapted from LittleMCAR() in the now-orphaned BaylorEdPsych
package: https://rdrr.io/cran/BaylorEdPsych/man/LittleMCAR.html. Some of
code is adapted from Eric Stemmler: https://web.archive.org/web/20201120030409/https://stats-bayes.com/post/2020/08/14/r-function-for-little-s-test-for-data-missing-completely-at-random/
using Maximum likelihood estimation from norm
.
Andrew Heiss, [email protected]
Little, Roderick J. A. 1988. "A Test of Missing Completely at Random for Multivariate Data with Missing Values." Journal of the American Statistical Association 83 (404): 1198–1202. doi:10.1080/01621459.1988.10478722.
mcar_test(airquality) mcar_test(oceanbuoys) # If there are non-numeric columns, there will be a warning mcar_test(riskfactors)
mcar_test(airquality) mcar_test(oceanbuoys) # If there are non-numeric columns, there will be a warning mcar_test(riskfactors)
Provide a data.frame containing each case (row), the number and percent of missing values in each case.
miss_case_cumsum(data)
miss_case_cumsum(data)
data |
a dataframe |
a tibble containing the number and percent of missing data in each case
miss_case_cumsum(airquality) ## Not run: library(dplyr) airquality %>% group_by(Month) %>% miss_case_cumsum() ## End(Not run)
miss_case_cumsum(airquality) ## Not run: library(dplyr) airquality %>% group_by(Month) %>% miss_case_cumsum() ## End(Not run)
Provide a summary for each case in the data of the number, percent missings, and cumulative sum of missings of the order of the variables. By default, it orders by the most missings in each variable.
miss_case_summary(data, order = TRUE, add_cumsum = FALSE, ...)
miss_case_summary(data, order = TRUE, add_cumsum = FALSE, ...)
data |
a data.frame |
order |
a logical indicating whether or not to order the result by n_miss. Defaults to TRUE. If FALSE, order of cases is the order input. |
add_cumsum |
logical indicating whether or not to add the cumulative sum of missings to the data. This can be useful when exploring patterns of nonresponse. These are calculated as the cumulative sum of the missings in the variables as they are first presented to the function. |
... |
extra arguments |
a tibble of the percent of missing data in each case.
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
n_complete()
n_complete_row()
n_miss()
n_miss_row()
pct_complete()
pct_miss()
prop_complete()
prop_complete_row()
prop_miss()
miss_case_summary(airquality) ## Not run: # works with group_by from dplyr library(dplyr) airquality %>% group_by(Month) %>% miss_case_summary() ## End(Not run)
miss_case_summary(airquality) ## Not run: # works with group_by from dplyr library(dplyr) airquality %>% group_by(Month) %>% miss_case_summary() ## End(Not run)
Provide a tidy table of the number of cases with 0, 1, 2, up to n, missing values and the proportion of the number of cases those cases make up.
miss_case_table(data)
miss_case_table(data)
data |
a dataframe |
a dataframe
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
n_complete()
n_complete_row()
n_miss()
n_miss_row()
pct_complete()
pct_miss()
prop_complete()
prop_complete_row()
prop_miss()
miss_case_table(airquality) ## Not run: library(dplyr) airquality %>% group_by(Month) %>% miss_case_table() ## End(Not run)
miss_case_table(airquality) ## Not run: library(dplyr) airquality %>% group_by(Month) %>% miss_case_table() ## End(Not run)
Return missing data info about the dataframe, the variables, and the cases. Specifically, returning how many elements in a dataframe contain a missing value, how many elements in a variable contain a missing value, and how many elements in a case contain a missing.
miss_prop_summary(data)
miss_prop_summary(data)
data |
a dataframe |
a dataframe
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
miss_prop_summary(airquality) ## Not run: library(dplyr) # respects dplyr::group_by airquality %>% group_by(Month) %>% miss_prop_summary() ## End(Not run)
miss_prop_summary(airquality) ## Not run: library(dplyr) # respects dplyr::group_by airquality %>% group_by(Month) %>% miss_prop_summary() ## End(Not run)
Searching for different kinds of missing values is really annoying. If
you have values like -99 in your data, when they shouldn't be there,
or they should be encoded as missing, it can be difficult to ascertain
if they are there, and if so, where they are. miss_scan_count
makes it
easier for users to search for particular occurrences of these values
across their variables. Note that the searches are done with regular
expressions, which are special ways of searching for text. See the
example below to see how to look for characters like ?
.
miss_scan_count(data, search)
miss_scan_count(data, search)
data |
data |
search |
values to search for |
a dataframe of the occurrences of the values you searched for
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
dat_ms <- tibble::tribble(~x, ~y, ~z, ~specials, 1, "A", -100, "?", 3, "N/A", -99, "!", NA, NA, -98, ".", -99, "E", -101, "*", -98, "F", -1, "-") miss_scan_count(dat_ms,-99) miss_scan_count(dat_ms,c(-99,-98)) miss_scan_count(dat_ms,c("-99","-98","N/A")) miss_scan_count(dat_ms, "\\?") miss_scan_count(dat_ms, "\\!") miss_scan_count(dat_ms, "\\.") miss_scan_count(dat_ms, "\\*") miss_scan_count(dat_ms, "-") miss_scan_count(dat_ms,common_na_strings)
dat_ms <- tibble::tribble(~x, ~y, ~z, ~specials, 1, "A", -100, "?", 3, "N/A", -99, "!", NA, NA, -98, ".", -99, "E", -101, "*", -98, "F", -1, "-") miss_scan_count(dat_ms,-99) miss_scan_count(dat_ms,c(-99,-98)) miss_scan_count(dat_ms,c("-99","-98","N/A")) miss_scan_count(dat_ms, "\\?") miss_scan_count(dat_ms, "\\!") miss_scan_count(dat_ms, "\\.") miss_scan_count(dat_ms, "\\*") miss_scan_count(dat_ms, "-") miss_scan_count(dat_ms,common_na_strings)
miss_summary
performs all of the missing data helper summaries and puts
them into lists within a tibble
miss_summary(data, order = TRUE)
miss_summary(data, order = TRUE)
data |
a dataframe |
order |
whether or not to order the result by n_miss |
a tibble of missing data summaries
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
n_complete()
n_complete_row()
n_miss()
n_miss_row()
pct_complete()
pct_miss()
prop_complete()
prop_complete_row()
prop_miss()
s_miss <- miss_summary(airquality) s_miss$miss_df_prop s_miss$miss_case_table s_miss$miss_var_summary # etc, etc, etc. ## Not run: library(dplyr) s_miss_group <- group_by(airquality, Month) %>% miss_summary() s_miss_group$miss_df_prop s_miss_group$miss_case_table # etc, etc, etc. ## End(Not run)
s_miss <- miss_summary(airquality) s_miss$miss_df_prop s_miss$miss_case_table s_miss$miss_var_summary # etc, etc, etc. ## Not run: library(dplyr) s_miss_group <- group_by(airquality, Month) %>% miss_summary() s_miss_group$miss_df_prop s_miss_group$miss_case_table # etc, etc, etc. ## End(Not run)
Calculate the cumulative sum of number & percentage of missingness for each variable.
miss_var_cumsum(data)
miss_var_cumsum(data)
data |
a data.frame |
a tibble of the cumulative sum of missing data in each variable
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
miss_var_cumsum(airquality) ## Not run: library(dplyr) # respects dplyr::group_by airquality %>% group_by(Month) %>% miss_var_cumsum() ## End(Not run)
miss_var_cumsum(airquality) ## Not run: library(dplyr) # respects dplyr::group_by airquality %>% group_by(Month) %>% miss_var_cumsum() ## End(Not run)
It us useful to find the number of missing values that occur in a single run.
The function, miss_var_run()
, returns a dataframe with the column names
"run_length" and "is_na", which describe the length of the run, and
whether that run describes a missing value.
miss_var_run(data, var)
miss_var_run(data, var)
data |
data.frame |
var |
a bare variable name |
dataframe with column names "run_length" and "is_na", which describe the length of the run, and whether that run describes a missing value.
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
n_complete()
n_complete_row()
n_miss()
n_miss_row()
pct_complete()
pct_miss()
prop_complete()
prop_complete_row()
prop_miss()
miss_var_run(pedestrian, hourly_counts) ## Not run: # find the number of runs missing/complete for each month library(dplyr) pedestrian %>% group_by(month) %>% miss_var_run(hourly_counts) library(ggplot2) # explore the number of missings in a given run miss_var_run(pedestrian, hourly_counts) %>% filter(is_na == "missing") %>% count(run_length) %>% ggplot(aes(x = run_length, y = n)) + geom_col() # look at the number of missing values and the run length of these. miss_var_run(pedestrian, hourly_counts) %>% ggplot(aes(x = is_na, y = run_length)) + geom_boxplot() # using group_by pedestrian %>% group_by(month) %>% miss_var_run(hourly_counts) ## End(Not run)
miss_var_run(pedestrian, hourly_counts) ## Not run: # find the number of runs missing/complete for each month library(dplyr) pedestrian %>% group_by(month) %>% miss_var_run(hourly_counts) library(ggplot2) # explore the number of missings in a given run miss_var_run(pedestrian, hourly_counts) %>% filter(is_na == "missing") %>% count(run_length) %>% ggplot(aes(x = run_length, y = n)) + geom_col() # look at the number of missing values and the run length of these. miss_var_run(pedestrian, hourly_counts) %>% ggplot(aes(x = is_na, y = run_length)) + geom_boxplot() # using group_by pedestrian %>% group_by(month) %>% miss_var_run(hourly_counts) ## End(Not run)
To summarise the missing values in a time series object it can be useful to
calculate the number of missing values in a given time period.
miss_var_span
takes a data.frame object, a variable, and a span_every
argument and returns a dataframe containing the number of missing values
within each span. When the number of observations isn't a perfect
multiple of the span length, the final span is whatever the last
remainder is. For example, the pedestrian
dataset has 37,700 rows. If
the span is set to 4000, then there will be 1700 rows remaining. This can
be provided using modulo (%%
): nrow(data) %% 4000
. This remainder
number is provided in n_in_span
.
miss_var_span(data, var, span_every)
miss_var_span(data, var, span_every)
data |
data.frame |
var |
bare unquoted variable name of interest. |
span_every |
integer describing the length of the span to be explored |
dataframe with variables n_miss
, n_complete
, prop_miss
, and
prop_complete
, which describe the number, or proportion of missing or
complete values within that given time span. The final variable,
n_in_span
states how many observations are in the span.
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
miss_var_span(data = pedestrian, var = hourly_counts, span_every = 168) ## Not run: library(dplyr) pedestrian %>% group_by(month) %>% miss_var_span(var = hourly_counts, span_every = 168) ## End(Not run)
miss_var_span(data = pedestrian, var = hourly_counts, span_every = 168) ## Not run: library(dplyr) pedestrian %>% group_by(month) %>% miss_var_span(var = hourly_counts, span_every = 168) ## End(Not run)
Provide a summary for each variable of the number, percent missings, and cumulative sum of missings of the order of the variables. By default, it orders by the most missings in each variable.
miss_var_summary(data, order = FALSE, add_cumsum = FALSE, digits, ...)
miss_var_summary(data, order = FALSE, add_cumsum = FALSE, digits, ...)
data |
a data.frame |
order |
a logical indicating whether to order the result by |
add_cumsum |
logical indicating whether or not to add the cumulative sum of missings to the data. This can be useful when exploring patterns of nonresponse. These are calculated as the cumulative sum of the missings in the variables as they are first presented to the function. |
digits |
how many digits to display in |
... |
extra arguments |
a tibble of the percent of missing data in each variable
n_miss_cumsum
is calculated as the cumulative sum of missings in the
variables in the order that they are given in the data when entering
the function
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
n_complete()
n_complete_row()
n_miss()
n_miss_row()
pct_complete()
pct_miss()
prop_complete()
prop_complete_row()
prop_miss()
miss_var_summary(airquality) miss_var_summary(oceanbuoys, order = TRUE) ## Not run: # works with group_by from dplyr library(dplyr) airquality %>% group_by(Month) %>% miss_var_summary() ## End(Not run)
miss_var_summary(airquality) miss_var_summary(oceanbuoys, order = TRUE) ## Not run: # works with group_by from dplyr library(dplyr) airquality %>% group_by(Month) %>% miss_var_summary() ## End(Not run)
Provide a tidy table of the number of variables with 0, 1, 2, up to n, missing values and the proportion of the number of variables those variables make up.
miss_var_table(data)
miss_var_table(data)
data |
a dataframe |
a dataframe
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
n_complete()
n_complete_row()
n_miss()
n_miss_row()
pct_complete()
pct_miss()
prop_complete()
prop_complete_row()
prop_miss()
miss_var_table(airquality) ## Not run: library(dplyr) airquality %>% group_by(Month) %>% miss_var_table() ## End(Not run)
miss_var_table(airquality) ## Not run: library(dplyr) airquality %>% group_by(Month) %>% miss_var_table() ## End(Not run)
It can be helpful when writing other functions to just return the names
of the variables that contain missing values. miss_var_which
returns a
vector of variable names that contain missings. It will return NULL when
there are no missings.
miss_var_which(data)
miss_var_which(data)
data |
a data.frame |
character vector of variable names
miss_var_which(airquality) miss_var_which(mtcars)
miss_var_which(airquality) miss_var_which(mtcars)
Defunct. Please see prop_miss_var()
, prop_complete_var()
, pct_miss_var()
, pct_complete_var()
, prop_miss_case()
, prop_complete_case()
, pct_miss_case()
, pct_complete_case()
.
miss_var_prop(...) complete_var_prop(...) miss_var_pct(...) complete_var_pct(...) miss_case_prop(...) complete_case_prop(...) miss_case_pct(...) complete_case_pct(...)
miss_var_prop(...) complete_var_prop(...) miss_var_pct(...) complete_var_pct(...) miss_case_prop(...) complete_case_prop(...) miss_case_pct(...) complete_case_pct(...)
... |
arguments |
A complement to n_miss
n_complete(x)
n_complete(x)
x |
a vector |
numeric number of complete values
n_complete(airquality) n_complete(airquality$Ozone)
n_complete(airquality) n_complete(airquality$Ozone)
Substitute for rowSums(!is.na(data))
but it also checks if input is NULL or
is a dataframe
n_complete_row(data)
n_complete_row(data)
data |
a dataframe |
numeric vector of the number of complete values in each row
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
n_complete()
n_complete_row()
n_miss()
n_miss_row()
pct_complete()
pct_miss()
prop_complete()
prop_complete_row()
prop_miss()
n_complete_row(airquality)
n_complete_row(airquality)
Substitute for sum(is.na(data))
n_miss(x)
n_miss(x)
x |
a vector |
numeric the number of missing values
n_miss(airquality) n_miss(airquality$Ozone)
n_miss(airquality) n_miss(airquality$Ozone)
Substitute for rowSums(is.na(data))
, but it also checks if input is NULL or
is a dataframe
n_miss_row(data)
n_miss_row(data)
data |
a dataframe |
numeric vector of the number of missing values in each row
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
n_complete()
n_complete_row()
n_miss()
n_miss_row()
pct_complete()
pct_miss()
prop_complete()
prop_complete_row()
prop_miss()
n_miss_row(airquality)
n_miss_row(airquality)
This function calculates the number of variables that contain a complete value
n_var_complete(data) n_case_complete(data)
n_var_complete(data) n_case_complete(data)
data |
data.frame |
integer number of complete values
# how many variables contain complete values? n_var_complete(airquality) n_case_complete(airquality)
# how many variables contain complete values? n_var_complete(airquality) n_case_complete(airquality)
This function calculates the number of variables or cases that contain a missing value
n_var_miss(data) n_case_miss(data)
n_var_miss(data) n_case_miss(data)
data |
data.frame |
integer, number of missings
# how many variables contain missing values? n_var_miss(airquality) n_case_miss(airquality)
# how many variables contain missing values? n_var_miss(airquality) n_case_miss(airquality)
Binding a shadow matrix to a regular dataframe converts it into nabular data, which makes it easier to visualise and work with missing data.
nabular(data, only_miss = FALSE, ...)
nabular(data, only_miss = FALSE, ...)
data |
a dataframe |
only_miss |
logical - if FALSE (default) it will bind a dataframe with all of the variables duplicated with their shadow. Setting this to TRUE will bind variables only those variables that contain missing values. See the examples for more details. |
... |
extra options to pass to |
data with the added variable shifted and the suffix _NA
aq_nab <- nabular(airquality) aq_s <- bind_shadow(airquality) all.equal(aq_nab, aq_s)
aq_nab <- nabular(airquality) aq_s <- bind_shadow(airquality) all.equal(aq_nab, aq_s)
naniar is a package to make it easier to summarise and handle missing values in R. It strives to do this in a way that is as consistent with tidyverse principles as possible. The work is fully discussed at Tierney & Cook (2023) doi:10.18637/jss.v105.i07.
Maintainer: Nicholas Tierney [email protected] (ORCID)
Authors:
Di Cook [email protected] (ORCID)
Miles McBain [email protected] (ORCID)
Colin Fay [email protected] (ORCID)
Other contributors:
Mitchell O'Hara-Wild [contributor]
Jim Hester [email protected] [contributor]
Luke Smith [contributor]
Andrew Heiss [email protected] (ORCID) [contributor]
Useful links:
Report bugs at https://github.com/njtierney/naniar/issues
Real-time data from moored ocean buoys for improved detection, understanding and prediction of El Ni'o and La Ni'a. The data is collected by the Tropical Atmosphere Ocean project (https://www.pmel.noaa.gov/gtmba/pmel-theme/pacific-ocean-tao).
data(oceanbuoys)
data(oceanbuoys)
An object of class tbl_df
(inherits from tbl
, data.frame
) with 736 rows and 8 columns.
Format: a data frame with 736 observations on the following 8 variables.
year
A numeric with levels 1993
1997
.
latitude
A numeric with levels -5
-2
0
.
longitude
A numeric with levels -110
-95
.
sea_temp_c
Sea surface temperature(degree Celsius), measured by the TAO buoys at one meter below the surface.
air_temp_c
Air temperature(degree Celsius), measured by the TAO buoys three meters above the sea surface.
humidity
Relative humidity(%), measured by the TAO buoys 3 meters above the sea surface.
wind_ew
The East-West wind vector components(M/s). TAO buoys measure the wind speed and direction four meters above the sea surface. If it is positive, the East-West component of the wind is blowing towards the East. If it is negative, this component is blowing towards the West.
wind_ns
The North-South wind vector components(M/s). TAO buoys measure the wind speed and direction four meters above the sea surface. If it is positive, the North-South component of the wind is blowing towards the North. If it is negative, this component is blowing towards the South.
https://www.pmel.noaa.gov/tao/drupal/disdel/
library(MissingDataGUI) (data named "tao")
vis_miss(oceanbuoys) # Look at the missingness in the variables miss_var_summary(oceanbuoys) ## Not run: # Look at the missingness in air temperature and humidity library(ggplot2) p <- ggplot(oceanbuoys, aes(x = air_temp_c, y = humidity)) + geom_miss_point() p # for each year? p + facet_wrap(~year) # this shows that there are more missing values in humidity in 1993, and # more air temperature missing values in 1997 # see more examples in the vignette, "getting started with naniar". ## End(Not run)
vis_miss(oceanbuoys) # Look at the missingness in the variables miss_var_summary(oceanbuoys) ## Not run: # Look at the missingness in air temperature and humidity library(ggplot2) p <- ggplot(oceanbuoys, aes(x = air_temp_c, y = humidity)) + geom_miss_point() p # for each year? p + facet_wrap(~year) # this shows that there are more missing values in humidity in 1993, and # more air temperature missing values in 1997 # see more examples in the vignette, "getting started with naniar". ## End(Not run)
The complement to pct_miss
pct_complete(x)
pct_complete(x)
x |
vector or data.frame |
numeric percent of complete values
pct_complete(airquality) pct_complete(airquality$Ozone)
pct_complete(airquality) pct_complete(airquality$Ozone)
This is shorthand for mean(is.na(x)) * 100
pct_miss(x)
pct_miss(x)
x |
vector or data.frame |
numeric the percent of missing values in x
pct_miss(airquality) pct_miss(airquality$Ozone)
pct_miss(airquality) pct_miss(airquality$Ozone)
Calculate the percentage of cases (rows) that contain a missing or complete value.
pct_miss_case(data) pct_complete_case(data)
pct_miss_case(data) pct_complete_case(data)
data |
a dataframe |
numeric the percentage of cases that contain a missing or complete value
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
pct_miss_case(airquality) pct_complete_case(airquality)
pct_miss_case(airquality) pct_complete_case(airquality)
Calculate the percentage of variables that contain a single missing or complete value.
pct_miss_var(data) pct_complete_var(data)
pct_miss_var(data) pct_complete_var(data)
data |
a dataframe |
numeric the percent of variables that contain missing or complete data
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
prop_miss_var(airquality) prop_complete_var(airquality)
prop_miss_var(airquality) prop_complete_var(airquality)
This dataset contains hourly counts of pedestrians from 4 sensors around Melbourne: Birrarung Marr, Bourke Street Mall, Flagstaff station, and Spencer St-Collins St (south), recorded from January 1st 2016 at 00:00:00 to December 31st 2016 at 23:00:00. The data is made free and publicly available from https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-monthly-counts-per-hour/information/
data(pedestrian)
data(pedestrian)
A tibble with 37,700 rows and 9 variables:
(integer) the number of pedestrians counted at that sensor at that time
(POSIXct, POSIXt) The time that the count was taken
(integer) Year of record
(factor) Month of record as an ordered factor (1 = January, 12 = December)
(integer) Full day of the month
(factor) Full day of the week as an ordered factor (1 = Sunday, 7 = Saturday)
(integer) The hour of the day in 24 hour format
(integer) the id of the sensor
(character) the full name of the sensor
# explore the missingness with vis_miss vis_miss(pedestrian) # Look at the missingness in the variables miss_var_summary(pedestrian) ## Not run: # There is only missingness in hourly_counts # Look at the missingness over a rolling window library(ggplot2) gg_miss_span(pedestrian, hourly_counts, span_every = 3000) ## End(Not run)
# explore the missingness with vis_miss vis_miss(pedestrian) # Look at the missingness in the variables miss_var_summary(pedestrian) ## Not run: # There is only missingness in hourly_counts # Look at the missingness over a rolling window library(ggplot2) gg_miss_span(pedestrian, hourly_counts, span_every = 3000) ## End(Not run)
The complement to prop_miss
prop_complete(x)
prop_complete(x)
x |
vector or data.frame |
numeric proportion of complete values
prop_complete(airquality) prop_complete(airquality$Ozone)
prop_complete(airquality) prop_complete(airquality$Ozone)
Substitute for rowMeans(!is.na(data))
, but it also checks if input is NULL or
is a dataframe
prop_complete_row(data)
prop_complete_row(data)
data |
a dataframe |
numeric vector of the proportion of missing values in each row
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
n_complete()
n_complete_row()
n_miss()
n_miss_row()
pct_complete()
pct_miss()
prop_complete()
prop_complete_row()
prop_miss()
prop_complete_row(airquality)
prop_complete_row(airquality)
This is shorthand for mean(is.na(x))
prop_miss(x)
prop_miss(x)
x |
vector or data.frame |
numeric the proportion of missing values in x
prop_miss(airquality) prop_miss(airquality$Ozone)
prop_miss(airquality) prop_miss(airquality$Ozone)
Substitute for rowMeans(is.na(data))
, but it also checks if input is NULL or
is a dataframe
prop_miss_row(data)
prop_miss_row(data)
data |
a dataframe |
numeric vector of the proportion of missing values in each row
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
n_complete()
n_complete_row()
n_miss()
n_miss_row()
pct_complete()
pct_miss()
prop_complete()
prop_complete_row()
prop_miss()
prop_miss_row(airquality)
prop_miss_row(airquality)
Calculate the proportion of cases (rows) that contain missing or complete values.
prop_miss_case(data) prop_complete_case(data)
prop_miss_case(data) prop_complete_case(data)
data |
a dataframe |
numeric the proportion of cases that contain a missing or complete value
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
prop_miss_case(airquality) prop_complete_case(airquality)
prop_miss_case(airquality) prop_complete_case(airquality)
Calculate the proportion of variables that contain a single missing or complete values.
prop_miss_var(data) prop_complete_var(data)
prop_miss_var(data) prop_complete_var(data)
data |
a dataframe |
numeric the proportion of variables that contain missing or complete data
pct_miss_case()
prop_miss_case()
pct_miss_var()
prop_miss_var()
pct_complete_case()
prop_complete_case()
pct_complete_var()
prop_complete_var()
miss_prop_summary()
miss_case_summary()
miss_case_table()
miss_summary()
miss_var_prop()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
prop_miss_var(airquality) prop_complete_var(airquality)
prop_miss_var(airquality) prop_complete_var(airquality)
It can be useful to add special missing values, naniar supports this with
the recode_shadow
function.
recode_shadow(data, ...) ## S3 method for class 'data.frame' recode_shadow(data, ...) ## S3 method for class 'grouped_df' recode_shadow(data, ...)
recode_shadow(data, ...) ## S3 method for class 'data.frame' recode_shadow(data, ...) ## S3 method for class 'grouped_df' recode_shadow(data, ...)
data |
data.frame |
... |
A sequence of two-sided formulas as in dplyr::case_when,
but when a wrapper function |
a dataframe with altered shadows
df <- tibble::tribble( ~wind, ~temp, -99, 45, 68, NA, 72, 25 ) dfs <- bind_shadow(df) dfs recode_shadow(dfs, temp = .where(wind == -99 ~ "bananas")) recode_shadow(dfs, temp = .where(wind == -99 ~ "bananas")) %>% recode_shadow(wind = .where(wind == -99 ~ "apples"))
df <- tibble::tribble( ~wind, ~temp, -99, 45, 68, NA, 72, 25 ) dfs <- bind_shadow(df) dfs recode_shadow(dfs, temp = .where(wind == -99 ~ "bananas")) recode_shadow(dfs, temp = .where(wind == -99 ~ "bananas")) %>% recode_shadow(wind = .where(wind == -99 ~ "apples"))
This function helps you replace NA values with a single provided value.
This can be classed as a kind of imputation, and is powered by
impute_fixed()
. However, we would generally recommend to impute using
other model based approaches. See the simputation
package, for example
simputation::impute_lm()
. See tidyr::replace_na()
for a slightly
different approach, dplyr::coalesce()
for replacing NAs with values from
other vectors, and dplyr::na_if()
to replace specified values with NA.
replace_na_with(x, value)
replace_na_with(x, value)
x |
vector |
value |
value to replace |
vector with replaced values
library(naniar) x <- c(1:5, NA, NA, NA) x replace_na_with(x, 0L) replace_na_with(x, "unknown") library(dplyr) dat <- tibble( ones = c(NA,1,1), twos = c(NA,NA, 2), threes = c(NA, NA, NA) ) dat dat %>% mutate( ones = replace_na_with(ones, 0), twos = replace_na_with(twos, -99), threes = replace_na_with(threes, "unknowns") ) dat %>% mutate( across( everything(), \(x) replace_na_with(x, -99) ) )
library(naniar) x <- c(1:5, NA, NA, NA) x replace_na_with(x, 0L) replace_na_with(x, "unknown") library(dplyr) dat <- tibble( ones = c(NA,1,1), twos = c(NA,NA, 2), threes = c(NA, NA, NA) ) dat dat %>% mutate( ones = replace_na_with(ones, 0), twos = replace_na_with(twos, -99), threes = replace_na_with(threes, "unknowns") ) dat %>% mutate( across( everything(), \(x) replace_na_with(x, -99) ) )
This function is Defunct, please see replace_with_na()
.
replace_to_na(...)
replace_to_na(...)
... |
additional arguments for methods. |
values replaced by NA
Specify variables and their values that you want to convert to missing values.
This is a complement to tidyr::replace_na
.
replace_with_na(data, replace = list(), ...)
replace_with_na(data, replace = list(), ...)
data |
A data.frame |
replace |
A named list given the NA to replace values for each column |
... |
additional arguments for methods. Currently unused |
Dataframe with values replaced by NA.
replace_with_na()
replace_with_na_all()
replace_with_na_at()
replace_with_na_if()
dat_ms <- tibble::tribble(~x, ~y, ~z, 1, "A", -100, 3, "N/A", -99, NA, NA, -98, -99, "E", -101, -98, "F", -1) replace_with_na(dat_ms, replace = list(x = -99)) replace_with_na(dat_ms, replace = list(x = c(-99, -98))) replace_with_na(dat_ms, replace = list(x = c(-99, -98), y = c("N/A"), z = c(-101)))
dat_ms <- tibble::tribble(~x, ~y, ~z, 1, "A", -100, 3, "N/A", -99, NA, NA, -98, -99, "E", -101, -98, "F", -1) replace_with_na(dat_ms, replace = list(x = -99)) replace_with_na(dat_ms, replace = list(x = c(-99, -98))) replace_with_na(dat_ms, replace = list(x = c(-99, -98), y = c("N/A"), z = c(-101)))
This function takes a dataframe and replaces all values that meet the condition specified as an NA value, following a special syntax.
replace_with_na_all(data, condition)
replace_with_na_all(data, condition)
data |
A dataframe |
condition |
A condition required to be TRUE to set NA. Here, the condition
is specified with a formula, following the syntax: |
dat_ms <- tibble::tribble(~x, ~y, ~z, 1, "A", -100, 3, "N/A", -99, NA, NA, -98, -99, "E", -101, -98, "F", -1) dat_ms #replace all instances of -99 with NA replace_with_na_all(data = dat_ms, condition = ~.x == -99) # replace all instances of -99 or -98, or "N/A" with NA replace_with_na_all(dat_ms, condition = ~.x %in% c(-99, -98, "N/A")) # replace all instances of common na strings replace_with_na_all(dat_ms, condition = ~.x %in% common_na_strings) # where works with functions replace_with_na_all(airquality, ~ sqrt(.x) < 5)
dat_ms <- tibble::tribble(~x, ~y, ~z, 1, "A", -100, 3, "N/A", -99, NA, NA, -98, -99, "E", -101, -98, "F", -1) dat_ms #replace all instances of -99 with NA replace_with_na_all(data = dat_ms, condition = ~.x == -99) # replace all instances of -99 or -98, or "N/A" with NA replace_with_na_all(dat_ms, condition = ~.x %in% c(-99, -98, "N/A")) # replace all instances of common na strings replace_with_na_all(dat_ms, condition = ~.x %in% common_na_strings) # where works with functions replace_with_na_all(airquality, ~ sqrt(.x) < 5)
Replace specified variables with NA where a certain condition is met
replace_with_na_at(data, .vars, condition)
replace_with_na_at(data, .vars, condition)
data |
dataframe |
.vars |
A character string of variables to replace with NA values |
condition |
A condition required to be TRUE to set NA. Here, the condition
is specified with a formula, following the syntax: |
a dataframe
dat_ms <- tibble::tribble(~x, ~y, ~z, 1, "A", -100, 3, "N/A", -99, NA, NA, -98, -99, "E", -101, -98, "F", -1) dat_ms replace_with_na_at(data = dat_ms, .vars = "x", condition = ~.x == -99) replace_with_na_at(data = dat_ms, .vars = c("x","z"), condition = ~.x == -99) # replace using values in common_na_strings replace_with_na_at(data = dat_ms, .vars = c("x","z"), condition = ~.x %in% common_na_strings)
dat_ms <- tibble::tribble(~x, ~y, ~z, 1, "A", -100, 3, "N/A", -99, NA, NA, -98, -99, "E", -101, -98, "F", -1) dat_ms replace_with_na_at(data = dat_ms, .vars = "x", condition = ~.x == -99) replace_with_na_at(data = dat_ms, .vars = c("x","z"), condition = ~.x == -99) # replace using values in common_na_strings replace_with_na_at(data = dat_ms, .vars = c("x","z"), condition = ~.x %in% common_na_strings)
Replace values with NA based on some condition, for variables that meet some predicate
replace_with_na_if(data, .predicate, condition)
replace_with_na_if(data, .predicate, condition)
data |
Dataframe |
.predicate |
A predicate function to be applied to the columns or a logical vector. |
condition |
A condition required to be TRUE to set NA. Here, the condition
is specified with a formula, following the syntax: |
Dataframe
dat_ms <- tibble::tribble(~x, ~y, ~z, 1, "A", -100, 3, "N/A", -99, NA, NA, -98, -99, "E", -101, -98, "F", -1) dat_ms replace_with_na_if(data = dat_ms, .predicate = is.character, condition = ~.x == "N/A") replace_with_na_if(data = dat_ms, .predicate = is.character, condition = ~.x %in% common_na_strings) replace_with_na(dat_ms, to_na = list(x = c(-99, -98), y = c("N/A"), z = c(-101)))
dat_ms <- tibble::tribble(~x, ~y, ~z, 1, "A", -100, 3, "N/A", -99, NA, NA, -98, -99, "E", -101, -98, "F", -1) dat_ms replace_with_na_if(data = dat_ms, .predicate = is.character, condition = ~.x == "N/A") replace_with_na_if(data = dat_ms, .predicate = is.character, condition = ~.x %in% common_na_strings) replace_with_na(dat_ms, to_na = list(x = c(-99, -98), y = c("N/A"), z = c(-101)))
The data is a subset of the 2009 survey from BRFSS, an ongoing data collection program designed to measure behavioral risk factors for the adult population (18 years of age or older) living in households.
data(riskfactors)
data(riskfactors)
An object of class tbl_df
(inherits from tbl
, data.frame
) with 245 rows and 34 columns.
https://www.cdc.gov/brfss/annual_data/annual_2009.htm
the codebook: https://www.cdc.gov/brfss/annual_data/annual_2009.htm
Format: a data frame with 245 observations on the following 34 variables.
state
A factor with 52 levels. The labels and states corresponding to the labels are as follows: 1:Alabama, 2:Alaska, 4:Arizona, 5:Arkansas, 6:California,8:Colorado, 9:Connecticut, 10:Delaware, 11:District of Columbia,12:Florida, 13:Georgia, 15:Hawaii, 16:Idaho, 1 :Illinois,18:Indiana, 19:Iowa, 20:Kansas, 21:Kentucky, 22:Louisiana,23:Maine, 24:Maryland, 25:Massachusetts, 26:Michigan,27:Minnesota, 28:Mississippi, 2:Missouri, 30:Montana,31:Nebraska, 32:Nevada, 33:New Hampshire, 34:New Jersey, 35:NewMexico, 36:New York, 37:North Carolina, 38:North Dakota, 39:Ohio,40:Oklahoma, 41:Oregon, 42:Pennsylvania, 44:Rhode Island, 45:SouthCarolina, 46:South Dakota, 47:Tennessee, 48:Texas, 49:Utah, 50:Vermont, 51:Virginia, 53:Washington, 54:West Virginia,55:Wisconsin, 56:Wyoming, 66:Guam, 72:Puerto Rico, 78:Virgin Islands
sex
A factor with levels Male
Female
.
age
A numeric vector from 7 to 97.
weight_lbs
The weight without shoes in pounds.
height_inch
The weight without shoes in inches.
bmi
Body Mass Index (BMI). Computed by weight in Kilogram /(height in Meters * height in Meters). Missing if any of weight or height is missing.
marital
A factor with levels Married
Divorced
Widowed
Separated
NeverMarried
UnmarriedCouple
.
pregnant
Whether pregnant now with two levels Yes
and
No
.
children
A numeric vector giving the number of children less than 18 years of age in household.
education
A factor with the education levels 1
2
3
4
5
6
as 1: Never attended
school or only kindergarten; 2: Grades 1 through 8 (Elementary);
3: Grades 9 through 11 (Some high school); 4: Grade 12 or GED
(High school graduate); 5: College 1 year to 3 years (Some college
or technical school); 6: College 4 years or more (College
graduate).
employment
A factor showing the employment status with levels
1
2
3
4
5
7
8
. The labels
mean – 1: Employed for wages; 2: Self-employed; 3: Out of work for more
than 1 year; 4: Out of work for less that 1 year; 5: A homemaker; 6: A
student; 7:Retired; 8: Unable to work.
income
The annual household income from all sources with
levels <10k
10-15k
15-20k
20-25k
25-35k
35-50k
50-75k
>75k
Dontknow
Refused
.
veteran
A factor with levels 1
2
3
4
5
. The question for this variable is: Have you ever
served on active duty in the United States Armed Forces, either in the
regular military or in a National Guard or military reserve unit? Active
duty does not include training for the Reserves or National Guard, but
DOES include activation, for example, for the Persian Gulf War. And the
labels are meaning: 1: Yes, now on active duty; 2: Yes, on active duty
during the last 12 months, but not now; 3: Yes, on active duty in the
past, but not during the last 12 months; 4: No, training for Reserves or
National Guard only; 5: No, never served in the military.
hispanic
A factor with levels Yes
No
corresponding to the question: are you Hispanic or Latino?
health_general
Answer to question "in general your health is"
with levels Excellent
VeryGood
Good
Fair
Poor
Refused
.
health_physical
The number of days during the last 30 days that the respondent's physical health was not good. -7 is for "Don't know/Not sure", and -9 is for "Refused".
health_mental
The number of days during the last 30 days that the respondent's mental health was not good. -7 is for "Don't know/Not sure", and -9 is for "Refused".
health_poor
The number of days during the last 30 days that poor physical or mental health keep the respondent from doing usual activities, such as self-care, work, or recreation. -7 is for "Don't know/Not sure", and -9 is for "Refused".
health_cover
Whether having any kind of health care
coverage, including health insurance, prepaid plans such as HMOs,
or government plans such as Medicare. The answer has two levels:
Yes
and No
.
provide_care
Whether providing any such care or assistance
to a friend or family member during the past month, with levels Yes
and No
.
activity_limited
Whether being limited in any way in any
activities because of physical, mental, or emotional problems,
with levels Yes
and No
.
drink_any
Whether having had at least one drink of
any alcoholic beverage such as beer, wine, a malt beverage or
liquor during the past 30 days, with levels Yes
and
No
.
drink_days
The number of days during the past 30 days that the respondent had at least one drink of any alcoholic beverage. -7 is for "Don't know/Not sure", and -9 is for "Refused".
drink_avg
The number of drinks on the average the respondent had on the days when he/she drank, during the past 30 days. -7 is for "Don't know/Not sure", and -9 is for "Refused".
smoke_100
Whether having smoked at least
100 cigarettes in the entire life, with levels Yes
and
No
.
smoke_days
The frequency of days now
smoking, with levels Everyday
Somedays
and
NotAtAll
(not at all).
smoke_stop
Whether
having stopped smoking for one day or longer during the past 12
months because the respondent was trying to quit smoking, with
levels Yes
and No
.
smoke_last
A factor
with levels 3
4
5
6
7
8
corresponding to the question: how long has it been since last
smoking cigarettes regularly? The labels mean: 3: Within the past
6 months (3 months but less than 6 months ago); 4: Within the past
year (6 months but less than 1 year ago); 5: Within the past 5
years (1 year but less than 5 years ago); 6: Within the past 10
years (5 years but less than 10 years ago); 7: 10 years or more;
8: Never smoked regularly.
diet_fruit
The number of fruit the respondent eat every year, not counting juice. -7 is for "Don't know/Not sure", and -9 is for "Refused".
diet_salad
The number of servings of green salad the respondent eat every year. -7 is for "Don't know/Not sure", and -9 is for "Refused".
diet_potato
The number of servings of potatoes, not including french fries, fried potatoes, or potato chips, that the respondent eat every year. -7 is for "Don't know/Not sure", and -9 is for "Refused".
diet_carrot
The number of carrots the respondent eat every year. -7 is for "Don't know/Not sure", and -9 is for "Refused".
diet_vegetable
The number of servings of vegetables the respondent eat every year, not counting carrots, potatoes, or salad. -7 is for "Don't know/Not sure", and -9 is for "Refused".
diet_juice
The number of fruit juices such as orange, grapefruit, or tomato that the respondent drink every year. -7 is for "Don't know/Not sure", and -9 is for "Refused".
library(MissingDataGUI) (named brfss)
vis_miss(riskfactors) # Look at the missingness in the variables miss_var_summary(riskfactors) # and now as a plot gg_miss_var(riskfactors) ## Not run: # Look at the missingness in bmi and poor health library(ggplot2) p <- ggplot(riskfactors, aes(x = health_poor, y = bmi)) + geom_miss_point() p # for each sex? p + facet_wrap(~sex) # for each education bracket? p + facet_wrap(~education) ## End(Not run)
vis_miss(riskfactors) # Look at the missingness in the variables miss_var_summary(riskfactors) # and now as a plot gg_miss_var(riskfactors) ## Not run: # Look at the missingness in bmi and poor health library(ggplot2) p <- ggplot(riskfactors, aes(x = health_poor, y = bmi)) + geom_miss_point() p # for each sex? p + facet_wrap(~sex) # for each education bracket? p + facet_wrap(~education) ## End(Not run)
impute_mean
impute_mean
imputes the mean for a vector. To get it to work on all
variables, use impute_mean_all
. To only impute variables
that satisfy a specific condition, use the scoped variants,
impute_below_at
, and impute_below_if
. To use _at
effectively,
you must know that _at`` affects variables selected with a character vector, or with
vars()'.
impute_mean_all(.tbl) impute_mean_at(.tbl, .vars) impute_mean_if(.tbl, .predicate)
impute_mean_all(.tbl) impute_mean_at(.tbl, .vars) impute_mean_if(.tbl, .predicate)
.tbl |
a data.frame |
.vars |
variables to impute |
.predicate |
variables to impute |
an dataset with values imputed
# select variables starting with a particular string. impute_mean_all(airquality) impute_mean_at(airquality, .vars = c("Ozone", "Solar.R")) ## Not run: library(dplyr) impute_mean_at(airquality, .vars = vars(Ozone)) impute_mean_if(airquality, .predicate = is.numeric) library(ggplot2) airquality %>% bind_shadow() %>% impute_mean_all() %>% add_label_shadow() %>% ggplot(aes(x = Ozone, y = Solar.R, colour = any_missing)) + geom_point() ## End(Not run)
# select variables starting with a particular string. impute_mean_all(airquality) impute_mean_at(airquality, .vars = c("Ozone", "Solar.R")) ## Not run: library(dplyr) impute_mean_at(airquality, .vars = vars(Ozone)) impute_mean_if(airquality, .predicate = is.numeric) library(ggplot2) airquality %>% bind_shadow() %>% impute_mean_all() %>% add_label_shadow() %>% ggplot(aes(x = Ozone, y = Solar.R, colour = any_missing)) + geom_point() ## End(Not run)
impute_median
impute_median
imputes the median for a vector. To only impute many
variables at once, we recommend that you use the across
function
workflow, shown in the examples for impute_median()
. You can use the
scoped variants, impute_median_all
.impute_below_at
, and
impute_below_if
to impute all, some, or just those variables meeting
some condition, respectively. To use _at
effectively, you must know
that _at
affects variables selected with a character vector, or with
vars()
.
impute_median_all(.tbl) impute_median_at(.tbl, .vars) impute_median_if(.tbl, .predicate)
impute_median_all(.tbl) impute_median_at(.tbl, .vars) impute_median_if(.tbl, .predicate)
.tbl |
a data.frame |
.vars |
variables to impute |
.predicate |
variables to impute |
an dataset with values imputed
# select variables starting with a particular string. impute_median_all(airquality) impute_median_at(airquality, .vars = c("Ozone", "Solar.R")) library(dplyr) impute_median_at(airquality, .vars = vars(Ozone)) impute_median_if(airquality, .predicate = is.numeric) library(ggplot2) airquality %>% bind_shadow() %>% impute_median_all() %>% add_label_shadow() %>% ggplot(aes(x = Ozone, y = Solar.R, colour = any_missing)) + geom_point()
# select variables starting with a particular string. impute_median_all(airquality) impute_median_at(airquality, .vars = c("Ozone", "Solar.R")) library(dplyr) impute_median_at(airquality, .vars = vars(Ozone)) impute_median_if(airquality, .predicate = is.numeric) library(ggplot2) airquality %>% bind_shadow() %>% impute_median_all() %>% add_label_shadow() %>% ggplot(aes(x = Ozone, y = Solar.R, colour = any_missing)) + geom_point()
Set a proportion or number of missing values
set_prop_miss(x, prop = 0.1) set_n_miss(x, n = 1)
set_prop_miss(x, prop = 0.1) set_n_miss(x, n = 1)
x |
vector of values to set missing |
prop |
proportion of values between 0 and 1 to set as missing |
n |
number of values to set missing |
vector with missing values added
vec <- rnorm(5) set_prop_miss(vec, 0.2) set_prop_miss(vec, 0.4) set_n_miss(vec, 1) set_n_miss(vec, 4)
vec <- rnorm(5) set_prop_miss(vec, 0.2) set_prop_miss(vec, 0.4) set_n_miss(vec, 1) set_n_miss(vec, 4)
Returns (at least) factors of !NA and NA, where !NA indicates a datum that is
not missing, and NA indicates missingness. It also allows you to specify
some new missings, if you like. This function is what powers the factor
levels in as_shadow()
.
shade(x, ..., extra_levels = NULL)
shade(x, ..., extra_levels = NULL)
x |
a vector |
... |
additional levels of missing to add |
extra_levels |
extra levels you might to specify for the factor. |
df <- tibble::tribble( ~wind, ~temp, -99, 45, 68, NA, 72, 25 ) shade(df$wind) shade(df$wind, inst_fail = -99)
df <- tibble::tribble( ~wind, ~temp, -99, 45, 68, NA, 72, 25 ) shade(df$wind) shade(df$wind, inst_fail = -99)
Once data is in nabular
form, where the shadow is bound to the data, it
can be useful to reshape it into a long format with the shadow columns
in a separate grouping - so you have variable
, value
, and
variable_NA
and value_NA
.
shadow_long(shadow_data, ..., fn_value_transform = NULL, only_main_vars = TRUE)
shadow_long(shadow_data, ..., fn_value_transform = NULL, only_main_vars = TRUE)
shadow_data |
a data.frame |
... |
bare name of variables that you want to focus on |
fn_value_transform |
function to transform the "value" column. Default
is NULL, which defaults to |
only_main_vars |
logical - do you want to filter down to main variables? |
data in long format, with columns variable
, value
, variable_NA
, and value_NA
.
aq_shadow <- nabular(airquality) shadow_long(aq_shadow) # then filter only on Ozone shadow_long(aq_shadow, Ozone) shadow_long(aq_shadow, Ozone, Solar.R) # ensure `value` is numeric shadow_long(aq_shadow, fn_value_transform = as.numeric) shadow_long(aq_shadow, Ozone, Solar.R, fn_value_transform = as.numeric)
aq_shadow <- nabular(airquality) shadow_long(aq_shadow) # then filter only on Ozone shadow_long(aq_shadow, Ozone) shadow_long(aq_shadow, Ozone, Solar.R) # ensure `value` is numeric shadow_long(aq_shadow, fn_value_transform = as.numeric) shadow_long(aq_shadow, Ozone, Solar.R, fn_value_transform = as.numeric)
shadow_shift
transforms missing values to facilitate visualisation, and has
different behaviour for different types of variables. For numeric
variables, the values are shifted to 10% below the minimum value for a given
variable plus some jittered noise, to separate repeated values, so that
missing values can be visualised along with the rest of the data.
shadow_shift(...)
shadow_shift(...)
... |
arguments to |
add_shadow_shift()
cast_shadow_shift()
cast_shadow_shift_label()
airquality$Ozone shadow_shift(airquality$Ozone) ## Not run: library(dplyr) airquality %>% mutate(Ozone_shift = shadow_shift(Ozone)) ## End(Not run)
airquality$Ozone shadow_shift(airquality$Ozone) ## Not run: library(dplyr) airquality %>% mutate(Ozone_shift = shadow_shift(Ozone)) ## End(Not run)
stat_miss_point adds a geometry for displaying missingness to geom_point
stat_miss_point( mapping = NULL, data = NULL, prop_below = 0.1, jitter = 0.05, geom = "point", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ... )
stat_miss_point( mapping = NULL, data = NULL, prop_below = 0.1, jitter = 0.05, geom = "point", position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ... )
mapping |
Set of aesthetic mappings created by |
data |
A data frame. If specified, overrides the default data frame defined at the top level of the plot. |
prop_below |
the degree to shift the values. The default is 0.1 |
jitter |
the amount of jitter to add. The default is 0.05 |
geom |
stat Override the default connection between |
position |
Position adjustment, either as a string, or the result of a call to a position adjustment function |
na.rm |
If |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
... |
other arguments passed on to
|
Remove the shadow variables (which end in _NA
) from the data, or vice versa.
This will also remove the nabular
class from the data.
unbind_shadow(data) unbind_data(data)
unbind_shadow(data) unbind_data(data)
data |
data.frame containing shadow columns (created by |
data.frame
without shadow columns if using unbind_shadow()
, or
without the original data, if using unbind_data()
.
# bind shadow columns aq_sh <- bind_shadow(airquality) # print data aq_sh # remove shadow columns unbind_shadow(aq_sh) # remove data unbind_data(aq_sh) # errors when you don't use data with shadows ## Not run: unbind_data(airquality) unbind_shadow(airquality) ## End(Not run)
# bind shadow columns aq_sh <- bind_shadow(airquality) # print data aq_sh # remove shadow columns unbind_shadow(aq_sh) # remove data unbind_data(aq_sh) # errors when you don't use data with shadows ## Not run: unbind_data(airquality) unbind_shadow(airquality) ## End(Not run)
This function is used inside recode_shadow
to help evaluate the formula
call effectively. .where
is a special function designed for use in
recode_shadow
, and you shouldn't use it outside of it
.where(...)
.where(...)
... |
case_when style formula |
a list of "condition" and "suffix" arguments
## Not run: df <- tibble::tribble( ~wind, ~temp, -99, 45, 68, NA, 72, 25 ) dfs <- bind_shadow(df) recode_shadow(dfs, temp = .where(wind == -99 ~ "bananas")) ## End(Not run)
## Not run: df <- tibble::tribble( ~wind, ~temp, -99, 45, 68, NA, 72, 25 ) dfs <- bind_shadow(df) recode_shadow(dfs, temp = .where(wind == -99 ~ "bananas")) ## End(Not run)
Internal function that is short for which(is.na(x), arr.ind = TRUE)
.
Creates array index locations of missing values in a dataframe.
where_na(x)
where_na(x)
x |
a dataframe |
a matrix with columns "row" and "col", which refer to the row and column that identify the position of a missing value in a dataframe
where_na(airquality) where_na(oceanbuoys$sea_temp_c)
where_na(airquality) where_na(oceanbuoys$sea_temp_c)
This function tells us which variables contain shade information
which_are_shade(.tbl)
which_are_shade(.tbl)
.tbl |
a data.frame or tbl |
numeric - which column numbers contain shade information
df_shadow <- bind_shadow(airquality) which_are_shade(df_shadow)
df_shadow <- bind_shadow(airquality) which_are_shade(df_shadow)
Equivalent to which(is.na())
- returns integer locations of missing values.
which_na(x)
which_na(x)
x |
a dataframe |
integer locations of missing values.
which_na(airquality)
which_na(airquality)