Getting Started with naniar

Introduction

Missing values are ubiquitous in data and need to be carefully explored and handled in the initial stages of analysis. In this vignette we describe the tools in the package naniar for exploring missing data structures with minimal deviation from the common workflows of ggplot and tidy data (Wickham, 2014, Wickham, 2009).

Sometimes researchers or analysts will introduce or describe a mechanism for missingness. For example, they might explain that data from a weather station might have a malfunction when there are extreme weather events, and does not record temperature data when gusts speeds are high. This seems like a nice simple, logical explanation. However, like all good explanations, this one is simple, but the process to get there was probably not, and likely involved more time than you would have liked developing exploratory data analyses and models.

So when someone presents a really nice plot and a nice sensible explanation, the initial thought might be:

They worked it out themselves so quickly, so easy!

As if the problem was so easy to solve, they could accidentally solve it - they couldn’t not solve it.

However, I think that if you manage to get that on the first go, that is more like turning around and throwing a rock into a lake and it landing in a cup in a boat. Unlikely.

With that thought in mind, this vignette aims to work with the following three questions, using the tools developed in naniar and another package, visdat. Namely, how do we:

  1. Start looking at missing data?
  2. Explore missingness mechanisms?
  3. Model missingness?

How do we start looking at missing data?

When you start with a dataset, you might do something where you look at the general summary, using functions such as:

These works really well when you’ve got a small amount of data, but when you have more data, you are generally limited by how much you can read.

So before you start looking at missing data, you’ll need to look at the data, but what does that even mean?

The package visdat helps you get a handle on this. visdat provides a visualisation of an entire data frame at once, and was heavily inspired by csv-fingerprint, and functions like missmap, from Amelia.

There are two main functions in the visdat package:

  • vis_dat, and
  • vis_miss

vis_dat

library(visdat)
vis_dat(airquality)

vis_dat visualises the whole dataframe at once, and provides information about the class of the data input into R, as well as whether the data is missing or not.

vis_miss

The function vis_miss provides a summary of whether the data is missing or not. It also provides the amount of missings in each columns.

vis_miss(airquality)

So here, Ozone and Solar.R have the most missing data, with Ozone having 24.2% missing data and Solar.R have 4.6%. The other variables do not have any missing data.

To read more about the functions available in visdat see the vignette “Using visdat”

Exploring missingness relationships

We can identify key variables that are missing using vis_miss, but for further exploration, we need to explore the relationship amongst the variables in this data:

  • Ozone,
  • Solar.R
  • Wind
  • Temp
  • Month
  • Day

Typically, when exploring this data, you might want to explore the variables Solar.R and Ozone, and so plot a scatterplot of solar radiation and ozone, doing something like this:

library(ggplot2)
ggplot(airquality, 
       aes(x = Solar.R, 
           y = Ozone)) + 
  geom_point()
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).

The problem with this is that ggplot does not handle missings be default, and removes the missing values. This makes them hard to explore. It also presents the strange question of “how do you visualise something that is not there?”. One approach to visualising missing data comes from ggobi and MANET, where we replace “NA” values with values 10% lower than the minimum value in that variable.

This process is performed and visualised for you with the geom_miss_point() ggplot2 geom. Here, we illustrate by exploring the relationship between Ozone and Solar radiation from the airquality dataset.

ggplot(airquality, 
       aes(x = Solar.R, 
           y = Ozone)) + 
  geom_point()
## Warning: Removed 42 rows containing missing values or values outside the scale range
## (`geom_point()`).
library(naniar)

ggplot(airquality, 
       aes(x = Solar.R, 
           y = Ozone)) + 
  geom_miss_point()

Being a proper ggplot geom, it supports all of the standard features of ggplot2, such as facets,

ggplot(airquality, 
       aes(x = Solar.R, 
           y = Ozone)) + 
  geom_miss_point() + 
  facet_wrap(~Month)

And different themes

ggplot(airquality, 
       aes(x = Solar.R, 
           y = Ozone)) + 
  geom_miss_point() + 
  facet_wrap(~Month) + 
  theme_dark()

Visualising missings in variables

Another approach to visualising the missings in a dataset is to use the gg_miss_var plot:

gg_miss_var(airquality)

The plots created with the gg_miss family all have a basic theme, but you can customise them, and add arguments like so:

gg_miss_var(airquality) + theme_bw() 

gg_miss_var(airquality) + labs(y = "Look at all the missing ones")

To add facets in these plots, you can use the facet argument:

gg_miss_var(airquality, facet = Month)

There are more visualisations available in naniar (each starting with gg_miss_) - you can see these in the “Gallery of Missing Data Visualisations” vignette..

It is important to note that for every visualisation of missing data in naniar, there is an accompanying function to get the dataframe of the plot out. This is important as the plot should not return a dataframe - but we also need to make the data available for use by the user so that it isn’t locked into a plot. You can find these summary plots below, with miss_var_summary providing the dataframe that gg_miss_var() is based on.

Replacing existing values with NA

When you are dealing with missing values, you might want to replace values with a missing values (NA). This is useful in cases when you know the origin of the data and can be certain which values should be missing. For example, you might know that all values of “N/A”, “N A”, and “Not Available”, or -99, or -1 are supposed to be missing.

naniar provides functions to specifically work on this type of problem using the function replace_with_na. This function is the compliment to tidyr::replace_na, which replaces an NA value with a specified value, whereas naniar::replace_with_na replaces a value with an NA:

  • tidyr::replace_na: Missing values turns into a value (NA –> -99)
  • naniar::replace_with_na: Value becomes a missing value (-99 –> NA)

You can read more about this in the vignette “Replacing values with NA”

Tidy Missing Data: The Shadow Matrix

Representing missing data structure in a tidy format is achieved using the shadow matrix, introduced in Swayne and Buja. The shadow matrix is the same dimension as the data, and consists of binary indicators of missingness of data values, where missing is represented as “NA”, and not missing is represented as “!NA”. Although these may be represented as 1 and 0, respectively. This representation can be seen in the figure below, adding the suffix “_NA” to the variables. This structure can also be extended to allow for additional factor levels to be created. For example 0 indicates data presence, 1 indicates missing values, 2 indicates imputed value, and 3 might indicate a particular type or class of missingness, where reasons for missingness might be known or inferred. The data matrix can also be augmented to include the shadow matrix, which facilitates visualisation of univariate and bivariate missing data visualisations. Another format is to display it in long form, which facilitates heatmap style visualisations. This approach can be very helpful for giving an overview of which variables contain the most missingness. Methods can also be applied to rearrange rows and columns to find clusters, and identify other interesting features of the data that may have previously been hidden or unclear.