Title: | Preliminary Visualisation of Data |
---|---|
Description: | Create preliminary exploratory data visualisations of an entire dataset to identify problems or unexpected features using 'ggplot2'. |
Authors: | Nicholas Tierney [aut, cre] , Sean Hughes [rev] (<https://orcid.org/0000-0002-9409-9405>, Sean Hughes reviewed the package for rOpenSci, see https://github.com/ropensci/onboarding/issues/87), Mara Averick [rev] (Mara Averick reviewed the package for rOpenSci, see https://github.com/ropensci/onboarding/issues/87), Stuart Lee [ctb], Earo Wang [ctb], Nic Crane [ctb], Christophe Regouby [ctb] |
Maintainer: | Nicholas Tierney <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.6.0.9000 |
Built: | 2024-11-15 05:23:02 UTC |
Source: | https://github.com/ropensci/visdat |
It can be useful to abbreviate variable names in a data set to make them easier to plot. This function takes in a data set and some minimum length to abbreviate the data to.
abbreviate_vars(data, min_length = 10)
abbreviate_vars(data, min_length = 10)
data |
data.frame |
min_length |
minimum number of characters to abbreviate down to |
data frame with abbreviated variable names
long_data <- data.frame( really_really_long_name = c(NA, NA, 1:8), very_quite_long_name = c(-1:-8, NA, NA), this_long_name_is_something_else = c(NA, NA, seq(from = 0, to = 1, length.out = 8)) ) vis_miss(long_data) long_data %>% abbreviate_vars() %>% vis_miss()
long_data <- data.frame( really_really_long_name = c(NA, NA, 1:8), very_quite_long_name = c(-1:-8, NA, NA), this_long_name_is_something_else = c(NA, NA, seq(from = 0, to = 1, length.out = 8)) ) vis_miss(long_data) long_data %>% abbreviate_vars() %>% vis_miss()
A dataset containing binary values and missing values. It is created to
illustrate the usage of vis_binary()
.
dat_bin
dat_bin
A data frame with 100 rows and 3 variables:
a binary variable with missing values.
a binary variable with missing values.
a binary variable with no missing values.
Return data used to create vis_cor plot
Create a tidy dataframe of correlations suitable for plotting
data_vis_cor(x, ...) ## Default S3 method: data_vis_cor(x, ...) ## S3 method for class 'data.frame' data_vis_cor( x, cor_method = "pearson", na_action = "pairwise.complete.obs", ... ) ## S3 method for class 'grouped_df' data_vis_cor(x, ...)
data_vis_cor(x, ...) ## Default S3 method: data_vis_cor(x, ...) ## S3 method for class 'data.frame' data_vis_cor( x, cor_method = "pearson", na_action = "pairwise.complete.obs", ... ) ## S3 method for class 'grouped_df' data_vis_cor(x, ...)
x |
data.frame |
... |
extra arguments (currently unused) |
cor_method |
correlation method to use, from |
na_action |
The method for computing covariances when there are missing
values present. This can be "everything", "all.obs", "complete.obs",
"na.or.complete", or "pairwise.complete.obs" (default). This option is
taken from the |
data frame
tidy dataframe of correlations
data_vis_cor(airquality) ## Not run: #return vis_dat data for each group library(dplyr) airquality %>% group_by(Month) %>% data_vis_cor() ## End(Not run) data_vis_cor(airquality)
data_vis_cor(airquality) ## Not run: #return vis_dat data for each group library(dplyr) airquality %>% group_by(Month) %>% data_vis_cor() ## End(Not run) data_vis_cor(airquality)
Return data used to create vis_dat plot
data_vis_dat(x, ...) ## Default S3 method: data_vis_dat(x, ...) ## S3 method for class 'data.frame' data_vis_dat(x, ...) ## S3 method for class 'grouped_df' data_vis_dat(x, ...)
data_vis_dat(x, ...) ## Default S3 method: data_vis_dat(x, ...) ## S3 method for class 'data.frame' data_vis_dat(x, ...) ## S3 method for class 'grouped_df' data_vis_dat(x, ...)
x |
data.frame |
... |
extra arguments (currently unused) |
data frame
data_vis_dat(airquality) ## Not run: #return vis_dat data for each group library(dplyr) airquality %>% group_by(Month) %>% data_vis_dat() ## End(Not run)
data_vis_dat(airquality) ## Not run: #return vis_dat data for each group library(dplyr) airquality %>% group_by(Month) %>% data_vis_dat() ## End(Not run)
Return data used to create vis_miss plot
Create a tidy dataframe of missing data suitable for plotting
data_vis_miss(x, ...) ## Default S3 method: data_vis_miss(x, ...) ## S3 method for class 'data.frame' data_vis_miss(x, cluster = FALSE, ...) ## S3 method for class 'grouped_df' data_vis_miss(x, ...)
data_vis_miss(x, ...) ## Default S3 method: data_vis_miss(x, ...) ## S3 method for class 'data.frame' data_vis_miss(x, cluster = FALSE, ...) ## S3 method for class 'grouped_df' data_vis_miss(x, ...)
x |
data.frame |
... |
extra arguments (currently unused) |
cluster |
logical - whether to cluster missingness. Default is FALSE. |
data frame
tidy dataframe of missing data
data_vis_miss(airquality) ## Not run: #return vis_dat data for each group library(dplyr) airquality %>% group_by(Month) %>% data_vis_miss() ## End(Not run) data_vis_miss(airquality)
data_vis_miss(airquality) ## Not run: #return vis_dat data for each group library(dplyr) airquality %>% group_by(Month) %>% data_vis_miss() ## End(Not run) data_vis_miss(airquality)
A dataset containing information about some randomly generated people,
created using the excellent wakefield
package. It is created as
deliberately messy dataset.
typical_data
typical_data
A data frame with 5000 rows and 11 variables:
Unique identifier for each individual, a sequential character vector of zero-padded identification numbers (IDs). see ?wakefield::id
Race for each individual, "Black", "White", "Hispanic", "Asian", "Other", "Bi-Racial", "Native", and "Hawaiin", see ?wakefield::race
Age of each individual, see ?wakefield::age
Male or female, see ?wakefield::sex
Height in centimeters, see ?wakefield::height
vector of intelligence quotients (IQ), see ?wakefield::iq
whether or not this person smokes, see ?wakefield::smokes
Yearly income in dollars, see ?wakefield::income
Whether or not this person has died yet., see ?wakefield::died
A wider dataset than typical_data
containing information about some
randomly generated people, created using the excellent wakefield
package. It is created as deliberately odd / eclectic dataset.
typical_data_large
typical_data_large
A data frame with 300 rows and 49 variables:
Age of each individual, see ?wakefield::age for more info
A vector of animals, see ?wakefield::animal
A vector of "Yes" or "No"
A vector of living areas "Suburban", "Urban", "Rural"
names of cars - see ?mtcars
vector of number of children - see ?wakefield::children
character vector of "heads" and "tails"
vector of vectors from "colors()"
vector of "important" dates for an individual
TRUE / FALSE for whether this person died
6 sided dice result
vector of GATC nucleobases
birth dates
a 0/1 dummy var
education attainment level
employee status
eye colour
percent grades
favorite school grade
control or treatment
hair colours - "brown", "black", "blonde", or "red"
height in cm
yearly income
choice of internet browser
intelligence quotient
random language of the world
levels between 1 and 4
likert response - "strongly agree", "agree", and so on
lorem ipsum text
marital status- "married", "divorced", "widowed", "separated", etc
miliary branch they are in
their favorite month
their name
a random normal number
their favorite political party
their race
their religion
their SAT score
an uttered sentence
sex of their first child
sex of their second child
do they smoke
their median speed travelled in a car
the last state they visited in the USA
a random string they smashed out on the keyboard
the last key they hit in upper case
TRUE FALSE answer to a question
significant year to that individuals
a zip code they have visited
Visualise binary values
vis_binary( data, col_zero = "salmon", col_one = "steelblue2", col_na = "grey90", order = NULL )
vis_binary( data, col_zero = "salmon", col_one = "steelblue2", col_na = "grey90", order = NULL )
data |
a data.frame |
col_zero |
colour for zeroes, default is "salmon" |
col_one |
colour for ones, default is "steelblue2" |
col_na |
colour for NA, default is "grey90" |
order |
optional character vector of the order of variables |
a ggplot plot of the binary values
vis_binary(dat_bin) # changing order of variables # create numeric names df <- setNames(dat_bin, c("1.1", "8.9", "10.4")) df # not ideal vis_binary(df) # good - specify the original order vis_binary(df, order = names(df))
vis_binary(dat_bin) # changing order of variables # create numeric names df <- setNames(dat_bin, c("1.1", "8.9", "10.4")) df # not ideal vis_binary(df) # good - specify the original order vis_binary(df, order = names(df))
vis_compare
, like the other vis_*
families, gives an at-a-glance ggplot
of a dataset, but in this case, hones in on visualising two different
dataframes of the same dimension, so it takes two dataframes as arguments.
vis_compare(df1, df2)
vis_compare(df1, df2)
df1 |
The first dataframe to compare |
df2 |
The second dataframe to compare to the first. |
ggplot2
object displaying which values in each data frame are
present in each other, and which are not.
vis_miss()
vis_dat()
vis_guess()
vis_expect()
vis_cor()
# make a new dataset of iris that contains some NA values aq_diff <- airquality aq_diff[1:10, 1:2] <- NA vis_compare(airquality, aq_diff)
# make a new dataset of iris that contains some NA values aq_diff <- airquality aq_diff[1:10, 1:2] <- NA vis_compare(airquality, aq_diff)
Visualise correlations amongst variables in your data as a heatmap
vis_cor( data, cor_method = "pearson", na_action = "pairwise.complete.obs", facet, ... )
vis_cor( data, cor_method = "pearson", na_action = "pairwise.complete.obs", facet, ... )
data |
data.frame |
cor_method |
correlation method to use, from |
na_action |
The method for computing covariances when there are missing
values present. This can be "everything", "all.obs", "complete.obs",
"na.or.complete", or "pairwise.complete.obs" (default). This option is
taken from the |
facet |
bare unqouted variable to use for facetting |
... |
extra arguments you may want to pass to |
ggplot2 object
vis_cor(airquality) vis_cor(airquality, facet = Month) vis_cor(mtcars) ## Not run: # this will error vis_cor(iris) ## End(Not run)
vis_cor(airquality) vis_cor(airquality, facet = Month) vis_cor(mtcars) ## Not run: # this will error vis_cor(iris) ## End(Not run)
vis_dat
gives you an at-a-glance ggplot object of what is inside a
dataframe. Cells are coloured according to what class they are and whether
the values are missing. As vis_dat
returns a ggplot object, it is very
easy to customize and change labels, and customize the plot
vis_dat( x, sort_type = TRUE, palette = "default", warn_large_data = TRUE, large_data_size = 9e+05, facet )
vis_dat( x, sort_type = TRUE, palette = "default", warn_large_data = TRUE, large_data_size = 9e+05, facet )
x |
a data.frame object |
sort_type |
logical TRUE/FALSE. When TRUE (default), it sorts by the type in the column to make it easier to see what is in the data |
palette |
character "default", "qual" or "cb_safe". "default" (the default) provides the stock ggplot scale for separating the colours. "qual" uses an experimental qualitative colour scheme for providing distinct colours for each Type. "cb_safe" is a set of colours that are appropriate for those with colourblindness. "qual" and "cb_safe" are drawn from http://colorbrewer2.org/. |
warn_large_data |
logical - warn if there is large data? Default is TRUE see note for more details |
large_data_size |
integer default is 900000 (given by 'nrow(data.frame) * ncol(data.frame)“). This can be changed. See note for more details. |
facet |
bare variable name for a variable you would like to facet
by. By default there is no facetting. Only one variable can be facetted.
You can get the data structure using |
ggplot2
object displaying the type of values in the data frame and
the position of any missing values.
Some datasets might be too large to plot, sometimes creating a blank plot - if this happens, I would recommend downsampling the data, either looking at the first 1,000 rows or by taking a random sample. This means that you won't get the same "look" at the data, but it is better than a blank plot! See example code for suggestions on doing this.
vis_miss()
vis_guess()
vis_expect()
vis_cor()
vis_compare()
vis_dat(airquality) # experimental colourblind safe palette vis_dat(airquality, palette = "cb_safe") vis_dat(airquality, palette = "qual") # if you have a large dataset, you might want to try downsampling: ## Not run: library(nycflights13) library(dplyr) flights %>% sample_n(1000) %>% vis_dat() flights %>% slice(1:1000) %>% vis_dat() ## End(Not run)
vis_dat(airquality) # experimental colourblind safe palette vis_dat(airquality, palette = "cb_safe") vis_dat(airquality, palette = "qual") # if you have a large dataset, you might want to try downsampling: ## Not run: library(nycflights13) library(dplyr) flights %>% sample_n(1000) %>% vis_dat() flights %>% slice(1:1000) %>% vis_dat() ## End(Not run)
vis_expect
visualises certain conditions or values in your data. For
example, If you are not sure whether to expect -1 in your data, you could
write: vis_expect(data, ~.x == -1)
, and you can see if there are times
where the values in your data are equal to -1. You could also, for example,
explore a set of bad strings, or possible NA values and visualise where
they are using vis_expect(data, ~.x %in% bad_strings)
where
bad_strings
is a character vector containing bad strings like N A
N/A
etc.
vis_expect(data, expectation, show_perc = TRUE)
vis_expect(data, expectation, show_perc = TRUE)
data |
a data.frame |
expectation |
a formula following the syntax: |
show_perc |
logical. TRUE now adds in the \ TRUE or FALSE in the whole dataset into the legend. Default value is TRUE. |
a ggplot2 object
vis_miss()
vis_dat()
vis_guess()
vis_cor()
vis_compare()
dat_test <- tibble::tribble( ~x, ~y, -1, "A", 0, "B", 1, "C", NA, NA ) vis_expect(dat_test, ~.x == -1) vis_expect(airquality, ~.x == 5.1) # explore some common NA strings common_nas <- c( "NA", "N A", "N/A", "na", "n a", "n/a" ) dat_ms <- tibble::tribble(~x, ~y, ~z, "1", "A", -100, "3", "N/A", -99, "NA", NA, -98, "N A", "E", -101, "na", "F", -1) vis_expect(dat_ms, ~.x %in% common_nas)
dat_test <- tibble::tribble( ~x, ~y, -1, "A", 0, "B", 1, "C", NA, NA ) vis_expect(dat_test, ~.x == -1) vis_expect(airquality, ~.x == 5.1) # explore some common NA strings common_nas <- c( "NA", "N A", "N/A", "na", "n a", "n/a" ) dat_ms <- tibble::tribble(~x, ~y, ~z, "1", "A", -100, "3", "N/A", -99, "NA", NA, -98, "N A", "E", -101, "na", "F", -1) vis_expect(dat_ms, ~.x %in% common_nas)
vis_guess
visualises the class of every single individual cell in a
dataframe and displays it as ggplot object, similar to vis_dat
. Cells
are coloured according to what class they are and whether the values are
missing. vis_guess
estimates the class of individual elements using
readr::guess_parser
. It may be currently slow on larger datasets.
vis_guess(x, palette = "default")
vis_guess(x, palette = "default")
x |
a data.frame |
palette |
character "default", "qual" or "cb_safe". "default" (the default) provides the stock ggplot scale for separating the colours. "qual" uses an experimental qualitative colour scheme for providing distinct colours for each Type. "cb_safe" is a set of colours that are appropriate for those with colourblindness. "qual" and "cb_safe" are drawn from http://colorbrewer2.org/. |
ggplot2
object displaying the guess of the type of values in the
data frame and the position of any missing values.
vis_miss()
vis_dat()
vis_expect()
vis_cor()
vis_compare()
messy_vector <- c(TRUE, "TRUE", "T", "01/01/01", "01/01/2001", NA, NaN, "NA", "Na", "na", "10", 10, "10.1", 10.1, "abc", "$%TG") set.seed(1114) messy_df <- data.frame(var1 = messy_vector, var2 = sample(messy_vector), var3 = sample(messy_vector)) vis_guess(messy_df)
messy_vector <- c(TRUE, "TRUE", "T", "01/01/01", "01/01/2001", NA, NaN, "NA", "Na", "na", "10", 10, "10.1", 10.1, "abc", "$%TG") set.seed(1114) messy_df <- data.frame(var1 = messy_vector, var2 = sample(messy_vector), var3 = sample(messy_vector)) vis_guess(messy_df)
vis_histogram
visualises the distribution of every numeric column in a
dataframe and displays it using a faceted ggplot object.
vis_histogram(x, ...)
vis_histogram(x, ...)
x |
a data.frame |
... |
Other arguments are passed as geom_histogram arguments. |
ggplot2
object displaying the guess of the type of values in the
data frame and the position of any missing values.
vis_histogram(airquality, bins = 30)
vis_histogram(airquality, bins = 30)
vis_miss
provides an at-a-glance ggplot of the missingness inside a
dataframe, colouring cells according to missingness, where black indicates
a missing cell and grey indicates a present cell. As it returns a ggplot
object, it is very easy to customize and change labels.
vis_miss( x, cluster = FALSE, sort_miss = FALSE, show_perc = TRUE, show_perc_col = TRUE, large_data_size = 9e+05, warn_large_data = TRUE, facet )
vis_miss( x, cluster = FALSE, sort_miss = FALSE, show_perc = TRUE, show_perc_col = TRUE, large_data_size = 9e+05, warn_large_data = TRUE, facet )
x |
a data.frame |
cluster |
logical. TRUE specifies that you want to use hierarchical clustering (mcquitty method) to arrange rows according to missingness. FALSE specifies that you want to leave it as is. Default value is FALSE. |
sort_miss |
logical. TRUE arranges the columns in order of missingness. Default value is FALSE. |
show_perc |
logical. TRUE now adds in the \ in the whole dataset into the legend. Default value is TRUE. |
show_perc_col |
logical. TRUE adds in the \
column into the x axis. Can be disabled with FALSE. Default value is TRUE.
No missingness percentage column information will be presented when |
large_data_size |
integer default is 900000 (given by 'nrow(data.frame) * ncol(data.frame)“). This can be changed. See note for more details. |
warn_large_data |
logical - warn if there is large data? Default is TRUE see note for more details |
facet |
(optional) bare variable name, if you want to create a faceted
plot, with one plot per level of the variable. No missingness percentage
column information will be presented when |
The missingness summaries in the columns are rounded to the nearest integer.
For more detailed summaries, please see the summaries in the naniar
R
package, specifically, naniar::miss_var_summary()
.
ggplot2
object displaying the position of missing values in the
dataframe, and the percentage of values missing and present.
Some datasets might be too large to plot, sometimes creating a blank plot - if this happens, I would recommend downsampling the data, either looking at the first 1,000 rows or by taking a random sample. This means that you won't get the same "look" at the data, but it is better than a blank plot! See example code for suggestions on doing this.
vis_dat()
vis_guess()
vis_expect()
vis_cor()
vis_compare()
vis_miss(airquality) vis_miss(airquality, cluster = TRUE) vis_miss(airquality, sort_miss = TRUE) vis_miss(airquality, facet = Month) ## Not run: # if you have a large dataset, you might want to try downsampling: library(nycflights13) library(dplyr) flights %>% sample_n(1000) %>% vis_miss() flights %>% slice(1:1000) %>% vis_miss() ## End(Not run)
vis_miss(airquality) vis_miss(airquality, cluster = TRUE) vis_miss(airquality, sort_miss = TRUE) vis_miss(airquality, facet = Month) ## Not run: # if you have a large dataset, you might want to try downsampling: library(nycflights13) library(dplyr) flights %>% sample_n(1000) %>% vis_miss() flights %>% slice(1:1000) %>% vis_miss() ## End(Not run)
Visualise all of the values in the data on a 0 to 1 scale. Only works on numeric data - see examples for how to subset to only numeric data.
vis_value(data, na_colour = "grey90", viridis_option = "D")
vis_value(data, na_colour = "grey90", viridis_option = "D")
data |
a data.frame |
na_colour |
a character vector of length one describing what colour you want the NA values to be. Default is "grey90" |
viridis_option |
A character string indicating the colormap option to use. Four options are available: "magma" (or "A"), "inferno" (or "B"), "plasma" (or "C"), "viridis" (or "D", the default option) and "cividis" (or "E"). |
a ggplot plot of the values
vis_value(airquality) vis_value(airquality, viridis_option = "A") vis_value(airquality, viridis_option = "B") vis_value(airquality, viridis_option = "C") vis_value(airquality, viridis_option = "E") ## Not run: library(dplyr) diamonds %>% select_if(is.numeric) %>% vis_value() ## End(Not run)
vis_value(airquality) vis_value(airquality, viridis_option = "A") vis_value(airquality, viridis_option = "B") vis_value(airquality, viridis_option = "C") vis_value(airquality, viridis_option = "E") ## Not run: library(dplyr) diamonds %>% select_if(is.numeric) %>% vis_value() ## End(Not run)