Title: | Browse Over Longitudinal Data Graphically and Analytically in R |
---|---|
Description: | Provides a framework of tools to summarise, visualise, and explore longitudinal data. It builds upon the tidy time series data frames used in the 'tsibble' package, and is designed to integrate within the 'tidyverse', and 'tidyverts' (for time series) ecosystems. The methods implemented include calculating features for understanding longitudinal data, including calculating summary statistics such as quantiles, medians, and numeric ranges, sampling individual series, identifying individual series representative of a group, and extending the facet system in 'ggplot2' to facilitate exploration of samples of data. These methods are fully described in the paper "brolgar: An R package to Browse Over Longitudinal Data Graphically and Analytically in R", Nicholas Tierney, Dianne Cook, Tania Prvan (2020) <doi:10.32614/RJ-2022-023>. |
Authors: | Nicholas Tierney [aut, cre] , Di Cook [aut] , Tania Prvan [aut], Stuart Lee [ctb], Earo Wang [ctb] |
Maintainer: | Nicholas Tierney <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.1 |
Built: | 2024-12-06 05:13:37 UTC |
Source: | https://github.com/njtierney/brolgar |
tsibble
Here, we are not counting the number of rows in the dataset, but rather we are counting the number observations for each keys in the data.
add_n_obs(.data, ...)
add_n_obs(.data, ...)
.data |
tsibble |
... |
extra arguments |
tsibble with n_obs
, the number of observations per key added.
library(dplyr) # you can explore the data to see those cases that have exactly two # observations: heights %>% add_n_obs() %>% filter(n_obs == 2)
library(dplyr) # you can explore the data to see those cases that have exactly two # observations: heights %>% add_n_obs() %>% filter(n_obs == 2)
Customised summaries of vectors with appropriate defaults for longitudinal
data. The functions are prefixed with b_
to assist with autocomplete.
It uses na.rm = TRUE
for all, and for calculations
involving quantiles, type = 8
and names = FALSE
. Summaries include:
* b_min: The minimum
* b_max: The maximum
* b_median: The median
* b_mean: The mean
* b_q25: The 25th quantile
* b_q75: The 75th quantile
* b_range: The range
* b_range_diff: difference in range (max - min)
* b_sd: The standard deviation
* b_var: The variance
* b_mad: The mean absolute deviation
* b_iqr: The Inter-quartile range
* b_diff_var: The variance diff()
* b_diff_sd: The standard deviation of diff()
* b_diff_mean: The mean of diff()
* b_diff_median: The median of diff()
* b_diff_q25: The q25 of diff()
* b_diff_q75: The q75 of diff()
b_min(x, ...) b_max(x, ...) b_median(x, ...) b_mean(x, ...) b_q25(x, ...) b_q75(x, ...) b_range(x, ...) b_range_diff(x, ...) b_sd(x, ...) b_var(x, ...) b_mad(x, ...) b_iqr(x, ...) b_diff_var(x, ...) b_diff_sd(x, ...) b_diff_mean(x, ...) b_diff_median(x, ...) b_diff_q25(x, ...) b_diff_q75(x, ...) b_diff_max(x, ...) b_diff_min(x, ...) b_diff_iqr(x, ...)
b_min(x, ...) b_max(x, ...) b_median(x, ...) b_mean(x, ...) b_q25(x, ...) b_q75(x, ...) b_range(x, ...) b_range_diff(x, ...) b_sd(x, ...) b_var(x, ...) b_mad(x, ...) b_iqr(x, ...) b_diff_var(x, ...) b_diff_sd(x, ...) b_diff_mean(x, ...) b_diff_median(x, ...) b_diff_q25(x, ...) b_diff_q75(x, ...) b_diff_max(x, ...) b_diff_min(x, ...) b_diff_iqr(x, ...)
x |
a vector |
... |
other arguments to pass |
x <- c(1:5, NA, 5:1) min(x) b_min(x) max(x) b_max(x) median(x) b_median(x) mean(x) b_mean(x) range(x) b_range(x) var(x) b_var(x) sd(x) b_sd(x)
x <- c(1:5, NA, 5:1) min(x) b_min(x) max(x) b_max(x) median(x) b_median(x) mean(x) b_mean(x) range(x) b_range(x) var(x) b_var(x) sd(x) b_sd(x)
tsibble
object in conjunction with features()
You can calculate a series of summary statistics (features) of a given
variable for a dataset. For example, a three number summary, the minimum,
median, and maximum, can be calculated for a given variable. This is
designed to work with the features()
function shown in the examples.
Other available features in brolgar
include:
feat_three_num()
- minimum, median, maximum
feat_five_num()
- minimum, q25, median, q75, maximum.
feat_ranges()
- min, max, range difference, interquartile range.
feat_spread()
- variance, standard deviation, median absolute distance,
and interquartile range
feat_monotonic()
- is it always increasing, decreasing, or unvarying?
feat_diff_summary()
- the summary statistics of the differences
amongst a value, including the five number summary, as well as the
standard deviation and variance. Returns NA if there is only one
observation, as we can't take the difference of one observation, and a
difference of 0 in these cases would be misleading.
feat_brolgar()
all features in brolgar.
feat_three_num(x, ...) feat_five_num(x, ...) feat_ranges(x, ...) feat_spread(x, ...) feat_monotonic(x, ...) feat_brolgar(x, ...) feat_diff_summary(x, ...)
feat_three_num(x, ...) feat_five_num(x, ...) feat_ranges(x, ...) feat_spread(x, ...) feat_monotonic(x, ...) feat_brolgar(x, ...) feat_diff_summary(x, ...)
x |
A vector to extract features from. |
... |
Further arguments passed to other functions. |
# You can use any of the features `feat_*` in conjunction with `features` # like so: heights %>% features(height_cm, # variable you want to explore feat_three_num) # the feature summarisation you want to perform
# You can use any of the features `feat_*` in conjunction with `features` # like so: heights %>% features(height_cm, # variable you want to explore feat_three_num) # the feature summarisation you want to perform
This function requires a tbl_ts
object, which can be created with
tsibble::as_tsibble()
. Under the hood, facet_strata
is powered by
stratify_keys()
and sample_n_keys()
.
facet_sample( n_per_facet = 3, n_facets = 12, nrow = NULL, ncol = NULL, scales = "fixed", shrink = TRUE, strip.position = "top" )
facet_sample( n_per_facet = 3, n_facets = 12, nrow = NULL, ncol = NULL, scales = "fixed", shrink = TRUE, strip.position = "top" )
n_per_facet |
Number of keys per facet you want to plot. Default is 3. |
n_facets |
Number of facets to create. Default is 12 |
nrow , ncol
|
Number of rows and columns. |
scales |
Should scales be fixed ( |
shrink |
If |
strip.position |
By default, the labels are displayed on the top of
the plot. Using |
a ggplot object
library(ggplot2) ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_sample() ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_sample(n_per_facet = 1, n_facets = 12)
library(ggplot2) ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_sample() ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_sample(n_per_facet = 1, n_facets = 12)
This function requires a tbl_ts
object, which can be created with
tsibble::as_tsibble()
. Under the hood, facet_strata
is powered by
stratify_keys()
.
facet_strata( n_strata = 12, along = NULL, fun = mean, nrow = NULL, ncol = NULL, scales = "fixed", shrink = TRUE, strip.position = "top" )
facet_strata( n_strata = 12, along = NULL, fun = mean, nrow = NULL, ncol = NULL, scales = "fixed", shrink = TRUE, strip.position = "top" )
n_strata |
number of groups to create |
along |
variable to stratify along. This groups by each |
fun |
summary function. Default is mean. |
nrow , ncol
|
Number of rows and columns. |
scales |
Should scales be fixed ( |
shrink |
If |
strip.position |
By default, the labels are displayed on the top of
the plot. Using |
a ggplot object
library(ggplot2) ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_strata() ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_wrap(~continent) ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_strata(along = year) library(dplyr) heights %>% key_slope(height_cm ~ year) %>% right_join(heights, ., by = "country") %>% ggplot(aes(x = year, y = height_cm)) + geom_line(aes(group = country)) + geom_smooth(method = "lm") + facet_strata(along = .slope_year)
library(ggplot2) ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_strata() ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_wrap(~continent) ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_strata(along = year) library(dplyr) heights %>% key_slope(height_cm ~ year) %>% right_join(heights, ., by = "country") %>% ggplot(aes(x = year, y = height_cm)) + geom_line(aes(group = country)) + geom_smooth(method = "lm") + facet_strata(along = .slope_year)
Average male heights in 144 countries from 1810-1989, with a smaller number of countries from 1500-1800. Data has been filtered to only include countries with more than one observation.
heights
heights
An object of class tbl_ts
(inherits from tbl_df
, tbl
, data.frame
) with 1490 rows and 4 columns.
heights
is stored as a time series tsibble
object. It contains
the variables:
country: The Country. This forms the identifying key
.
year: Year. This forms the time index
.
height_cm: Average male height in centimeters.
continent: continent extracted from country name using countrycode
package (https://joss.theoj.org/papers/10.21105/joss.00848).
For more information, see the article: "Why are you tall while others are short? Agricultural production and other proximate determinants of global heights", Joerg Baten and Matthias Blum, European Review of Economic History 18 (2014), 144–165. Data available from https://datasets.iisg.amsterdam/dataset.xhtml?persistentId=hdl:10622/IAEKLA, accessed via the Clio Infra website.
# show the data heights # show the spaghetti plot (ugh!) library(ggplot2) ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() # Explore all samples with `facet_strata()` ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_strata() # Explore the heights over each continent ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_wrap(~continent) # explore the five number summary of height_cm with `features` heights %>% features(height_cm, feat_five_num)
# show the data heights # show the spaghetti plot (ugh!) library(ggplot2) ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() # Explore all samples with `facet_strata()` ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_strata() # Explore the heights over each continent ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_wrap(~continent) # explore the five number summary of height_cm with `features` heights %>% features(height_cm, feat_five_num)
These functions check if the index is regular (index_regular()
), and
summarise the index variable (index_summary()
). This can be useful
to check your index variables.
index_regular(.data, ...) ## S3 method for class 'tbl_ts' index_regular(.data, ...) ## S3 method for class 'data.frame' index_regular(.data, index, ...) index_summary(.data, ...) ## S3 method for class 'tbl_ts' index_summary(.data, ...) ## S3 method for class 'data.frame' index_summary(.data, index, ...)
index_regular(.data, ...) ## S3 method for class 'tbl_ts' index_regular(.data, ...) ## S3 method for class 'data.frame' index_regular(.data, index, ...) index_summary(.data, ...) ## S3 method for class 'tbl_ts' index_summary(.data, ...) ## S3 method for class 'data.frame' index_summary(.data, index, ...)
.data |
data.frame or tsibble |
... |
extra arguments |
index |
the proposed index variable |
logical TRUE means it is regular, FALSE means not
# a tsibble index_regular(heights) # some data frames index_regular(pisa, year) index_regular(airquality, Month) # a tsibble index_summary(heights) # some data frames index_summary(pisa, year) index_summary(airquality, Month) index_summary(airquality, Day)
# a tsibble index_regular(heights) # some data frames index_regular(pisa, year) index_regular(airquality, Month) # a tsibble index_summary(heights) # some data frames index_summary(pisa, year) index_summary(airquality, Month) index_summary(airquality, Day)
Using key_slope
you can fit a linear model to each key in the tsibble
.
add_key_slope
adds this slope information back to the data, and returns
the full dimension tsibble
.
key_slope(.data, formula, ...) add_key_slope(.data, formula) add_key_slope.default(.data, formula)
key_slope(.data, formula, ...) add_key_slope(.data, formula) add_key_slope.default(.data, formula)
.data |
tsibble |
formula |
formula |
... |
extra arguments |
tibble with coefficient information
key_slope(heights, height_cm ~ year)
key_slope(heights, height_cm ~ year)
Return keys nearest to a given statistics or summary.
keys_near(.data, ...) ## Default S3 method: keys_near(.data, ...)
keys_near(.data, ...) ## Default S3 method: keys_near(.data, ...)
.data |
tsibble |
... |
extra arguments to pass to |
data.frame containing keys closest to a given statistic.
keys_near(heights, height_cm)
keys_near(heights, height_cm)
Return keys nearest to a given statistics or summary.
## S3 method for class 'data.frame' keys_near(.data, key, var, top_n = 1, funs = l_five_num, ...)
## S3 method for class 'data.frame' keys_near(.data, key, var, top_n = 1, funs = l_five_num, ...)
.data |
data.frame |
key |
key, which identifies unique observations. |
var |
variable to summarise |
top_n |
top number of closest observations to return - default is 1, which will also return ties. |
funs |
named list of functions to summarise by. Default is a given
list of the five number summary, |
... |
extra arguments to pass to |
heights %>% key_slope(height_cm ~ year) %>% keys_near(key = country, var = .slope_year) # Specify your own list of summaries l_ranges <- list(min = b_min, range_diff = b_range_diff, max = b_max, iqr = b_iqr) heights %>% key_slope(formula = height_cm ~ year) %>% keys_near(key = country, var = .slope_year, funs = l_ranges)
heights %>% key_slope(height_cm ~ year) %>% keys_near(key = country, var = .slope_year) # Specify your own list of summaries l_ranges <- list(min = b_min, range_diff = b_range_diff, max = b_max, iqr = b_iqr) heights %>% key_slope(formula = height_cm ~ year) %>% keys_near(key = country, var = .slope_year, funs = l_ranges)
Return keys nearest to a given statistics or summary.
## S3 method for class 'tbl_ts' keys_near(.data, var, top_n = 1, funs = l_five_num, stat_as_factor = TRUE, ...)
## S3 method for class 'tbl_ts' keys_near(.data, var, top_n = 1, funs = l_five_num, stat_as_factor = TRUE, ...)
.data |
tsibble |
var |
variable to summarise |
top_n |
top number of closest observations to return - default is 1, which will also return ties. |
funs |
named list of functions to summarise by. Default is a given
list of the five number summary, |
stat_as_factor |
coerce |
... |
extra arguments to pass to |
# Return observations closest to the five number summary of height_cm heights %>% keys_near(var = height_cm)
# Return observations closest to the five number summary of height_cm heights %>% keys_near(var = height_cm)
Designed for use with the keys_near()
function.
l_five_num l_three_num
l_five_num l_three_num
An object of class list
of length 5.
An object of class list
of length 3.
# Specify your own list of summaries l_ranges <- list(min = b_min, range_diff = b_range_diff, max = b_max, iqr = b_iqr) heights %>% key_slope(formula = height_cm ~ year) %>% keys_near(key = country, var = .slope_year, funs = l_ranges)
# Specify your own list of summaries l_ranges <- list(min = b_min, range_diff = b_range_diff, max = b_max, iqr = b_iqr) heights %>% key_slope(formula = height_cm ~ year) %>% keys_near(key = country, var = .slope_year, funs = l_ranges)
These provides three families of functions to tell you if values are always
increasing, decreasing, or unvarying, with the functions, increasing()
,
decreasing()
, or unvarying()
. Under the hood it uses diff
to find
differences, so if you like you can pass extra arguments to diff
.
increasing(x, ...) decreasing(x, ...) unvarying(x, ...) monotonic(x, ...)
increasing(x, ...) decreasing(x, ...) unvarying(x, ...) monotonic(x, ...)
x |
numeric or integer |
... |
extra arguments to pass to diff |
logical TRUE or FALSE
vec_inc <- c(1:10) vec_dec<- c(10:1) vec_ran <- c(sample(1:10)) vec_flat <- rep.int(1,10) increasing(vec_inc) increasing(vec_dec) increasing(vec_ran) increasing(vec_flat) decreasing(vec_inc) decreasing(vec_dec) decreasing(vec_ran) decreasing(vec_flat) unvarying(vec_inc) unvarying(vec_dec) unvarying(vec_ran) unvarying(vec_flat) library(ggplot2) library(gghighlight) library(dplyr) heights_mono <- heights %>% features(height_cm, feat_monotonic) %>% left_join(heights, by = "country") ggplot(heights_mono, aes(x = year, y = height_cm, group = country)) + geom_line() + gghighlight(increase) ggplot(heights_mono, aes(x = year, y = height_cm, group = country)) + geom_line() + gghighlight(decrease) heights_mono %>% filter(monotonic) %>% ggplot(aes(x = year, y = height_cm, group = country)) + geom_line() heights_mono %>% filter(increase) %>% ggplot(aes(x = year, y = height_cm, group = country)) + geom_line()
vec_inc <- c(1:10) vec_dec<- c(10:1) vec_ran <- c(sample(1:10)) vec_flat <- rep.int(1,10) increasing(vec_inc) increasing(vec_dec) increasing(vec_ran) increasing(vec_flat) decreasing(vec_inc) decreasing(vec_dec) decreasing(vec_ran) decreasing(vec_flat) unvarying(vec_inc) unvarying(vec_dec) unvarying(vec_ran) unvarying(vec_flat) library(ggplot2) library(gghighlight) library(dplyr) heights_mono <- heights %>% features(height_cm, feat_monotonic) %>% left_join(heights, by = "country") ggplot(heights_mono, aes(x = year, y = height_cm, group = country)) + geom_line() + gghighlight(increase) ggplot(heights_mono, aes(x = year, y = height_cm, group = country)) + geom_line() + gghighlight(decrease) heights_mono %>% filter(monotonic) %>% ggplot(aes(x = year, y = height_cm, group = country)) + geom_line() heights_mono %>% filter(increase) %>% ggplot(aes(x = year, y = height_cm, group = country)) + geom_line()
Returns the number of observations of a vector or data.frame. It uses
vctrs::vec_size()
under the hood.
n_obs(x, names = TRUE)
n_obs(x, names = TRUE)
x |
vector or data.frame |
names |
logical; If TRUE the result is a named vector named "n_obs", else it is just the number of observations. |
number of observations
You cannot use n_obs
with features
counting the key variable like
so - features(heights, country, n_obs)
. Instead, use any other variable.
n_obs(iris) n_obs(1:10) add_n_obs(heights) heights %>% features(height_cm, n_obs) # can be any variable except id, the key.
n_obs(iris) n_obs(1:10) add_n_obs(heights) heights %>% features(height_cm, n_obs) # can be any variable except id, the key.
Return x percent to y percent of values
near_between(x, from, to)
near_between(x, from, to)
x |
numeric vector |
from |
the lower bound of percentage |
to |
the upper bound of percentage |
logical vector
x <- runif(20) near_middle(x = x, middle = 0.5, within = 0.2) library(dplyr) heights %>% features(height_cm, list(min = min)) %>% filter(near_between(min, 0.1, 0.9)) near_quantile(x = x, probs = 0.5, tol = 0.01) near_quantile(x, c(0.25, 0.5, 0.75), 0.05) heights %>% features(height_cm, l_five_num) %>% mutate_at(vars(min:max), .funs = near_quantile, 0.5, 0.01) %>% filter(min) heights %>% features(height_cm, list(min = min)) %>% mutate(min_near_q3 = near_quantile(min, c(0.25, 0.5, 0.75), 0.01)) %>% filter(min_near_q3) heights %>% features(height_cm, list(min = min)) %>% filter(near_between(min, 0.1, 0.9)) heights %>% features(height_cm, list(min = min)) %>% filter(near_middle(min, 0.5, 0.1))
x <- runif(20) near_middle(x = x, middle = 0.5, within = 0.2) library(dplyr) heights %>% features(height_cm, list(min = min)) %>% filter(near_between(min, 0.1, 0.9)) near_quantile(x = x, probs = 0.5, tol = 0.01) near_quantile(x, c(0.25, 0.5, 0.75), 0.05) heights %>% features(height_cm, l_five_num) %>% mutate_at(vars(min:max), .funs = near_quantile, 0.5, 0.01) %>% filter(min) heights %>% features(height_cm, list(min = min)) %>% mutate(min_near_q3 = near_quantile(min, c(0.25, 0.5, 0.75), 0.01)) %>% filter(min_near_q3) heights %>% features(height_cm, list(min = min)) %>% filter(near_between(min, 0.1, 0.9)) heights %>% features(height_cm, list(min = min)) %>% filter(near_middle(min, 0.5, 0.1))
Return the middle x percent of values
near_middle(x, middle, within)
near_middle(x, middle, within)
x |
numeric vector |
middle |
percentage you want to center around |
within |
percentage around center |
logical vector
x <- runif(20) near_middle(x = x, middle = 0.5, within = 0.2) library(dplyr) heights %>% features(height_cm, list(min = min)) %>% filter(near_middle(min, 0.5, 0.1))
x <- runif(20) near_middle(x = x, middle = 0.5, within = 0.2) library(dplyr) heights %>% features(height_cm, list(min = min)) %>% filter(near_middle(min, 0.5, 0.1))
Which values are nearest to any given quantiles
near_quantile(x, probs, tol = 0.01)
near_quantile(x, probs, tol = 0.01)
x |
vector |
probs |
quantiles to calculate |
tol |
tolerance in terms of x that you will accept near to the quantile. Default is 0.01. |
logical vector of TRUE/FALSE if number is close to a quantile
x <- runif(20) near_quantile(x, 0.5, 0.05) near_quantile(x, c(0.25, 0.5, 0.75), 0.05) library(dplyr) heights %>% features(height_cm, list(min = min)) %>% mutate(min_near_median = near_quantile(min, 0.5, 0.01)) %>% filter(min_near_median) heights %>% features(height_cm, list(min = min)) %>% mutate(min_near_q3 = near_quantile(min, c(0.25, 0.5, 0.75), 0.01)) %>% filter(min_near_q3)
x <- runif(20) near_quantile(x, 0.5, 0.05) near_quantile(x, c(0.25, 0.5, 0.75), 0.05) library(dplyr) heights %>% features(height_cm, list(min = min)) %>% mutate(min_near_median = near_quantile(min, 0.5, 0.01)) %>% filter(min_near_median) heights %>% features(height_cm, list(min = min)) %>% mutate(min_near_q3 = near_quantile(min, c(0.25, 0.5, 0.75), 0.01)) %>% filter(min_near_q3)
Returns TRUE if x is nearest to y.
There are two implementations. nearest_lgl()
returns a logical vector
when an element of the first argument is nearest to an element of the
second argument. nearest_qt_lgl()
is similar to nearest_lgl()
, but
instead determines if an element of the first argument is nearest to
some value of the given quantile probabilities. See example for more
detail.
nearest_lgl(x, y) nearest_qt_lgl(y, ...)
nearest_lgl(x, y) nearest_qt_lgl(y, ...)
x |
a numeric vector |
y |
a numeric vector |
... |
(if used) arguments to pass to |
logical vector of length(y)
x <- 1:10 y <- 5:14 z <- 16:25 a <- -1:-5 b <- -1 nearest_lgl(x, y) nearest_lgl(y, x) nearest_lgl(x, z) nearest_lgl(z, x) nearest_lgl(x, a) nearest_lgl(a, x) nearest_lgl(x, b) nearest_lgl(b, x) library(dplyr) heights_near_min <- heights %>% filter(nearest_lgl(min(height_cm), height_cm)) heights_near_fivenum <- heights %>% filter(nearest_lgl(fivenum(height_cm), height_cm)) heights_near_qt_1 <- heights %>% filter(nearest_qt_lgl(height_cm, c(0.5))) heights_near_qt_3 <- heights %>% filter(nearest_qt_lgl(height_cm, c(0.1, 0.5, 0.9)))
x <- 1:10 y <- 5:14 z <- 16:25 a <- -1:-5 b <- -1 nearest_lgl(x, y) nearest_lgl(y, x) nearest_lgl(x, z) nearest_lgl(z, x) nearest_lgl(x, a) nearest_lgl(a, x) nearest_lgl(x, b) nearest_lgl(b, x) library(dplyr) heights_near_min <- heights %>% filter(nearest_lgl(min(height_cm), height_cm)) heights_near_fivenum <- heights %>% filter(nearest_lgl(fivenum(height_cm), height_cm)) heights_near_qt_1 <- heights %>% filter(nearest_qt_lgl(height_cm, c(0.5))) heights_near_qt_3 <- heights %>% filter(nearest_qt_lgl(height_cm, c(0.1, 0.5, 0.9)))
A subset of PISA data, containing scores and other information from the triennial testing of 15 year olds around the globe. Original data available from https://www.oecd.org/pisa/data/. Data derived from https://github.com/kevinwang09/learningtower.
pisa
pisa
A tibble of the following variables
year the year of measurement
country the three letter country code. This data contains Australia, New Zealand, and Indonesia. The full data from learningtower contains 99 countries.
school_id The unique school identification number
student_id The student identification number
gender recorded gender - 1 female or 2 male or missing
math Simulated score in mathematics
read Simulated score in reading
science Simulated score in science
stu_wgt The final survey weight score for the student score
Understanding a bit more about the PISA data, the school_id
and
student_id
are not unique across time. This means the longitudinal element
is the country within a given year.
We can cast pisa
as a tsibble
, but we need to aggregate the data to each
year and country. In doing so, it is important that we provide some summary
statistics of each of the scores - we want to include the mean, and minimum
and maximum of the math, reading, and science scores, so that we do not lose
the information of the individuals.
The example code below does this, first grouping by year and country, then
calculating the weighted mean for math, reading, and science. This can be
done using the student weight variable stu_wgt
, to get the survey weighted
mean. The minimum and maximum are then calculated.
pisa library(dplyr) # Let's identify #1. The **key**, the individual, who would have repeated measurements. #2. The **index**, the time component. #3. The **regularity** of the time interval (index). # Here it looks like the key is the student_id, which is nested within # school_id #' and country, # And the index is year, so we would write the following as_tsibble(pisa, key = country, index = year) # We can assess the regularity of the year like so: index_regular(pisa, year) index_summary(pisa, year) # We can now convert this into a `tsibble`: pisa_ts <- as_tsibble(pisa, key = country, index = year, regular = TRUE) pisa_ts pisa_ts_au_nz <- pisa_ts %>% filter(country %in% c("AUS", "NZL", "QAT")) library(ggplot2) ggplot(pisa_ts_au_nz, aes(x = year, y = math_mean, group = country, colour = country)) + geom_ribbon(aes(ymin = math_min, ymax = math_max), fill = "grey70") + geom_line(size = 1) + lims(y = c(0, 1000)) + labs(y = "math") + facet_wrap(~country)
pisa library(dplyr) # Let's identify #1. The **key**, the individual, who would have repeated measurements. #2. The **index**, the time component. #3. The **regularity** of the time interval (index). # Here it looks like the key is the student_id, which is nested within # school_id #' and country, # And the index is year, so we would write the following as_tsibble(pisa, key = country, index = year) # We can assess the regularity of the year like so: index_regular(pisa, year) index_summary(pisa, year) # We can now convert this into a `tsibble`: pisa_ts <- as_tsibble(pisa, key = country, index = year, regular = TRUE) pisa_ts pisa_ts_au_nz <- pisa_ts %>% filter(country %in% c("AUS", "NZL", "QAT")) library(ggplot2) ggplot(pisa_ts_au_nz, aes(x = year, y = math_mean, group = country, colour = country)) + geom_ribbon(aes(ymin = math_min, ymax = math_max), fill = "grey70") + geom_line(size = 1) + lims(y = c(0, 1000)) + labs(y = "math") + facet_wrap(~country)
Sample a number or fraction of keys to explore
sample_n_keys(.data, size) sample_frac_keys(.data, size)
sample_n_keys(.data, size) sample_frac_keys(.data, size)
.data |
tsibble object |
size |
The number or fraction of observations, depending on the
function used. In |
tsibble with fewer observations of key
library(ggplot2) sample_n_keys(heights, size = 10) %>% ggplot(aes(x = year, y = height_cm, group = country)) + geom_line() library(ggplot2) sample_frac_keys(wages, 0.1) %>% ggplot(aes(x = xp, y = unemploy_rate, group = id)) + geom_line()
library(ggplot2) sample_n_keys(heights, size = 10) %>% ggplot(aes(x = year, y = height_cm, group = country)) + geom_line() library(ggplot2) sample_frac_keys(wages, 0.1) %>% ggplot(aes(x = xp, y = unemploy_rate, group = id)) + geom_line()
To look at as much of the raw data as possible, it can be helpful to
stratify the data into groups for plotting. You can stratify
the
keys
using the stratify_keys()
function, which adds the column,
.strata
. This allows the user to create facetted plots showing a more
of the raw data.
stratify_keys(.data, n_strata, along = NULL, fun = mean, ...)
stratify_keys(.data, n_strata, along = NULL, fun = mean, ...)
.data |
data.frame to explore |
n_strata |
number of groups to create |
along |
variable to stratify along. This groups by each |
fun |
summary function. Default is mean. |
... |
extra arguments |
data.frame with column, .strata
containing n_strata
groups
library(ggplot2) library(brolgar) heights %>% sample_frac_keys(size = 0.1) %>% stratify_keys(10) %>% ggplot(aes(x = height_cm, y = year, group = country)) + geom_line() + facet_wrap(~.strata) # now facet along some feature library(dplyr) heights %>% key_slope(height_cm ~ year) %>% right_join(heights, ., by = "country") %>% stratify_keys(n_strata = 12, along = .slope_year, fun = median) %>% ggplot(aes(x = year, y = height_cm, group = country)) + geom_line() + facet_wrap(~.strata) heights %>% stratify_keys(n_strata = 12, along = height_cm) %>% ggplot(aes(x = year, y = height_cm, group = country)) + geom_line() + facet_wrap(~.strata)
library(ggplot2) library(brolgar) heights %>% sample_frac_keys(size = 0.1) %>% stratify_keys(10) %>% ggplot(aes(x = height_cm, y = year, group = country)) + geom_line() + facet_wrap(~.strata) # now facet along some feature library(dplyr) heights %>% key_slope(height_cm ~ year) %>% right_join(heights, ., by = "country") %>% stratify_keys(n_strata = 12, along = .slope_year, fun = median) %>% ggplot(aes(x = year, y = height_cm, group = country)) + geom_line() + facet_wrap(~.strata) heights %>% stratify_keys(n_strata = 12, along = height_cm) %>% ggplot(aes(x = year, y = height_cm, group = country)) + geom_line() + facet_wrap(~.strata)
This data contains measurements on hourly wages by years in
the workforce, with education and race as covariates. The population
measured was male high-school dropouts, aged between 14 and 17 years
when first measured. wages
is a time series tsibble
.
It comes from J. D. Singer and J. B. Willett.
Applied Longitudinal Data Analysis.
Oxford University Press, Oxford, UK, 2003.
https://stats.idre.ucla.edu/stat/r/examples/alda/data/wages_pp.txt
wages
wages
A tsibble
data frame with 6402 rows and 8 variables:
1–888, for each subject. This forms the key
of the data
natural log of wages, adjusted for inflation, to 1990 dollars.
Experience - the length of time in the workforce (in years).
This is treated as the time variable, with t0 for each subject starting
on their first day at work. The number of time points and values of time
points for each subject can differ. This forms the index
of the data
when/if a graduate equivalency diploma is obtained.
change in experience since getting a ged (if they get one)
categorical indicator of race = black.
categorical indicator of race = hispanic.
highest grade completed
unemployment rates in the local geographic region at each measurement time
# show the data wages library(ggplot2) # set seed so that the plots stay the same set.seed(2019-7-15-1300) # explore a sample of five individuals wages %>% sample_n_keys(size = 5) %>% ggplot(aes(x = xp, y = ln_wages, group = id)) + geom_line() # Explore many samples with `facet_sample()` ggplot(wages, aes(x = xp, y = ln_wages, group = id)) + geom_line() + facet_sample() # explore the five number summary of ln_wages with `features` wages %>% features(ln_wages, feat_five_num)
# show the data wages library(ggplot2) # set seed so that the plots stay the same set.seed(2019-7-15-1300) # explore a sample of five individuals wages %>% sample_n_keys(size = 5) %>% ggplot(aes(x = xp, y = ln_wages, group = id)) + geom_line() # Explore many samples with `facet_sample()` ggplot(wages, aes(x = xp, y = ln_wages, group = id)) + geom_line() + facet_sample() # explore the five number summary of ln_wages with `features` wages %>% features(ln_wages, feat_five_num)