brolgar
explores two ways to explore the data, first
exploring the raw data, then exploring the data using summaries. This
vignette displays a variety of ways to explore your data around these
two ideas.
When you first receive your data, you want to look at as much raw data as possible. This section discusses a few techniques to make it more palatable to explore your raw data without getting too much overplotting.
Sample n random individuals to explore (Note: Possibly not representative)
For example, we can sample 20 random individuals, and then plot them.
(perhaps change sample_n_keys
into
sample_id
.)
wages %>%
sample_n_keys(size = 20)
#> # A tsibble: 132 x 9 [!]
#> # Key: id [20]
#> id ln_wages xp ged xp_since_ged black hispanic high_grade
#> <int> <dbl> <dbl> <int> <dbl> <int> <int> <int>
#> 1 7173 1.58 0.247 0 0 1 0 10
#> 2 7173 1.96 0.542 0 0 1 0 10
#> 3 7173 1.68 1.41 0 0 1 0 10
#> 4 7173 1.75 1.47 0 0 1 0 10
#> 5 7173 1.48 1.93 0 0 1 0 10
#> 6 9613 1.60 0.375 0 0 0 0 11
#> 7 9613 1.69 1.38 0 0 0 0 11
#> 8 9613 1.48 2.74 0 0 0 0 11
#> 9 9613 1.37 3.68 0 0 0 0 11
#> 10 9613 1.30 4.01 0 0 0 0 11
#> # ℹ 122 more rows
#> # ℹ 1 more variable: unemploy_rate <dbl>
wages %>%
sample_n_keys(size = 20) %>%
ggplot(aes(x = xp,
y = ln_wages,
group = id)) +
geom_line()
There was a variety of the number of observations in the data - some
with only a few, and some with many. We can filter by the number of the
observations in the data using add_n_obs()
, which adds a
new column, n_obs
, the number of observations for each
key.
wages %>%
add_n_obs()
#> # A tsibble: 6,402 x 10 [!]
#> # Key: id [888]
#> id xp n_obs ln_wages ged xp_since_ged black hispanic high_grade
#> <int> <dbl> <int> <dbl> <int> <dbl> <int> <int> <int>
#> 1 31 0.015 8 1.49 1 0.015 0 1 8
#> 2 31 0.715 8 1.43 1 0.715 0 1 8
#> 3 31 1.73 8 1.47 1 1.73 0 1 8
#> 4 31 2.77 8 1.75 1 2.77 0 1 8
#> 5 31 3.93 8 1.93 1 3.93 0 1 8
#> 6 31 4.95 8 1.71 1 4.95 0 1 8
#> 7 31 5.96 8 2.09 1 5.96 0 1 8
#> 8 31 6.98 8 2.13 1 6.98 0 1 8
#> 9 36 0.315 10 1.98 1 0.315 0 0 9
#> 10 36 0.983 10 1.80 1 0.983 0 0 9
#> # ℹ 6,392 more rows
#> # ℹ 1 more variable: unemploy_rate <dbl>
We can then filter our data based on the number of observations, and
combine this with the previous steps to sample the data using
sample_n_keys()
.
facet_strata
brolgar
provides some clever facets to help make it
easier to explore your data. facet_strata()
splits the data
into 12 groups by default:
set.seed(2019-07-23-1936)
library(ggplot2)
ggplot(wages,
aes(x = xp,
y = ln_wages,
group = id)) +
geom_line() +
facet_strata()
You can control the number with n_strata
: