--- title: "Finding Features in Data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Finding Features in Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} options(rmarkdown.html_vignette.check_title = FALSE) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) ``` When you are presented with longitudinal data, it is useful to summarise the data into a format where you have one row per key. That means one row per unique identifier of the data - if you aren't sure what this means, see the vignette, ["Longitudinal Data Structures"](https://brolgar.njtierney.com/articles/longitudinal-data-structures.html). So, say for example you wanted to find features in the wages data, which looks like this: ```{r print-wages} library(brolgar) wages ``` You can return a dataset that has one row per key, with say the minimum value for `ln_wages`, for each key: ```{r wages-summary, echo = FALSE} wages_min <- wages %>% features(ln_wages, list(min = min)) wages_min ``` This then allows us to summarise these kinds of data, to say for example find the distribution of minimum values: ```{r gg-min-wages} library(ggplot2) ggplot(wages_min, aes(x = min)) + geom_density() ``` We call these summaries `features` of the data. This vignette discusses how to calculate these features of the data. # Calculating features We can calculate `features` of longitudinal data using the `features` function (from [`fabletools`](https://fabletools.tidyverts.org/), made available in `brolgar`). `features` works by specifying the data, the variable to summarise, and the feature to calculate: ```r features(, , ) ``` or with the pipe: ```r %>% features(, ) ``` As an example, we can calculate a five number summary (minimum, 25th quantile, median, mean, 75th quantile, and maximum) of the data using `feat_five_num`, like so: ```{r features-fivenum} wages_five <- wages %>% features(ln_wages, feat_five_num) wages_five ``` Here we are taking the `wages` data, piping it to `features`, and then telling it to summarise the `ln_wages` variable, using `feat_five_num`. There are several handy functions for calculating features of the data that `brolgar` provides. These all start with `feat_`. You can, for example, find those whose values only increase or decrease with `feat_monotonic`: ```{r features-monotonic} wages_mono <- wages %>% features(ln_wages, feat_monotonic) wages_mono ``` These could then be used to identify individuals who only increase like so: ```{r wages-mono-filter} library(dplyr) wages_mono %>% filter(increase) ``` They could then be joined back to the data ```{r wages-mono-join} wages_mono_join <- wages_mono %>% filter(increase) %>% left_join(wages, by = "id") wages_mono_join ``` And these could be plotted: ```{r gg-wages-mono} ggplot(wages_mono_join, aes(x = xp, y = ln_wages, group = id)) + geom_line() ``` To get a sense of the data and where it came from, we could create a plot with `gghighlight` to highlight those that only increase, by using `gghighlight(increase)` - since `increase` is a logical, this tells `gghighlight` to highlight those that are TRUE. ```{r gg-high-mono} library(gghighlight) wages_mono %>% left_join(wages, by = "id") %>% ggplot(aes(x = xp, y = ln_wages, group = id)) + geom_line() + gghighlight(increase) ``` You can explore the available features, see the function [References](https://brolgar.njtierney.com/reference/index.html) # Creating your own Features To create your own features or summaries to pass to `features`, you provide a named list of functions. For example: ```{r create-three} library(brolgar) feat_three <- list(min = min, med = median, max = max) feat_three ``` These are then passed to `features` like so: ```{r demo-feat-three} wages %>% features(ln_wages, feat_three) heights %>% features(height_cm, feat_three) ``` Inside `brolgar`, the features are created with the following syntax: ```{r demo-feat-five-num, eval = FALSE} feat_five_num <- function(x, ...) { list( min = b_min(x, ...), q25 = b_q25(x, ...), med = b_median(x, ...), q75 = b_q75(x, ...), max = b_max(x, ...) ) } ``` Here the functions `b_` are functions with a default of `na.rm = TRUE`, and in the cases of quantiles, they use `type = 8`, and `names = FALSE`. # Accessing sets of features If you want to run many or all features from a package on your data you can collect them all with `feature_set`. For example: ```{r show-features-set} library(fabletools) feat_brolgar <- feature_set(pkgs = "brolgar") length(feat_brolgar) ``` You could then run these like so: ```{r run-features-set} wages %>% features(ln_wages, feat_brolgar) ``` For more information see `?fabletools::feature_set` # Registering a feature in a package If you create features in your own package and want to make them accessible with `feature_set`, do the following. Functions can be registered via `fabletools::register_feature()`. To register features in a package, I create a file called `zzz.R`, and use the `.onLoad(...)` function to set this up on loading the package: ```{r show-register-feature, eval = FALSE} .onLoad <- function(...) { fabletools::register_feature(feat_three_num, c("summary")) # ... and as many as you want here! } ```