How to dynamically aggregate any dataset in R with purrr and dplyr

This post continues what I started last time, when I described an approach for dynamic filtering of data frame in R. The motivation and approach are very similar. In short, we want to have a function that takes a list with instructions about how to aggregate a data frame. This is desirable in non-interactive or automated workflows as in these cases we usually don’t know what the user might request. Creating a flexible interface, where one can send a JSON with aggregation instructions therefore can be very helpful for various {shiny} and {plumber} tasks.

Dynamic aggregation

As with the dynamic filter, we want something of the form:

.aggregate(my_data, my_list_of_instructions)

For aggregation to work, we need to specify at least one grouping variable, and then the names and aggregation functions for the columns we would like to summarize. So the instructions list should have the form:

aggr_list <- list(
    groups = list(col = "group_var"),
    aggregates = list(
        list(col = "column 1", fun = "mean"),
        list(col = "column 2", fun = "sd"),
        ...
    )
)

To show an example, we could get the mean, median, and standard deviation of Sepal Length in species from the iris dataset:

d <- iris
l <- list(
  groups = list(col = "Species"),
  aggregates = list(
    list(col = "Sepal.Length", fun = "mean"),
    list(col = "Sepal.Length", fun = "median"),
    list(col = "Sepal.Length", fun = "sd")
  )
)
.aggregate(d, l)

Joining with `by = join_by(Species)`
Joining with `by = join_by(Species)`

     Species Sepal.Length_mean Sepal.Length_median Sepal.Length_sd
1     setosa              5.01                 5.0            0.35
2 versicolor              5.94                 5.9            0.52
3  virginica              6.59                 6.5            0.64

As we can tell, we in the aggregates list, we can specify the same column multiple times, and the resulting table will only contain the columns for which we’ve specified an aggregation.

A more complex example would be to group on more than variable and summarize multiple columns. For example, group by cylinder, gear and carburetor and find the mean of miles per gallon, the total horse power and the median weight of cars in the mtcars dataset:

d <- mtcars
l <- list(
  groups = list(col = c("cyl", "gear", "carb")),
  aggregates = list(
    list(col = "mpg", fun = "mean"),
    list(col = "hp", fun = "sum"),
    list(col = "wt", fun = "median")
  )
)
.aggregate(d, l)

Joining with `by = join_by(cyl, gear, carb)`
Joining with `by = join_by(cyl, gear, carb)`

   cyl gear carb mpg_mean hp_sum wt_median
1    6    4    4    19.75    466      3.16
2    4    4    1    29.10    290      2.07
3    6    3    1    19.75    215      3.34
4    8    3    2    17.15    650      3.48
5    8    3    4    12.62   1140      5.25
6    4    4    2    24.75    318      2.96
7    8    3    3    16.30    540      3.78
8    4    3    1    21.50     97      2.46
9    4    5    2    28.20    204      1.83
10   8    5    4    15.80    264      3.17
11   6    5    6    19.70    175      2.77
12   8    5    8    15.00    335      3.57

Neat. Again, the motivation here is not to replace the sweet {dplyr} syntax, rather to come up with a mechanism to do complex aggregation in one step, by creating an instructions list (or JSON) which, again, in {shiny} or {plumber} context would most likely be constructed programmatically.

How does it work?

Similar to our filtering function, we have two sections. First is input validation, again using {checkmate} because it’s awesome. We can make sure that the inputs are of correct type, that they are named, and that the columns specified as grouping and aggregation variables are present in the dataset. Likewise, we ensure that the requested aggregation functions are supported.

In the second section, we first convert the string passed to fun to a function, so it can be used downstream in dplyr::summarize. We do this with a helper function that can be called at the time summarize executes.

Then, we loop over the contents of the aggregates sub-list and apply each aggregation independently resulting in a list of aggregated datasets. For example, if we specified the mean, median, and sd for Sepal Length by Species, at this stage we would have a list of three data frames with the grouping column and another column representing the aggregate.

Finally, to collate the tables we use reduce with left_join, essentially joining the three data frames by species.

Note the nice argument .by for dplyr::summarize. In this case we just send the string from the groups$col slot in our instructions list. i.e we don’t need to use dplyr::group_by and the NSE construct it would require (converting it to symbol, and !! it).

#' Aggregate a dataset
#' @param .data the data frame
#' @param .arglist a list of column names to group by
#' @examples
#' d <- iris
#' l <- list(
#'   groups = list(col = "Species"),
#'   aggregates = list(
#'     list(col = "Sepal.Length", fun = "mean"),
#'     list(col = "Sepal.Width", fun = "mean")
#'   )
#' )
#' .aggregate(d, l)
#' @return a data frame with the columns aggregated
#' @export
.aggregate <- function(.data, .arglist) {
  checkmate::assert_data_frame(.data)
  checkmate::assert_list(.arglist, types = "list")
  checkmate::assert_named(.arglist)
  checkmate::assert_subset(
    names(.arglist),
    choices = c("groups", "aggregates")
  )
  checkmate::assert_list(.arglist$groups, len = 1)
  checkmate::assert_subset(
    purrr::pluck(.arglist, "groups", "col"),
    choices = names(.data)
  )
  checkmate::assert_subset(
    purrr::pluck(.arglist, "aggregates", "col"),
    choices = names(.data)
  )
  checkmate::assert_subset(
    purrr::pluck(.arglist, "aggregates", "fun"),
    choices = c(
      "mean", "median", "sum", "min", "max", "sd", "var", "count"
    )
  )

  .get_aggr_fun <- function(x) {
    switch(x,
      "mean" = mean,
      "median" = stats::median,
      "sum" = sum,
      "min" = min,
      "max" = max,
      "sd" = sd,
      "var" = var,
      "count" = length
    )
  }

  aggr_dataset <- purrr::map(
    .arglist$aggregates,
    .f = ~ dplyr::summarise(
      .data,
      !!paste(.x$col, .x$fun, sep = "_") :=
        .get_aggr_fun(.x$fun)(!!rlang::sym(.x$col)),
      .by = .arglist$groups$col
    )
  ) |>
    purrr::reduce(dplyr::left_join) |>
    dplyr::mutate_if(is.numeric, round, 2)

  aggr_dataset
}

In a {shiny} application, where we want to enable the user to select any column(s) to aggregate with some choices for aggregating function, it would require quite a bit of if/else logic to capture all the cases manually. Using this alternative, we simply harvest the user’s selections from the input and construct the list of instructions before plugging it into .aggregate to obtain our results.

Summary

In this post, we explored a method for dynamically aggregating datasets in R using purrr and dplyr. We created a function .aggregate that takes a data frame and a list of instructions specifying the grouping variables and the aggregation functions to apply. This approach is particularly useful in non-interactive or automated workflows, such as in {shiny} applications or {plumber} APIS, where the user might specify different aggregation requirements.

We demonstrated the usage of the .aggregate function with examples using the iris and mtcars datasets, showing how to group by one or more variables and apply various aggregation functions. The function ensures input validation using {checkmate} and dynamically applies the specified aggregation functions, collating the results into a single data frame.

This method provides a flexible and efficient way to perform complex aggregations programmatically, reducing the need for extensive if/else logic and making it easier to handle user-defined aggregation instructions.