.aggregate(my_data, my_list_of_instructions)
This post continues what I started last time, when I described an approach for dynamic filtering of data frame in R
. The motivation and approach are very similar. In short, we want to have a function that takes a list with instructions about how to aggregate a data frame. This is desirable in non-interactive or automated workflows as in these cases we usually don’t know what the user might request. Creating a flexible interface, where one can send a JSON with aggregation instructions therefore can be very helpful for various {shiny}
and {plumber}
tasks.
Dynamic aggregation
As with the dynamic filter, we want something of the form:
For aggregation to work, we need to specify at least one grouping variable, and then the names and aggregation functions for the columns we would like to summarize. So the instructions list should have the form:
<- list(
aggr_list groups = list(col = "group_var"),
aggregates = list(
list(col = "column 1", fun = "mean"),
list(col = "column 2", fun = "sd"),
...
) )
To show an example, we could get the mean, median, and standard deviation of Sepal Length in species from the iris
dataset:
<- iris
d <- list(
l groups = list(col = "Species"),
aggregates = list(
list(col = "Sepal.Length", fun = "mean"),
list(col = "Sepal.Length", fun = "median"),
list(col = "Sepal.Length", fun = "sd")
)
).aggregate(d, l)
Joining with `by = join_by(Species)`
Joining with `by = join_by(Species)`
Species Sepal.Length_mean Sepal.Length_median Sepal.Length_sd
1 setosa 5.01 5.0 0.35
2 versicolor 5.94 5.9 0.52
3 virginica 6.59 6.5 0.64
As we can tell, we in the aggregates list
, we can specify the same column multiple times, and the resulting table will only contain the columns for which we’ve specified an aggregation.
A more complex example would be to group on more than variable and summarize multiple columns. For example, group by cylinder, gear and carburetor and find the mean of miles per gallon, the total horse power and the median weight of cars in the mtcars
dataset:
<- mtcars
d <- list(
l groups = list(col = c("cyl", "gear", "carb")),
aggregates = list(
list(col = "mpg", fun = "mean"),
list(col = "hp", fun = "sum"),
list(col = "wt", fun = "median")
)
).aggregate(d, l)
Joining with `by = join_by(cyl, gear, carb)`
Joining with `by = join_by(cyl, gear, carb)`
cyl gear carb mpg_mean hp_sum wt_median
1 6 4 4 19.75 466 3.16
2 4 4 1 29.10 290 2.07
3 6 3 1 19.75 215 3.34
4 8 3 2 17.15 650 3.48
5 8 3 4 12.62 1140 5.25
6 4 4 2 24.75 318 2.96
7 8 3 3 16.30 540 3.78
8 4 3 1 21.50 97 2.46
9 4 5 2 28.20 204 1.83
10 8 5 4 15.80 264 3.17
11 6 5 6 19.70 175 2.77
12 8 5 8 15.00 335 3.57
Neat. Again, the motivation here is not to replace the sweet {dplyr}
syntax, rather to come up with a mechanism to do complex aggregation in one step, by creating an instructions list (or JSON) which, again, in {shiny}
or {plumber}
context would most likely be constructed programmatically.
How does it work?
Similar to our filtering function, we have two sections. First is input validation, again using {checkmate}
because it’s awesome. We can make sure that the inputs are of correct type, that they are named, and that the columns specified as grouping and aggregation variables are present in the dataset. Likewise, we ensure that the requested aggregation functions are supported.
In the second section, we first convert the string passed to fun
to a function, so it can be used downstream in dplyr::summarize
. We do this with a helper function that can be called at the time summarize
executes.
Then, we loop over the contents of the aggregates
sub-list and apply each aggregation independently resulting in a list of aggregated datasets. For example, if we specified the mean, median, and sd for Sepal Length by Species, at this stage we would have a list of three data frames with the grouping column and another column representing the aggregate.
Finally, to collate the tables we use reduce
with left_join
, essentially joining the three data frames by species.
Note the nice argument .by
for dplyr::summarize
. In this case we just send the string from the groups$col
slot in our instructions list. i.e we don’t need to use dplyr::group_by
and the NSE construct it would require (converting it to symbol, and !!
it).
#' Aggregate a dataset
#' @param .data the data frame
#' @param .arglist a list of column names to group by
#' @examples
#' d <- iris
#' l <- list(
#' groups = list(col = "Species"),
#' aggregates = list(
#' list(col = "Sepal.Length", fun = "mean"),
#' list(col = "Sepal.Width", fun = "mean")
#' )
#' )
#' .aggregate(d, l)
#' @return a data frame with the columns aggregated
#' @export
<- function(.data, .arglist) {
.aggregate ::assert_data_frame(.data)
checkmate::assert_list(.arglist, types = "list")
checkmate::assert_named(.arglist)
checkmate::assert_subset(
checkmatenames(.arglist),
choices = c("groups", "aggregates")
)::assert_list(.arglist$groups, len = 1)
checkmate::assert_subset(
checkmate::pluck(.arglist, "groups", "col"),
purrrchoices = names(.data)
)::assert_subset(
checkmate::pluck(.arglist, "aggregates", "col"),
purrrchoices = names(.data)
)::assert_subset(
checkmate::pluck(.arglist, "aggregates", "fun"),
purrrchoices = c(
"mean", "median", "sum", "min", "max", "sd", "var", "count"
)
)
<- function(x) {
.get_aggr_fun switch(x,
"mean" = mean,
"median" = stats::median,
"sum" = sum,
"min" = min,
"max" = max,
"sd" = sd,
"var" = var,
"count" = length
)
}
<- purrr::map(
aggr_dataset $aggregates,
.arglist.f = ~ dplyr::summarise(
.data,!!paste(.x$col, .x$fun, sep = "_") :=
.get_aggr_fun(.x$fun)(!!rlang::sym(.x$col)),
.by = .arglist$groups$col
)|>
) ::reduce(dplyr::left_join) |>
purrr::mutate_if(is.numeric, round, 2)
dplyr
aggr_dataset }
In a {shiny}
application, where we want to enable the user to select any column(s) to aggregate with some choices for aggregating function, it would require quite a bit of if/else
logic to capture all the cases manually. Using this alternative, we simply harvest the user’s selections from the input and construct the list of instructions before plugging it into .aggregate
to obtain our results.
Summary
In this post, we explored a method for dynamically aggregating datasets in R using purrr
and dplyr
. We created a function .aggregate
that takes a data frame and a list of instructions specifying the grouping variables and the aggregation functions to apply. This approach is particularly useful in non-interactive or automated workflows, such as in {shiny}
applications or {plumber}
APIS, where the user might specify different aggregation requirements.
We demonstrated the usage of the .aggregate
function with examples using the iris
and mtcars
datasets, showing how to group by one or more variables and apply various aggregation functions. The function ensures input validation using {checkmate}
and dynamically applies the specified aggregation functions, collating the results into a single data frame.
This method provides a flexible and efficient way to perform complex aggregations programmatically, reducing the need for extensive if/else
logic and making it easier to handle user-defined aggregation instructions.