An R AWS Lambda function to download Tidytuesday datasets
Use {r2lambda}
to download Tidytuesday dataset
In this exercise, we’ll create an AWS Lambda function that downloads the tidytuesday data set for the most recent Tuesday (or most recent Tuesday from a date of interest).
Required packages
library(r2lambda)
library(jsonlite)
library(magrittr)
Runtime function
The first step is to write the runtime function. This is the function that will be
executed when we invoke the Lambda function after it has been deployed. To download
the Tidytuesday data set, we will use the {tidytuesdayR}
package. In the runtime
script, we define a function called tidytyesday_lambda
that takes one optional
argument date
. If date
is omitted, the function returns the data set(s) for the most
recent Tuesday, otherwise, it looks up the most recent Tuesday from a date of interest
and returns the corresponding data set(s).
library(tidytuesdayR)
tidytuesday_lambda <- function(date = NULL) {
if (is.null(date))
date <- Sys.Date()
most_recent_tuesday <- tidytuesdayR::last_tuesday(date = date)
tt_data <- tidytuesdayR::tt_load(x = most_recent_tuesday)
data_names <- names(tt_data)
data_list <- lapply(data_names, function(x) tt_data[[x]])
return(data_list)
}
tidytuesday_lambda("2022-02-02")
R script to build the lambda
To build the lambda image, we need an R
script that sources any required code,
loads any needed libraries, defines a runtime function, and ends with a call to
lambdr::start_lambda()
. The runtime function does not have to be defined in this
file. We could, for example, source another script, or load a package and set a
loaded function as the runtime function in the subsequent call to r2lambda::build_lambda
(see below). We save this script to a file and record the path:
r_code <- "
library(tidytuesdayR)
tidytuesday_lambda <- function(date = NULL) {
if (is.null(date))
date <- Sys.Date()
most_recent_tuesday <- tidytuesdayR::last_tuesday(date = date)
tt_data <- tidytuesdayR::tt_load(x = most_recent_tuesday)
data_names <- names(tt_data)
data_list <- lapply(data_names, function(x) tt_data[[x]])
return(data_list)
}
lambdr::start_lambda()
"
tmpfile <- tempfile(pattern = "ttlambda_", fileext = ".R")
write(x = r_code, file = tmpfile)
Build, test, and deploy the lambda function
1. Build
We set the
runtime_function
argument to the name of the function we wish thedocker
container to run when invoked. In this case, this istidytuesday_lambda
. This adds aCMD
instruction to theDockerfile
We set the
runtime_path
argument to the path we stored the script defining our runtime function.We set the
dependencies
argument toc("tidytuesdayR")
because we need to have thetidytuesdayR
package installed within thedocker
container if we are to download the dataset. This steps adds aRUN
instruction to theDockerfile
that callsinstall.packages
to install{tidytuesdayR}
from CRAN.Finally, the
tag
argument sets the name of our Lambda function which we’ll use later to test and invoke the function. Thetag
argument also becomes the name of the folder that{r2lambda}
will create to build the image. This folder will have two files,Dockerfile
andruntime.R
.runtime.R
is our script fromruntime_path
, renamed before it is copied in thedocker
image with aCOPY
instruction.
runtime_function <- "tidytuesday_lambda"
runtime_path <- tmpfile
dependencies <- "tidytuesdayR"
r2lambda::build_lambda(
tag = "tidytuesday3",
runtime_function = runtime_function,
runtime_path = runtime_path,
dependencies = dependencies
)
2. Test
To make sure our Lambda docker
container works as intended, we start it locally,
and invoke it to test the response. The response is a list of three elements:
response <- r2lambda::test_lambda(tag = "tidytuesday3", payload = list(date = Sys.Date()))
status
, should be 0 if the test worked,stdout
, the standard output stream of the invocation, andstderr
, the standard error stream of the invocation
stdout
and stderr
are raw
vectors that we need to parse, for example:
rawToChar(response$stdout)
If the stdout
slot of the response returns the correct output of our function,
we are good to deploy to AWS.
3. Deploy
The deployment step is simple, in that all we need to do is specify the name (tag) of
the Lambda function we wish to push to AWS ECR. The deploy_lambda
function also
accepts ...
, which are named arguments ultimately passed onto
paws.compute:::lambda_create_function
. This is the function that calls the Lambda
API. To see all available arguments run ?paws.compute:::lambda_create_function
.
The most important arguments are probably Timeout
and MemorySize
, which set
the time our function will be allowed to run and the amount of memory it will have
available. In many cases it will make sense to increase the defaults of 3 seconds
and 128 mb.
r2lambda::deploy_lambda(tag = "tidytuesday3", Timeout = 30)
4. Invoke
If all goes well, our function should now be available on the cloud awaiting requests.
We can invoke it from R
using invoke_lambda
. The arguments are:
function_name
– the name of the functioninvocation_type
– typicallyRequestResponse
include_log
– whether to print the logs of the run on the consolepayload
– a named list with arguments sent to theruntime_function
. In this case, the runtime function,tidytuesday_lambda
has a single argumentdate
, so the corresponding list islist(date = Sys.Date())
. As our function can be called without any argument, we can also send an empty list as the payload.
response <- r2lambda::invoke_lambda(
function_name = "tidytuesday3",
invocation_type = "RequestResponse",
payload = list(),
include_logs = TRUE
)
Just like in the local test, the response payload comes as a raw vector that needs to be parsed into a data.frame:
tidytuesday_dataset <- response$Payload %>%
rawToChar() %>%
jsonlite::fromJSON(simplifyDataFrame = TRUE)
tidytuesday_dataset[[1]][1:5, 1:5]
Summary
In this post, we went over some details about:
- how to prepare an
R
script before deploying it as a Lambda function, - what are the roles of several of the key arguments,
- how to request longer timeout or more memory for a Lambda function, and
- how to parse the response payload returned by the Lambda function
Stay tuned for a follow-up post where we set this Lambda function to run on a weekly schedule!