Set an R-based AWS Lambda function to run on a schedule

A common use of the AWS Lambda service is to set a function to run on a recurring schedule, e.g. to collect logs, move data, or perform some ETL process. In this post, we’ll see how we can set up an AWS Lambda function, running R, on a schedule.

A lambda runtime function

We start with a simple function that does not require any input and does not return anything. If this example lambda is to run on a schedule, we don’t want to worry about any input arguments. Also, we want this lambda function to simply have a side effect, like printing something to the logs, without returning any data or writing to a database. This will help us greatly with the setup, in that we’ll be able to deploy and schedule the lambda with minimal involvement from other AWS services.

With this in mind, we have the following function that simply prints the system time. Printing the current time makes sense because we can easily check that the lambda runs on the correct schedule from the logs.

current_time <- function() {
  print(paste("CURRENT TIME: ", Sys.time()))
}

Build, test, and deploy

Then, we follow the procedure described in Tidy Tuesday dataset Lambda post. We write this to a file that we’ll use to build the lambda docker image:

r_code <- "
  current_time <- function() {
    print(paste('CURRENT TIME:', Sys.time()))
  }
  
  lambdr::start_lambda()
"

tmpfile <- tempfile(pattern = "current_time_lambda_", fileext = ".R")
write(x = r_code, file = tmpfile)

And then build the docker image. Note that we don’t have any dependencies other than base R.

r2lambda::build_lambda(
  tag = "current_time",
  runtime_function = "current_time",
  runtime_path = tmpfile,
  dependencies = NULL
)

We test the lambda docker container locally, because it makes sense. The console output should include the log messages and the standard output string showing the current time.

r2lambda::test_lambda(tag = "current_time", payload = list())

Then, we deploy the lambda to AWS, leaving the lambda environment to its defaults, as 3 seconds should be enough to get and print the current time.

r2lambda::deploy_lambda(tag = "current_time")

Finally, to make sure everything went well, we invoke the cloud instance of our function. Be sure to include the logs, as this particular function does not return anything.

r2lambda::invoke_lambda(
  function_name = "current_time",
  invocation_type = "RequestResponse",
  payload = list(),
  include_logs = TRUE
)

Schedule to run every minute

To make a lambda function run on a recurring schedule, we need to update an already deployed function. This involves three steps and two AWS services, Lambda for serverless computing and EventBridge for serverless event routing:

creating a schedule event role (EventBridge, paws::eventbridge)
adding permissions to this role to invoke lambda functions (Lambda, paws::lambda)
adding our target lambda function to event (EventBridge, paws::eventbridge)

Detailed instructions are available in the AWS documentation. The function schedule_lambda abstracts these three steps in one go. To set a Lambda on a schedule, we need the name of the function we wish to update, and the rate at which we want EventBridge to invoke it. Two expression formats for setting the rate are supported, cron and rate. For example, to schedule a lambda to run every Sunday at midnight, we could use execution_rate = "cron(0 0 * * Sun)". Alternatively, to schedule a lambda to run every 15 minutes, we might use execution_rate = "rate(15 minutes)". The details are in this AWS article

r2lambda::schedule_lambda(
  lambda_function = "current_time", 
  execution_rate = "rate(1 minute)"
  )

Checking the AWS logs

To see if our function runs every minute, we can take a look at the AWS logs. If the function was writing to a database, or dropping files in an S3 bucket, we could also check the contents of those resources for the effects of the scheduled lambda function. But as our example function only prints the current time, the only way to know that it indeed runs every minute is to check the logs.

To do this, we’ll use paws and r2lambda::aws_connect to establish an AWS CloudWatchLogs service locally, and fetch the recent logs to look for traces of our lambda function.

In the first step, we connect to cloudwatchlogs and fetch the names of the log groups. Inspect the logs object below to find the name corresponding to the lambda function whose logs we want to fetch.

logs_service <- r2lambda::aws_connect(service = "cloudwatchlogs")
logs <- logs_service$describe_log_groups()
(logGroups <- sapply(logs$logGroups, "[[", 1))

Then, we can grab only the data for our scheduled lambda function:

current_time_lambda_logs <- logs_service$filter_log_events(
  logGroupName = "/aws/lambda/current_time")

And pull only the message printed by our R function wrapped in the lambda:

messages <- sapply(current_time_lambda_logs$events, "[[", "message")
current_time_messages <- messages[grepl("CURRENT TIME", messages)]
data.frame(Current_time_lambda = current_time_messages)

#>                         Current_time_lambda
#> 1 [1] "CURRENT TIME: 2023-02-26 22:53:55"\n
#> 2 [1] "CURRENT TIME: 2023-02-26 22:54:41"\n
#> 3 [1] "CURRENT TIME: 2023-02-26 22:55:41"\n
#> 4 [1] "CURRENT TIME: 2023-02-26 22:56:41"\n
#> 5 [1] "CURRENT TIME: 2023-02-26 22:57:41"\n

Evidently, the Lambda function printed the system time every one minute, as we intended!

Clean up

We don’t want to let a this lambda fire every minute, even if trivial it still uses resources and incurs some cost. So its wise to delete the event schedule rule and maybe even the lambda function it self.

To remove the event rule, we first need to remove associated targets. In the code below, we connect to EventBridge, lookup the names of all event rules, find the rule we wish to remove (in this case the most-recent one with index 1), and then, first remove its target followed by deleting the rule it self. (I’ll probably add a function abstract this procedure in the {r2lamdba} package.)

# connect to the EventBridge service
events_service <- r2lambda::aws_connect("eventbridge")
# find the names of all rules 
schedule_rules <- events_service$list_rules()[[1]] %>% sapply("[[", 1)

# find the targets associated with the rule we want to remove
rule_to_remove <- schedule_rules[[1]]

target_arn_to_remove <- events_service$list_targets_by_rule(Rule = rule_to_remove)$Targets[[1]]$Id
events_service$remove_targets(Rule = rule_to_remove, Ids = target_arn_to_remove)
events_service$delete_rule(Name = rule_to_remove)

events_service$list_rules()[[1]] %>% sapply("[[", 1)

Finally, to remove the Lambda, we do something similar. Look up the names of all deployed functions on our account, and then delete the one(s) we’d like to delete.

lambda_service <- r2lambda::aws_connect("lambda")
lambda_service$list_functions()$Functions %>% sapply("[[","FunctionName")
lambda_service$delete_function(FunctionName = "current_time")

Summary

In this post: - we wrote a simple lambda runtime function, - built a docker image locally, - tested the lambda invocation, - deployed it to AWS Lambda, - updated it to run on a schedule, - checked the AWS logs to confirm it executes at the correct times, and - cleaned up our AWS environment.

I hope you found this tutorial useful, and that it will motivate you to try the {r2lambda} package. It is available on GitHub and can be installed with remotes::install_github. I am looking for feedback on whether or not the workflows from r2lambda are working for other people – not many have tried it so far. I am also interested in suggestions on how to improve the interface, what features to add, what additional documentation to include, and so on. Try it and share your experience!