Rstudio in the cloud for those of us with old laptops part 2: automating with terraform
Setting the scene for automation
In the previous post I wrote about how to spin up a EC2 instance with Rstudio server, so some of the more computational heavy R
processing can be moved on AWS infrastructure.
Let’s say you’ve done this. Kept the EC2 for some time and then decided to terminate it, since after all, even stopped, it still incurs some costs. Then, some time after that, you need to spin up a new instance, and have to go through all of the manual clicking through the AWS console described before.
Enter terraform
. It is a tool to write infrastructure as code, or more descriptively, as human readable instructions to define resources that can be run on any cloud provider. And since these instructions live in text files, you can have them versioned with git
, and keep track of any changes over time. I think this is awesome.
Additionally, terraform
works with all major cloud providers, so if you prefer to use something else insted of aws
you can adapt the code.
Prerequisites for trying out terraform for configuring Rstudio server
Two things need to be done before we can see terraform
in action.
I am not going to go into details here because different operating systems might have different steps on how to do it, so I suggest you follow the official documentation for your system.
That these have been successfully installed you can check with aws --version
and terraform --version
in your preferred terminal.
Additionally, aws
needs to be configured by typing: aws configure
. Then we have to enter the AWS Access Key ID, AWS Secret Access Key, and Default Region Name for the IAM user. If you don’t have IAM user set up, you really should. Here is a guide from AWS).
Writing the first ever terraform configuration
The Terraform
documentation is pretty good, so we are not really writing anything new, but copying from there. In a
folder called rstudio-terraform
or something more appropriate, you can create a main.tf
file and paste the following:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.16"
}
}
required_version = ">= 1.2.0"
}
provider "aws" {
region = "us-west-2"
}
resource "aws_instance" "app_server" {
ami = "ami-830c94e3"
instance_type = "t2.micro"
tags = {
Name = "ExampleAppServerInstance"
}
}
Its worth understanding the above specification and the best place to learn more about the three blocks terraform
, provider
, and resource
is in the docs.
The changes that I made were:
to add a different tag, so I changed
ExampleAppServerInstance
toRstudioTerraform
,to change the AMI. Since I was using Ubuntu before, I want to keep that, so I head out to Ubuntu Cloud Image Finder and find the AMI code for
22.04
which isami-03e08697c325f02ab
, andto change the region to
eu-central-1
.
Additionally I added a security group and a key name. If you did the the previous manual steps you should have these ready, so just name them in the configuration. If not create them on the AWS console. It is possible, of course, to create them with terraform
, but we won’t go there in this blogpost.
Finally, to test that things work, I am requesting an output of the public IP of the resource that is going to be created. So my final main.tf
file looks like this:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.16"
}
}
required_version = ">= 1.2.0"
}
provider "aws" {
region = "eu-central-1"
}
resource "aws_instance" "app_server" {
ami = "ami-03e08697c325f02ab"
instance_type = "t2.micro"
security_groups = ["THE_NAME_OF_THE_SECURITY_GROUP"]
key_name = "THE NAME OF THE KEY"
tags = {
Name = "RstudioTerraform"
}
}
output "my-public-ip"{
value= aws_instance.app_server.public_ip
}
Now, once this is saved, in the folder that holds the main.tf
file, in the terminal we run terraform init
, followed by terraform plan
, and finally terraform apply
.
The first two commands should be instantaneous, and the last one should take maybe 20 seconds to complete.
The EC2 instance should show up in the AWS console, and you can verify that the IP address that was printed on the terminal is the same one that the instance has in the Console listed under public IP address.
This only gets us half way. We still need to do bunch of stuff before we have Rstudio server running. But for now you can do terraform destroy
and see how the EC2 instance is being terminated. Repeating terraform apply
will create a new instance, and terraform destroy
will destroy it again.
Extending the terraform configuration to set up Rstudio server
Next, we need to run all those other commands in Ubuntu that update packages, install R
and Rstudio server
, create a user, and maybe something more.
It is possible to keep the whole configuration in one file, so adding code to main.tf
would not be a problem. However, it seems more convenient to have multiple files that hold logical parts together. In R
terms think of it as package that has multiple functions in different R
files.
So, create a new tf
file, maybe called remote.tf
since the code in there will do things on the remote EC2 instance.
The contents will be as follows:
resource "null_resource" "remote"{
connection {
type = "ssh"
user = "ubuntu"
private_key = file("/full/path/to/the/key.pem")
host = aws_instance.app_server.public_ip
}
provisioner "remote-exec" {
inline = [
# update indices
"sudo apt update -qq",
# install two helper packages we need
"sudo apt install --no-install-recommends software-properties-common dirmngr",
# add the signing key (by Michael Rutter) for these repos
# To verify key, run gpg --show-keys /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
# Fingerprint: E298A3A825C0D65DFD57CBB651716619E084DAB9
"wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc",
# add the R 4.0 repo from CRAN -- adjust 'focal' to 'groovy' or 'bionic' as needed
"sudo add-apt-repository --yes 'deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/'",
"sudo apt install --yes --no-install-recommends r-base",
"sudo add-apt-repository --yes ppa:c2d4u.team/c2d4u4.0+",
"sudo apt install --yes --no-install-recommends r-cran-tidyverse",
"sudo apt-get install --yes gdebi-core",
"wget https://download2.rstudio.org/server/jammy/amd64/rstudio-server-2022.12.0-353-amd64.deb",
"sudo gdebi -n rstudio-server-2022.12.0-353-amd64.deb"
]
}
provisioner "file" {
source = "/local/path/rstudio-terraform/rserver.conf"
destination = "/home/ubuntu/rserver.conf"
}
provisioner "remote-exec" {
inline = [
"sudo mv /home/ubuntu/rserver.conf /etc/rstudio/rserver.conf",
"sudo systemctl restart rstudio-server.service ",
# setup the rstudio user
"sudo groupadd rstudio-users",
"sudo useradd -m -s /bin/bash -p $(perl -e 'print crypt($ARGV[0], 'password')' 'YOUR_PASSWORD') rstudio",
"sudo usermod -a -G rstudio-users rstudio"
]
}
}
That’s quite a lot of code. Let’s go through it step by step.
Understanding the sections in the additional .tf configuration file
The file begins with resource "null_resource" "remote"{
. I don’t know why exactly this is added. The documentation
is kind of unclear to me, but a lot of places mention this as the approach (and it turned out it works).
Next, it is the connection
part which I think is straightforward. We are just telling terraform
to connect to this newly created instance using ssh
with the key we provide.
Next, the remote-exec
section is telling terraform
to execute bunch of commands
on the remote
. Clever :). The commands that we are executing are copied from
the official instructions for installing Ubuntu packages for R.
The only changes made is adding --yes
to apt install
and -n
to gdebi
, because
we want these to be executed without asking something like are you sure you want to install...
,
and because there is no way to answer this promnt (at least as far as I could see) once terraform
is ran.
Next, the file
section, uploads the rserver.conf
uploads the configuration for Rstudio server
on the EC2 instance. If you remember from the previous post
we need to configure which users can be able to access Rstudio server
. Terraform
works in two steps when uploading files. First we upload the file to the home directory of the user that is logged in, then we move that from one to another place on the remote EC2.
The rserver.conf
should have these two lines:
# users allowed to access rstudio
auth-required-user-group=rstudio-users
The next remote-exec
section then completes the setup with coping the file locally
on the EC2, restarting the service and creating the proper user and group, also
following the documentation linked in the previous post.
Save the remote.tf
and go through the plan
and apply
steps once more. You
should see the new instance created in about two minutes (because installing some of
the packages will take time).
Then log in to the Rstudio to verify that everything works as expected. Nice!
Final thoughts
There are some other stuff to discuss here. Running git init
will convert the
folder to a git repository (which can even be added to Github) – just make sure
to have any keys (if you have them in the same folder) listed in the .gitignore
file. And the management of the tfstate
file is a topic in it self, especially if you plan to share the resource within your organization.