Clusters used to cost a lot of money. Now however, you can roll your own cluster for a couple of quid. Here’s an intro into how to do that.
A cluster is a hub spoke model that uses a head node running a job management piece of software called a scheduler. The scheduler is responsible for taking jobs from users and assigning them to compute nodes. The compute nodes are the machines that do the actual work.
AWS has a tool called pcluster which you can install and use to build clusters in a VPC (Virtual Private Cloud).
Using pcluster, which relies on CloudFormation to build the cluster, you can build a cluster with a head node which will fire up the compute nodes when you submit a job to the scheduler. My personal favourite scheduler is SLURM , which is a free and open source scheduler that is used on some of the world’s largest supercomputers.
The pcluster tool uses a configuration file to set the size of the head node, the size of the compute nodes, the job scheduler (SLURM in this case) and the networking configuration. You can also specify the AMI (Amazon Machine Image) to use for the head node and the compute nodes.
When a job is submitted to the scheduler, the scheduler will start and build the compute nodes, then run the job on the compute nodes. When the job is finished, the compute nodes will be terminated, thus saving you money. You can, of course, determine the size of the compute nodes and the number of compute nodes to use. For this example, I’m going to use the t2.micro instance type, which is the smallest instance type available on AWS. This is so we can experiment without having to fork out too much $$$.
Let’s get started
First we’ll start with getting a working copy of an Ansible environment which has all the AWS tools on board, the AWS CLI and the pcluster tool. I’ve created a Docker image which has all the tools installed and ready to go. First, we’ll set some variables.
ANSIBLE_VER='6.7.0'
REGISTRY="cloudguyinbroadstone"
REPO="ansible-${ANSIBLE_VER}"
TAG='1.3'
Now, let’s create a project folder:
cd
mkdir cluster
cd cluster
Now let’s pull the image.
docker pull ${REGISTRY}/${REPO}:${TAG}
OK, we have the image. Now let’s run the container:
docker run -it --rm -v ~/.aws/:/root/.aws/ -v .:/ansible ${REPO}:${TAG} /bin/bash
which should respond with something like this:
root@10f70634c7b8:/ansible#
When you use ls to list the contents of the /ansible folder in the docker container, you should get an empty folder. This is the ~/cluster folder on your host machine.
You should be able to get the version of pcluster by running:
pcluster version
{
"version": "3.7.2"
}
Run up a test cluster
Now we need to configure the cluster and generate a new config file. To do that, run the following..
pcluster configure --config mynewcluster.yml
Go through the steps all the way. If you need it to create a new VPC, by all means do. I did the same, producing a minimal configuration based on a new VPC with a public and private subnet for instances. The head node is in the public and the workers are in the private subnet. You just need to generate an SSH key in the EC2 console under ‘Key Pairs’. That’s not obligatory btw, you can just use your current key in the region. Docs are here.
So let’s create a test cluster called orion. The configuration file is in the cluster folder and is called pcluster-orion.yml. Let’s run a dry run first to make sure the configuration file is OK. If you were to use this configuration, you would need to change the security group id, VPC id and so on to suit your setup.
pcluster create-cluster --cluster-configuration ./cluster/pcluster-orion.yml --cluster-name orion --region eu-west-2 --dryrun True
should return something like this:
{
"message": "Request would have succeeded, but DryRun flag is set."
}
Nice. OK. Let’s create the cluster for real, no dry run this time. N.B. This will cost some money, so make sure you are aware of that. Probably only a few dollars, but still. I’ve set all nodes to be t2.micro, which is the small and dirt cheap. You can change this in the configuration later as you move to production. The command is..
pcluster create-cluster --cluster-configuration ./cluster/pcluster-orion.yml --cluster-name orion --region eu-west-2
The system will begin building your cluster and the head node will appear in the EC2 console. The compute nodes will appear when you submit a job to the scheduler.
Test the cluster
OK, so now the cluster is up and running, let’s test it. First, we need to SSH into the head node. To do that, we need to get the IP address of the head node. You can do that by running the following command:
pcluster describe-cluster --cluster-name orion --region eu-west-2
or just looking in the EC2 console for the instance called HeadNode.
So we can SSH into the head node, we need to add in the SSH key to the container. Exit the old container and start it again, this time adding in the SSH key as shown below. cluster-key.pem was generated using the Key Pairs section of the EC2 console.
docker run -it --rm -v ~/.ssh/cluster-key.pem:/root/.ssh/cluster-key.pem -v ~/.aws/:/root/.aws/ -v .:/ansible ${REPO}:${TAG} /bin/bash
Now we can SSH into the head node using the following command:
ssh -i "~/.ssh/cluster-key.pem" ec2-user@ec2-35-178-204-112.eu-west-2.compute.amazonaws.com
Run a job
Here’s a test job.
sbatch -N1 <<EOF
#!/bin/sh
srun hostname |sort
srun sleep 10
EOF
Cut and paste the whole lot into the SSH session and hit enter. You should see a compute instance appear, build, process the job and drop off the results in a file called slurm-1.out.
Delet this now
When you’re finished, you can delete the cluster using the following command:
pcluster delete-cluster --cluster-name orion --region eu-west-2
Conclusion
This is just a taster of what’s possible with AWS pcluster. You can configure the cluster in many, many different ways and exactly to your taste. Once you have the configuration file into a git repo, you can make tracked changes and perfect your installation to meet your organisation’s needs. You can then integrate it with Ansible to automate the build process to just asking a few questions and then letting Ansible/pcluster do the rest.