For sale on Gumtree

So the other day, I was browsing through Gumtree because it’s interesting seeing what’s for sale in a 1 mile radius of where I live.

I was thinking, I wonder how does the item count change over time? I wonder if I could use AWS to track the number of items for sale in a given area on Gumtree?

With no clear API available, I had a look at the URL and realised quite quickly that the count of the number of items is in a singular H1 tag, which is very easy to snag, returning, as it does, the text…

4018 ads in Broadstone, Dorset

Great, so I can use AWS Lambda with Python 3.11 to get the page, parse it for the H1 tag, and then extract the number of items for sale. Then I’d like it to update an RRD file with the number of items for sale, and then I can use RRDTool to graph the number of items for sale over time.

It’s also a great opportunity to learn how to use a Docker container to run the Lambda function and learn what limitations there are compared to developing the solution in the AWS console.

There’s a lot of advantages to developing the function locally especially as AWS provide a Docker image for Lambda development. There is also the option to use AWS SAM (Serverless Application Model) to develop the function locally, but I’m going to use the Docker image for this example. SAM is great and does a lot of the heavy lift, but I wanted to see how would it work in Docker and, if successful, could we maybe wire up some Docker Actions to build and deploy the function to AWS? OK, well, let’s get started.

Get the framework

Here’s the main documentation for AWS Lambda with container images, and it’s short and to the point.

I’ll summarise the steps here, but it’s worth reading the documentation as it’s very clear and concise.

First we need the Dockerfile and related files to build the image. All we need at minimum is the lambda_function.py file, a Dockerfile and the requirements.txt file, so let’s create a directory and the files.

mkdir gumtree
cd gumtree
touch lambda_function.py && touch Dockerfile && touch requirements.txt

lambda_function.py

Paste the following into the lambda_function.py file. I’m using the Python 3.11 image (see Dockerfile) because it’s the default Python, but there are others, of course and not just Python, but also Node.js, Java, .NET, Go, Ruby, and Rust.

import sys
def handler(event, context):
    print('This line will appear in Cloudwatch Logs : AWS Lambda using Python ' + sys.version + '!')
    return {
        'statusCode': 200,
        'body': 'This line appears in your Lambda console.
    }

Dockerfile

Paste the following into the Dockerfile file.

FROM public.ecr.aws/lambda/python:3.11

# Copy requirements.txt
COPY requirements.txt ${LAMBDA_TASK_ROOT}

# Install the specified packages
RUN pip install -r requirements.txt

# Copy function code
COPY lambda_function.py ${LAMBDA_TASK_ROOT}

# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "lambda_function.handler" ]

requirements.txt

For the requirements.txt file, it’s empty to start with but you will need to add packages here to install requires libraries. Initially, I’d use boto3 for the AWS SDK.

boto3

Build, test, push

Now we have the files, we can build the image, test it locally, and then push it to ECR. I’ve documented the commands in the README.md file in the Github repo so I won’t repeat them here. Once you have the image in ECR, you can create the Lambda function in the AWS console.

Policies and Roles

No AWS work is complete without a policy and an accompanying role. They’re documented here in the Github repo. Naturally, you’ll need to change the account number to your own.

Create the Lambda function

I use the AWS CLI to delete and create the function like this..

aws lambda delete-function --function-name ${LAMBDA_NAME}
aws lambda create-function \
  --code ImageUri=${REGISTRY}/${REPO}:latest \
  --description "Count of number of items for sale near Broadstone" \
  --environment Variables="{CLOUDFRONT_DISTRIBUTION=${CLOUDFRONT_DISTRIBUTION},RRD_FILE=${RRD_FILE},CSV_FILE=${CSV_FILE},GUMTREE_URL=${GUMTREE_URL},S3_BUCKET=${S3_BUCKET}}" \
  --function-name ${LAMBDA_NAME} \
  --timeout 30 \
  --package-type Image \
  --role "arn:aws:iam::${ACCOUNT_ID}:role/AWSLambdaBasicExecutionRole-${LAMBDA_NAME}"

Once complete, you can test the function in the console, and you should see the output in the Cloudwatch logs.

Now automate all of it

All automation starts with using the command line to see what works and how it works. Once that’s done, the next step is to automate it.

In this case, the most suitable automation is to use Github Actions. These are similar to BitBucket pipelines and allow you to create a workflow to automate the build, test and deploy process.

Once you clone the repo, you’ll see the .github/workflows directory and the build-and-deploy.yml file. This is the workflow that will check out the code, log in to the AWS Elastic Container Registry (ECR), build the image, push it to ECR, update the ’latest’ tag and then create/update the Lambda function.

All that’s required are the appropriate secrets and variables to be added to the repo under Settings, Secrets and variables, Actions.

The secrets are

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_ACCOUNT_ID

The variables are

CLOUDFRONT_DISTRIBUTION
LAMBDA_NAME
REGION
REPO
S3_BUCKET

Once these values are set, then you can commit the changes and the workflow will run. You can see the workflow running in the Actions tab in Github.

Summary

Docker images and Python 3.11 are a great combination for AWS Lambda functions when you want to work on code locally and then deploy it to AWS. It’s particularly useful if you have packages like rrdtool that are not natively available in AWS Lambda and are (surprisingly) fiddly to compile. As a Dockerfile and requirements.txt, it becomes much easier to manage and deploy, meaning you can focus on the Python code and not the environment.

As time goes by, your organisation will most probably move to keeping a repo of managed templates (common controls and libraries) for jobs such as this to make the development process even easier and faster.

All in all then, images are a great way to develop and deploy Lambda functions if they’re larger blocks of code but can stay under 15 minutes duration. The other beauty is that images can be shared in publicly accessible repos like Docker Hub and ECR, so you can share your work with others.

References

Output

So, here’s the final results.

Hourly count Daily count Weekly count Monthly count Yearly count

Get the framework#

lambda_function.py#

Dockerfile#

requirements.txt#

Build, test, push#

Policies and Roles#

Create the Lambda function#

Now automate all of it#

Summary#

References#

Output#