Part 2: Django, Docker, Elastic Beanstalk, RDS, SQS, Celery, and .ebextensions

Have your bottle of ibuprofen ready.

Jun 27, 2022

Celery Beat: A power solution, but a constant headache.

For a solution on how to make Django + Celery + Docker fully scalable on Elastic Beanstalk, see Part 3.

In Part 1 I deployed a Docker app running Django to AWS Elastic Beanstalk (EB) using Docker Compose. There, we just used the standard internal sqlite3 database for demonstration purposes and not a remote database, but if we’re using EB then it makes sense to have our database rest inside of AWS’s RDS service. At the end of the post, I commented that adding AWS RDS to the configuration would be “straightforward” and shrugged off the .ebextensions. As it turns out, it is not “straightforward” and the deployment scheme we drew up completely bypassed .ebextensions!

In this post, we’re going back to the drawing board to redesign our deployment routine to include an RDS database and launching a Celery asynchronous background task manager.

A quick note: in multi-instance EB environments, you might have problems with remaining logged in. What gives? Go to your EB environment dashboard, go to configurations, go to your load balancer settings, and enable session stickiness.

An additional issue: After following through with setting up RDS and turning off Django’s debug mode, I noticed that my app was functioning but my load balancer health check was suddenly erroring out, putting my EB environment into “severe” mode erroring with a Target.ResponseCodeMismatch code while simultaneously telling me that “ELB health is failing or not available for all instances“ and not giving me which HTTP code it was unexpectedly receiving. When reading the environment logs, I also noticed that /var/log/nginx/access.log, which contains information about health check responses, had also disappeared. The load balancer is responsible for sending out health checks to each instance, and it is possible that something went haywire there (this would not be the first time I broke an EB environment in an unexpected way after heavy experimentation). As frustrating as it it, the fix appears to have been to just make an entirely new environment. This is not ideal, but it is important to have EB health checks functioning properly.

Associating a RDS Database

Creating a new RDS database, either fresh or from a snapshot of an existing database, is very easy with EB. Simply navigate to your EB environment dashboard, go to the configurations panel, and either create a fresh database or create a new database by providing an existing snapshot name. Be careful in how you setup your deletion rules—if you terminate the environment without taking precautions, there is a chance your RDS database will be deleted too.

For those of you wanting to associate a new EB environment with an existing RDS database, I have bad news. Apparently, the underlying foundations of EB do not permit this. You will have to either expose your database to the internet and inject its information as environment variables, or somehow figure out how to do it within the VPC.

Using `ebcli`’s Deploy Command and Updates to `docker-compose.yml`

In the last post, we deployed our app by just sending EB our docker-compose.yml file which simply contained a link to our image. This was fine for what we were doing, but we were ultimately bypassing some important steps. Most notably, the.ebextensions folder was embedded into our image and not executing on deployment! This is fine if you don’t need any pre or post deployment routines or hooks, but for any sophisticated service this will likely be an issue.

So we’ll have to deploy the old fashioned way: by zipping our app and deploying it with the eb deploy <ENVIRONMENT-NAME> command in the working directory of your Django app (this must be configured with the ebcli first by running eb init!). It is very likely that once you have your deployment working, you can set up your deployment routine to slightly resemble what we did in Part 1 by telling eb deploy to only upload docker-compose.yml, .ebextensions, .platform, and any other directories you need and to point Docker to a cloud-based image in docker-compose.yml. But that is a headache for a different day, and we’ll just send our entire app to the cloud and have docker-compose.yml build the image there. My new docker-compose.yml for this deployment routine looks like this:

version: '3.8'
services:
   web:
     build:
       context: .
       dockerfile: Dockerfile
     image: project:latest
     command: bash -c "python3 manage.py migrate && python3 manage.py runserver 0.0.0.0:8000"
     ports:
         - 80:8000
     env_file:
       - .env

There are some important differences here from Part 1. First, as I said above, we’ll just build the image locally on deployment and call it project:latest (it is not being sent to a remote repo, like Docker Hub or ECR, so don’t worry about overwriting anything unless you explicitly tell it to). The build command’s context is the folder relative to docker-compose.yml where the Dockerfile is, and the dockerfile command is the name of the Dockerfile (useful if you have multiple Dockerfiles depending on the stage of deployment). Next, we’re going to remove our database migration from .ebextensions and just do it before running the server. To make a long story short, the formatting after the command argument is the best way to chain together multiple commands on AWS Linux 2. Lastly, we’re including an env_file. More on this in a second.

I have also made another change in my deployment routine. I made a second “base” image that is based upon the first. Since we’re building the image locally, it will have to install all of our Python libraries during the deployment. Depending on how many you have, this could take a lot of time. In my second “base” image, I have just copy and pasted the requirements.txt pip install from Part 1 and referenced the original base image with all of the OS libraries. I tagged this one project:base-python:

# Use Original Base Image
FROM project:base

# copy requiremnets.txt to the Docker workdir and install all dependencies
COPY requirements.txt /app/requirements.txt
RUN python3 -m pip install -r requirements.txt

After generating this image and pushing to my repos, I then go to work on making my deployment Dockerfile that references this one. The Dockerfile I am then deploying and references in docker-compose.yml is just the port opening and Django project copying commands from our Part 1 image. If you are confused, just know that I split the original Dockerfile from Part 1 into 2 different images to speed up the deployment by pre-installing the Python libraries. Here is the Dockerfile I am deploying:

FROM project:base-python

# copy project
COPY . /app/

# port where the Django app runs
EXPOSE 80
EXPOSE 8000

# Use docker-compose to run the Django app

Note that if you’re using a public Docker Hub repo for project:base-python, then the FROM reference should be fine. If you’re using a private repo, then its better to host project:base-python on ECR where you won’t have permission issues. In this case, you can just change the FROM command to point to the ECR URL of your tagged image. This process will make production deployments lighter, but its feasibility depends on your project’s needs.

Environment Variables and RDS

It is good security practice to avoid hardcoding your database login information in code. Instead, we can scoop them up programmatically at runtime through environment variables, generally handled by the standard os library in Python. Here is what this generally looks like in settings.py when using a remote database:

DATABASES = {
    'default': {
        'ENGINE': 'django.contrib.gis.db.backends.postgis',
        'NAME': os.environ['RDS_DB_NAME'],
        'USER': os.environ['RDS_USERNAME'],
        'PASSWORD': os.environ['RDS_PASSWORD'],
        'HOST': os.environ['RDS_HOSTNAME'],
        'PORT': os.environ['RDS_PORT'],
    }
}

For those of you familiar with deploying Django to EB without Docker, this has likely never raised an issue. However, if you were to simply run this code in with the configuration in Part 1, you will find that these keys are not available, causing your code to error out and your deployment to fail. What gives? At the end of Part 1, I showed you how to inspect your Docker environment after ssh’ing into your instance. Running docker inspect on our instance, we will see that these RDS environment variables (or any environment variables injected via the EB environment dashboard) are not listed. And here it is again, the sinking feeling of yet another possibly insurmountable deployment issue on EB. However, an unlikely hero, the AWS EB documentation will save the day.

Above, in our production docker-compose.yml, we included a reference to an environment file. As it turns out, the environment variables injected into our EB environment are only available to be directly referenced on deployment, and are otherwise (specifically in the Docker platform on EB) stored in a hidden file called .env placed in the root directory of our application. This includes the RDS environment variables (IP, username, etc.) for the database associated with our EB environment. According to AWS documentation, we can simply scoop this up with our docker-compose.yml file and they will be injected into our environment variables that can be programmatically referenced by Python and Django. We can also additionally add more environment variables via the environment argument. This would be useful if you wanted to easily switch Django’s DEBUG variable in settings.py depending on whether you are debugging or deploying to production. With these changes, your deployed instances should have ready access to AWS RDS.

Django Static Files: A Pitfall of Docker + Elastic Beanstalk

Static files are all of the JS, CSS, and media used to make your web pages function. Django aggregates these into a central location when python3 manage.py collectstatic is run. In Part 1, I deployed Django in debug mode, and static files were being served as usual. However, in production, Django changes where and how it serves its static files. Generally, static files are served via URL references in your web pages, and Django treats these URLs a bit differently than those you’ve defined in your project. In non-Docker deployments running Django with DEBUG=FALSE, you must configure your proxy server via .ebextensions to serve static files. For example, .ebextensions/01_staticfiles.config would look like:

option_settings:
  aws:elasticbeanstalk:environment:proxy:staticfiles:
    /static: staticfiles/

However, the AWS documentation shows that configuring the proxy to serve static files in Amazon Linux 2 is unavailable for Docker. Some people have had hacked around this limitation by copying them elsewhere to the system and then configuring nginx to look for them. Others seem to have handled this by attaching volumes via docker-compose.yml. However, I’m not a fan of keeping static files on the web servers themselves, as it puts additional load on your servers and doesn’t scale well at all (what happens when a user needs to upload an image in your multi-instance environment without a central place to put it?). A better solution is to use AWS S3 to host them, which can be integrated into your project directly with django-storages without much additional work or changes in standard Django workflow beyond configuring settings.py. Not only do we get around this problem, but you also get your own centralized CDN for all static files and media. You can also extend their backends to point to different static URLs for debug or production if you’re worried about messing things up. Anyway, we get this configured (configuring the S3 bucket permissions to be appropriately accessible in read-only mode can be a frustrating but very important process) and we will successfully bypass this limitation in serving static files. You can either send static files to S3 locally by building some switches in settings.py or you can add it to the commands in docker-compose.yml (but this can be risky for accidentally overwriting your production static files!).

Celery + Docker + Django + EB: Welcome to the Thunderdome

Any sophisticated web platform will want the ability to schedule and queue tasks. Celery is just that service: a way to schedule and execute asynchronous tasks. Using django-celery and django-celery-beat, and configuring our Django project accordingly, we can use our own database as a centralized task scheduler. Celery requires two components: the Beat, which sends out the signal that a task needs to be run, and the Workers, which execute the task. The Beat and the Workers need to be connected by some sort of messaging system so that they can talk to each other. For this, it is obvious to choose AWS SQS, which can be integrated into django-celery and which gives you so many free messages that this will likely not cost you to implement. There are additional tutorials out there which show you how to configure SQS in settings.py as your message broker between Beat and Worker. Pro tip: be sure to set up a dedicated queue per environment and switch out their keys as you change between production deployments and debug setups. If your production queue and your debug queue are the same, then tasks could be sent to either! I like to use the same switch in settings.py that I used for choosing between local and production static files for solving this issue.

However, this is where Elastic Beanstalk becomes very unideal. First, Celery should run as a daemon in most normal configurations, and configuring/debugging this on deployment is a total nightmare. Secondly, you only ever want one Beat running at a time, which can be problematic on an auto-scaling service like EB. Multiple Beats mean that multiple signals to run a task are being sent, therefore causing task duplication which can cause all sorts of issues. Secondly, EB currently offers no ability to protect a specific instance from being terminated during auto-scaling. Even if you get only one Beat running, it always exists under the threat of being trimmed. EB treats all instances equally, and the leader has no protective weight. Lastly, it is difficult to decouple your Beat and Workers from your web environment: generally, you need access to your database and the Django ORM in order to execute most tasks.

A new leader instance is elected by EB when eb deploy is run. The leader environment is just a chosen instance in your environment to execute exclusive code. Ideally, we could leverage the leader_only command in .ebextensions to start a Beat on only the leader, and then allow the Workers to run on every instance. We would then want to make sure that we turned off auto-scaling by selected the same minimum and maximum number of instances allowed as a stopgap solution to protect our Beat. There are creative solutions for trying to keep a leader identified at all times during instance adjustments, but this is too much of a headache for now. An additional issue is that you cannot tell which instance is the leader, so identifying which instance your Beat is running would be a nightmare of ssh’ing and proc checks.

The ideal setup would be to decouple the Beat and the Workers and to put them in a different environment that can be more easily managed, requiring you to open your database to the internet and deploy your Django code somewhere else so that the Beat could read the database and access your ORM. You would have to be careful to keep your models.py up-to-date between your Celery environment and your web environment though, otherwise your ORM will begin to throw errors. Unsurprisingly, there is no good documentation for how to accomplish this. Another task for another day.

For now, let’s compromise on auto-scaling and keep our instance count rigid. This is not ideally and potentially costly, but it is so, so hard to give up the ease of use of EB. In a previous project, I used Amazon Linux 1’s supervisor to run Celery Worker and Beat as a daemon, with Beat only running on the leader thanks to the leader_only tag in .ebextensions where I was launching this service. However, supervisor does not come configured in Amazon Linux 2, and the paths to reference any install are totally different. Unless we’re down to try hours of hacking, this is not an ideal path forward. Additionally, solutions attempting to use platform hooks do not work with our configuration (after hours of trying, I forget the reason for this particular attempt). To make a long story short, after several exhausting hours of trying, what I found to work is to take advantage of docker-compose and to run Beat and Worker in separate containers alongside of our web app. This does not scale as well as the leader_only options we previously used in Amazon Linux 1, but it may be possible to kill Beat containers with platform hooks if they fail the EB_IS_COMMAND_LEADER environment variable check. I have not tried this, but it is a promising path for future work. For now, we will live with a single EB instance and can manually kill other Beat containers through ssh if we have to spin up an additional, reasonable number of other instances.

Ok, so let’s set up a multi-container environment to handle Celery Beat and Worker alongside our app. First problem: Worker requires the notoriously difficult to install pycurl Python library. Let’s go back to our Python base image (the second image we built in Part 1 to handle our requirements.txt install, which also references our initial base image in a FROM argument) and add the following block of code to install pycurl:

RUN yum -y install openssl-static && \
  export PYCURL_SSL_LIBRARY=openssl && \
  pip3 install pycurl==7.43.0.5 --global-option="--with-openssl" --upgrade

Rebuild your base Python image, tag it as base-python, and push to ECR.

Now, we need to configure our Dockerfile and docker-compose.yml. Here, I will take inspiration from this blog post. In the root of your Django project, add a folder called celery and create two subfolders called beat and worker. In each, we will add bash scripts simply called start. In file celery/beat/start, add the following:

#!/bin/bash

set -o errexit
set -o nounset

rm -f './celerybeat.pid'
celery -A project beat -l info

Then, in celery/worker/start, add:

#!/bin/bash

set -o errexit
set -o nounset

celery -A project worker --loglevel=info

Where, in both, rename project to the name of the folder from the root Django directory where settings.py lives (i.e. from the Django root, I would find settings.py under project/settings.py). Now, in our production Dockerfile, we will add commands to copy these files to the Docker image root so that we can call them later from docker-compose.yml. Your new production Dockerfile should look like this (using the base Python image stored at the ECR URL):

# Use Base Python Image
FROM URL:base-python

# copy project
COPY . /app/

ENV DJANGO_SETTINGS_MODULE=project.settings

COPY ./celery/worker/start /start-celeryworker
RUN sed -i 's/\r$//g' /start-celeryworker
RUN chmod +x /start-celeryworker


COPY ./celery/beat/start /start-celerybeat
RUN sed -i 's/\r$//g' /start-celerybeat
RUN chmod +x /start-celerybeat

# port where the Django app runs
EXPOSE 80
EXPOSE 8000

Where we again replace project.settings with your specific settings.py path. Now, let’s extend our original docker-compose.yml to be the following:

version: '3.8'
services:
   web:
     build:
       context: .
       dockerfile: Dockerfile
     image: project:latest
     command: bash -c "python3 manage.py migrate && python3 manage.py runserver 0.0.0.0:8000"
     volumes:
       - .:/app
     ports:
         - 80:8000
     env_file:
       - .env
   celery_worker:
     build:
       context: .
       dockerfile: Dockerfile
     image: django_celery_worker
     command: /start-celeryworker
     volumes:
       - .:/app
     env_file:
       - .env
     environment:
       - DJANGO_SETTINGS_MODULE=project.settings
     restart: on-failure # will restart until it's success
     depends_on:
        - web

   celery_beat:
     build:
       context: .
       dockerfile: Dockerfile
     image: django_celery_beat
     command: /start-celerybeat
     volumes:
       - .:/app
     env_file:
       - .env
     environment:
       - DJANGO_SETTINGS_MODULE=project.settings
     depends_on:
       - web

There is much to explain here. Again, we are building 3 separate images, and in each we are referencing the Dockerfile that we just produced above. The volume is referring to the app folder that we make in our Dockerfile. It is important that the Celery images have these contents because they need access to our Django settings.py. There is likely a way to “trim the fat” for the rest of the project so that we don’t include more than we have to and make each Celery image bigger than necessary, but for now we will just deal with it. The command in each of the Celery images references the bash files we made above as a way to execute Beat and Worker. Additionally, we are assuming that you have the Django settings.py module set up to use RDS. Othewise, Worker and Beat might be using copies of the same database. If you are not using a remote database, it would be a good idea to build a 4th container to host a database and use depends_on accordingly. We are also using the .env file in each to get our RDS data (note: for local testing purposes you’ll want to comment this line out. If you make a local .env environment, be sure to delete it before deploying to EB or else it will overwrite EB’s .env file!). Lastly, we add restart: on-failure to the worker node. This is important: even though we make both Celery containers depend on the web service, this does not influence the startup sequence! I had several occasions where Worker started before the web app, making the database unavailable during local testing and causing docker-compose to error out. This will just keep restarting the errored out Worker container until the web service boots up.

You may then deploy to EB, and it should work! I use django-celery-results to store task results in the database for easy confirmation that my setup is working. You can also see the standard outputs of Beat and Worker in the Elastic Beanstalk environment logs by inspecting /var/log/eb-docker/containers/eb-current-app/eb-stdouterr.log via either ssh or by eb logs.

An Imperfect Solution

I have (somewhat) successfully combined Django, Docker, Celery, and RDS on AWS Elastic Beanstalk while using SQS as my message broker. Somewhat defeating the purpose of EB, this solution does not scale to multiple instances without producing multiple Beats, but this should get you deployed in a limited capacity until you can figure that part out (leave a comment if you do!). Some potential solutions are killing the Beat containers on non-leader instances through platform hooks (see my brief description above), or trying something like redbeat. The best solution would be to remove your Beat and Worker from your web service’s EB environment entirely, perhaps to different EB environment, and then connect it to your RDS instance manually (this may require exposing the database to the internet if you can’t get your VPC grouping right). If I get a solution like this working, I will be sure to write another post.

Standard Deviation

Discussion about this post