Part 3: Adding Scalability: Django + Celery + Docker + Elastic Beanstalk

Finally, Scaling to the Clouds

Jul 08, 2022

Note: we are continuing our example from Parts 1 and 2. If you have not read those posts, you may not fully understand the process I am describing below, or will miss out on configurations that I have previously made.

In Parts 1 and 2, I showed you how to combine Django + Docker into the AWS Elastic Beanstalk (EB) architecture. We also took advantage of RDS and S3 for our cloud-based database and static files CDN. By the end of Part 2, we had a Django app running in Docker on Elastic Beanstalk with Celery running for handling our asynchronous tasks, while also using the database as our Celery task scheduler. I didn’t explicitly walk through the Celery configuration in Django, but fear not…this is covered in the official Celery documentation…I have not done anything out of the ordinary. I’m also using AWS SQS as our message broker, which means that AWS will handle communications between Celery and Django.

A limitation of our configuration in Part 2 was that Django, the Celery Beat (what initiates the tasks), and the Celery Worker (what actually processes the task) were running in a Docker multi-container environment, where Django, Beat, and Worker were all in separate containers on the same machine. This is fine if your tasks aren’t compute heavy and you don’t need to take advantage of scaling, however…why use Elastic Beanstalk then? If you did scale with this configuration, you would also have multiple Beats launched, which means that you would have duplicated tasks launched for every instance that is running. This can quickly overwhelm your system and cause issues with database locking. What we ideally want is to separate Celery from the Django EB environment, so that Django can take advantage of auto-scaling while Celery can be more carefully managed to have a single Beat and any number of Workers elsewhere (you can have multiple Workers, which will take tasks as they are made available from the queue).

If you are here after scouring the internet for a solution for scaling Django and Celery on a service like EB, this post will be an oasis in a desert. There are tons of pleading StackOverflow posts for advice here, all with less-than-helpful solutions (especially for systems based on Amazon Linux 2, where supervisor is not an out-of-the-box configuration). Here, I will provide you with one way to do it…but the lessons learned here should be able to help you with any other scalable Celery configuration that you decide on.

Redeploying Our Django App

In Part 2 we deployed Django, Beat, and Worker in 3 separate containers in a multi-container Docker deploy to a single EB environment. However, this solution will not scale in EB for the reasons listed above. Here, we are going to separate the environments that Django and Celery run on, and use SQS and the RDS database for all of them to communicate. We’ll use our EB Web environment to continue to host Django. Continuing from Part 2, we will adapt our docker-compose.yml file to just serve the Django web service container and remove the Celery Beat and Worker services:

version: '3.8'
services:
   web:
     build:
       context: .
       dockerfile: Dockerfile
     image: project:latest
     command: bash -c "python3 manage.py migrate && python3 manage.py runserver 0.0.0.0:8000"
     volumes:
       - .:/app
     ports:
         - 80:8000
     env_file:
       - .env

Use eb deploy <ENVIRONMENT NAME> from your command line to deploy the app to the environment hosting the web service only. You can now freely set instance limits in the EB configuration panel knowing that your Django app can now scale according to demand, harnessing the best feature of Elastic Beanstalk! One more tip for scaling instances with Django: if you have issues with users being continuously signed out in multi-instance EB environments, you can turn on session stickiness under the load balancer settings in the EB configurations. This will make sure users are always routed to the instance that their session is attached to.

Note: As I mentioned in Part 2, I have had problems with my web environment entering unhealthy states after a number of deploys. The health check is a result of your load balancer pinging your instances at a specified path and port. However, eventually the service stops returning a health check altogether. I have not yet figured out the cause of this, but the deploys themselves are successful and the app functions normally. I even made an endpoint specifically to return a 200 status…but the issue seems to be with the load balancer itself. It is super annoying to see the red indicators around what should be a perfectly healthy environment. At least one other person has experienced this problem with the same sort of basic configuration that we are using here. If I ever find the cause or solution, I will update this post accordingly.

Alright, so using our previous configurations, we have pushed the Django web app to our web EB environment. Now, we will set up Celery in a totally different environment.

Configuring the Exclusive Celery Environment

The nice thing about Elastic Beanstalk is that it not only allows you to spin up web environments, but also worker environments! This is ideal for running Workers, since your Worker pool can dynamically adapt to traffic (remember: you can have many workers…but you only want one Beat!). So, let’s go ahead and create a new EB worker environment through the EB dashboard. You can just use their sample app initially while we get things set up.

There are a few points to understand before continuing. First, since we’re using our database as our centralized task manager (an approach I highly recommend for many reasons, including for ease of use with scheduled tasks and the ability to execute tasks on demand through the Django admin dashboard). Since we’re using RDS as our database, and Elastic Beanstalk annoyingly does not allow for you to associate your environment with an existing database directly, we’ll need to circumvent that problem in a different way. Secondly, Celery requires access to your Django code, especially settings.py. That means that you will need to deploy the entire Django app to your EB Celery environments in the same way that you deploy your Django app to EB web environment. I’ve encountered stale code running on Celery before in EB…so make sure you keep your Celery instances up to date as your task code changes in Django. There may be more sophisticated ways to handle this, but I have not had time to test and dig in deeper.

First Things First: Integrating the RDS Database with Our Celery Environment

As I said above, we cannot associate an existing RDS database with a new EB environment via the EB dashboard. Recall that in Part 2, we constructed the database and associated it with our Django web EB environment (this is the preferable approach, rather than constructing the database and associating it with our Celery environment). Ultimately, this means that the EB environment is not automatically configured to make database connections, and the login variables used to make connections to your RDS database are not automatically injected as environment variables. We need to do two things to connect our Celery EB environment to RDS:

Safely include RDS login info as environment variables to the Celery EB environment.
Add the Celery EB security group to RDS’s inbound connections rules (otherwise, outside connections will be rejected!).

Ok, so let’s work on (1) first. First, make sure you have the necessary RDS environment variables in hand. Recall from our settings.py that this is what Django/Celery are looking for:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql_psycopg2',
        'NAME': os.environ['RDS_DB_NAME'],
        'USER': os.environ['RDS_USERNAME'],
        'PASSWORD': os.environ['RDS_PASSWORD'],
        'HOST': os.environ['RDS_HOSTNAME'],
        'PORT': os.environ['RDS_PORT'],
    }
}

A pro tip: if you forget what these are, inspect the Docker container running on your Django EB web environment using the process described at the bottom of Part 1. There are 3 ways to inject these environment variables into your Celery environment. I will rank them from approximately most to least secure:

Include them in a .env file that will be packaged and uploaded in the root of your Django app. Remember that this will overwrite the existing .env in the Elastic Beanstalk instance, so make sure you don’t need any of those for your Celery app to work. Also, make sure that this file is added to your .gitignore and .dockerignore so that it is not uploaded to the internet (ESPECIALLY if your repositories are public). If you rename the .env file, make sure that change is included in the docker-compose.yml file!
Include them directly into the docker-compose.yml file. If you go this route, make sure you take the same security precautions as in (1).
Inject them directly through the environment variables panel on the EB environment dashboard. If you use this method, make sure all users with access to your EB environments at least have MFA enabled.

For the purposes of being more explicit, I will use (2) in this example. More on this implementation later when I construct our Celery-specific docker-compose.yml. Make sure that wherever you stored these variables, everything is spelled correctly (this cost me a solid hour while testing this setup!). You will see, for example, password errors in eb logs if things are not spelled correctly.

Let’s consider our environment variables problem solved for now. Now, we must allow traffic from our newly-created Celery environment. Go to your RDS console, select your database, click on the link to your VPC security group, click on the tab for “inbound rules”, click on “edit inbound rules”, and you should see an inbound traffic rule related to your existing Django web EB environment’s security group. We need to allow traffic from our Celery EB environment’s security group. Click “add rule”, set “type” to PostgreSQL, use the TCP protocol with port 5432, use a custom source and, with the magnifying class, find the security group associated with your Celery EB environment (it should be listed according to your environment name and easily recognizable, but you can also find the security group name from the EB environment dashboard). Save the rules, and you should be good to go. From here, if you have connection refusal issues visible in eb logs, you did not set up your inbound traffic rule correctly. As of writing this, I have not tested whether the RDS database needs to have the “Publicly Accessible” option on or not. If you have issues beyond what I have described, try checking that out.

There is one more Django-specific change we must make. If we are running in production mode, Django will give Celery an error if its current environment is not in ALLOWED HOSTS in settings.py. Because we are not running a web app from here, and the instances may each have their own IP, it is not straightforward to add your Celery EB environment to the the ALLOWED_HOSTS list. An ugly work-around is to make a Celery switch in settings.py which, when set to True, will just set ALLOWED_HOSTS=[‘*’], meaning that it will just run everywhere. This is very unsecure for web environments, but because our Celery environment is just running in a worker setup, this should be mostly fine.

Alright, so we should now have our RDS connections set up without having to directly associate our EB environment with the existing database. It is not as clean as you might like and require some additional security concerns, but this is how we get around EB’s self-imposed limitations.

Deploying the Celery App

As I mentioned above, we need to deploy the entire Django app with our Celery containers so that it can access app code and settings.py. Here, I will again take use of the multi-container options and deploy Beat and Worker to the same EB instance, which will not allow workers to auto-scale. My tasks do not require more than one instance and so this is not a pressing concern for me, but the ease of deployment, platform management, and option to change if my Celery needs change in the future still make the EB platform a desirable option. If you need Workers to scale, my recommendation is this: move Beat to a single EC2 instance and only deploy the Worker container to EB. The methods described in this post should be enough to make a singular Beat deployment directly to EC2 a little less confusing, if not downright simple. But, for now, I will just bundle Beat and Worker together, much like I did in Part 2.

One consequence of bundling Beat and Worker together on the EB worker platform is that I found that anything less than a t2.small instance would result in deployment failures. Inspecting my eb logs, I found that I was getting a sqsd fault error, which stemmed from a memory allocation error. For my purposes, anything less than a t2.small instance (which can be configured easily through the EB environment dashboard) was not providing enough resources. This was surprising to me, since my Django + Beat + Worker multi-container environment was working on less than this! Whether I’m doing something wrong or if worker environments just have more overhead, I do not yet know.

These issues aside, let’s configure our Beat + Worker multicontainer deployment. Our docker-compose.yml file will look like this:

version: '3.8'
services:
   celery_beat:
     build:
       context: .
       dockerfile: Dockerfile
     image: celery_beat:latest
     command: /start-celerybeat
     volumes:
       - .:/app
     env_file:
       - .env
     environment:
       - DJANGO_SETTINGS_MODULE=project.settings
       - RDS_DB_NAME=<db name>
       - RDS_USERNAME=<username>
       - RDS_PASSWORD=<password>
       - RDS_HOSTNAME=<hostname>
       - RDS_PORT=5432
     restart: on-failure # will restart if the images fails
   celery_worker:
     build:
       context: .
       dockerfile: Dockerfile
     image: celery_worker:latest
     command: /start-celeryworker
     volumes:
       - .:/app
     env_file:
       - .env
     environment:
       - DJANGO_SETTINGS_MODULE=project.settings
       - RDS_DB_NAME=<db name>
       - RDS_USERNAME=<username>
       - RDS_PASSWORD=<password>
       - RDS_HOSTNAME=<hostname>
       - RDS_PORT=5432
     depends_on:
       - celery_beat
     ports:
       - "80"
     restart: on-failure # will restart if the images fails

A few things to note. First, I included the RDS environment variables as I discussed above. Secondly, I specified ports in the Worker environment. I did this because EB load balancer health checks occur at port 80 by default (more on this later). However, the worker will be bound to this port, making it unavailable to the Beat container. Any two containers cannot be bound to the same port! Do not specify the same port for any two containers or else you will get a deployment error, or just leave this parameter out for them to auto-assign! Next, I’m using the same Dockerfile and directory structure that I used in Part 2, so if you’re confused about what some of this means, go check that post out. Next, I’m still pulling the EB default .env file, even though I’m not using it for RDS. Lastly, I added a depends-on for Celery Worker on Celery Beat. A few times, I noticed that I was getting memory issues associated with what I thought was both images being built at once. However, this was also on smaller instances than t2.small and may not be necessary.

Now, go ahead and eb deploy to your Celery EB environment like usual. Your Celery app should be up and running and communicating with your RDS database. Scheduled tasks should be firing as well.

Another note on EB health checks. Much like my problems with the mysterious EB load balancer health checks on the Django web environment, this comes with its own issue. Because we’re running a worker environment, I have not set Celery up to take HTTP connections on port 80 (and I’m not even sure if that’s possible since we’re not running web architecture). Yet EB insists on sending HTTP health checks there anyway! The result is successful deployments, but a permanent Severe state. Once again, its going to look ugly, but it should continue to work.

At Long Last: A Scalable Solution With Django + Docker + Celery + AWS Elastic Beanstalk

Congrats, weary internet traveler…I hope you have found what you were looking for with this post: a fairly comprehensive solution to building a Django app with Celery integration that is scalable with AWS Linux 2 and Elastic Beanstalk. As always, if you have any suggestions to my procedure, if I have made a mistake, or if you have problems of your own, please leave a comment.

Addendum: Figuring out the EB Severe Health Status Issue

After working on this deployment more, I realized that every instance entered a permanent Severe status after a period of a couple of minutes, which is given by the load balancer’s health check, even though the app was online, accessible, and totally working as intended! Figuring out why took way longer than it should have. At first, I thought it had something to do with the fact that I didn’t have Nginx configured as a container in my multi-container environment, and so I configured Nginx to work (if this is necessary, I will do another post. Wondering if it will be for handing HTTPS requests).

Anyway, as it turns out (thanks to this post), EB performs health checks on private IP address. Moreover, the HTTP status of the health checks is not reported on the EB health check dashboard. Inspecting the EC2 dashboard on AWS, looking at the left-most pane, and finding “target groups“ under the load balancing submenu, it is possible to directly inspect the health check options for the load balancer. Sure, enough it was returning 400 codes! Because Django only responds to requests on addresses listed under the ALLOWED_HOSTS list in settings.py, and because the load balancer was making health requests to my Django app on a private IP that was not listed in ALLOWED_HOSTS, Django was returning 400 codes to the load balancer health check despite functioning normally. To debug this, I deployed a version of my app with ALLOWED_HOSTS = [“*”] (i.e. all hosts allowed—don’t do this in production) and, sure enough, the app returned to healthy status.

So…because we don’t want ALLOWED_HOSTS = [“*”] in our production app, we need to find a way to automatically and dynamically register the EB private IP address in ALLOWED_HOSTS. Directly under your ALLOWED_HOSTS list in settings.py, add the following:

ALLOWED_HOSTS = [...]

# Determine private IP address of EC2 instance (fixes EB health checks which occur via IP)
try:
    IMDSv2_TOKEN = requests.put('http://169.254.169.254/latest/api/token', headers={
        'X-aws-ec2-metadata-token-ttl-seconds': '3600'
    }).text
    EC2_PRIVATE_IP = requests.get('http://169.254.169.254/latest/meta-data/local-ipv4', timeout=0.01, headers={
        'X-aws-ec2-metadata-token': IMDSv2_TOKEN
    }).text
except requests.exceptions.RequestException:
    EC2_PRIVATE_IP = None
if EC2_PRIVATE_IP:
    ALLOWED_HOSTS.append(EC2_PRIVATE_IP)

This worked for me on AWS Linux 2 running a Docker Django app at the time of writing this, as a new update to AWS EB required token authentication for grabbing the private IP.

With this last, lingering problem finally solved…we FINALLY have a stable deployment. Happy coding!

Aug 2, 2022

Scott - thanks again for this set-up. I communicated with you briefly on Alex3917's github post. I did find this (sent to me from AWS support), which might help the issue of leader termination? I am not well versed with celery, so perhaps not, but wanted to share with you anyway. Let me know what you think: https://github.com/awsdocs/elastic-beanstalk-samples/blob/master/configuration-files/aws-provided/instance-configuration/cron-leaderonly-linux.config

Expand full comment

Standard Deviation

Discussion about this post