Welcome to the first post of our new blog series, where we’ll deep dive into how our tech team members plan, experiment, and resolve issues that come up as we grow and scale in the world of conversational commerce.
With each post, we hope you learn something new!
To kick off the series, I want to start with part one of how our team of engineers was able to scale our servers to handle 10x the load.
We were growing extremely frustrated with our server requests continuously timing out and servers going down and messaging piling up in rabbitmq.
To recover from these kinds of failures, we always had to manually intervene and restart servers and workers- sometimes even delete all the messages in the queue because workers just couldn’t process them.
So to prepare ourselves and our systems to handle more clients, we knew it was high time that we took on a project to scale our systems.
In this post, we’ll show you how we optimized our processes and servers to handle a 10x scale, what we did wrong while approaching the problem and how we found a way that works for us.
Let’s take a look at the earlier framework.
How Python Works on the Web
The Python app serves the web using something called a web server.
A web server cannot just integrate with your app automatically. To do so, there is a standard process built for all python web apps where you need to expose a WSGI application from your project.
This WSGI app can be given to any server built independently of our python project. Thus we can use any WSGI server out there to run our WSGI app.
Scaling with Gunicorn – Why it did not work
This is the configuration we were using to run our django app with gunicorn:
This config launches one gunicorn sync worker, and it processes each request synchronously, that is, one at a time. So if you make two requests at the same time, one request will have to wait until the other is finished.
Not a great experience for the user.
We then launched 30 Kubernetes pods (increased to 60 with autoscaling), where each of them had this configuration for the Django app. So then, with 30 gunicorn sync workers running in parallel to process the incoming requests, we could handle 30 requests in parallel.
But, if each pod can handle all the requests synchronously, then if one request takes a lot of time to execute (maybe because a query or another API call is taking too long or the connection to rabbitmq is blocked) then all other requests coming to that pod will have to wait until that request is finished executing.
Imagine this happening across the pods!
Now, this configuration also had these additional problems:
- At peak-loads server would just die (DOS)
- Latency was skyrocketing.
- As the concurrent connections increase on the server, latency starts reaching the stratosphere and plenty of requests just fail.
It became clear that this was not working. Here’s what we tried next.
Introducing Gunicorn Async Workers with Gevent and Monkey Patching
The above configuration launches 2 async workers with 500 greenlets.
Now each pod can handle 500*2=1000 connections concurrently using something called monkey patching. However, Django responded back negatively about monkey patching, as it was leaking database connections.
Launching 30 pods using this configuration to handle 30,000 concurrent requests sounds easy, but in reality, it was not.
Because of several requests that were being handled concurrently, we also needed that many database connections.
However, our DB ran out of connections because of the capping limit, and now every request was failing.
We also saw errors like:
These errors mean the database has no more connections left to give to you.
How did we fix that?
Limiting Connections Using a Pooler
Next, we introduced pgbouncer (Postgres connection pooler) for global pooling because Django does not support pooling natively.
(FYI- Pooling reduces PostgreSQL resource consumption and supports online restart/upgrade without dropping client connections.)
We then stress-tested our servers again. Here’s what we noticed:
- The connections problem was solved with no error
- RPS would never go above 125 RPS
- With only 100 concurrent connections, latency was so high that the 95th percentile of it would go above 50000 ms.
This latency is not ideal. Connection pooling is supposed to remove the hassle of establishing a new connection for each request and in turn reduce latency; that was not the case here.
We also ran into other problems while using gunicorn async workers. The main issue was the Connection leak.
Database connections would never close at random times, and this was because gunicorn async works use greenlets, and if you spawn another greenlet manually in your code or you called celery task, it would cause problems.
One of the problems it caused was that it would never call Django signal request_finished and so the database connection would never close.
Take a look here to learn how it works.
After this experiment, where we tried every permutation of configuration possible, we realized that scaling gunicorn for our use case just wasn’t working.
It was time to move on.
Hunt for a New Server
After running these experiments, we knew that the new server that we were looking for should:
- Handle around 200 concurrent connections
- Handle more than 1000 RPS without breaking a sweat. (which is 10x the current load)
- Not have monkey patching to avoid running into any possible problem with celery
We tried two servers- Bjoern and CherryPy.
Bjoern is a very lightweight single-threaded WSGI server that works with a libev event loop.
So you can scale it like you would scale your nodejs server, launching multiple processes to utilize all CPU resources.
Bjoern is great on paper and awesome based on some of the reviews we read. But when we stress-tested our app by launching it with Bjoern, the results weren’t great because latency had a big hit.
We weren’t convinced looking at the numbers and decided it was too slow for us.
CherryPy is a multithreaded thread-pooled server written purely in python. The creators claim that it is one of the high performant python servers out there.
Despite being written in pure python, it handles a large number of concurrent connections with high RPS and low latency.
We wrote a small snippet to launch our Django WSGI app with cherrypy and started stress testing:
Launch server script
We ran the server with 100 threads in the thread pool and increased it to 150, if needed, with 1000 rqs (request queue size) and launched 10 pods. We also removed pgbouncer for the Django server.
Here are the stress test results:
- It was easily maintaining 200 concurrent connections
- Was able to achieve 1200+ RPS easily
- 95th percentile latency reduced to less than 600ms
- The median latency is less than 200ms
Finally! Everything is as we wanted.
What has Celery got to do with it
The batteries included distributed task queue built for Django for background processing.
These batteries started bothering us as we started scaling. We started by moving from Redis to RabbitMq as a broker.
This is the configuration we had for launching each celery worker.
Yeah, gevent again and app.gcelery is for monkey patching. The code for this is below.
It worked smoothly until we started receiving more than usual traffic. All the celery worker pods would just die after raising this error!
If we peruse through the traceback, we will see this:
With this configuration (+ prefetch_count=100) each worker could prefetch 500*100=50000 tasks.
This means all the tasks would just go to one or two workers at most, and then those workers wouldn’t just be able to process all those tasks, they would die and re-delivery would happen.
The same scenario repeated for every worker until every worker was dead, and tasks started to pile up in the Queue, which caused rabbitmq connections to get blocked. This would lead to the Django server getting stuck often.
So the goal became to evenly distribute tasks b/w celery pods so that only one worker would not prefetch all the tasks.
To tackle this, we reduced the prefetch count to 3 and concurrency to 300, so each worker in a pod could now prefetch 900 tasks.
So in total, if we launched 5 pods, we could prefetch 900*5= 4500 tasks from rabbitmq.
But, while even distribution worked as expected, the celery pods were still dying with the same error- BrokenPipe!
To solve this error, we ditched gevent for celery and moved to native threads.
This is the configuration we used:
Each celery worker then launched with a thread pool of 100 to process all the tasks, with a prefetch count of 4, and as a result each worker could prefetch 100*4=400 tasks from rabbitmq.
We launched 10 pods for celery workers. So prefetch then became 10*400=4000 tasks.
With these many tasks in one queue and all of them making database queries, we also needed to be careful that we don’t run out of database connections, which is easily possible with the default way of obtaining a DB connection in django + celery.
To solve this problem, we introduced pgbouncer for celery workers only, because all these tasks were taking connections but not using them all the time, and the connections were just sitting idle most of the time.
So now Django would run with cherryPy and celery with native threads pool + pgbouncer (Final configuration)
It was time to stress test, and the results were:
- Same as above for Django
- No broken pipe errors
- More than 600 tasks were processed in a second
- No connections were blocked in rabbitmq
Our goal was to optimize our system and processes to manage a 10x scale. Trying these experiments and constantly iterating got us there!
Currently, we receive a max of 60 requests per second, while our server can handle 1200. We also currently receive 20 messages per second in a single queue while it has a capacity to process 600.
What about 100x? (Future Plans)
While we have optimized the system, very soon, we’ll have to build on it and work on scaling it again.
We’ll need to answer two key things-
- What do we need to do to scale to 6000 RPS and 2000 MPS?
- What is stopping the current configuration from handling that amount of load?
The only thing we know of as of now is that the service endpoint name resolution failures across Kubernetes services under a large amount of load.
We still are looking into what could cause this. One of the reasons could be because “AWS EC2 instances have a hard limit on the number of packets being transferred in a second which is 1200” but we are not sure.
That’s the next step – figuring it out and starting such experiments again. Stay tuned for how and what we learn in Part 2 of this coming soon!