Building EventMQ: An Asynchronous Job Execution System
Welcome to the first in a series of blog posts from our awesome tech team. They're constantly working to take our product and customer experience to the next level. We hope you enjoy hearing about the process!
A lot happens when a person syncs their calendar to EventBoard: subscriptions are created, the calendar is synced, and push notifications are sent out – just to name a few. In order to keep up with the growth we're experiencing while keeping our API and website load times fast, we push most of the work to the background.
These jobs (subscriptions, calendar sync, push notifications) are critical. But they don't need to happen right now, while the user is waiting for a response in their browser. This is where running these jobs asynchronously comes in. We delay their execution by deferring them to a system that will ensure these jobs are eventually executed.
Testing and Researching Existing Solutions
We tested and researched other solutions in the same space, but found all of them were lacking in one way or another. Systems that were touted as being reliable were very fragile and needed constant babysitting, which nobody has time for, and systems touted as being simple achieved this at the cost of critical features we needed for the business.
Of the two we put into production, neither solution lent itself to debugging what was going on when something was wrong. For us, this was a cardinal rule that had been broken, as transparency is key.
Choosing a Message Queue
Tasked once more with replacing our production asynchronous execution system with something that gave us both a richer feature set as well as more visibility into the system and the jobs themselves, we set out to pick our final Message Queue (MQ).
The more we researched, the more we saw that the MQs can be lumped into two categories:
- Monolithic and cumbersome
- Simple and bare bones
RabbitMQ, ActiveMQ and Kafka are huge. These "everything and the kitchen sink" MQs tended to have small teams of people at large companies managing them, with people posting from here to the last page of the internet asking for help getting them to cluster.
NATS, Amazon's SQS, ZeroMQ and Redis (our current "MQ") tended to only pass messages from A to B and needed supporting code to provide the features of their older cousins.
We ended up picking ZeroMQ. (If you haven't read the guide, you should. It's fantastic.) Our wish list wasn't that complicated, so writing our own solution wasn't too big of an undertaking, especially if our messaging patterns were wrapped up in a handy library (ZeroMQ).
EventMQ: How It Works
With that, we set out to create something that gave us reliability, a clear scaling path, and transparency. We named this solution EventMQ, a play on the product it was designed to support (EventBoard) and a comment on the programming paradigm it implements.
EventMQ provides us with a simple solution for distributing any kind of task, scheduled or not, across as many servers needed.
Following the Unix philosophy, EventMQ is comprised of three "do one thing well" devices/daemons: router, scheduler and jobmanager. With these devices, we can take any method or function in our code and defer the execution to a worker.
Router – The router is what most messaging systems call the "broker." We chose to call it the router so it constantly reminds us that it needs to do very little aside from route messages. This keeps it fast and reliable.
Scheduler – The scheduler keeps a list of tasks to run at a specific time, then queues the jobs to be executed at that time.
Job Manager – The job manager is responsible for managing the jobs. Currently it only supports multi-processes on the same server, like a classic asynchronous job execution system. But with AWS Lambda becoming more popular you can expect another worker option in the future.
Based on the EventMQ Protocol spec, a client can pass a REQUEST or SCHEDULE command to a router. When the router gets these messages it forwards them either to a scheduler with capacity, or a job manager with capacity. If these messages cannot be delivered, then they are queued in memory and sent as soon as it can.
When the scheduler receives a SCHEDULE command it saves it in the database and waits for the scheduled time to come. When it does, the scheduler acts as a client and sends a REQUEST to the router.
When the job manager receives a REQUEST the metadata is stripped and the command is reassembled and finally passed on to a worker which executes the code.
When important messages come through the chain, they are tracked by their respective devices and the receiver replies with an ACK command acknowledging the message has been received.
In addition to keeping our load times fast, EventMQ has also provided insight into trends that were previously harder to pinpoint.
By using Grafana and Graphite together, we've been able to watch real-time trends displayed on a screen in our work area.
Once that data was on-screen, it helped make several things more visible. For example, we saw huge spikes in the number of broker messages at 10 and 40 minutes after every hour.
That data aligned with meeting behavior that EventBoard has highlighted at many companies: a high percentage of scheduled meetings are cancelled by our check-in feature, simply because no one shows up.
We have released an early preview EventMQ on Github licensed under the LGPL v2. Feel free to check it out, use it in your own projects, or give us feedback.