Presentation: Nonconformist Resilience: DB-Backed Job Queues
What You’ll Learn
- Discover the hidden complexity implicit in common message-bus-based approaches to background work.
- Reset expectations of what your platform can bring to correctness and resilience at high velocity and team scale.
- Understand the qualities that might make a database-backed job queue right for your next app.
Abstract
Resilience in the face of chaos is a tall order. As a vertically integrated financial institution where rapidly delivered features with complete data consistency and scrupulous correctness are all non-negotiable, Betterment had its work cut out for it. So we moved the goalposts - inward. By eliminating complexity that many teams consider table stakes, we’ve built a distributed software ecosystem that empowers engineers to do their best work with a minimum of high-wire distributed systems thinking.
One of the complexity-obliterating weapons in our arsenal is our approach to background work. I’ll present how we use, deploy, and even love Delayed::Job (yes, a database-backed job queue) at Betterment for its transactional enqueue semantics, safe retry with exponential backoff, and its storage model, which lends itself to simple but powerful SLA-based monitoring and alerting. DJ enables engineers to pour their creativity into their features and get resilience by default.
QCon: What is the focus of your work today?
I lead software architecture at Betterment, which means I work with people throughout Betterment’s engineering team, keeping apprised of new developments and challenges throughout the org, sharing and cross-pollinating best practices and a shared vision for our platform, and regularly diving deep into the code alongside domain owners
QCon: What’s the motivation for your talk?
John: A lot of companies end up selecting patterns based on industry norms, but sometimes the accepted patterns have rough edges that may permanently leak into your app layer causing pain. There’s a strong sense in the industry currently that you should never use a database as a job queue, instead delegating to a product that’s called a queue. And there are valid reasons to prefer a dedicated queue, but there are also reasons not to, which often get short shrift. At Betterment, we build a suite of products that people rightly care a great deal about the correctness and consistency of, and folks don’t generally realize that when you coordinate across two datastores (which a queue is) how hard a problem it is to perform a transaction that also enqueues background work, and then ensure that that background work definitely gets worked if-and-only-if the transaction commits. Many folks will end up addressing the edge cases in their business logic on a per-feature basis rather than simply eliminating the problem by unfashionably using the database as a work queue.
There will definitely be pushback from some folks on the basis of scalability and throughput - and those are real concerns for some applications, but certainly not all, and in many cases, there are other levers you should be thinking about pulling to alleviate those concerns rather than switching jobs to a dedicated queue. I’ll be presenting an honest warts-and-all accounting of the tradeoffs so that ideally folks in the room can apply them to their distinctive problem spaces and come away with better outcomes, fully aware of the pros and cons of the choices they make.
QCon: How do-you describe the persona of the target audience of this talk?
John: Engineers building new platforms or evaluating technology for future revs of their platforms would be the sweet spot. Background work is something that most grown-up apps need to perform, and there doesn’t seem to be much info out there about the pros and cons of different approaches.
Similar Talks
Scaling DB Access for Billions of Queries Per Day @PayPal
Software Engineer @PayPal
Petrica Voicu
Psychologically Safe Process Evolution in a Flat Structure
Director of Software Development @Hunter_Ind
Christopher Lucian
PID Loops and the Art of Keeping Systems Stable
Senior Principal Engineer @awscloud
Colm MacCárthaigh
Are We Really Cloud-Native?
Director of Technology @Luminis_eu
Bert Ertman
The Trouble With Learning in Complex Systems
Senior Cloud Advocate @Microsoft
Jason Hand
How Did Things Go Right? Learning More From Incidents
Site Reliability Engineering @Netflix
Ryan Kitchens
What Breaks Our Systems: A Taxonomy of Black Swans
Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee
Laura Nolan
Cultivating High-Performing Teams in Hypergrowth
Chief Scientist @n26
Patrick Kua
Inside Job: How to Build Great Teams Within a Legacy Organization?
Engineering Director @Meetup