Countering microservice disasters
Last night while reading Disasters I’ve seen in a microservices world, I realized it’s a great chance for me to put many engineering decisions we made at Productboard into context for the team. Many of them are direct counters to the pitfalls that João Alves writes about. I fully agree with all the concerns outlined in the article, and I believe it can’t be left to chance to avoid them.
At Productboard, we’re adopting a service-oriented architecture for two reasons — to enable faster decision-making in empowered teams and allow us to staff quicker by adding two more stacks into the mix we develop in.
Our Core Engineering Principles
At the heart of our Engineering Strategy is a rather simple document. A manifesto if you’d like. We call it Core Engineering Principles. In short and opinionated statements, it establishes a common approach shared by the whole team. There is a lot of room for teams to make decisions within the apps, but the Principles define the basic boundaries.
To name a few which will be relevant in the context of the article I mentioned above:
- A service in production is clearly owned by a single product tribe, ideally a product team.
- Teams are responsible for the smooth operation of their applications in the production environment.
- Sharing persistence is not allowed.
- Copy-pasting of code and data is encouraged, not having to align across the company around changes offsets the duplication created.
- Services should communicate asynchronously unless impractical for good reasons leveraging the shared message bus.
- We focus on three backend languages to reach a large enough pool for hiring, but not to fragment between too many languages: Node.js w/ Typescript, Ruby, and Kotlin.
- Technical decisions made within applications in product teams are up to the product teams themselves: library choices, linting rules, intra-app architectural patterns.
There are about 30 more touching stacks, architecture, and org structure. They might not be a great fit for every kind of engineering team, but for us doubling in growth every year, they provide a flexible yet safe blueprint.
How our Principles counter the Disasters
Let’s walk through the disasters one by one and apply how the principles are supposed to help avoid the traps and pitfalls.
Disaster #1: too small services
First of all, we aim to make sure each new service is clearly owned by one team. Even in case of an outburst of new services, there’s a clear group of people accountable for maintenance and stability.
We encourage to have a single-digit number of services within a Tribe, one or two is actually a reasonable number. Each Service should be able to provide value on its own.
Disaster #2: development environments
One of the main reasons why we have a) encouraging copy-pasting data b) no sharing of persistence c) strong bias towards asynchronous communication in the Principles is the fact those services can typically be spun up on their own.
If we’re building a new service and need to run the whole stack to develop against it, it’s a huge architectural smell.
We still have to operate a docker-compose setup provisioning the original monolithic RoR app, but we realize that maintaining it indefinitely is a futile effort.
Disaster #3: end-to-end tests
The message bus defines our contracts, and we maintain these contracts in a separate git repo. Combined with the fact the services should provide value running on their own and that they have a copy of the data they need, we can successfully bank on testing services as units and on contract testing.
Services typically enrich data of other services (as our machine learning model suggests features related to customer feedback or integration services bring in entities from 3rd parties). We tend to avoid anything that resembles a transaction spanning multiple services, requiring an E2E test.
Disaster #4: huge, shared database
This is outright banned in one of our principles — persistence shall not be shared. We still have quite some room to scale the historically largest database vertically, but new major functionality is developed horizontally.
Disaster #5: API gateways
The article focuses that the risk point has a gateway that can have business logic embedded inside (a Spring Boot app is mentioned).
We do operate a Kong API Gateway exactly for the reasons mentioned — simplify and ensure consistent authentication, rate limiting, and logging.
However, our Kong setup is declaratively configured through a git repo, making it implicitly hard to add any rogue business logic.
Disaster #6: Timeouts, retries, and resilience
“I’ve seen teams using circuit breakers and then increase the timeouts of an HTTP call to a service downstream”
Going back to our bias towards async communication and no direct dependencies, this problem goes away completely.
There’s a whole article to be written about the perils of asynchronous communication, and we still need to make sure we are resilient when it comes to cloud network jitter and failovers of Multi-AZ services, but that’s where another component of our Engineering Strategy kicks in — our Production Ready Checklist, maybe more on that next time.
Interested in joining our growing team? Well, we’re hiring across the board! Check out our careers page for the latest vacancies.