Problems with lock-step deployment

Problems with lock-step deployment

Overview

No matter what software you create, in the end you must have some kind of deployment for it to reach the end customer. When you start building microservices - or any other service(s) - for that matter, as soon as they start to pile up, you'll see that it's crucial to deploy them independently. It's so much simpler to reason about your system architecture if you let each service have it's own responsibility and be deployed alone. I'm sure there are more pros/cons, but this post is based on my experiences so far. My intention is not to be exhaustive and mention everything that's bad with a lock-step deployment, but to mention some that i've had experience with. But before we dig deeper into details, let's take a step back and start with what lock-step deployment even means.

What's a lock-step deploy?

A lock-step deploy is when you have 2 or more services that are deployed at the same time, in the same CI/CD setup. It could be a web app and a backend API. It could be two backend API's. It could be a backend API and a running background job doing some processing talking to the API. The exact shape of the services is less important. What they do is less important too. The important thing (the problem) is that they are deployed at the same time. They are kinda wired together, as the below picture illustrates. It's hard to update one without affecting the others.

Photo by Bill Oxford / Unsplash

So now that we know what a lock-step deploy is, then let's consider why this is bad.

Problem with lock-step deploy

Deployment failure

Let's take a scenario where you're deploying two services but one of them fails. Some transient network error caused the one of the deploys to fail, so now only one of them succeeded and is up and running with the new version. Is this bad? Well it could be. If their contract between them was breaking then sure it's bad. If the contract between them was not breaking it might still work. It's not a very robust way of deployment anyway.

Cycle time

If you make a very small change to one of the services, then you'll have to wait for all of the services to be rebuilt, tested and deployed, in order to get just your little fix out in production. This can cause a longer cycle time needed before getting your fix into production.

Why do you even have lock-step deploy?

Knowledge and/or laziness

Things happens for a reason, and from my experience people wasn't even aware that lock-step was an issue anyway. To be honest - I didn't even know that myself a few years ago. But if we're gonna find the reason why we have lock-step deploys in the first place, it's worth to take some time and see what people usually think about when building microservices, and I think it starts with that you essentially either 1) modeled your services wrong or 2) deployed them at once since it was easy to set up. Today I find it very hard to find any compelling reason to deploy two services tightly together.

Functionality scattered across services

Let's say your building a blog platform, and with your current architecture you must update three services to allow a customer to start upload pictures. This is a case where DDD people say your services talk across bounded contexts. The services are not limited enough in their responsibility. So in this case we have a lock-step deploy because we have sliced our services at the wrong responsibility. Adding new features to this blog platform can be tedious and take a lot of time, for even small changes.

Simple deployments over time becomes monolithic

Consider when we have ten services but only two of them must be deployed at the same time (lock-step). All is fine for now. Day to day features are developed and most thing works very good. The team is aligned and your continuously improving the code. The problem I see with this is that over time, with new people coming onboard, it's a blurry line between whether the lock-step will be eliminated or 2 lock-step services will become 3 of 4. All of a sudden you're essentially deploying a distributed monolith. I do agree that on the high level, it might sound nice to press one button to deploy everything, who doesn't love that idea? But remember what we just talked about with deployment failure. How fun is it when one button triggers deployment of 7 services and 3 of them fails.

A better approach

Single Responsibility

It's usually said that microservices should have functionality (responsiblity) for a small set of features, corresponding to a subset of functionality for the entire business domain. We want this microservice to live it's own life. Usually one microservice talks to another, and they have a contract between them. This contract could be JSON over HTTP, gRPC, asynchronous messages through Kafka or Azure Service Bus etc. What communication protocol is less important for lock-step itself. But the thing is that it's much easier to reason about the microservices if they have their own lifecycle. We want them to "live their own life". Have their own lifetime, because let's be honest: sometimes you realize that you must rebuild something, or have built the wrong thing. Or you have built the right thing, but there's a bug. In all of these cases you'd ideally want the bug to be fixed or the part of the system to be isolated, so you can do the fix and then redeploy it, with zero impact on the rest of the system. That's why it must have it it's single responsibility. It's important that you can do a code change in the service, and let it be built and deployed independently.

Avoid breaking change

Let's say you have two backend API's talking with each other. One reason why you might have a lock-step deployment is because there are breaking changes between them. But the problem here is not the breaking changes - it's that you're allowing breaking changes to take place at all. Instead try to add new functionality in a backward compatible matter, and things will get easier. Add properties first, then remove the old ones. More code you say, sure. But then other API's can choose when in time they do their deploy, and pick the correct version. Then you'll cleanup the code for the old version when everyone has migrated. So then this is not an issue anymore. I've been part of a team hat had this exact problem. We had to deploy two services at the same time, practically speaking pressing two buttons in Octopus Deploy at the same time. It's guaranteed there will be some amount of time in there where the two API's just are out of sync, and some requests will fail. Maybe this is okay, depending on the purpose of the system and your user base.

CI/CD

This aligns well with Single Responsibility that I just mentioned. Don't build everything together, and then deploy everything together. Prefer to have each service isolated, and each service have it's own CI/CD setup. And allow each service to go from a single code change to production with any involvement whatsoever of the rest of your services. This way each service will be deployed independently, and you've essentially eliminated the problems with lock-step deploy.

Summary

Lock-step deployment is a nasty thing. You'll almost always want to avoid it.