Getting Used To Microservices
TL;DR: My new employer has built their software as many separate services, communicating through well-defined networked interfaces. I have spent the past few years working on monoliths, so this takes some adjustment.
I have noticed four big differences so far. With the service-based design,
- The data is all over the place;
- It is difficult to diagnose functionality that spans many services;
- Refactoring system design becomes expensive; and
- There is dead code nobody knows is dead.
While these sound negative, I don’t mean to say this is a negative experience. There are many positive things I’ve noticed too, but I already expected them because proponens of microservices have done a good job of marketing the idea to me already.
The things above are those I didn’t expect. But first, let’s kill a myth, and then go into these points in a little more detail.1 Please note that these are the very early observations of a microservice beginner, and drawn mainly from a specific system that may or may not be well-architected. I’ve tried to tease out more general lessons from this specific system, but I still have a lot to learn and it’s likely that one year from now I will re-read this article and wonder what I was smoking.
Monoliths can be well designed
A monolith is not synonymous with a bowl of spaghetti. we can have separate projects, well-delineated modules, stable interfaces, etc. If we build all of that up into one huge binary deployed as single unit, it is a monolith. There are some benefits to doing that:
- Deployment is extraordinarily simple. We just copy the binary to the server and restart the service. Done. That’s the entire product deployed with all components on their newest versions.
- Synchronising changes across the codebase becomes possible. Since everything is built together, we can make atomic changes across the entire product.
- With local rather than remote procedure calls, it’s easy to follow execution, modify behaviour for testing purposes, and so on.
Some of these are, of course, also the drawback of deploying software as a monolith. Maybe it’s unrealistic to deploy the entire product every time a component is updated. Maybe it’s undesirable due to organisation size to synchronise changes across the entire codebase.
Data is all over the place
One of the first things I noticed as I tried to make a change is how incredibly denormalised data is. Each domain object is owned by one service, yes, but then every other service that wants to deal with that type of object has its own internal representation of that object.
Imagine a software system that runs all the libraries2 You know those big houses that give you books for free in exchange for your promise that you’ll give the book back later. in an area. There’s this concept of an inter-library loan, where Alice who lives in Townsville wants to borrow a book on ancient auction houses, but the Townsville library does not have that book. However, the library in Citypolis has the book, so the Townsville library borrows the book from Citypolis and then Alice loans it from Townsville. Great.
Now, there’s a InterlibraryLoan
service that owns the inter-library loan
object, containing all the details of the transaction between the libraries.
However, the software system also wants to present some of this information to
Alice, so the regular Loan
service needs some of the information from the
inter-library loan object. Oh, and the Scheduling
service also needs some of
that information because it affects when people in Citypolis can loan the book
again. And Insurance
needs it because it affects the risk of no return. And so
on, and so forth.3 Note that none of those services need to store
information on inter-library loans, but when they need it, they must request it
from InterlibraryLoan
, and convert the result to their internal
representations.
Which hints at the next layer of denormalisation: the InterlibraryLoan
service
must not expose the schema of its internal object to any other services.4 If
it did, they would be coupled to the point where they could just use the same
object anyway. So instead, there’s an InterlibraryLoanDescription
object that
contains most of the information from the domain object, except with a more
stable api. Oh, and if we are separating queries and commands (cqrs) then in
the worst case we might also want different representations of the object for
those two types of interface.
Thus, backing up, if five services deal with the inter-library loan object, there will be 5×4=20 representations of the same data. If we make a change to the domain object, that change might need propagation to 19 other objects.
Or it doesn’t! Maybe only six out of those 19 other objects actually need the change to the schema, and then the representations diverge in what information they carry.5 I think that is supposed to be fine in a microservice design, but coming from a monolith world where there was generally only one representation of each thing – it hurts! It takes some getting used to. Especially when the objects are almost the same but not quite.
Making services bigger would help
Obviously, one way out of this is to make services bigger. The underlying problem is that one use case (borrowing a book) is represented through multiple services. That means those services are coupled – not in the code, but through the functionality they jointly implement.
If we make changes to the feature, that change will have a blast radius that spans multiple services. If we made services more coarse grained, we could have one service per use case, and then changes to use cases would have their blast radius limited to that service. But maybe those services would be decoupled to a fault on a technical level instead, so let’s explore alternatives..
Domain driven design suggests a way out
On a technical level, domain driven design builds quite heavily on the idea of a bounded context, which is a vertical slice of functionality that runs completely independently from others. Bounded contexts talk to each other through well-defined interfaces, etc. It’s their idea of a module, or a service, or however we want to deploy it.
Also in ddd, bounded contexts are not supposed to share domain models with other bounded contexts6 That would work against their independence.. However, if a bounded context needs to reference something from another context, it is not supposed to embed a thick denormalised version of that context’s data. Rather, it lightly references the data by id.
Maybe that’s insufficient for some contexts that truly need to aggregate data from other contexts – I haven’t thought it through yet.7 The point of this article is to capture my thoughts while they’re fresh and I’m still naïve, so I can learn from how my feelings around the subject have changed over time.
The ability to poke at the system is limited
If we have a bug that we don’t quite know how to fix, we could try to make a change, push to an experimental branch, wait for ci/cd pipelines, deploy it to a development environment, run through the test protocol, etc. Repeat until fixed. This might take a few hours in total.
Alternatively – we could inherit some utility class and hack some instrumentation into it and replace the original temporarily, make a direct call from one part of the code to another, and get into a 10-second trial-and-error loop on our local machine. We might fix it in 30 minutes.
In essence, the skilled programmer knows they’re allowed to do anything they want with the code to test their solutions as effectively as possible, and they’re really good at finding those shortcuts.8 If you want ideas, read Working Effectively With Legacy Code by Feathers. It has a whole list of seams which are places you can easily inject functionality without changing much existing code. They set up super-tight feedback loops for testing things out. That is known as job smarts in the research on expertise.9 Job smarts involves other things as well, such as combining or sequencing steps differently to a documented process to increase efficiency.
My experience with microservices so far has been that it’s much harder to do that. I can’t just swap out a class for something else and change the behaviour of the system, because the service needs to be deployed in a network with all other services. I can’t call directly from one thing to another. By design, of course, but it makes it much harder to find shortcuts for testing.
One might counter that microservices are so small and isolated so they’re easy to test on their own, but that misses the point. The meaningful end-user functionality spans multiple services. There are messages being passed around and workflows involving several stops at different services to implement a single use case. That is the thing I want to hook into, not the isolated operation of a service.
Again, this will definitely be exacerbated by a suboptimal drawing of sevrice boundaries, and if there’s anything me and my fellow software engineers seem to have a tough time with, it’s precisely drawing module boundaries.
Whole-system refactoring is hard, too
Both the complicated interconnectedness of services, and the desire to maintain their stable api, means system design calcifies easily. The code inside the services can be refactored very easily, but the structure between the services is rarely changed.
The consequence of this is that when functionality needs to change, developers are more likely to add new code paths, rather than change existing ones. I have seen api endpoints that are virtually clones of each other, except with small tweaks to behaviour. In a monolith, it would have been cheap to unify them into more general behaviour, but microservices seem to become somewhat more append-only.
There is dead code nobody knows is dead
When code is built as a monolith, finding dead code is easy. It’s baby mode static analysis. Many editors give some sort of visual indication these days. And we should delete it with no fear.10 The version control system knows exactly what it was, and it’s trivial to recover if we change our minds. Kevlin Henney has spoken at length about why, although I don’t have a specific reference at hand.
The main problem I see with dead code is that it gets in the way of our understanding of the system. I think of the code as the formal specification of what the system is meant to do11 I want to reference a comic here, but I can’t remember exactly which one it is. It depicts a manager-type person dreaming of a day when they can automate the job of programmers. They’re imagining handing to a robot a detailed specification of what the software should do, and then the robot goes and writes the code. The person they are talking to asks “And who do you think writes that detailed specification? They’re called programmers.” and if that contains a bunch of behaviour that never happens, it becomes difficult to reconcile what’s actually part of the intended system and what’s not.
The problem with dead code in microservices is that most of the code is behind a public api. We don’t know what can call into that api, so static analysis has to assume all code is alive.12 Fortunately, if we can detect a dead api endpoint, the handler code can be deleted and then static analysis will light up a christmas tree of all the newly-detected dead codepaths behind that endpoint. I don’t yet know how to detect dead code in microservices. There are two heuristics I’ve used:
- Grep for the url of the api endpoint in all other services’ code. This is not a perfect proxy, because some services could dynamically construct the url they call.13 Some services also seem to rewrite paths, so it becomes a transitive grep which has to be performed manually.
- Introduce a call counter into every endpoint and see how the value of that changes over time. If it stays very low, it’s unlikely the endpoint is used. But at what level should the threshold for “very low” be? I don’t know.
And at this point, I stumbled over a remark from Kevlin Henney which I would have liked to leave as the last words of the article.
If you’re unsure whether or not code is dead, and that’s why you don’t want to remove it, then that uncertainty is already telling you a great deal about your architecture and the developer relationship with it.
I’m learning a lot
Leaving with that remark, though, would have ended this article on a much sadder note than I intend it. I am, for the first time in a couple of years, learning a lot about software engineering again. I don’t yet understand what all of it is for, but I know that knowledge is again compounding at a high rate. It’s fun.