A while ago I wrote an article entitled the Inner Loop which commented on the challenges of breaking up a codebase when both of those codebases are evolving together. In that article I skirted around the mono vs. micro repository debate and instead focused on what kind of development activity can lead to pain when development loops are separated.

Since writing that article I joined the Azure SDK team which builds the libraries used by developers to access the Azure platform. My role is focused on engineering systems and I've spent a lot of the last 6+ months (along with others in our team) setting up build and release pipelines to support building, testing and shipping our libraries.

The Azure SDK is broken up across multiple repositories, along language/run-time lines, so we have a .NET repository, a Java repository, a Python repository, a JavaScript repository and so on. Each repository contains numerous libraries which ship on a separate cadence but otherwise share a common engineering system.

In effect we are managing multiple mono-repos so we get plenty of experience dealing with the challenges associated with them.

Non-monolithic monoliths

One of the key things that I mentioned above was that each library shipped on a separate cadence. That means that the version numbers for libraries rev at different paces - and that makes sense. Let's say that we made a major update to our App Configuration library (bumped the major version number) it wouldn't make sense to also bump the major version number of the Storage library. After all it is possible that nothing at all changed.

We have a mono-repository per language, but we operate the build and release pipelines for each logical grouping of libraries as independently as possible (we could do a better job here).

This generally works well until you run into dependencies between the libraries that are used by developers to access a particular service. For example the Event Hubs library takes a dependency on the Storage library to support state synchronization in scaled out event processing scenarios.

So if Event Hubs depends on Storage, how can we ship them independently? The answer lies in the way that we manage dependencies.

Source composition vs. binary composition

We realized pretty early on in this journey that using the latest version of an internal dependency (e.g. Storage depended on by Event Hubs) that we could end up creating a problem where we silently took a dependency on unreleased features in our dependencies and we wouldn't detected it until we shipped (leading to broken customers).

Within the team we use the language of source composition vs. binary composition to differentiate between dependencies that we take against a version of a library in source vs. a version that has been built into a package and published to a public registry (such as NuGet or NPM).

Whilst binary composition does introduce some rigor to dependency management (every dependency you take must have an explicit immutable version) it does start to create problems around inner loop productivity such as when you are working on some of our core HTTP pipeline logic and downstream consumers of that library. To try and make it easier we publish nightly builds from each of our build pipelines into a package registry. Those "nightly" packages then can be used as a dependency.

Even with the nightly package builds, the friction can still be too high and so in some scenarios a particular pull request will be operating in source composition mode; but that leads to problems downstream when you go to land a pull request that spans libraries that ship from different pipelines.

Shippable unit and cadence defines everything

One of the things that my recent experience has taught me is that the nature of what and when you ship has an outsize impact on the way that you structure your build and release pipeline and how the tooling that the pipeline automates must function.

If you have a mono-repository which has one large component or set of components that all ship together then you are going to generally have one set of pipelines and tooling will tend to operate in aggregate across the codebase.

However if you have a mono-repository that has multiple independent components then you are going to have a pipeline per component and it is best if your tooling is best designed to scope its operation to just the components that are shipping through this pipeline. This applies not only to build and testing, but also other things like static analyzers and report generation.

Micro or mono? The pain is the same

Ultimately I've learned that the mono-repo vs. micro-repo debate is pointless. Previously I would have put myself firmly in the micro-repo camp. However I've learned that the number of repos is actually irrelevant, what actually matters is how you ship your components.

If you have a mono-repo with many independently shipping components you are going to end up treating it like a set of micro-repos with the only upside being the reduced number of actual repositories - but the integration pain remains the same.

Micro-repositories allow for local variation (if that is even desirable) and can make things like servicing a bit easier to manage, but once again if the aggregate codebase is moving quickly you are going to have integration pain.

Considerations for repo structuring

I think when it comes to picking mono-repo vs. micro-repo you need to consider the following:

  1. Granularity of shippable unit
  2. Level of interdependence
  3. Stability over time
  4. Servicing & support model

If you have a huge lump of code that ships all at the same time, then a mono-repo is going to be a lot less work. But if on the other hand you have lots small libraries that ship independently a mono-repo doesn't provide you with any substantial benefits (since you end up needing to do binary composition and deal with the same kind of integration pain you would have with micro-repositories).

Whether you use that as a catalyst to break out from a mono-repo probably depends on the level of interdependence. If you have a lot of independently shippable libraries but a comparatively low number of internal dependencies then you could more easily adopt a micro-repo model.

Sometimes internal dependencies start off being painful but lessen over time (for example if you are developing a core library which later becomes stable).

Finally the way that your team wants to support the libraries matters. On the Azure SDK team we really want to make it easy for Java developers to be able to come to one place and log issues with the SDK without us having to bounce them around repositories. On the flip-side, if you do need to create a servicing release that can be more challenging in a mono-repo because you effectively share the engineering system across many components and have the support all the different iterations of that engineering system in order to support shipping hotfixes.

There isn't a simple answer here unfortunately.

One final thought

One thing that I would love to see is much better tooling for working across multiple repositories to make the integration challenges of multiple repositories easier.

I imagine a system where a developer can have a logical workspace with multiple changes in flight across repos and as those changes are made builds are occuring transparently in the background that produce the binary packages that can then be shared. When the work is complete the independent changes can be landed in order and dependency lists updated to the official release versions.

Different developer ecosystems tackle this in different ways and some do it quite nicely but it would be great if this was something that collaboration platforms could provide as a first class feature.