DevOps Lessons from Lean: Small Batches Improve Flow
Update: This and other related topics will be in the upcoming DevOps Cookbook.
DevOps problems are fundamentally flow problems. Work doesn’t flow properly from one end of the lifecycle (Dev) to the other end of the lifecycle (Ops).
While spirited discussions on tools are a regular occurrence in DevOps circles, there are other simple, yet profound, techniques that have nothing to do with technology but have proven to have a huge impact on improving flow.
Top of that list? Work in small batches.
It seems so simple that it couldn’t possibly make that big of difference, but it does. And there is historical precedent for it as well. The principle of working is small batches has proved it’s merit in Agile software development and on an even larger stage during the manufacturing revolutions of the 1970s and 1980s.
The reasons why working in small batches has such a strong net positive impact on flow might seem a bit counterintuitive at first. In the absence of relying on “because I told you so”, below are the best explanations I could find as to why this works.
What is a “batch size”?
A batch is the unit of work that passes from one stage to the next stage in a process. The batch size the scale of that work product.
What are the benefits of reducing batch sizes?
Reduces cycle time and gets you quicker feedback – With a small batch size, each batch makes it through the full lifecycle quicker. Since work on a feature isn’t complete until it is successfully running in production and getting feedback from users, large batch sizes simply delay that feedback. This means the larger the batch the longer you wait to find out if you did it right. It’s easier to make business and technical decisions and easier to recover from a mistake if you are working on shorter time horizons.
Reduces risk of an error or outage – With a small batch size, you are reducing the amount of complexity that has to be dealt with at any one time by the people working on the batch. The reduction in complexity comes not only from the number and size of the moving parts that are touched while working on the batch, but also in the amount of person-to-person communication that needs to happen (due to smaller teams). This is just acknowledging the natural limitations of human beings. The more complexity people have to deal with, the more mistakes there will be. Smaller batch size also leads to quicker feedback, so if there is an error in the batch it will be caught sooner. A small batch size lends itself well to quicker problem detection and resolution (the field of focus in addressing the problem can be contained to the footprint of that small batch and the work that is still fresh in everyone’s mind).
Reduces product risk – This builds on the idea of faster feedback. The sooner you can put an individual feature in front of your target audience, the sooner you will know if you’ve achieved the right product and market fit. The larger the batch size, the greater the product risk when you finally release that batch. Statistics shows us that it’s beneficial to decompose a large risk into a series of small risks. For example, bet all of your money on a single coin flip and you have a 50% chance of losing all of your money. Break that bet into 4 smaller bets and it would take 4 sequential bets to result in financial ruin (1 in 16 or 6.25% chance of losing all of your money).
Large batch sizes also often lead to compounding schedule delays and cost overruns. The larger the batch, the more likely it is that a mistake was made in estimating or during the work itself. The chance and potential impact of these mistakes compounds as the batch size grows… increasing the delay in being able to get that all important feedback from the users and increasing your product risk.
Improves efficiency and lowers overhead – Conventional wisdom holds that large batches allow greater productivity (i.e. you get more done with large uninterrupted periods of work) and lower overhead (less batches = less transactional costs). As has been proven in the manufacturing world (Lean) and now software development (Agile), this simply isn’t the case. The larger the scope of the batch, the more complexity the individual has to deal with. The complexity of a debug task grows as 2ⁿ when n things are changed in one batch. In knowledge work, the larger the uninterrupted period of work leads to greater change complexity, greater the volume of debug work, and more handoff complexity. That is all added overhead. But even assuming the individual was still being more efficient by working in a large batch, you would still be creating greater inefficiency for the end-to-end process.
For a large batch of changes, especially those made to an even larger system, the handoff to the next step in the process is going to be highly inefficient for the receiving party to deal with (think: Development to Operations “toss if over the wall” handoff of a major release). And if something goes wrong, the time between when the error was introduced and when it will be discovered is so long that it is no longer fresh in the mind of the person who introduced the error. Small batches also have been proven to actually reduce transaction costs because of a curious fact of human nature… people get better at and find ways to increasingly improve the things they are forced to do more often.
Improves management visibility and control – Reducing batch sizes gives you a greater number of instrumentation points by which you can visualize and measure the flow of work through your organization. It’s notoriously difficult to accurately determine progress of in-flight work. You are largely going to be limited to the subjective analysis of project managers and the biased opinion of the person doing the work. The only points where you can have certainty is either when the work has just started or when the work has just completed (and accepted by the next step in the process). With large batch sizes you have to wait long periods of time between those start and completion points, making it difficult to see how things are flowing, providing little guarantee that you will have adequate warning if things are going wrong, and allowing for few opportunities to make adjustments to optimize or triage. With small batch sizes you can see work move through the lifecycle with certainty, spot problems early, and make ongoing adjustments to optimize the flow of delivery.
Encourages decoupled architectures with less dependency issues – Smaller batch sizes can also have a positive impact on architecture. Most IT systems are built from within the context of large projects. Large projects create them and then large projects are undertaken to change them. The result is a built-in tolerance for monolithic architectures with complex dependencies. As you move to small batch sizes you are naturally limiting the work in progress on a particular segment of your code/infrastructure. While initially this might seem like it will slow the organization down, the principles of flow show that this will actually give you greater throughput over time. But in order to speed things up even further, you will end up looking for ways to increasingly decouple and isolate (including making fault tolerant) your architecture to allow for greater parallelization of work.
What are the economic benefits of reducing batch size?
In manufacturing and in software development, reducing batch sizes has been showen to have a significant impact on the economics of the production process. The diagram below (scanned from Donald G. Reinertsen’s “The Principles of Product Development Flow”, pg 121) lays out the direct links between smaller batch sizes and improved economics. I think the logic speaks for itself.
What are your control points for reducing batch sizes?
Reducing batch sizes is a policy decision that needs to be implemented at multiple levels:
Project Initiation and Funding – How projects are formed and funded tends to have a strong correlation to batch size. The definition of requirements and success criteria, in addition to the allocation of budget, is usually done in a large batch that corresponds to a specific or set of business goals that were created at the quarterly or yearly scale. The inertia of this large batch is often carried throughout the rest of the lifecycle, becoming a pacemaker of sorts that encourages large batch sizes. Positive work done to break down these large initial batches into smaller batches can turn that inertia back into a net positive effect for the company. Reduction in the time horizon for the expected results of a project is usually a good way to force the issue (e.g. try scoping and budgeting projects to single month size rather than quarter/multi-quarter size).
Project management – When creating projects consider what is the smallest amount of change that can be undertaken in the shortest amount of time and still achieve a measurable result. This will naturally lead to smaller teams working on smaller batches of work that can flow independently through the lifecycle with faster feedback and lower risk to the overall system.
Testing – Demand that individual pieces of work are tested as soon as those pieces of work are completed (and not wait for the entire project/release to be code complete). Continuous integration and it’s built in unit/smoke tests is a crude example of this principle. Carry that further. Ensure that full deployment and testing efforts are ongoing during any project. This will automatically force engineers to think about their work in small units that can be completed and handed off for testing at regular intervals (naturally creating the urge to reduce batch sizes).
Release management – Break down large releases into small units of deployment that employ standardized packaging and configuration management mechanisms. These units of deployment should be aligned towards the things that are changed (i.e. application services) rather than large project releases that change many things. In addition to reducing deployment and configuration woes, this also has the effect of standardizing batch sizing across lifecycle by determining the appropriate unit of change for your infrastructure.
I’m standing the on shoulders of people a lot smarter than me in this post. If you are interested in these ideas please check out: