On the opening day of November, 2019, my inbox contained a message
from of our senior developers in India. He had observed developers
struggling with core distributed systems concepts that they needed to
understand, in order to work effectively with modern tools like Kafka,
Cassandra, and Zookeeper. He had tried teaching the theory behind key
concepts in distributed systems, but found that his colleagues struggled
to fully grasp the consequences. So he tried a more code-centric
approach. He explored the code driving these core open-source systems, and built
simplified implementations, designed to highlight and teach the
theoretical concepts. This was more successful and his email was about
how to take this training further.
We decided that developing a series of patterns would be a good
direction to go and set out on what turned out to be a four year
journey. More than most aspects of software development, distributed system
design often requires the kind of mathematical analysis provided by
tools such as formal methods. But as challenging as it is to understand
how the theory works, there’s still a considerable jump between what appears in a
paper and what can be implemented in a practical system. By studying
the code of systems that run our online systems every day (often
requiring learning new languages and frameworks) Unmesh was
able to formulate the common solutions embedded in this code into more
general patterns. Building skeletal implementations of these patterns
ensured he he properly understood the oft-subtle
behaviors and trade-offs.
To communicate what he’d learned, he then drafted patterns, sent them
to me and other interested Thoughtworkers, reflected on reviews, and
published developed drafts here for wider consumption. As the pattern
collection took shape, he contacted Pearson to turn this into a book,
which I am proud to add to my signature series.
The final book contains thirty patterns, each illustrated with
explanatory text, many with sequence diagrams to explain the complex
interactions, and all with code samples that clarify the all-important
details. Understanding these patterns provides a solid foundation for
understanding how distributed systems work. In particular they
illuminate the most gnarly problem that these systems face: how to
ensure that data can be distributed in order to increase availability
and resilience, without running into paradoxes when multiple writers try
to update at the same time.