Distributed Systems Fundamentals Every Developer Should Know
Every system you use daily - from Google Search to your team’s Slack workspace - is a distributed system. Understanding how they work isn’t optional for modern software engineers.
Here’s what you need to know.
What Makes a System “Distributed”
A distributed system is any system where components on networked computers communicate and coordinate to achieve a common goal. The key challenge: these components can fail independently.
Your laptop is not a distributed system. A web app with a database, cache, and message queue? That’s distributed.
The CAP Theorem
The CAP theorem states that a distributed system can provide at most two of three guarantees:
- Consistency - Every read receives the most recent write
- Availability - Every request receives a response
- Partition tolerance - The system continues operating despite network failures
Since network partitions are inevitable, you’re really choosing between consistency and availability.
In practice, most systems make nuanced tradeoffs rather than a binary choice. DynamoDB, for example, offers tunable consistency.
Consensus: Getting Nodes to Agree
When multiple nodes need to agree on a value, you need a consensus algorithm. The two most important ones:
Raft
Raft is designed to be understandable. It elects a leader who handles all writes, and followers replicate the leader’s log.
Key properties:
- Leader election - if the leader fails, a new one is elected
- Log replication - the leader sends entries to followers
- Safety - if a log entry is committed, it won’t be lost
Paxos
Paxos is theoretically elegant but notoriously difficult to implement. Most production systems use Raft or a Raft derivative (etcd, CockroachDB, TiKV).
Event-Driven Architecture
Instead of services calling each other directly (request-response), event-driven systems communicate through events:
Service A --publishes--> Event Bus --delivers--> Service B
--delivers--> Service C
Benefits:
- Loose coupling - services don’t know about each other
- Resilience - if Service B is down, events queue up
- Scalability - add consumers without changing producers
Tools: Kafka, RabbitMQ, Amazon EventBridge, NATS.
Practical Patterns
Circuit Breaker
Stop calling a failing service. After a threshold of failures, “open” the circuit and return a fallback response. Periodically test if the service has recovered.
Saga Pattern
Manage distributed transactions as a sequence of local transactions. Each step has a compensating action to undo it if a later step fails.
CQRS
Separate read and write models. Writes go to an optimized write store, reads go to a denormalized read store. This lets you scale reads and writes independently.
Where to Go From Here
Start with these resources:
- “Designing Data-Intensive Applications” by Martin Kleppmann - the definitive guide
- MIT 6.824 - distributed systems course with labs
- The Raft paper - readable and well-illustrated
The best way to learn distributed systems is to build one. Start with a simple key-value store that replicates across three nodes. You’ll encounter every fundamental challenge within a few hours.
Mohan
Software engineer writing about AI, distributed systems, and the craft of building great software.
Related Articles
Modern AI Agent Architectures: How Multi-Agent Systems Like OpenHands and Claude Flow Work
A deep dive into multi-agent AI architectures - how frameworks like OpenHands and Claude Flow use planner-worker patterns, role-based orchestration, and parallel execution to solve complex tasks.
A Pragmatic Testing Strategy for Real Projects
Forget the testing pyramid dogma. Here's a practical approach to testing that maximizes confidence with minimal maintenance burden.
TypeScript Patterns That Scale: Lessons from Large Codebases
Discriminated unions, branded types, and the builder pattern - TypeScript techniques that keep large codebases maintainable.
Stay up to date
Get notified when I publish new articles. No spam, unsubscribe anytime.
No spam. Unsubscribe anytime.