Distributed Systems Fundamentals Every Developer Should Know

Every system you use daily - from Google Search to your team’s Slack workspace - is a distributed system. Understanding how they work isn’t optional for modern software engineers.

Here’s what you need to know.

What Makes a System “Distributed”

A distributed system is any system where components on networked computers communicate and coordinate to achieve a common goal. The key challenge: these components can fail independently.

Your laptop is not a distributed system. A web app with a database, cache, and message queue? That’s distributed.

The CAP Theorem

The CAP theorem states that a distributed system can provide at most two of three guarantees:

Consistency - Every read receives the most recent write
Availability - Every request receives a response
Partition tolerance - The system continues operating despite network failures

Since network partitions are inevitable, you’re really choosing between consistency and availability.

In practice, most systems make nuanced tradeoffs rather than a binary choice. DynamoDB, for example, offers tunable consistency.

Consensus: Getting Nodes to Agree

When multiple nodes need to agree on a value, you need a consensus algorithm. The two most important ones:

Raft

Raft is designed to be understandable. It elects a leader who handles all writes, and followers replicate the leader’s log.

Key properties:

Leader election - if the leader fails, a new one is elected
Log replication - the leader sends entries to followers
Safety - if a log entry is committed, it won’t be lost

Paxos

Paxos is theoretically elegant but notoriously difficult to implement. Most production systems use Raft or a Raft derivative (etcd, CockroachDB, TiKV).

Event-Driven Architecture

Instead of services calling each other directly (request-response), event-driven systems communicate through events:

Service A --publishes--> Event Bus --delivers--> Service B
                                   --delivers--> Service C

Benefits:

Loose coupling - services don’t know about each other
Resilience - if Service B is down, events queue up
Scalability - add consumers without changing producers

Tools: Kafka, RabbitMQ, Amazon EventBridge, NATS.

Practical Patterns

Circuit Breaker

Stop calling a failing service. After a threshold of failures, “open” the circuit and return a fallback response. Periodically test if the service has recovered.

Saga Pattern

Manage distributed transactions as a sequence of local transactions. Each step has a compensating action to undo it if a later step fails.

CQRS

Separate read and write models. Writes go to an optimized write store, reads go to a denormalized read store. This lets you scale reads and writes independently.

Where to Go From Here

Start with these resources:

“Designing Data-Intensive Applications” by Martin Kleppmann - the definitive guide
MIT 6.824 - distributed systems course with labs
The Raft paper - readable and well-illustrated

The best way to learn distributed systems is to build one. Start with a simple key-value store that replicates across three nodes. You’ll encounter every fundamental challenge within a few hours.