Distributed Systems Theory — Staff Interview Handbook
The theory that separates senior from staff in design rounds: why distributed systems are hard, CAP and PACELC, the full consistency spectrum, time and ordering (Lamport & vector clocks), replication and quorums, consensus (Raft/Paxos at a usable depth), delivery semantics and idempotency, partitioning, CRDTs, and the failure modes (split-brain, clock skew, fencing) that interviewers love — plus a Q&A bank you can answer out loud.
1. Why Distributed Systems Are Hard
Section titled “1. Why Distributed Systems Are Hard”A distributed system is a set of independent machines that appear to users as one coherent system. The hard part is not the happy path — it’s that, at scale, partial failure is the normal state: some node is always slow, restarting, GC-pausing, or unreachable, and you can’t tell which from the outside.
The two facts everything follows from:
- The network is unreliable. Messages can be lost, delayed, duplicated, or reordered. A timeout tells you nothing — you can’t distinguish a crashed node from a slow node from a dropped reply. This is the single most important idea in the field.
- There is no global clock. Each machine has its own clock, and they drift. “What happened first?” has no cheap, exact answer.
The 8 Fallacies of Distributed Computing (worth naming): the network is reliable; latency is zero; bandwidth is infinite; the network is secure; topology doesn’t change; there is one administrator; transport cost is zero; the network is homogeneous. All false. Every one is a bug waiting to happen.
Senior answer: “I design for partial failure first. The core constraint is that a timeout is ambiguous — I can’t tell a dead node from a slow one — so every remote call needs timeouts, retries with idempotency, and a plan for ‘I don’t know if that succeeded.‘“
2. CAP Theorem — What It Actually Says
Section titled “2. CAP Theorem — What It Actually Says”During a network partition (P), a distributed system must choose between Consistency (C — every read sees the latest write, i.e. linearizability) and Availability (A — every request gets a non-error response).
CAP is a choice made only when a partition is happening. It is not “pick 2 of 3” as a permanent property — that framing is wrong and interviewers notice.
- CP (consistency over availability): on a partition, refuse requests that can’t be made safe. Example: a leader-based store (etcd, ZooKeeper, HBase) — the minority side stops serving.
- AP (availability over consistency): on a partition, keep serving and reconcile later (eventual consistency). Example: Dynamo-style stores (Cassandra, DynamoDB in some modes).
Trap: Saying a system is “CA”. You don’t get to opt out of partitions — they will happen, so the real choice is CP vs AP. “CA” only describes a single-node system.
Trap: Treating CAP as binary/global. Real systems choose per-operation (a bank may be CP for balance transfers, AP for “recently viewed”).
3. PACELC — The More Honest Version
Section titled “3. PACELC — The More Honest Version”CAP only talks about partitions. PACELC extends it:
If there’s a Partition, choose A or C; Else (normal operation), choose between Latency and Consistency.
This captures the everyday tradeoff CAP ignores: even with no partition, strong consistency costs latency (you must coordinate replicas before answering).
- PC/EC: always consistent, pay latency (e.g. spanner-like, traditional RDBMS replication done synchronously).
- PA/EL: favor availability and low latency, accept weaker consistency (Cassandra, Dynamo).
Senior answer: “I reach for PACELC because the interesting tradeoff isn’t just partitions — it’s that synchronous coordination costs latency on every request. Most of the time there’s no partition, so ‘L vs C in the normal case’ is what actually shapes the design.”
4. The Consistency Spectrum
Section titled “4. The Consistency Spectrum”“Consistency” is not one thing — it’s a spectrum from strong (expensive, intuitive) to weak (cheap, surprising). Know these by name and tradeoff:
| Model | Guarantee | Cost |
|---|---|---|
| Linearizable (strong) | Reads see the latest committed write; system behaves like one copy | Highest — needs coordination/consensus |
| Sequential | All clients see operations in the same order (not necessarily real-time) | High |
| Causal | Causally-related operations are seen in order; concurrent ops may differ | Medium — the sweet spot for many apps |
| Read-your-writes | You always see your own latest write | Low (session guarantee) |
| Monotonic reads | You never see time go backwards | Low (session guarantee) |
| Eventual | If writes stop, replicas eventually converge | Lowest — most surprising |
Session guarantees (read-your-writes, monotonic reads/writes) are the practical middle ground — cheap to provide, and they kill the most user-visible anomalies (“I posted a comment and it vanished on refresh”).
Senior answer: “Most products don’t need linearizability everywhere. I default to causal + session guarantees and reserve strong consistency for the few invariants that truly need it — money, uniqueness, locks.”
5. Time & Ordering: Clocks
Section titled “5. Time & Ordering: Clocks”Because there’s no global clock, we order events with logical clocks.
Physical clocks (wall time) drift and can jump (NTP corrections, leap seconds). Never use wall-clock timestamps to order events or build locks — clock skew between machines causes silent data loss (“last write wins” picks the wrong write).
Lamport clocks — a single counter per node, incremented on each event; on send, attach it; on
receive, set clock = max(local, received) + 1. Gives a total order consistent with causality:
if A → B (A causally precedes B), then L(A) < L(B). But the converse fails — L(A) < L(B)
does not prove causality, so Lamport clocks can’t detect concurrency.
Vector clocks — a vector of counters, one per node. They can detect concurrency: by comparing vectors you learn whether A happened-before B, B before A, or they’re concurrent (a true conflict needing resolution). Cost: size grows with the number of nodes.
Hybrid Logical Clocks (HLC) — combine physical time (so timestamps are human-meaningful and roughly track real time) with a logical component (so causality is preserved despite skew). Used by CockroachDB and others.
Nice to know: Google Spanner uses TrueTime — GPS/atomic clocks expose time as an interval
[earliest, latest], and it waits out the uncertainty before committing, achieving external consistency. It buys consistency with hardware + a few ms of commit-wait.
6. Replication
Section titled “6. Replication”Keeping copies of data on multiple nodes — for durability, availability, and read scaling. Three shapes:
- Single-leader (primary/replica): all writes go to the leader, which streams a replication log to followers. Simple; reads can scale on followers. Replication lag breaks read-your-writes on followers. Failover requires electing a new leader (and risks split-brain).
- Multi-leader: multiple nodes accept writes (e.g. multi-region, or offline-capable clients). Great for write availability and latency, but write conflicts are now possible and must be resolved (LWW, app merge, or CRDTs).
- Leaderless (Dynamo-style): clients write to several replicas directly; consistency comes from quorums + read repair + anti-entropy. Highly available, eventually consistent.
Sync vs async replication:
- Synchronous — the write waits for replicas to ack → durable, no data loss on failover, but slower and the leader stalls if a replica is down.
- Asynchronous — fast, but a failover can lose the last writes that didn’t replicate.
- Semi-sync (common default): wait for one replica synchronously, the rest async — a balance.
7. Quorums
Section titled “7. Quorums”In leaderless/quorum systems with N replicas, require W acks on write and R replicas on read. If:
R + W > Nthen any read quorum overlaps any write quorum by at least one node → a read is guaranteed to see the latest write (given versioning to pick the newest).
W = N, R = 1: fast reads, slow/fragile writes.R = N, W = 1: fast writes, slow reads.- Common:
N = 3, W = 2, R = 2→ tolerate one node down and still be consistent.
Supporting machinery: read repair (fix stale replicas during a read), hinted handoff (a healthy node temporarily holds writes for a down node), anti-entropy / Merkle trees (background sync of divergent replicas).
Trap: Quorums give you strong-ish consistency, not linearizability by default — without care (e.g. no read repair on the coordination path), you can still observe anomalies. Cassandra’s
QUORUMis tunable consistency, not a serializable transaction.
8. Consensus
Section titled “8. Consensus”The problem: get a group of nodes to agree on a single value (or an ordered log of values) despite failures and message loss. This underpins leader election, distributed locks, config stores, and replicated state machines.
Why it’s needed: “just use a leader” begs the question — who picks the leader, and how do they agree after the old one dies? That agreement is consensus.
Safety properties: all nodes decide the same value; only a value that was actually proposed; the decision is final. Liveness: they eventually decide (when the network behaves).
- Paxos — the classic, correct, famously hard to understand and implement.
- Raft — designed for understandability; same guarantees. Three ideas: leader election
(randomized timeouts elect one leader per term), log replication (leader appends entries,
replicates to followers, commits once a majority acks), and safety (a new leader must have
all committed entries). Majority quorum = tolerate
floor(N/2)failures (so run odd numbers: 3 tolerates 1, 5 tolerates 2).
Where you’ve used it: etcd, Consul (Raft); ZooKeeper (ZAB, Raft-like); Spanner (Paxos); Kafka’s KRaft controller.
FLP impossibility: in a fully asynchronous network, no consensus protocol can guarantee termination if even one node may fail. Real systems sidestep this with timeouts / partial synchrony — i.e. they trade guaranteed liveness for “makes progress when the network is well-behaved.” Naming FLP signals depth.
9. Failure Detection
Section titled “9. Failure Detection”You can’t know a node is dead; you suspect it. Heartbeats + timeouts are the basic tool, but a fixed timeout is wrong: too short → false positives (healthy node declared dead → needless failover, split-brain risk); too long → slow detection.
Phi-accrual failure detectors (Cassandra, Akka) output a suspicion level based on the recent distribution of heartbeat arrival times, instead of a hard yes/no — letting callers choose their own threshold.
Senior answer: “Failure detection is probabilistic. I tune timeouts to the network’s real latency distribution, and I make failover safe under false positives with fencing tokens — so a wrongly-suspected node can’t corrupt state when it comes back.”
10. Delivery Semantics & Idempotency
Section titled “10. Delivery Semantics & Idempotency”What a messaging/RPC system promises about duplicates and loss:
- At-most-once: may lose messages, never duplicates (fire-and-forget). Rarely acceptable.
- At-least-once: never loses, may duplicate (retries after an ambiguous timeout). The common real-world default.
- Exactly-once: the holy grail — and “exactly-once delivery” is essentially impossible over an unreliable network. What systems actually provide is exactly-once processing = at-least-once delivery + idempotent consumers / dedup.
Idempotency = applying an operation twice has the same effect as once. The practical fix for
at-least-once. Techniques: idempotency keys (client sends a unique ID; server dedups), natural
idempotency (SET x = 5 vs x += 5), or an inbox/dedup table.
Trap: Claiming a system is “exactly-once” without qualification. The senior version: “at-least-once delivery plus idempotent processing — Kafka’s exactly-once is transactions + idempotent producers within Kafka, not magic across external side effects.”
11. Distributed Transactions
Section titled “11. Distributed Transactions”You want atomicity across services/shards, but there’s no shared lock manager.
- 2PC (Two-Phase Commit): a coordinator asks all participants to prepare, then commit if all vote yes. Problems: it’s a blocking protocol — if the coordinator dies after prepare, participants hold locks indefinitely; poor availability; doesn’t scale. Used inside some databases, rarely across microservices.
- Saga: model a transaction as a sequence of local transactions, each with a compensating action to undo it. Coordinated by choreography (events) or an orchestrator. Trades atomicity/isolation for availability — you get eventual consistency and must handle intermediate states. (See the Microservices handbook for depth.)
Senior answer: “Across service/shard boundaries I avoid 2PC because it’s blocking and availability-hostile. I use Sagas with compensations and design the business flow to tolerate intermediate, eventually-consistent states.”
12. Partitioning & Consistent Hashing
Section titled “12. Partitioning & Consistent Hashing”To scale beyond one node, partition (shard) data by key.
- Range partitioning: keys in ranges → efficient range scans, but hot spots if keys are skewed (e.g. timestamps).
- Hash partitioning: hash the key → even spread, but loses range-scan locality.
Consistent hashing solves the re-sharding problem: naive hash(key) % N remaps almost
everything when N changes. Consistent hashing maps nodes and keys onto a ring so adding/removing a
node only moves ~1/N of keys. Virtual nodes smooth out imbalance and heterogeneous hardware.
Used by Dynamo, Cassandra, and many caches.
Trap: A celebrity/hot key still overwhelms one partition even with consistent hashing — handle with key splitting, replication of the hot key, or request coalescing.
13. CRDTs — Conflict-Free Replicated Data Types
Section titled “13. CRDTs — Conflict-Free Replicated Data Types”Data structures designed so that concurrent updates on different replicas merge automatically without coordination — the merge is commutative, associative, and idempotent, so replicas converge to the same state regardless of order. This is Strong Eventual Consistency.
- Examples: G-Counter / PN-Counter (counters), OR-Set (add/remove sets), LWW-Register, sequence CRDTs for collaborative text (the tech behind Google-Docs-style editing, Redis CRDTs, Riak, automerge/Yjs).
- Tradeoff: not every problem maps to a CRDT, and metadata can grow; but where they fit, you get multi-leader writes with no conflicts to resolve manually.
Senior answer: “When I need offline/multi-region writes with automatic merge — counters, presence, collaborative editing — I reach for CRDTs instead of last-write-wins, which silently drops data.”
14. Classic Failure Modes (Name These)
Section titled “14. Classic Failure Modes (Name These)”- Split-brain: a partition makes two nodes both think they’re leader → divergent writes. Prevent with majority quorum (a minority can’t elect a leader) and fencing tokens.
- Fencing tokens: a monotonically increasing number handed out with a lock; downstream storage rejects any write carrying an older token → a paused-then-resumed old leader can’t clobber newer state. The canonical fix for “the lock holder GC-paused.”
- Thundering herd / cache stampede: a popular key expires and thousands of requests hit the DB at
once → use request coalescing, jittered TTLs, or
stale-while-revalidate. - Retry storms / metastable failures: retries amplify load during a hiccup and keep the system down even after the trigger passes → backoff with jitter, circuit breakers, load shedding.
- Clock skew bugs: ordering or expiring by wall clock across machines → use logical clocks / server-assigned versions.
- Gray failure: a node is “up” (passes health checks) but degraded (slow disk) → detect with end-to-end SLO-based health, not just liveness pings.
15. Interview Q&A Bank
Section titled “15. Interview Q&A Bank”Q: State the CAP theorem precisely.
When a network partition occurs, a system must choose between consistency (linearizable reads) and availability (every request gets a non-error response). It’s a per-partition, often per-operation choice — not “pick 2 of 3” permanently. “CA” isn’t a real option because partitions are inevitable.
Q: Why is PACELC better than CAP?
CAP only addresses partitions. PACELC adds the everyday case: even with no partition (Else), you trade Latency vs Consistency, because synchronous coordination costs latency on every request.
Q: Difference between linearizability and eventual consistency?
Linearizability: the system behaves like a single copy; a read always sees the latest committed write (needs coordination). Eventual: replicas converge if writes stop, but reads can see stale or out-of-order data meanwhile.
Q: Why can’t you use wall-clock timestamps to order events across machines?
Clocks drift and jump; skew means “latest timestamp” can pick the wrong write, silently losing data. Use logical clocks (Lamport/vector) or server-assigned versions.
Q: Lamport vs vector clocks?
Lamport gives a total order consistent with causality but can’t detect concurrency. Vector clocks can distinguish happened-before from concurrent (true conflicts), at the cost of size O(nodes).
Q: What does R + W > N give you?
Read and write quorums overlap by at least one replica, so a read sees the latest write (with versioning). N=3, W=2, R=2 tolerates one node down while staying consistent. It’s strong-ish, not automatically linearizable.
Q: What problem does consensus solve, and name a protocol?
Getting nodes to agree on one value / an ordered log despite failures — the basis of leader election, locks, replicated state machines. Raft (understandable) or Paxos; both need a majority quorum, so use odd cluster sizes.
Q: What is FLP impossibility?
In a fully asynchronous network, no consensus algorithm can guarantee termination if one node may fail. Real systems use timeouts/partial synchrony to make progress when the network behaves.
Q: Is exactly-once delivery possible?
Not over an unreliable network. You get exactly-once processing = at-least-once delivery + idempotent consumers (idempotency keys / dedup). Kafka’s “exactly-once” is transactions + idempotent producers within Kafka.
Q: Why avoid 2PC across microservices?
It’s blocking: a coordinator crash after prepare leaves participants holding locks, hurting availability and scale. Use Sagas with compensating transactions and accept eventual consistency.
Q: What is consistent hashing and why use it?
Map nodes and keys onto a ring so adding/removing a node moves only ~1/N of keys (vs
% Nremapping almost everything). Virtual nodes balance load. Used by Dynamo/Cassandra/caches.
Q: How do you prevent split-brain?
Require a majority quorum to elect a leader (a minority can’t), and use fencing tokens so a resurrected old leader’s stale writes are rejected by downstream storage.
Q: What’s a fencing token?
A monotonically increasing number issued with a lock; storage rejects writes with an older token — protecting against a lock holder that paused (GC/network) and resumed after losing the lock.
Q: When would you use a CRDT?
When you need conflict-free concurrent writes across replicas/offline — counters, sets, presence, collaborative text — and want automatic convergence instead of lossy last-write-wins.
Q: How do you make an unreliable remote call safe?
Timeout (a timeout is ambiguous), retry with backoff+jitter, make the operation idempotent (so retries don’t double-apply), circuit-break to avoid retry storms, and have a fallback/degraded mode.
16. Cheat Sheet
Section titled “16. Cheat Sheet”- A timeout is ambiguous — can’t tell dead from slow. Design for partial failure first.
- CAP: on a partition, choose C or A (per-operation). “CA” isn’t real. Use PACELC: Else, Latency vs Consistency.
- Consistency spectrum: linearizable → sequential → causal → read-your-writes/monotonic → eventual. Default to causal + session guarantees; reserve strong for money/uniqueness/locks.
- No global clock. Lamport = order, can’t detect concurrency; vector clocks detect concurrency; HLC/TrueTime track real time safely. Never order by wall clock.
- Replication: single-leader (simple, lag), multi-leader (conflicts), leaderless (quorums). Sync = durable+slow, async = fast+can-lose-writes, semi-sync = balance.
- Quorum:
R + W > N→ overlap → consistent reads. N=3,W=2,R=2 common. Read repair, hinted handoff, Merkle anti-entropy. - Consensus = agree on a value/log. Raft (election + log replication + majority). Odd sizes: 3→tolerate 1, 5→tolerate 2. FLP: no guaranteed termination if async + failures.
- Delivery: at-most / at-least (default) / exactly-once processing via idempotency.
- Distributed txns: avoid 2PC (blocking) across services → Saga + compensations.
- Partition by hash/range; consistent hashing + vnodes for cheap re-sharding; beware hot keys.
- CRDTs = automatic conflict-free merge (counters/sets/text) → strong eventual consistency.
- Failure modes: split-brain (quorum + fencing tokens), thundering herd (jitter/coalesce), retry storms (backoff+jitter, circuit breaker), clock skew, gray failure.
End of handbook. The staff signal: reason from first principles — the network is unreliable and there is no global clock — then reach for the right tool (quorums, consensus, idempotency, CRDTs, fencing) and state the tradeoff out loud. Precision about CAP, exactly-once, and consistency models is what separates staff from senior.
© 2026 Saeid Darvishghazvini · Open source for the community · Privacy · · CC BY-NC 4.0