Raft Protocol
In-Depth Analysis of the Raft Algorithm - Part One: The Origin of the Problem
Why Raft? The Painful Journey from Single-Node to Distributed Systems
1. Starting with a Single Node: Everything Was Simple
Imagine you’re developing a simple counter service:
// Single-machine version - works perfectly
type Counter struct {
value int
mu sync.Mutex
}
func (c *Counter) Increment() {
c.mu.Lock()
c.value++
c.mu.Unlock()
}
func (c *Counter) Get() int {
c.mu.Lock()
defer c.mu.Unlock()
return c.value
}The Beautiful World of Single-Node Systems:
- Data Consistency: Only one copy of data exists, eliminating conflicts
- Atomic Operations: Locking mechanisms ensure operations aren’t interrupted
- Simple Failures: Either everything works, or the entire service crashes
2. The Cruel Reality: Limitations of Single-Server Architecture
But reality quickly slaps you in the face:
Issue 1: Single Point of Failure
User Request โ [Single-Server Service] โ Database
โ
Server Crash = Entire System UnavailableIssue 2: Performance Bottlenecks
1000 concurrent requests โ [Single-server service] โ Overwhelmed, response slows downIssue 3: Data Loss Risk
Server hard drive failure โ All data lost โ Business collapses3. The Naive Solution: Simple Replication
You think: “Why not just set up multiple servers?”
User Request โ Load Balancer โ [Server1] [Server2] [Server3]
โ โ โ
Database1 Database2 Database3It looks promising, but problems quickly emerge…
Problem 1: Data Inconsistency
Time T1: User A sends Increment() to Server 1 โ Server 1's counter = 1
Time T2: User B sends Get() to Server 2 โ Server 2's counter = 0 (not yet synchronized)
Result: User B sees outdated data!Problem 2: Concurrency Conflicts
Time T1: Server 1 and Server 2 simultaneously receive Increment() requests
Server 1: Reads counter=5, calculates 5+1=6, writes 6
Server 2: Reads counter=5, calculates 5+1=6, writes 6
Expected result: 7
Actual result: 6 (One increment lost)4. Further Attempt: Master-Slave Replication
You think: “Then I’ll set up one master server, with the rest as slave servers.”
Write requests โ [Master Server] โ Synchronized to โ [Slave Server 1] [Slave Server 2]
Read requests โ [Slave Server 1] [Slave Server 2] (Distributes read load)This seems promising, but new problems arise…
Problem 1: What if the master server crashes?
Master server crashes โ All write operations fail โ System becomes read-onlyProblem 2: How to choose a new master server?
Slave Server 1: "I should be the new master!"
Slave Server 2: "No, I should be the new master!"
Slave Server 3: "You're both useless, I'll take it!"
Result: Three master servers coexist โ Data becomes completely chaoticIssue 3: Network Partitioning
Network failure causes:
[Master Server] โโ [Slave Server 1] Network normal
โ
[Slave Server 2] โโ [Slave Server 3] Network disconnected
Slave servers 2 and 3 assume the master server failed and elect a new master
Now two master servers are operating simultaneously!5. The Core Issue: Fundamental Challenges in Distributed Systems
Through the above experiments, we uncover the core challenges facing distributed systems:
5.1 Consistency Problem
- How to maintain data consistency across multiple nodes?
- How to handle concurrent updates?
- How to guarantee atomicity of operations?
5.2 Availability
- How does the system continue functioning when some nodes fail?
- How to quickly detect and handle failures?
- How to avoid single points of failure?
5.3 Partition Tolerance
- How to handle network partitions?
- How to prevent split-brain scenarios?
- How to restore consistency after network recovery?
6. CAP Theorem: The Brutal Reality
The CAP Theorem states: In distributed systems, consistency (C), availability (A), and partition tolerance (P) cannot be achieved simultaneously. At most, only two can be guaranteed at any given time.
When a network partition occurs, you must choose:
Choose Consistency (C):
- Stop service and wait for network recovery
- Ensure data remains consistent
- But the system becomes unavailable
Choose Availability (A):
- Continue providing service
- But data inconsistencies may occur
- Conflicts must be resolved after network recovery7. What Solution Do We Need?
Based on the above analysis, we require an algorithm that can:
- Automatically elect a leader: When the primary node fails, it can automatically select a new primary node
- Ensure data consistency: All nodes eventually maintain consistent data
- Handle network partitions: Correctly manage partitions to prevent split-brain scenarios
- Fault tolerance: System continues functioning normally with minor node failures
- Simplicity and clarity: Logical algorithm design for easy implementation and verification
This is precisely the problem Raft algorithm solves
Preview of the Next Section
In Part Two, we will explore how Raft ingeniously addresses these challenges:
- The core concept of Raft: the strong leader model
- Why the “majority” mechanism was chosen
- The sophisticated design of the election process
- How log consistency is guaranteed
In-Depth Analysis of the Raft Algorithm - Part Two: Core Design Principles of Raft
How Raft Ingeniously Solves Distributed Challenges
1. Raft’s Core Insight: The Strong Leader Model
Facing the chaos described in Part 1, Raft introduced a simple yet powerful concept:
“At any given moment, there can be only one leader in the cluster, and all write operations must be processed through the leader.”
Chaos in Traditional Multi-Master Models:
[Master1] โโ [Master2] โโ [Master3] (Each can accept write requests, prone to conflicts)
Raft's Strong Leader Model:
[Leader] โ [Follower1] โ [Follower2] (Only the Leader accepts write requests)Why is this design so crucial?
Problem recap: In Part 1, we saw how multiple master servers working simultaneously leads to data chaos.
Raft’s solution:
- Only one Leader at any given time
- All writes are serialized through the Leader
- Followers only replicate the Leader’s actions
This fundamentally prevents concurrent write conflicts!
2. Why Choose the “Majority” Mechanism?
You might ask: “Why does Raft always emphasize ‘majority’? Why not require unanimous agreement?”
Let’s understand this through concrete examples:
Scenario 1: Requiring All Nodes to Agree
5-node cluster: [A] [B] [C] [D] [E]
Write operation flow:
1. Leader A receives write request
2. A sends replication requests to B, C, D, E
3. Waits for confirmation from all B, C, D, E
4. If node E experiences network failure โ Entire write operation fails
Result: A single node failure renders the entire system unavailableScenario 2: Raft’s Majority Mechanism
5-node cluster: [A] [B] [C] [D] [E]
Write operation flow:
1. Leader A receives write request
2. A sends replication requests to B, C, D, E
3. Success if 3 confirmations received (including itself)
4. Even if D and E fail, A, B, and C still provide confirmations
Result: Minor node failures do not impact system availabilityMathematical Principle of Majority
Key Insight: Any two majority sets must intersect!
5 nodes: [A] [B] [C] [D] [E]
Majority set 1: {A, B, C}
Majority set 2: {A, D, E}
Intersection: {A}
Majority set 1: {B, C, D}
Majority set 2: {C, D, E}
Intersection: {C, D}What does this mean?
- If an operation is confirmed by a majority of nodes, any subsequent majority decisions will “see” that operation
- This guarantees consistency: two disjoint majority sets cannot make conflicting decisions
3. Brain Split Protection During Network Partitions
Now let’s examine how Raft resolves the most challenging brain split issue:
Scenario: Network Partition
Original cluster: [A] [B] [C] [D] [E], with A as Leader
After network partition:
Partition 1: [A] [B] (2 nodes)
Partition 2: [C] [D] [E] (3 nodes)Problems with Traditional Approaches
Partition 1: A believes it remains Leader and continues processing write requests
Partition 2: C, D, E elect a new Leader (e.g., C) and also process write requests
Result: Two Leaders operating simultaneously โ Brain Split!Raft’s Solution
Partition 1: [A] [B] (2 nodes, less than majority)
- A cannot obtain majority confirmation
- A automatically demotes to Follower
- Partition 1 becomes read-only
Partition 2: [C] [D] [E] (3 nodes, majority)
- Can elect a new Leader
- Continues processing write requests
- System remains availableKey Points:
- Only partitions with majority nodes can elect a Leader
- Minority partitions automatically become read-only to prevent split-brain
- After network recovery, nodes in minority partitions synchronize data from majority partitions
4. Raft’s Three Role States
Raft designs each node to be in one of three states:
4.1 Follower
// Pseudocode
type Follower struct {
currentTerm int64
votedFor string
log []LogEntry
}
func (f *Follower) HandleMessage(msg Message) {
switch msg.Type {
case AppendEntries:
// Receive log replication from Leader
f.appendEntries(msg)
case RequestVote:
// Handle vote request
f.handleVoteRequest(msg)
case Timeout:
// Timed out waiting for Leader heartbeat, transition to Candidate
f.becomeCandidate()
}
}Responsibilities of a Follower:
- Passively receive log replication from the Leader
- Respond to vote requests
- Detect Leader failures (heartbeat timeouts)
4.2 Candidate
type Candidate struct {
currentTerm int64
votedFor string
log []LogEntry
votes map[string]bool
}
func (c *Candidate) StartElection() {
c.currentTerm++ // Increment term number
c.votedFor = c.nodeID // Vote for self
c.votes[c.nodeID] = true // Record own vote
// Send vote requests to all other nodes
for _, peer := range c.peers {
go c.sendVoteRequest(peer)
}
}Candidate Responsibilities:
- Initiate elections
- Collect votes
- Transition to Leader or Follower based on results
4.3 Leader
type Leader struct {
currentTerm int64
log []LogEntry
nextIndex map[string]int64 // Next log index for each Follower
matchIndex map[string]int64 // Highest log index replicated by each Follower
}
func (l *Leader) HandleClientRequest(req ClientRequest) {
// 1. Add the request to the local log
entry := LogEntry{
Term: l.currentTerm,
Index: len(l.log) + 1,
Command: req.Command,
}
l.log = append(l.log, entry)
// 2. Replicate in parallel to all Followers
for _, peer := range l.peers {
go l.replicateLog(peer)
}
}Responsibilities of the Leader:
- Handle client requests
- Replicate logs to Followers
- Send heartbeats to maintain authority
- Determine when to commit logs
5. The Ingenious Design of State Transitions
At startup, all nodes are Followers
โ
[Follower] โโโโโโโโโโโโโโโ
โ โ
Timeout/No heartbeat received โ
โ โ
[Candidate] โโโโโโโโโโโโโโค
โ โ
โโโโโโดโโโโโ โ
โ โ โ
Gain majority votes Election failure/ โ
โ Detect higher tenure โ
โ โ โ
[Leader] โโโโโดโโโโโโโโโโโโโโโ
โ
Detected higher tenure/lost majority support
โ
โ
[Follower]State Transition Triggers
Follower โ Candidate:
- No Leader heartbeat received within election timeout
- Indicates Leader may have failed
Candidate โ Leader:
- Received majority votes
- Becomes new Leader
Candidate โ Follower:
- Election failed (other node became Leader)
- Discovered higher term number
Leader โ Follower:
- Discovers a higher term number
- Loses majority support due to network partition
6. Term: Raft’s Logical Clock
Raft introduces the “term number” concept to resolve timing issues:
Timeline:
Term 1: [A is Leader] โโโโโโโโโโโโโ A fails
Term 2: [Elections in progress...] โโโโโโโโโโโโโ B becomes Leader
Term 3: [B is Leader] โโโโโโโโโโโโโ Network partition
Term 4: [Elections in progress...] โโโโโโโโโโโโโ C becomes LeaderRole of Term Number
1. Detecting Expired Messages:
if msg.Term < currentTerm {
// This message is from an expired Leader; ignore it
return
}2. Discovering a New Leader:
if msg.Term > currentTerm {
// Detect higher term, immediately transition to Follower
currentTerm = msg.Term
becomeFollower()
}3. Prevent duplicate voting:
if msg.Term == currentTerm && votedFor != nil {
// This term has already been voted on
return false
}7. Why is this design so elegant?
Review the problems mentioned in Part 1 and see how Raft solves them:
Problem 1: How to elect a new leader?
Raft Solution:
- Automatic election mechanism
- Majority voting guarantees uniqueness
- Term number prevents conflicts
Problem 2: How to avoid split-brain?
Raft Solution:
- Only a majority of partitions can elect a Leader
- Minority partitions automatically become read-only
- Mathematically guarantees no dual leadership
Problem 3: How to ensure data consistency?
Raft Solution:
- Strong leader model serializes all writes
- Majority replication guarantees durability
- Log replication ensures sequential consistency
Preview of the Next Section
In Part Three, we will delve into two core algorithms of Raft:
- The detailed process of leader election
- The sophisticated mechanism of log replication
- How to handle various edge cases
By now, you should grasp Raft’s core design philosophyโit elegantly resolves fundamental issues in distributed systems through a strong leader model and majority voting mechanism.
In-Depth Analysis of the Raft Algorithm - Part 3A: Leader Election Algorithm
Election Algorithms: How to Produce a Single Leader Amid Chaos
1. Election Triggering Conditions
Elections do not occur arbitrarily but are triggered by specific conditions:
Trigger Condition 1: No Leader at Startup
Cluster Startup: [A] [B] [C] [D] [E]
All nodes are Followers, waiting for Leader heartbeat
Timeout detected with no Leader โ Election beginsTrigger Condition 2: Leader Failure
Normal operation: Leader A โ [B] [C] [D] [E]
A suddenly crashes โ B, C, D, E wait for heartbeat timeout โ Election beginsTrigger Condition 3: Network Partition
Before partition: Leader A โ [B] [C] [D] [E]
After partition: [A] [B] | [C] [D] [E]
Right partition detects no heartbeat from A โ Start election2. Election Timeout: A Clever Design to Prevent Simultaneous Elections
Problem: What if all Followers start elections simultaneously?
At time T: A, B, C, D, E all time out simultaneously
A: "I want to be Leader!" โ Sends vote requests to B, C, D, E
B: "I want to be Leader!" โ Sends vote requests to A, C, D, E
C: "I want to be Leader!" โ Sends vote requests to A, B, D, E
...
Result: Everyone votes for themselves, no one gains majority โ Election failsRaft’s Solution: Randomized Election Timeout
// Each node has a different election timeout
func randomElectionTimeout() time.Duration {
// Random value between 150ms - 300ms
return time.Duration(150 + rand.Intn(150)) * time.Millisecond
}Effect:
Time T: All nodes start timing
Time T+180ms: Node B times out first and initiates election
Time T+200ms: Node C times out but detects B already electing, votes for B
Time T+250ms: Node A times out, but B has already become Leader3. Detailed Process of Voting Requests
When a Follower decides to initiate an election:
Step 1: Transition to Candidate state
func (n *Node) becomeCandidate() {
n.state = Candidate
n.currentTerm++ // Increment term number
n.votedFor = n.nodeID // Vote for self
n.voteCount = 1 // Self-vote
n.resetElectionTimer() // Reset election timer
}Step 2: Send vote request
type VoteRequest struct {
Term int64 // Candidate's term number
CandidateID string // Candidate ID
LastLogIndex int64 // Index of candidate's last log entry
LastLogTerm int64 // Term number of candidate's last log entry
}
func (n *Node) sendVoteRequests() {
req := VoteRequest{
Term: n.currentTerm,
CandidateID: n.nodeID,
LastLogIndex: n.getLastLogIndex(),
LastLogTerm: n.getLastLogTerm(),
}
for _, peer := range n.peers {
go n.sendVoteRequest(peer, req)
}4. Voting Decision: Not All Vote Requests Are Accepted
Nodes receiving vote requests must perform multiple checks:
Check 1: Term Number Verification
func (n *Node) handleVoteRequest(req VoteRequest) VoteResponse {
if req.Term < n.currentTerm {
// Expired candidate, reject vote
return VoteResponse{
Term: n.currentTerm,
VoteGranted: false,
}
}
if req.Term > n.currentTerm {
// Detect higher term, update own state
n.currentTerm = req.Term
n.votedFor = ""
n.becomeFollower()
}
}Check 2: Duplicate Voting Check
if n.votedFor != "" && n.votedFor != req.CandidateID {
// Already voted for someone else in this term
return VoteResponse{
Term: n.currentTerm,
VoteGranted: false,
}
}Check 3: Log Freshness Check (Critical!)
// Candidate's log must be at least as recent as their own
lastLogIndex := n.getLastLogIndex()
lastLogTerm := n.getLastLogTerm()
if req.LastLogTerm < lastLogTerm {
// Candidate's last logged term is older, reject
return VoteResponse{VoteGranted: false}
}
if req.LastLogTerm == lastLogTerm && req.LastLogIndex < lastLogIndex {
// Same term but smaller index, reject
return VoteResponse{VoteGranted: false}
}
// Passed all checks, grant vote
n.votedFor = req.CandidateID
return VoteResponse{
Term: n.currentTerm,
VoteGranted: true,
}5. Why is log freshness verification necessary?
This is one of Raft’s most ingenious designs:
Scenario: What happens without log verification?
Initial state:
Leader A: [log1, log2, log3] (latest)
Node B: [log1, log2] (lagging)
Node C: [log1, log2, log3] (latest)
After A crashes:
B initiates election, C votes for B โ B becomes new Leader
But B's logs are older than C's! B overwrites C's log3 โ Data loss!Raft’s Safeguard Mechanism:
B initiates election, sending vote request to C:
B's LastLogIndex=2, LastLogTerm=1
C's LastLogIndex=3, LastLogTerm=1
C checks: B's log is older than mine โ Rejects vote
B fails to obtain majority votes โ Election failsResult: Only the node with the latest log can become Leader, ensuring no data loss!
6. Three Election Outcomes
Case 1: Obtains majority votes and becomes Leader
func (n *Node) handleVoteResponse(resp VoteResponse) {
if resp.VoteGranted {
n.voteCount++
if n.voteCount > len(n.peers)/2 {
n.becomeLeader()
}
}
}
func (n *Node) becomeLeader() {
n.state = Leader
n.initializeLeaderState()
n.sendHeartbeats() // Immediately send heartbeats to establish authority
}Case 2: Discovering a higher term, transition to Follower
if resp.Term > n.currentTerm {
n.currentTerm = resp.Term
n.becomeFollower()
}Case 3: Election Timeout, Restart Election
func (n *Node) onElectionTimeout() {
if n.state == Candidate {
// Election failed, restart
n.becomeCandidate()
n.sendVoteRequests()
}
}7. Active Election Guarantee
Problem: Could an infinite loop occur where no Leader is ever elected?
Theoretically possible:
Round 1: A and B run elections simultaneously, tied vote count โ Failed
Round 2: A and B run elections simultaneously again, tied vote count โ Failed
...Infinite loopRaft’s Solution:
- Randomized Timeouts: Reduces probability of simultaneous elections
- Exponential Backoff: Increases wait time after election failures
- Mathematical Proof: Under the assumption of eventually reliable networks, a Leader will ultimately be elected
8. Complete Flowchart of the Election Algorithm
[Follower]
โ
โ Election timeout
โผ
[Candidate]
โ
โโ Increment term number
โโ Vote for self
โโ Reset election timer
โโ Send vote request
โ
โผ
Wait for vote response
โ
โโ Obtain majority votes โโโ [Leader]
โโ Discover higher term number โโโ [Follower]
โโ Election timeout โโโโโ Re-electSummary
The ingenuity of the election algorithm lies in:
- Randomized Timeouts: Avoids conflicts during simultaneous elections
- Term Number Mechanism: Resolves timing and conflict issues
- Log Freshness Check: Ensures the new Leader possesses the most complete data
- Majority Voting: Mathematically guarantees uniqueness
Next, we’ll explain the Log-Replication Algorithm, demonstrating how the Leader ensures data consistency across all nodes.
In-Depth Analysis of the Raft Algorithm - Part III: Log-Consistency Algorithms
Log-Consistency: Ensuring Data Consistency Across All Nodes
1. Why are logs needed?
In distributed systems, “logs” are not debugging logs but ordered records of operation sequences:
Problems with the traditional approach
Client request: SET x=1
Leader updates directly: x=1
Then notifies Follower: x=1
Issue: If the notification is lost, the Follower's x remains the old valueRaft’s Logging Approach
Client request: SET x=1
Leader logs: [Index=1, Term=1, Command="SET x=1"]
Replicates log to Followers
Executes command only after majority replication confirmedKey Insight: Log operations before executing them, ensuring replayability and consistency.
2. Log Entry Structure
type LogEntry struct {
Index int64 // Log index (starting from 1)
Term int64 // Term number when this log was created
Command interface{} // Client command (e.g., "SET x=1")
}
// Example log sequence
log := []LogEntry{
{Index: 1, Term: 1, Command: "SET x=1"},
{Index: 2, Term: 1, Command: "SET y=2"},
{Index: 3, Term: 2, Command: "DELETE x"},
{Index: 4, Term: 2, Command: "SET z=3"},
}Purpose of each field:
- Index: Determines operation sequence, ensuring all nodes execute in the same order
- Term: Identifies the Leader at creation time, used for conflict detection
- Command: The actual operation to execute
3. Basic Log Replication Process
Step 1: Leader Receives Client Request
func (l *Leader) HandleClientRequest(cmd Command) {
// 1. Create a new log entry
entry := LogEntry{
Index: len(l.log) + 1,
Term: l.currentTerm,
Command: cmd,
}
// 2. Append to local log (without execution)
l.log = append(l.log, entry)
// 3. Initiate replication to Followers
l.replicateToFollowers(entry)
}Step 2: Parallel replication to all Followers
func (l *Leader) replicateToFollowers(entry LogEntry) {
for _, follower := range l.followers {
go l.sendAppendEntries(follower, entry)
}
}Step 3: Wait for Majority Confirmation
func (l *Leader) waitForMajority(entry LogEntry) {
confirmCount := 1 // Leader counts itself as one vote
for response := range l.responseChan {
if response.Success {
confirmCount++
if confirmCount > len(l.cluster)/2 {
// Majority confirmed, commit entry
l.commitEntry(entry)
break
}
}
}Step 4: Commit and Execute
func (l *Leader) commitEntry(entry LogEntry) {
// 1. Mark as committed
l.commitIndex = entry.Index
// 2. Execute command
result := l.stateMachine.Apply(entry.Command)
// 3. Return result to client
l.respondToClient(result)
// 4. Notify Followers to commit
l.notifyFollowersToCommit(entry.Index)
}4. AppendEntries RPC: Core of Log Replication
This is the most critical RPC call in Raft:
AppendEntries Request Structure
type AppendEntriesRequest struct {
Term int64 // Leader's term number
LeaderID string // Leader's ID
PrevLogIndex int64 // Index of the log entry preceding the new entry
PrevLogTerm int64 // Term number of the log entry preceding the new entry
Entries []LogEntry // Log entries to replicate (empty during heartbeat)
LeaderCommit int64 // Leader's commit index
}Why are PrevLogIndex and PrevLogTerm required?
Key Question: How to ensure log continuity?
Incorrect replication approach:
Leader: [1, 2, 3, 4, 5]
Follower: [1, 2, ?, ?, ?]
Leader sends entries 4 and 5 directly โ Follower becomes [1, 2, 4, 5]
Entry 3 is missing, log continuity is broken!Raft’s Solution:
// When Leader sends entry 4
req := AppendEntriesRequest{
PrevLogIndex: 3, // Entry 4's predecessor is 3
PrevLogTerm: 2, // Entry 3's term number is 2
Entries: [entry4], // Entry to append
}
// Follower checks
if follower.log[3].Term != req.PrevLogTerm {
// Entry 3 mismatches, reject adding entry 4
return AppendEntriesResponse{Success: false}
}5. Follower Log Consistency Check
Processing logic when a Follower receives an AppendEntries request:
func (f *Follower) handleAppendEntries(req AppendEntriesRequest) AppendEntriesResponse {
// 1. Term check
if req.Term < f.currentTerm {
return AppendEntriesResponse{
Term: f.currentTerm,
Success: false,
}
}
// 2. Update term and Leader information
if req.Term > f.currentTerm {
f.currentTerm = req.Term
f.votedFor = โโ
}
f.leaderID = req.LeaderID
f.resetElectionTimer() // Reset election timer upon receiving Leader message
// 3. Log Consistency Check
if req.PrevLogIndex > 0 {
if len(f.log) < req.PrevLogIndex {
// Log is too short, missing preceding entries
return AppendEntriesResponse{Success: false}
}
if f.log[req.PrevLogIndex-1].Term != req.PrevLogTerm {
// Term of preceding entry does not match
return AppendEntriesResponse{Success: false}
}
}
// 4. Add new entries
for i, entry := range req.Entries {
index := req.PrevLogIndex + i + 1
if len(f.log) >= index {
// Check for conflicts
if f.log[index-1].Term != entry.Term {
// Conflict detected, delete this position and all subsequent entries
f.log = f.log[:index-1]
}
}
if len(f.log) < index {
// Add new entry
f.log = append(f.log, entry)
}
}
// 5. Update commit index
if req.LeaderCommit > f.commitIndex {
f.commitIndex = min(req.LeaderCommit, len(f.log))
f.applyCommittedEntries()
}
return AppendEntriesResponse{
Term: f.currentTerm,
Success: true,
}
}6. Log Recovery: Handling Inconsistencies
How does Raft recover when a Follower’s log becomes inconsistent with the Leader’s?
Scenario: Log Divergence Caused by Network Partition
State before partition:
Leader A: [1, 2, 3]
Node B: [1, 2, 3]
Node C: [1, 2, 3]
Network partition: A | B,C
A continues receiving requests: [1, 2, 3, 4, 5]
B becomes new Leader, receiving requests: [1, 2, 3, 6, 7]
After partition recovery:
A: [1, 2, 3, 4, 5] (Old Leader)
B: [1, 2, 3, 6, 7] (New Leader)Raft’s Recovery Process
Step 1: After B becomes Leader, attempt to replicate logs
// B sends AppendEntries to A
req := AppendEntriesRequest{
PrevLogIndex: 5, // B assumes A has 5 logs
PrevLogTerm: 2, // The 5th log's term should be 2
Entries: [], // Sends heartbeat probe first
}
// A checks: My 5th log term is 1, not 2
// A replies: Success: falseStep 2: B decrements the index to find a consistent point
// B receives a failure response and decrements nextIndex[A]
B.nextIndex[A] = 4
// Retry
req := AppendEntriesRequest{
PrevLogIndex: 4,
PrevLogTerm: 1, // The 4th log's term is 1
Entries: [],
}
// A checks: My 4th log's term is also 1, but I don't have a 4th log!
// A replies: Success: falseStep 3: Continue decrementing until a match is found
// B continues decrementing
B.nextIndex[A] = 3
req := AppendEntriesRequest{
PrevLogIndex: 3,
PrevLogTerm: 1, // Term 3's term is 1
Entries: [],
}
// A checks: My Term 3 log's term is indeed 1
// A replies: Success: trueStep 4: Overwrite from the consistent point
// B finds the consistent point and starts sending correct logs
req := AppendEntriesRequest{
PrevLogIndex: 3,
PrevLogTerm: 1,
Entries: [entry6, entry7], // B's 4th and 5th logs
}
// After A receives:
// Delete its own logs 4 and 5 (conflicting 4 and 5)
// Add logs 4 and 5 sent by B (6 and 7)
// Final result: A's log becomes [1, 2, 3, 6, 7]7. Commit Rules: When Can Commands Be Executed?
Key Principle: Only log entries replicated by a majority of nodes can be committed for execution.
Commit Conditions
func (l *Leader) canCommit(index int64) bool {
// 1. Must be a log entry from the current term
if l.log[index-1].Term != l.currentTerm {
return false
}
// 2. Must be replicated by a majority of nodes
replicaCount := 1 // Leader itself
for _, follower := range l.followers {
if l.matchIndex[follower] >= index {
replicaCount++
}
}
return replicaCount > len(l.cluster)/2
}Why can’t logs from old terms be committed?
Dangerous Scenario:
Initial state: Leader A, logs [1, 2]
A crashes, B becomes Leader, adds log 3: [1, 2, 3]
B crashes, A recovers as Leader, logs still [1, 2]
If A commits log 2 directly then crashes
C becomes Leader and may overwrite the committed log 2!Raft’s Safety Rules:
- Leaders can only commit logs from the current term
- Committing logs from the current term implicitly commits all previous logs
8. Log Replication Performance Optimization
Optimization 1: Batch Transmission
// Transmit in batches instead of sequentially
func (l *Leader) batchAppendEntries(follower string) {
entries := l.getPendingEntries(follower)
if len(entries) > 0 {
l.sendAppendEntries(follower, entries)
}
}Optimization 2: Parallel Replication
// Send to all Followers in parallel
func (l *Leader) replicateInParallel() {
var wg sync.WaitGroup
for _, follower := range l.followers {
wg.Add(1)
go func(f string) {
defer wg.Done()
l.replicateToFollower(f)
}(follower)
}
wg.Wait()
}Optimization 3: Rapid Rollback
// Rapidly locate the consistent point when inconsistencies are detected
type AppendEntriesResponse struct {
Success bool
Term int64
ConflictTerm int64 // Term of conflicting entries
FirstIndex int64 // First index within this term
}Summary
The ingenuity of the log replication algorithm lies in:
- Log entry structure: Index ensures order, Term detects conflicts
- Consistency checks: PrevLogIndex/PrevLogTerm guarantee continuity
- Conflict Resolution: Automatically roll back to a consistent point, then overwrite
- Commit Rule: Only commit logs replicated by a majority of nodes in the current term
This mechanism ensures all nodes eventually possess identical log sequences, thereby guaranteeing state machine consistency.
Next, we will cover Part Four: Security Proofs and Boundary Condition Handling.
In-Depth Analysis of the Raft Algorithm - Part IV: A. Safety Guarantees
Raft’s Safety: Why It Ensures Data Is Neither Lost Nor Corrupted
1. Raft’s Safety Commitments
The Raft algorithm makes several key safety guarantees:
Guarantee 1: Election Safety
Commitment: At most one Leader is elected during any given tenure
Why is this important?
If two Leaders exist simultaneously:
Leader A: Processes SET x=1
Leader B: Processes SET x=2
Result: Data conflict, uncertain system stateGuarantee 2: Leader Append-Only
Guarantee: A Leader never overwrites or deletes entries in its own log; it only appends new entries
Guarantee 3: Log Matching
Commitment: If two logs have entries with the same term ID at a given index position, all entries before and including that position are identical.
Guarantee 4: Leader Completeness
Commitment: If a log entry is committed during a certain term, it will appear in the logs of all Leaders from that term onwards.
Guarantee 5: State Machine Safety
Guarantee: If a server applies a log entry at a certain index position, no other server will apply a different log entry at that index position.
2. Mathematical Proof of Election Safety
Theorem: At most one Leader exists within any given term T.
Proof:
Suppose two Leaders exist within term T: A and B.
To become Leader, A must obtain a majority vote: |votes_A| > n/2
To become Leader, B must obtain a majority vote: |votes_B| > n/2
Therefore: |votes_A| + |votes_B| > n
However, each node can vote at most once during term T
Thus: |votes_A โฉ votes_B| = โ
This implies: |votes_A| + |votes_B| โค n
Contradiction! Hence, there exists at most one Leader during term T.3. Proof of Leader Consistency
Theorem: If a log entry is committed during term T, it will appear in the logs of all Leaders for terms > T.
Proof Approach:
- Committing a log entry means it is replicated by a majority of nodes.
- Becoming a new Leader requires obtaining a majority vote.
- Any two majority sets must intersect
- Log freshness checks during voting ensure the new Leader possesses the most complete log
4. Why Can’t Logs from Previous Terms Be Committed?
Hazardous Scenario
S1: [1, 2] (Term=2, Leader)
S2: [1, 2]
S3: [1]
S1 crashes, S3 becomes Leader (Term=3), adding entry 3:
S3: [1, 3] (Term=3, Leader)
S1 recovers and becomes Leader (Term=4):
If S1 directly commits entry 2 and then crashes,
S3 re-becomes Leader and overwrites the already committed entry 2!Raft’s Solution
// Only commit logs from the current term
if l.log[index-1].Term != l.currentTerm {
return false
}5. Safety During Network Partitions
Partition: [A,B] | [C,D,E]
Partition 1 (Minority):
- Cannot obtain majority confirmation
- Enters read-only state
Partition 2 (Majority):
- Can elect a new Leader
- Continues processing write requests
Network Recovery:
- A automatically demoted to Follower
- Synchronizes logs from the new LeaderSummary
Raft guarantees safety through rigorous mathematical proofs:
- Majority mechanism prevents split-brain
- Log inspection ensures integrity
- Commit rules guarantee consistency
The next section will cover boundary condition handling.
In-Depth Analysis of the Raft Algorithm - Part IVB: Boundary Conditions and Implementation Details
The Devil Is in the Details: Boundary Conditions in Raft Implementation
1. Timing Issues: The Ingenious Design of Election Timeouts
Problem: How to Avoid Election Deadlocks?
Scenario: Two nodes simultaneously initiate elections
Time T: A and B both time out and initiate elections
A โ C: Request vote (Term=5)
B โ C: Request vote (Term=5)
C can only vote for one; assume C votes for A
A gains 2 votes (self + C), B gains 1 vote (self)
Neither achieves majority โ Election fails
Next round: A and B may time out again...Solution: Randomize timeout durations
func (n *Node) resetElectionTimer() {
// Random timeout between 150-300ms
timeout := time.Duration(150+rand.Intn(150)) * time.Millisecond
n.electionTimer.Reset(timeout)
}Why does this work?
- Mathematically, randomization breaks synchronization
- Even if simultaneous timeouts occur occasionally, the next timeout duration will differ
- Ultimately, one node will always โsneak aheadโ successfully
Timeout Selection Principle
Election Timeout >> Heartbeat Interval >> Network Round-Trip Time
Typical Configuration:
- Heartbeat Interval: 50ms
- Election timeout: 150-300ms
- Network RTT: 1-5ms
Reasons:
- Avoid false elections due to network jitter
- Allow sufficient time for the Leader to send heartbeats
- Fault detection must occur faster than the election timeout2. Log Compression: Snapshot Mechanism
Problem: How to handle infinitely growing logs?
As the system runs, logs grow increasingly long:
[1, 2, 3, 4, 5, ..., 1000000]
Issues:
- Excessive memory consumption
- Prolonged synchronization time for new nodes
- Extended recovery time during restartsSolution: Snapshot Mechanism
type Snapshot struct {
LastIncludedIndex int64 // Index of the last log entry included in the snapshot
LastIncludedTerm int64 // Term of the last log entry included in the snapshot
Data []byte // State machine snapshot data
}
func (n *Node) createSnapshot() {
// 1. Create state machine snapshot
snapshotData := n.stateMachine.CreateSnapshot()
// 2. Record snapshot metadata
snapshot := Snapshot{
LastIncludedIndex: n.lastApplied,
LastIncludedTerm: n.log[n.lastApplied-1].Term,
Data: snapshotData,
}
// 3. Save the snapshot
n.saveSnapshot(snapshot)
// 4. Delete logged entries included in the snapshot
n.log = n.log[n.lastApplied:]
// 5. Adjust indices
n.adjustIndices(n.lastApplied)
}Snapshot Transmission and Installation
// Leader sends snapshot to lagging Follower
type InstallSnapshotRequest struct {
Term int64
LeaderID string
LastIncludedIndex int64
LastIncludedTerm int64
Data []byte
}
func (f *Follower) handleInstallSnapshot(req InstallSnapshotRequest) {
// 1. Check term
if req.Term < f.currentTerm {
return InstallSnapshotResponse{Term: f.currentTerm}
}
// 2. Install snapshot
f.stateMachine.InstallSnapshot(req.Data)
// 3. Update log state
f.log = []LogEntry{} // Clear logs
f.lastApplied = req.LastIncludedIndex
f.commitIndex = req.LastIncludedIndex
// 4. Save snapshot metadata
f.lastIncludedIndex = req.LastIncludedIndex
f.lastIncludedTerm = req.LastIncludedTerm
}3. Member Changes: Dynamically Adjusting Cluster Size
Problem: How to safely add/remove nodes?
Naive Approach: Directly switch configurations
Old configuration: [A, B, C] (3 nodes)
New configuration: [A, B, C, D, E] (5 nodes)
Issue: Different nodes may observe configuration changes at different times
A, B see old config: 2 votes needed to become Leader
C, D, E see new config: 3 votes needed to become Leader
Potential for two concurrent Leaders!Raft’s Solution: Joint Consensus
Phase 1: Enter joint configuration C_old,new
- Requires majority votes in both C_old and C_new
- Example: C_old=[A,B,C], C_new=[A,B,C,D,E]
- Becoming Leader requires: 2 votes in C_old AND 3 votes in C_new
Phase 2: Switch to new configuration C_new
- Only requires majority in C_new
- Safely completes configuration changetype Configuration struct {
Old []string // Old configuration node list
New []string // New configuration node list
}
func (l *Leader) addServer(newServer string) {
// 1. Create joint configuration
jointConfig := Configuration{
Old: l.config.Servers,
New: append(l.config.Servers, newServer),
}
// 2. Submit joint configuration
l.proposeConfigChange(jointConfig)
// 3. Submit new configuration after joint config commits
go func() {
<-l.jointConfigCommitted
newConfig := Configuration{
Old: nil,
New: jointConfig.New,
}
l.proposeConfigChange(newConfig)
}()
}4. Network Partition Recovery: Data Synchronization
Scenario: Recovery after a prolonged partition
Before partitioning: [A, B, C, D, E], where A is Leader
Long-term partitioning:
Partition 1: [A, B] (processed 100 requests)
Partition 2: [C, D, E] (C became Leader, processed 200 requests)
Partition recovery: How to synchronize data?Recovery Process
func (f *Follower) handleAppendEntries(req AppendEntriesRequest) {
// 1. Detect higher term, immediately transition to Follower
if req.Term > f.currentTerm {
f.currentTerm = req.Term
f.votedFor = โโ
f.state = Follower
}
// 2. Log consistency check failed
if req.PrevLogIndex > len(f.log) {
return AppendEntriesResponse{
Success: false,
ConflictIndex: len(f.log) + 1,
}
}
// 3. Conflict detected; delete conflicting log and all subsequent logs
if f.log[req.PrevLogIndex-1].Term != req.PrevLogTerm {
conflictTerm := f.log[req.PrevLogIndex-1].Term
conflictIndex := req.PrevLogIndex
// Find the first index of the conflicting term
for i := req.PrevLogIndex - 1; i >= 0; i-- {
if f.log[i].Term != conflictTerm {
break
}
conflictIndex = i + 1
}
// Delete conflicting logs
f.log = f.log[:conflictIndex-1]
return AppendEntriesResponse{
Success: false,
ConflictTerm: conflictTerm,
ConflictIndex: conflictIndex,
}
}
}5. Client Interaction: Idempotency and Duplicate Detection
Issue: Duplicate Execution Due to Network Retransmission
Client sends: SET x=1
Leader executes: x=1
Response lost, client retries: SET x=1
Leader executes again: x=1 (duplicate execution)Solution: Client Session and Sequence ID
type ClientRequest struct {
ClientID string // Client unique identifier
SequenceID int64 // Request sequence ID
Command interface{}
}
type ClientSession struct {
LastSequenceID int64 // Last processed sequence ID
LastResponse interface{} // Last response result
}
func (l *Leader) handleClientRequest(req ClientRequest) {
session := l.clientSessions[req.ClientID]
// Check for duplicate requests
if req.SequenceID <= session.LastSequenceID {
// Return cached response
return session.LastResponse
}
// Process new request
entry := LogEntry{
Index: len(l.log) + 1,
Term: l.currentTerm,
Command: req,
ClientID: req.ClientID,
SequenceID: req.SequenceID,
}
l.log = append(l.log, entry)
l.replicateToFollowers(entry)
}6. Performance Optimization: Batch Processing and Pipelining
Batch Processing: Reducing Network Overhead
func (l *Leader) batchReplication() {
ticker := time.NewTicker(10 * time.Millisecond)
for {
select {
case <-ticker.C:
for _, follower := range l.followers {
entries := l.getPendingEntries(follower)
if len(entries) > 0 {
l.sendBatchAppendEntries(follower, entries)
}
}
}
}
}Pipeline: Parallel Processing of Multiple Requests
func (l *Leader) pipelineReplication(follower string) {
// Send the next batch without waiting for the previous response
for {
entries := l.getNextBatch(follower)
go l.sendAppendEntries(follower, entries)
// Adjust sending rate based on network conditions
time.Sleep(l.calculateDelay(follower))
}
}7. Failure Detection: Heartbeat and Timeout
Precise Failure Detection
type FailureDetector struct {
heartbeatInterval time.Duration
timeoutThreshold time.Duration
lastHeartbeat map[string]time.Time
}
func (fd *FailureDetector) isNodeAlive(nodeID string) bool {
lastSeen := fd.lastHeartbeat[nodeID]
return time.Since(lastSeen) < fd.timeoutThreshold
}
// Adaptive timeout: adjusts based on network conditions
func (fd *FailureDetector) updateTimeout(nodeID string, rtt time.Duration) {
// Timeout = average RTT * multiplier + safety margin
fd.timeoutThreshold = rtt*4 + 50*time.Millisecond
}8. Persistence: WAL and State Recovery
Write-Ahead Log (WAL)
func (n *Node) persistState() error {
state := PersistentState{
CurrentTerm: n.currentTerm,
VotedFor: n.votedFor,
Log: n.log,
}
// Atomic write: write to temporary file, then rename
tempFile := n.dataDir + โ/state.tmpโ
if err := writeToFile(tempFile, state); err != nil {
return err
}
return os.Rename(tempFile, n.dataDir+โ/state.datโ)
}
func (n *Node) recoverState() error {
data, err := ioutil.ReadFile(n.dataDir + โ/state.datโ)
if err != nil {
return err
}
var state PersistentState
if err := json.Unmarshal(data, &state); err != nil {
return err
}
n.currentTerm = state.CurrentTerm
n.votedFor = state.VotedFor
n.log = state.Log
return nil
}Summary
Raft’s implementation details reveal the complexity of distributed systems:
- Timing Control: Randomization resolves synchronization issues
- Resource Management: Snapshot mechanism controls memory growth
- Dynamic Configuration: Union consensus ensures safe changes
- Fault Tolerance & Recovery: Intelligent log repair mechanism
- Performance Optimization: Batch processing and pipelining techniques
- Persistence: WAL guarantees data integrity
These details transform Raft from a theoretical algorithm into a practical engineering implementation.