Contents

Raft Protocol

Contents

In-Depth Analysis of the Raft Algorithm - Part One: The Origin of the Problem

Why Raft? The Painful Journey from Single-Node to Distributed Systems

1. Starting with a Single Node: Everything Was Simple

Imagine you’re developing a simple counter service:

// Single-machine version - works perfectly
type Counter struct {
    value int
    mu    sync.Mutex
}

func (c *Counter) Increment() {
    c.mu.Lock()
    c.value++
    c.mu.Unlock()
}

func (c *Counter) Get() int {
    c.mu.Lock()
    defer c.mu.Unlock()
    return c.value
}

The Beautiful World of Single-Node Systems:

  • Data Consistency: Only one copy of data exists, eliminating conflicts
  • Atomic Operations: Locking mechanisms ensure operations aren’t interrupted
  • Simple Failures: Either everything works, or the entire service crashes

2. The Cruel Reality: Limitations of Single-Server Architecture

But reality quickly slaps you in the face:

Issue 1: Single Point of Failure

User Request โ†’ [Single-Server Service] โ†’ Database
                โ†‘
            Server Crash = Entire System Unavailable

Issue 2: Performance Bottlenecks

1000 concurrent requests โ†’ [Single-server service] โ†’ Overwhelmed, response slows down

Issue 3: Data Loss Risk

Server hard drive failure โ†’ All data lost โ†’ Business collapses

3. The Naive Solution: Simple Replication

You think: “Why not just set up multiple servers?”

User Request โ†’ Load Balancer โ†’ [Server1] [Server2] [Server3]
                        โ†“        โ†“        โ†“
                      Database1   Database2   Database3

It looks promising, but problems quickly emerge…

Problem 1: Data Inconsistency

Time T1: User A sends Increment() to Server 1 โ†’ Server 1's counter = 1
Time T2: User B sends Get() to Server 2 โ†’ Server 2's counter = 0 (not yet synchronized)

Result: User B sees outdated data!

Problem 2: Concurrency Conflicts

Time T1: Server 1 and Server 2 simultaneously receive Increment() requests
Server 1: Reads counter=5, calculates 5+1=6, writes 6
Server 2: Reads counter=5, calculates 5+1=6, writes 6

Expected result: 7
Actual result: 6 (One increment lost)

4. Further Attempt: Master-Slave Replication

You think: “Then I’ll set up one master server, with the rest as slave servers.”

Write requests โ†’ [Master Server] โ†’ Synchronized to โ†’ [Slave Server 1] [Slave Server 2]
Read requests โ†’ [Slave Server 1] [Slave Server 2] (Distributes read load)

This seems promising, but new problems arise…

Problem 1: What if the master server crashes?

Master server crashes โ†’ All write operations fail โ†’ System becomes read-only

Problem 2: How to choose a new master server?

Slave Server 1: "I should be the new master!"
Slave Server 2: "No, I should be the new master!"
Slave Server 3: "You're both useless, I'll take it!"

Result: Three master servers coexist โ†’ Data becomes completely chaotic

Issue 3: Network Partitioning

Network failure causes:
[Master Server] โ†โ†’ [Slave Server 1]    Network normal
     โ†•
[Slave Server 2] โ†โ†’ [Slave Server 3]   Network disconnected

Slave servers 2 and 3 assume the master server failed and elect a new master
Now two master servers are operating simultaneously!

5. The Core Issue: Fundamental Challenges in Distributed Systems

Through the above experiments, we uncover the core challenges facing distributed systems:

5.1 Consistency Problem

  • How to maintain data consistency across multiple nodes?
  • How to handle concurrent updates?
  • How to guarantee atomicity of operations?

5.2 Availability

  • How does the system continue functioning when some nodes fail?
  • How to quickly detect and handle failures?
  • How to avoid single points of failure?

5.3 Partition Tolerance

  • How to handle network partitions?
  • How to prevent split-brain scenarios?
  • How to restore consistency after network recovery?

6. CAP Theorem: The Brutal Reality

The CAP Theorem states: In distributed systems, consistency (C), availability (A), and partition tolerance (P) cannot be achieved simultaneously. At most, only two can be guaranteed at any given time.

When a network partition occurs, you must choose:

Choose Consistency (C):
- Stop service and wait for network recovery
- Ensure data remains consistent
- But the system becomes unavailable

Choose Availability (A):
- Continue providing service
- But data inconsistencies may occur
- Conflicts must be resolved after network recovery

7. What Solution Do We Need?

Based on the above analysis, we require an algorithm that can:

  1. Automatically elect a leader: When the primary node fails, it can automatically select a new primary node
  2. Ensure data consistency: All nodes eventually maintain consistent data
  3. Handle network partitions: Correctly manage partitions to prevent split-brain scenarios
  4. Fault tolerance: System continues functioning normally with minor node failures
  5. Simplicity and clarity: Logical algorithm design for easy implementation and verification

This is precisely the problem Raft algorithm solves

Preview of the Next Section

In Part Two, we will explore how Raft ingeniously addresses these challenges:

  • The core concept of Raft: the strong leader model
  • Why the “majority” mechanism was chosen
  • The sophisticated design of the election process
  • How log consistency is guaranteed

In-Depth Analysis of the Raft Algorithm - Part Two: Core Design Principles of Raft

How Raft Ingeniously Solves Distributed Challenges

1. Raft’s Core Insight: The Strong Leader Model

Facing the chaos described in Part 1, Raft introduced a simple yet powerful concept:

“At any given moment, there can be only one leader in the cluster, and all write operations must be processed through the leader.”

Chaos in Traditional Multi-Master Models:
[Master1] โ†โ†’ [Master2] โ†โ†’ [Master3]  (Each can accept write requests, prone to conflicts)

Raft's Strong Leader Model:
[Leader] โ†’ [Follower1] โ†’ [Follower2]  (Only the Leader accepts write requests)

Why is this design so crucial?

Problem recap: In Part 1, we saw how multiple master servers working simultaneously leads to data chaos.

Raft’s solution:

  • Only one Leader at any given time
  • All writes are serialized through the Leader
  • Followers only replicate the Leader’s actions

This fundamentally prevents concurrent write conflicts!

2. Why Choose the “Majority” Mechanism?

You might ask: “Why does Raft always emphasize ‘majority’? Why not require unanimous agreement?”

Let’s understand this through concrete examples:

Scenario 1: Requiring All Nodes to Agree

5-node cluster: [A] [B] [C] [D] [E]

Write operation flow:
1. Leader A receives write request
2. A sends replication requests to B, C, D, E
3. Waits for confirmation from all B, C, D, E
4. If node E experiences network failure โ†’ Entire write operation fails

Result: A single node failure renders the entire system unavailable

Scenario 2: Raft’s Majority Mechanism

5-node cluster: [A] [B] [C] [D] [E]

Write operation flow:
1. Leader A receives write request
2. A sends replication requests to B, C, D, E  
3. Success if 3 confirmations received (including itself)
4. Even if D and E fail, A, B, and C still provide confirmations

Result: Minor node failures do not impact system availability

Mathematical Principle of Majority

Key Insight: Any two majority sets must intersect!

5 nodes: [A] [B] [C] [D] [E]

Majority set 1: {A, B, C}
Majority set 2: {A, D, E}
Intersection: {A}

Majority set 1: {B, C, D}  
Majority set 2: {C, D, E}
Intersection: {C, D}

What does this mean?

  • If an operation is confirmed by a majority of nodes, any subsequent majority decisions will “see” that operation
  • This guarantees consistency: two disjoint majority sets cannot make conflicting decisions

3. Brain Split Protection During Network Partitions

Now let’s examine how Raft resolves the most challenging brain split issue:

Scenario: Network Partition

Original cluster: [A] [B] [C] [D] [E], with A as Leader

After network partition:
Partition 1: [A] [B]        (2 nodes)
Partition 2: [C] [D] [E]    (3 nodes)

Problems with Traditional Approaches

Partition 1: A believes it remains Leader and continues processing write requests
Partition 2: C, D, E elect a new Leader (e.g., C) and also process write requests

Result: Two Leaders operating simultaneously โ†’ Brain Split!

Raft’s Solution

Partition 1: [A] [B] (2 nodes, less than majority)
- A cannot obtain majority confirmation
- A automatically demotes to Follower
- Partition 1 becomes read-only

Partition 2: [C] [D] [E] (3 nodes, majority)  
- Can elect a new Leader
- Continues processing write requests
- System remains available

Key Points:

  • Only partitions with majority nodes can elect a Leader
  • Minority partitions automatically become read-only to prevent split-brain
  • After network recovery, nodes in minority partitions synchronize data from majority partitions

4. Raft’s Three Role States

Raft designs each node to be in one of three states:

4.1 Follower

// Pseudocode
type Follower struct {
    currentTerm int64
    votedFor    string
    log         []LogEntry
}

func (f *Follower) HandleMessage(msg Message) {
    switch msg.Type {
    case AppendEntries:
        // Receive log replication from Leader
        f.appendEntries(msg)
    case RequestVote:
        // Handle vote request
        f.handleVoteRequest(msg)
    case Timeout:
        // Timed out waiting for Leader heartbeat, transition to Candidate
        f.becomeCandidate()
    }
}

Responsibilities of a Follower:

  • Passively receive log replication from the Leader
  • Respond to vote requests
  • Detect Leader failures (heartbeat timeouts)

4.2 Candidate

type Candidate struct {
    currentTerm int64
    votedFor    string
    log         []LogEntry
    votes       map[string]bool
}

func (c *Candidate) StartElection() {
    c.currentTerm++           // Increment term number
    c.votedFor = c.nodeID     // Vote for self
    c.votes[c.nodeID] = true  // Record own vote
    
    // Send vote requests to all other nodes
    for _, peer := range c.peers {
        go c.sendVoteRequest(peer)
    }
}

Candidate Responsibilities:

  • Initiate elections
  • Collect votes
  • Transition to Leader or Follower based on results

4.3 Leader

type Leader struct {
    currentTerm int64
    log         []LogEntry
    nextIndex   map[string]int64  // Next log index for each Follower
    matchIndex  map[string]int64  // Highest log index replicated by each Follower
}

func (l *Leader) HandleClientRequest(req ClientRequest) {
    // 1. Add the request to the local log
    entry := LogEntry{
        Term:    l.currentTerm,
        Index:   len(l.log) + 1,
        Command: req.Command,
    }
    l.log = append(l.log, entry)
    
    // 2. Replicate in parallel to all Followers
    for _, peer := range l.peers {
        go l.replicateLog(peer)
    }
}

Responsibilities of the Leader:

  • Handle client requests
  • Replicate logs to Followers
  • Send heartbeats to maintain authority
  • Determine when to commit logs

5. The Ingenious Design of State Transitions

At startup, all nodes are Followers
         โ†“
    [Follower] โ†โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚                  โ”‚
    Timeout/No heartbeat received           โ”‚
         โ†“                  โ”‚
    [Candidate] โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
         โ”‚                  โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”              โ”‚
    โ”‚         โ”‚              โ”‚
Gain majority votes   Election failure/         โ”‚
    โ”‚      Detect higher tenure        โ”‚
    โ†“         โ”‚              โ”‚
 [Leader] โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ”‚
Detected higher tenure/lost majority support
    โ”‚
    โ†“
[Follower]

State Transition Triggers

Follower โ†’ Candidate:

  • No Leader heartbeat received within election timeout
  • Indicates Leader may have failed

Candidate โ†’ Leader:

  • Received majority votes
  • Becomes new Leader

Candidate โ†’ Follower:

  • Election failed (other node became Leader)
  • Discovered higher term number

Leader โ†’ Follower:

  • Discovers a higher term number
  • Loses majority support due to network partition

6. Term: Raft’s Logical Clock

Raft introduces the “term number” concept to resolve timing issues:

Timeline:
Term 1: [A is Leader] โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ A fails
Term 2: [Elections in progress...] โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ B becomes Leader  
Term 3: [B is Leader] โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Network partition
Term 4: [Elections in progress...] โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ C becomes Leader

Role of Term Number

1. Detecting Expired Messages:

if msg.Term < currentTerm {
    // This message is from an expired Leader; ignore it
    return
}

2. Discovering a New Leader:

if msg.Term > currentTerm {
    // Detect higher term, immediately transition to Follower
    currentTerm = msg.Term
    becomeFollower()
}

3. Prevent duplicate voting:

if msg.Term == currentTerm && votedFor != nil {
    // This term has already been voted on
    return false
}

7. Why is this design so elegant?

Review the problems mentioned in Part 1 and see how Raft solves them:

Problem 1: How to elect a new leader?

Raft Solution:

  • Automatic election mechanism
  • Majority voting guarantees uniqueness
  • Term number prevents conflicts

Problem 2: How to avoid split-brain?

Raft Solution:

  • Only a majority of partitions can elect a Leader
  • Minority partitions automatically become read-only
  • Mathematically guarantees no dual leadership

Problem 3: How to ensure data consistency?

Raft Solution:

  • Strong leader model serializes all writes
  • Majority replication guarantees durability
  • Log replication ensures sequential consistency

Preview of the Next Section

In Part Three, we will delve into two core algorithms of Raft:

  • The detailed process of leader election
  • The sophisticated mechanism of log replication
  • How to handle various edge cases

By now, you should grasp Raft’s core design philosophyโ€”it elegantly resolves fundamental issues in distributed systems through a strong leader model and majority voting mechanism.

In-Depth Analysis of the Raft Algorithm - Part 3A: Leader Election Algorithm

Election Algorithms: How to Produce a Single Leader Amid Chaos

1. Election Triggering Conditions

Elections do not occur arbitrarily but are triggered by specific conditions:

Trigger Condition 1: No Leader at Startup

Cluster Startup: [A] [B] [C] [D] [E]
All nodes are Followers, waiting for Leader heartbeat
Timeout detected with no Leader โ†’ Election begins

Trigger Condition 2: Leader Failure

Normal operation: Leader A โ†’ [B] [C] [D] [E]
A suddenly crashes โ†’ B, C, D, E wait for heartbeat timeout โ†’ Election begins

Trigger Condition 3: Network Partition

Before partition: Leader A โ†’ [B] [C] [D] [E]
After partition: [A] [B] | [C] [D] [E]
Right partition detects no heartbeat from A โ†’ Start election

2. Election Timeout: A Clever Design to Prevent Simultaneous Elections

Problem: What if all Followers start elections simultaneously?

At time T: A, B, C, D, E all time out simultaneously
A: "I want to be Leader!" โ†’ Sends vote requests to B, C, D, E
B: "I want to be Leader!" โ†’ Sends vote requests to A, C, D, E
C: "I want to be Leader!" โ†’ Sends vote requests to A, B, D, E
...

Result: Everyone votes for themselves, no one gains majority โ†’ Election fails

Raft’s Solution: Randomized Election Timeout

// Each node has a different election timeout
func randomElectionTimeout() time.Duration {
    // Random value between 150ms - 300ms
    return time.Duration(150 + rand.Intn(150)) * time.Millisecond
}

Effect:

Time T: All nodes start timing
Time T+180ms: Node B times out first and initiates election
Time T+200ms: Node C times out but detects B already electing, votes for B
Time T+250ms: Node A times out, but B has already become Leader

3. Detailed Process of Voting Requests

When a Follower decides to initiate an election:

Step 1: Transition to Candidate state

func (n *Node) becomeCandidate() {
    n.state = Candidate
    n.currentTerm++           // Increment term number
    n.votedFor = n.nodeID     // Vote for self
    n.voteCount = 1           // Self-vote
    n.resetElectionTimer()    // Reset election timer
}

Step 2: Send vote request

type VoteRequest struct {
    Term         int64  // Candidate's term number
    CandidateID  string // Candidate ID
    LastLogIndex int64  // Index of candidate's last log entry
    LastLogTerm  int64  // Term number of candidate's last log entry
}

func (n *Node) sendVoteRequests() {
    req := VoteRequest{
        Term:         n.currentTerm,
        CandidateID:  n.nodeID,
        LastLogIndex: n.getLastLogIndex(),
        LastLogTerm:  n.getLastLogTerm(),
    }
    
    for _, peer := range n.peers {
    go n.sendVoteRequest(peer, req)
}

4. Voting Decision: Not All Vote Requests Are Accepted

Nodes receiving vote requests must perform multiple checks:

Check 1: Term Number Verification

func (n *Node) handleVoteRequest(req VoteRequest) VoteResponse {
    if req.Term < n.currentTerm {
        // Expired candidate, reject vote
        return VoteResponse{
            Term:        n.currentTerm,
            VoteGranted: false,
        }
    }
    
    if req.Term > n.currentTerm {
        // Detect higher term, update own state
        n.currentTerm = req.Term
        n.votedFor = ""
        n.becomeFollower()
    }
}

Check 2: Duplicate Voting Check

if n.votedFor != "" && n.votedFor != req.CandidateID {
    // Already voted for someone else in this term
    return VoteResponse{
        Term:        n.currentTerm,
        VoteGranted: false,
    }
}

Check 3: Log Freshness Check (Critical!)

// Candidate's log must be at least as recent as their own
lastLogIndex := n.getLastLogIndex()
lastLogTerm := n.getLastLogTerm()

if req.LastLogTerm < lastLogTerm {
    // Candidate's last logged term is older, reject
    return VoteResponse{VoteGranted: false}
}

if req.LastLogTerm == lastLogTerm && req.LastLogIndex < lastLogIndex {
    // Same term but smaller index, reject
    return VoteResponse{VoteGranted: false}
}

// Passed all checks, grant vote
n.votedFor = req.CandidateID
return VoteResponse{
    Term:        n.currentTerm,
    VoteGranted: true,
}

5. Why is log freshness verification necessary?

This is one of Raft’s most ingenious designs:

Scenario: What happens without log verification?

Initial state:
Leader A: [log1, log2, log3]  (latest)
Node B:   [log1, log2]        (lagging)
Node C:   [log1, log2, log3]  (latest)

After A crashes:
B initiates election, C votes for B โ†’ B becomes new Leader
But B's logs are older than C's! B overwrites C's log3 โ†’ Data loss!

Raft’s Safeguard Mechanism:

B initiates election, sending vote request to C:
B's LastLogIndex=2, LastLogTerm=1
C's LastLogIndex=3, LastLogTerm=1

C checks: B's log is older than mine โ†’ Rejects vote
B fails to obtain majority votes โ†’ Election fails

Result: Only the node with the latest log can become Leader, ensuring no data loss!

6. Three Election Outcomes

Case 1: Obtains majority votes and becomes Leader

func (n *Node) handleVoteResponse(resp VoteResponse) {
    if resp.VoteGranted {
        n.voteCount++
        if n.voteCount > len(n.peers)/2 {
            n.becomeLeader()
        }
    }
}

func (n *Node) becomeLeader() {
    n.state = Leader
    n.initializeLeaderState()
    n.sendHeartbeats() // Immediately send heartbeats to establish authority
}

Case 2: Discovering a higher term, transition to Follower

if resp.Term > n.currentTerm {
    n.currentTerm = resp.Term
    n.becomeFollower()
}

Case 3: Election Timeout, Restart Election

func (n *Node) onElectionTimeout() {
    if n.state == Candidate {
        // Election failed, restart
        n.becomeCandidate()
        n.sendVoteRequests()
    }
}

7. Active Election Guarantee

Problem: Could an infinite loop occur where no Leader is ever elected?

Theoretically possible:

Round 1: A and B run elections simultaneously, tied vote count โ†’ Failed
Round 2: A and B run elections simultaneously again, tied vote count โ†’ Failed
...Infinite loop

Raft’s Solution:

  1. Randomized Timeouts: Reduces probability of simultaneous elections
  2. Exponential Backoff: Increases wait time after election failures
  3. Mathematical Proof: Under the assumption of eventually reliable networks, a Leader will ultimately be elected

8. Complete Flowchart of the Election Algorithm

[Follower] 
    โ”‚
    โ”‚ Election timeout
    โ–ผ
[Candidate]
    โ”‚
    โ”œโ”€ Increment term number
    โ”œโ”€ Vote for self  
    โ”œโ”€ Reset election timer
    โ””โ”€ Send vote request
    โ”‚
    โ–ผ
Wait for vote response
    โ”‚
    โ”œโ”€ Obtain majority votes โ”€โ”€โ†’ [Leader]
    โ”œโ”€ Discover higher term number โ”€โ”€โ†’ [Follower]  
    โ””โ”€ Election timeout โ”€โ”€โ”€โ”€โ†’ Re-elect

Summary

The ingenuity of the election algorithm lies in:

  1. Randomized Timeouts: Avoids conflicts during simultaneous elections
  2. Term Number Mechanism: Resolves timing and conflict issues
  3. Log Freshness Check: Ensures the new Leader possesses the most complete data
  4. Majority Voting: Mathematically guarantees uniqueness

Next, we’ll explain the Log-Replication Algorithm, demonstrating how the Leader ensures data consistency across all nodes.

In-Depth Analysis of the Raft Algorithm - Part III: Log-Consistency Algorithms

Log-Consistency: Ensuring Data Consistency Across All Nodes

1. Why are logs needed?

In distributed systems, “logs” are not debugging logs but ordered records of operation sequences:

Problems with the traditional approach

Client request: SET x=1
Leader updates directly: x=1
Then notifies Follower: x=1

Issue: If the notification is lost, the Follower's x remains the old value

Raft’s Logging Approach

Client request: SET x=1
Leader logs: [Index=1, Term=1, Command="SET x=1"]
Replicates log to Followers
Executes command only after majority replication confirmed

Key Insight: Log operations before executing them, ensuring replayability and consistency.

2. Log Entry Structure

type LogEntry struct {
    Index   int64       // Log index (starting from 1)
    Term    int64       // Term number when this log was created
    Command interface{} // Client command (e.g., "SET x=1")
}

// Example log sequence
log := []LogEntry{
    {Index: 1, Term: 1, Command: "SET x=1"},
    {Index: 2, Term: 1, Command: "SET y=2"},
    {Index: 3, Term: 2, Command: "DELETE x"},
    {Index: 4, Term: 2, Command: "SET z=3"},
}

Purpose of each field:

  • Index: Determines operation sequence, ensuring all nodes execute in the same order
  • Term: Identifies the Leader at creation time, used for conflict detection
  • Command: The actual operation to execute

3. Basic Log Replication Process

Step 1: Leader Receives Client Request

func (l *Leader) HandleClientRequest(cmd Command) {
    // 1. Create a new log entry
    entry := LogEntry{
        Index:   len(l.log) + 1,
        Term:    l.currentTerm,
        Command: cmd,
    }
    
    // 2. Append to local log (without execution)
    l.log = append(l.log, entry)
    
    // 3. Initiate replication to Followers
    l.replicateToFollowers(entry)
}

Step 2: Parallel replication to all Followers

func (l *Leader) replicateToFollowers(entry LogEntry) {
    for _, follower := range l.followers {
        go l.sendAppendEntries(follower, entry)
    }
}

Step 3: Wait for Majority Confirmation

func (l *Leader) waitForMajority(entry LogEntry) {
    confirmCount := 1 // Leader counts itself as one vote
    
    for response := range l.responseChan {
    if response.Success {
        confirmCount++
        if confirmCount > len(l.cluster)/2 {
            // Majority confirmed, commit entry
            l.commitEntry(entry)
            break
        }
    }
}

Step 4: Commit and Execute

func (l *Leader) commitEntry(entry LogEntry) {
    // 1. Mark as committed
    l.commitIndex = entry.Index
    
    // 2. Execute command
    result := l.stateMachine.Apply(entry.Command)
    
    // 3. Return result to client
    l.respondToClient(result)
    
    // 4. Notify Followers to commit
    l.notifyFollowersToCommit(entry.Index)
}

4. AppendEntries RPC: Core of Log Replication

This is the most critical RPC call in Raft:

AppendEntries Request Structure

type AppendEntriesRequest struct {
    Term         int64      // Leader's term number
    LeaderID     string     // Leader's ID
    PrevLogIndex int64      // Index of the log entry preceding the new entry
    PrevLogTerm  int64      // Term number of the log entry preceding the new entry
    Entries      []LogEntry // Log entries to replicate (empty during heartbeat)
    LeaderCommit int64      // Leader's commit index
}

Why are PrevLogIndex and PrevLogTerm required?

Key Question: How to ensure log continuity?

Incorrect replication approach:
Leader: [1, 2, 3, 4, 5]
Follower: [1, 2, ?, ?, ?]

Leader sends entries 4 and 5 directly โ†’ Follower becomes [1, 2, 4, 5]
Entry 3 is missing, log continuity is broken!

Raft’s Solution:

// When Leader sends entry 4
req := AppendEntriesRequest{
    PrevLogIndex: 3,        // Entry 4's predecessor is 3
    PrevLogTerm:  2,        // Entry 3's term number is 2
    Entries:      [entry4], // Entry to append
}

// Follower checks
if follower.log[3].Term != req.PrevLogTerm {
    // Entry 3 mismatches, reject adding entry 4
    return AppendEntriesResponse{Success: false}
}

5. Follower Log Consistency Check

Processing logic when a Follower receives an AppendEntries request:

func (f *Follower) handleAppendEntries(req AppendEntriesRequest) AppendEntriesResponse {
    // 1. Term check
    if req.Term < f.currentTerm {
        return AppendEntriesResponse{
            Term:    f.currentTerm,
            Success: false,
        }
    }
    
    // 2. Update term and Leader information
    if req.Term > f.currentTerm {
        f.currentTerm = req.Term
        f.votedFor = โ€œโ€
    }
    f.leaderID = req.LeaderID
    f.resetElectionTimer() // Reset election timer upon receiving Leader message
    
    // 3. Log Consistency Check
    if req.PrevLogIndex > 0 {
        if len(f.log) < req.PrevLogIndex {
            // Log is too short, missing preceding entries
            return AppendEntriesResponse{Success: false}
        }
        
        if f.log[req.PrevLogIndex-1].Term != req.PrevLogTerm {
            // Term of preceding entry does not match
            return AppendEntriesResponse{Success: false}
        }
    }
    
    // 4. Add new entries
    for i, entry := range req.Entries {
        index := req.PrevLogIndex + i + 1
        
        if len(f.log) >= index {
            // Check for conflicts
            if f.log[index-1].Term != entry.Term {
                // Conflict detected, delete this position and all subsequent entries
                f.log = f.log[:index-1]
            }
        }
        
        if len(f.log) < index {
            // Add new entry
            f.log = append(f.log, entry)
        }
    }
    
    // 5. Update commit index
    if req.LeaderCommit > f.commitIndex {
        f.commitIndex = min(req.LeaderCommit, len(f.log))
        f.applyCommittedEntries()
    }
    
    return AppendEntriesResponse{
        Term:    f.currentTerm,
        Success: true,
    }
}

6. Log Recovery: Handling Inconsistencies

How does Raft recover when a Follower’s log becomes inconsistent with the Leader’s?

Scenario: Log Divergence Caused by Network Partition

State before partition:
Leader A: [1, 2, 3]
Node B:   [1, 2, 3]
Node C:   [1, 2, 3]

Network partition: A | B,C
A continues receiving requests: [1, 2, 3, 4, 5]
B becomes new Leader, receiving requests: [1, 2, 3, 6, 7]

After partition recovery:
A: [1, 2, 3, 4, 5]  (Old Leader)
B: [1, 2, 3, 6, 7]  (New Leader)

Raft’s Recovery Process

Step 1: After B becomes Leader, attempt to replicate logs

// B sends AppendEntries to A
req := AppendEntriesRequest{
    PrevLogIndex: 5,     // B assumes A has 5 logs
    PrevLogTerm:  2,     // The 5th log's term should be 2
    Entries:      [],    // Sends heartbeat probe first
}

// A checks: My 5th log term is 1, not 2
// A replies: Success: false

Step 2: B decrements the index to find a consistent point

// B receives a failure response and decrements nextIndex[A]
B.nextIndex[A] = 4

// Retry
req := AppendEntriesRequest{
    PrevLogIndex: 4,
    PrevLogTerm:  1,     // The 4th log's term is 1
    Entries:      [],
}

// A checks: My 4th log's term is also 1, but I don't have a 4th log!
// A replies: Success: false

Step 3: Continue decrementing until a match is found

// B continues decrementing
B.nextIndex[A] = 3

req := AppendEntriesRequest{
    PrevLogIndex: 3,
    PrevLogTerm:  1,     // Term 3's term is 1
    Entries:      [],
}

// A checks: My Term 3 log's term is indeed 1
// A replies: Success: true

Step 4: Overwrite from the consistent point

// B finds the consistent point and starts sending correct logs
req := AppendEntriesRequest{
    PrevLogIndex: 3,
    PrevLogTerm:  1,
    Entries:      [entry6, entry7], // B's 4th and 5th logs
}

// After A receives:
// Delete its own logs 4 and 5 (conflicting 4 and 5)
// Add logs 4 and 5 sent by B (6 and 7)
// Final result: A's log becomes [1, 2, 3, 6, 7]

7. Commit Rules: When Can Commands Be Executed?

Key Principle: Only log entries replicated by a majority of nodes can be committed for execution.

Commit Conditions

func (l *Leader) canCommit(index int64) bool {
    // 1. Must be a log entry from the current term
    if l.log[index-1].Term != l.currentTerm {
        return false
    }
    
    // 2. Must be replicated by a majority of nodes
    replicaCount := 1 // Leader itself
    for _, follower := range l.followers {
        if l.matchIndex[follower] >= index {
            replicaCount++
        }
    }
    
    return replicaCount > len(l.cluster)/2
}

Why can’t logs from old terms be committed?

Dangerous Scenario:

Initial state: Leader A, logs [1, 2]
A crashes, B becomes Leader, adds log 3: [1, 2, 3]
B crashes, A recovers as Leader, logs still [1, 2]

If A commits log 2 directly then crashes
C becomes Leader and may overwrite the committed log 2!

Raft’s Safety Rules:

  • Leaders can only commit logs from the current term
  • Committing logs from the current term implicitly commits all previous logs

8. Log Replication Performance Optimization

Optimization 1: Batch Transmission

// Transmit in batches instead of sequentially
func (l *Leader) batchAppendEntries(follower string) {
    entries := l.getPendingEntries(follower)
    if len(entries) > 0 {
        l.sendAppendEntries(follower, entries)
    }
}

Optimization 2: Parallel Replication

// Send to all Followers in parallel
func (l *Leader) replicateInParallel() {
    var wg sync.WaitGroup
    for _, follower := range l.followers {
    wg.Add(1)
    go func(f string) {
        defer wg.Done()
        l.replicateToFollower(f)
    }(follower)
}
wg.Wait()
}

Optimization 3: Rapid Rollback

// Rapidly locate the consistent point when inconsistencies are detected
type AppendEntriesResponse struct {
    Success      bool
    Term         int64
    ConflictTerm int64  // Term of conflicting entries
    FirstIndex   int64  // First index within this term
}

Summary

The ingenuity of the log replication algorithm lies in:

  1. Log entry structure: Index ensures order, Term detects conflicts
  2. Consistency checks: PrevLogIndex/PrevLogTerm guarantee continuity
  3. Conflict Resolution: Automatically roll back to a consistent point, then overwrite
  4. Commit Rule: Only commit logs replicated by a majority of nodes in the current term

This mechanism ensures all nodes eventually possess identical log sequences, thereby guaranteeing state machine consistency.

Next, we will cover Part Four: Security Proofs and Boundary Condition Handling.

In-Depth Analysis of the Raft Algorithm - Part IV: A. Safety Guarantees

Raft’s Safety: Why It Ensures Data Is Neither Lost Nor Corrupted

1. Raft’s Safety Commitments

The Raft algorithm makes several key safety guarantees:

Guarantee 1: Election Safety

Commitment: At most one Leader is elected during any given tenure

Why is this important?

If two Leaders exist simultaneously:
Leader A: Processes SET x=1
Leader B: Processes SET x=2
Result: Data conflict, uncertain system state

Guarantee 2: Leader Append-Only

Guarantee: A Leader never overwrites or deletes entries in its own log; it only appends new entries

Guarantee 3: Log Matching

Commitment: If two logs have entries with the same term ID at a given index position, all entries before and including that position are identical.

Guarantee 4: Leader Completeness

Commitment: If a log entry is committed during a certain term, it will appear in the logs of all Leaders from that term onwards.

Guarantee 5: State Machine Safety

Guarantee: If a server applies a log entry at a certain index position, no other server will apply a different log entry at that index position.

2. Mathematical Proof of Election Safety

Theorem: At most one Leader exists within any given term T.

Proof:

Suppose two Leaders exist within term T: A and B.

To become Leader, A must obtain a majority vote: |votes_A| > n/2
To become Leader, B must obtain a majority vote: |votes_B| > n/2

Therefore: |votes_A| + |votes_B| > n

However, each node can vote at most once during term T
Thus: |votes_A โˆฉ votes_B| = โˆ…

This implies: |votes_A| + |votes_B| โ‰ค n

Contradiction! Hence, there exists at most one Leader during term T.

3. Proof of Leader Consistency

Theorem: If a log entry is committed during term T, it will appear in the logs of all Leaders for terms > T.

Proof Approach:

  1. Committing a log entry means it is replicated by a majority of nodes.
  2. Becoming a new Leader requires obtaining a majority vote.
  3. Any two majority sets must intersect
  4. Log freshness checks during voting ensure the new Leader possesses the most complete log

4. Why Can’t Logs from Previous Terms Be Committed?

Hazardous Scenario

S1: [1, 2] (Term=2, Leader)
S2: [1, 2]
S3: [1]

S1 crashes, S3 becomes Leader (Term=3), adding entry 3:
S3: [1, 3] (Term=3, Leader)

S1 recovers and becomes Leader (Term=4):
If S1 directly commits entry 2 and then crashes,
S3 re-becomes Leader and overwrites the already committed entry 2!

Raft’s Solution

// Only commit logs from the current term
if l.log[index-1].Term != l.currentTerm {
    return false
}

5. Safety During Network Partitions

Partition: [A,B] | [C,D,E]

Partition 1 (Minority):
- Cannot obtain majority confirmation
- Enters read-only state

Partition 2 (Majority):
- Can elect a new Leader
- Continues processing write requests

Network Recovery:
- A automatically demoted to Follower
- Synchronizes logs from the new Leader

Summary

Raft guarantees safety through rigorous mathematical proofs:

  • Majority mechanism prevents split-brain
  • Log inspection ensures integrity
  • Commit rules guarantee consistency

The next section will cover boundary condition handling.

In-Depth Analysis of the Raft Algorithm - Part IVB: Boundary Conditions and Implementation Details

The Devil Is in the Details: Boundary Conditions in Raft Implementation

1. Timing Issues: The Ingenious Design of Election Timeouts

Problem: How to Avoid Election Deadlocks?

Scenario: Two nodes simultaneously initiate elections

Time T: A and B both time out and initiate elections
A โ†’ C: Request vote (Term=5)
B โ†’ C: Request vote (Term=5)

C can only vote for one; assume C votes for A
A gains 2 votes (self + C), B gains 1 vote (self)
Neither achieves majority โ†’ Election fails

Next round: A and B may time out again...

Solution: Randomize timeout durations

func (n *Node) resetElectionTimer() {
    // Random timeout between 150-300ms
    timeout := time.Duration(150+rand.Intn(150)) * time.Millisecond
    n.electionTimer.Reset(timeout)
}

Why does this work?

  • Mathematically, randomization breaks synchronization
  • Even if simultaneous timeouts occur occasionally, the next timeout duration will differ
  • Ultimately, one node will always โ€œsneak aheadโ€ successfully

Timeout Selection Principle

Election Timeout >> Heartbeat Interval >> Network Round-Trip Time

Typical Configuration:
- Heartbeat Interval: 50ms
- Election timeout: 150-300ms  
- Network RTT: 1-5ms

Reasons:
- Avoid false elections due to network jitter
- Allow sufficient time for the Leader to send heartbeats
- Fault detection must occur faster than the election timeout

2. Log Compression: Snapshot Mechanism

Problem: How to handle infinitely growing logs?

As the system runs, logs grow increasingly long:
[1, 2, 3, 4, 5, ..., 1000000]

Issues:
- Excessive memory consumption
- Prolonged synchronization time for new nodes
- Extended recovery time during restarts

Solution: Snapshot Mechanism

type Snapshot struct {
    LastIncludedIndex int64  // Index of the last log entry included in the snapshot
    LastIncludedTerm  int64  // Term of the last log entry included in the snapshot
    Data              []byte // State machine snapshot data
}

func (n *Node) createSnapshot() {
    // 1. Create state machine snapshot
    snapshotData := n.stateMachine.CreateSnapshot()
    
    // 2. Record snapshot metadata
    snapshot := Snapshot{
        LastIncludedIndex: n.lastApplied,
        LastIncludedTerm:  n.log[n.lastApplied-1].Term,
        Data:              snapshotData,
    }
    
    // 3. Save the snapshot
    n.saveSnapshot(snapshot)
    
    // 4. Delete logged entries included in the snapshot
    n.log = n.log[n.lastApplied:]
    
    // 5. Adjust indices
    n.adjustIndices(n.lastApplied)
}

Snapshot Transmission and Installation

// Leader sends snapshot to lagging Follower
type InstallSnapshotRequest struct {
    Term              int64
    LeaderID          string
    LastIncludedIndex int64
    LastIncludedTerm  int64
    Data              []byte
}

func (f *Follower) handleInstallSnapshot(req InstallSnapshotRequest) {
    // 1. Check term
    if req.Term < f.currentTerm {
        return InstallSnapshotResponse{Term: f.currentTerm}
    }
    
    // 2. Install snapshot
    f.stateMachine.InstallSnapshot(req.Data)
    
    // 3. Update log state
    f.log = []LogEntry{} // Clear logs
    f.lastApplied = req.LastIncludedIndex
    f.commitIndex = req.LastIncludedIndex
    
    // 4. Save snapshot metadata
    f.lastIncludedIndex = req.LastIncludedIndex
    f.lastIncludedTerm = req.LastIncludedTerm
}

3. Member Changes: Dynamically Adjusting Cluster Size

Problem: How to safely add/remove nodes?

Naive Approach: Directly switch configurations

Old configuration: [A, B, C] (3 nodes)
New configuration: [A, B, C, D, E] (5 nodes)

Issue: Different nodes may observe configuration changes at different times
A, B see old config: 2 votes needed to become Leader
C, D, E see new config: 3 votes needed to become Leader
Potential for two concurrent Leaders!

Raft’s Solution: Joint Consensus

Phase 1: Enter joint configuration C_old,new
- Requires majority votes in both C_old and C_new
- Example: C_old=[A,B,C], C_new=[A,B,C,D,E]
- Becoming Leader requires: 2 votes in C_old AND 3 votes in C_new

Phase 2: Switch to new configuration C_new
- Only requires majority in C_new
- Safely completes configuration change
type Configuration struct {
    Old []string // Old configuration node list
    New []string // New configuration node list
}

func (l *Leader) addServer(newServer string) {
    // 1. Create joint configuration
    jointConfig := Configuration{
        Old: l.config.Servers,
        New: append(l.config.Servers, newServer),
    }
    
    // 2. Submit joint configuration
    l.proposeConfigChange(jointConfig)
    
    // 3. Submit new configuration after joint config commits
    go func() {
        <-l.jointConfigCommitted
        newConfig := Configuration{
            Old: nil,
            New: jointConfig.New,
        }
        l.proposeConfigChange(newConfig)
    }()
}

4. Network Partition Recovery: Data Synchronization

Scenario: Recovery after a prolonged partition

Before partitioning: [A, B, C, D, E], where A is Leader

Long-term partitioning:
Partition 1: [A, B] (processed 100 requests)
Partition 2: [C, D, E] (C became Leader, processed 200 requests)

Partition recovery: How to synchronize data?

Recovery Process

func (f *Follower) handleAppendEntries(req AppendEntriesRequest) {
    // 1. Detect higher term, immediately transition to Follower
    if req.Term > f.currentTerm {
        f.currentTerm = req.Term
        f.votedFor = โ€œโ€
        f.state = Follower
    }
    
    // 2. Log consistency check failed
    if req.PrevLogIndex > len(f.log) {
        return AppendEntriesResponse{
            Success:      false,
            ConflictIndex: len(f.log) + 1,
        }
    }
    
    // 3. Conflict detected; delete conflicting log and all subsequent logs
    if f.log[req.PrevLogIndex-1].Term != req.PrevLogTerm {
        conflictTerm := f.log[req.PrevLogIndex-1].Term
        conflictIndex := req.PrevLogIndex
        
        // Find the first index of the conflicting term
        for i := req.PrevLogIndex - 1; i >= 0; i-- {
            if f.log[i].Term != conflictTerm {
                break
            }
            conflictIndex = i + 1
        }
        
        // Delete conflicting logs
        f.log = f.log[:conflictIndex-1]
        
        return AppendEntriesResponse{
            Success:       false,
            ConflictTerm:  conflictTerm,
            ConflictIndex: conflictIndex,
        }
    }
}

5. Client Interaction: Idempotency and Duplicate Detection

Issue: Duplicate Execution Due to Network Retransmission

Client sends: SET x=1
Leader executes: x=1
Response lost, client retries: SET x=1
Leader executes again: x=1 (duplicate execution)

Solution: Client Session and Sequence ID

type ClientRequest struct {
    ClientID   string // Client unique identifier
    SequenceID int64  // Request sequence ID
    Command    interface{}
}

type ClientSession struct {
    LastSequenceID int64      // Last processed sequence ID
    LastResponse   interface{} // Last response result
}

func (l *Leader) handleClientRequest(req ClientRequest) {
    session := l.clientSessions[req.ClientID]
    
    // Check for duplicate requests
    if req.SequenceID <= session.LastSequenceID {
        // Return cached response
        return session.LastResponse
    }
    
    // Process new request
    entry := LogEntry{
        Index:      len(l.log) + 1,
        Term:       l.currentTerm,
        Command:    req,
        ClientID:   req.ClientID,
        SequenceID: req.SequenceID,
    }
    
    l.log = append(l.log, entry)
    l.replicateToFollowers(entry)
}

6. Performance Optimization: Batch Processing and Pipelining

Batch Processing: Reducing Network Overhead

func (l *Leader) batchReplication() {
    ticker := time.NewTicker(10 * time.Millisecond)
    
    for {
        select {
        case <-ticker.C:
            for _, follower := range l.followers {
                entries := l.getPendingEntries(follower)
                if len(entries) > 0 {
                    l.sendBatchAppendEntries(follower, entries)
                }
            }
        }
    }
}

Pipeline: Parallel Processing of Multiple Requests

func (l *Leader) pipelineReplication(follower string) {
    // Send the next batch without waiting for the previous response
    for {
        entries := l.getNextBatch(follower)
        go l.sendAppendEntries(follower, entries)
        
        // Adjust sending rate based on network conditions
        time.Sleep(l.calculateDelay(follower))
    }
}

7. Failure Detection: Heartbeat and Timeout

Precise Failure Detection

type FailureDetector struct {
    heartbeatInterval time.Duration
    timeoutThreshold  time.Duration
    lastHeartbeat     map[string]time.Time
}

func (fd *FailureDetector) isNodeAlive(nodeID string) bool {
    lastSeen := fd.lastHeartbeat[nodeID]
    return time.Since(lastSeen) < fd.timeoutThreshold
}

// Adaptive timeout: adjusts based on network conditions
func (fd *FailureDetector) updateTimeout(nodeID string, rtt time.Duration) {
    // Timeout = average RTT * multiplier + safety margin
    fd.timeoutThreshold = rtt*4 + 50*time.Millisecond
}

8. Persistence: WAL and State Recovery

Write-Ahead Log (WAL)

func (n *Node) persistState() error {
    state := PersistentState{
        CurrentTerm: n.currentTerm,
        VotedFor:    n.votedFor,
        Log:         n.log,
    }
    
    // Atomic write: write to temporary file, then rename
    tempFile := n.dataDir + โ€œ/state.tmpโ€
    if err := writeToFile(tempFile, state); err != nil {
        return err
    }
    
    return os.Rename(tempFile, n.dataDir+โ€œ/state.datโ€)
}

func (n *Node) recoverState() error {
    data, err := ioutil.ReadFile(n.dataDir + โ€œ/state.datโ€)
    if err != nil {
        return err
    }
    
    var state PersistentState
    if err := json.Unmarshal(data, &state); err != nil {
        return err
}

n.currentTerm = state.CurrentTerm
n.votedFor = state.VotedFor
n.log = state.Log

return nil
}

Summary

Raft’s implementation details reveal the complexity of distributed systems:

  1. Timing Control: Randomization resolves synchronization issues
  2. Resource Management: Snapshot mechanism controls memory growth
  3. Dynamic Configuration: Union consensus ensures safe changes
  4. Fault Tolerance & Recovery: Intelligent log repair mechanism
  5. Performance Optimization: Batch processing and pipelining techniques
  6. Persistence: WAL guarantees data integrity

These details transform Raft from a theoretical algorithm into a practical engineering implementation.