Decorative rotating star animation
Back to Blog
July 28, 2025
7 min read

Raftokay: Building Raft Consensus From Scratch

A deep dive into implementing Raft consensus in Go: leader election, log replication, persistence, and chaos testing included.

Distributed SystemsRaftConsensusGoOpen Source
Raftokay: Building Raft Consensus From Scratch

Inside Raftokay


A production-ready Raft implementation built in Go, with WAL persistence, snapshots, and a chaos simulator to prove it actually works.


Why Build Raft From Scratch?


Everyone reaches for etcd or Consul when they need consensus. Fair enough, they're battle-tested.


But here's the thing: if you don't understand what's happening underneath, you're flying blind when things break. And in distributed systems, things break in ways that make you question reality.


Raftokay exists because:


  • I wanted to understand Raft deeply, not just use it
  • Most "educational" implementations skip the hard parts (persistence, snapshots, recovery)
  • Production systems need chaos testing, not just happy-path unit tests
  • Go's concurrency model makes Raft's state machine almost pleasant to implement

  • This isn't a toy. It persists to disk. It recovers from crashes. It survives network partitions.


    The Raft Algorithm (30-Second Version)


    Raft solves distributed consensus: getting multiple machines to agree on a sequence of operations even when some machines crash or networks fail.


    The core insight is brutal simplicity: one leader, everyone else follows. The leader:


    1. Accepts client requests

    2. Replicates them to followers

    3. Commits them once a majority acknowledges


    If the leader dies, followers elect a new one. That's it. No complex multi-phase commits. No Byzantine fault tolerance. Just "follow the leader" with elections when needed.


    Leader Election: Randomized Chaos Control


    The election mechanism is elegant in its paranoia. Every server runs a timer. If it doesn't hear from a leader before the timer expires, it assumes the leader is dead and starts an election.


    The trick is randomized timeouts:


    // Election timeout: 150-300ms randomized
    electionTimeout := 150 + rand.Intn(150)

    Why random? Because simultaneous elections cause split votes. If all servers timed out at exactly 200ms, they'd all start elections, all vote for themselves, and no one would win. Randomization breaks symmetry.


    The state machine is clean:


  • Follower: Receives heartbeats, votes in elections, replicates logs
  • Candidate: Requests votes, times out and retries if election fails
  • Leader: Sends heartbeats every 50ms, replicates log entries, commits when majority acknowledges

  • Transitions are explicit:


    follower → candidate (election timeout)
    candidate → leader (won election)
    candidate → follower (discovered higher term)
    leader → follower (discovered higher term)

    No ambiguous states. No edge cases hidden in boolean flags.


    Log Replication: The Real Work


    Leader election is flashy, but log replication is where consensus actually happens.


    When the leader receives a command:


    1. Appends it to its local log with the current term number

    2. Sends AppendEntries RPCs to all followers

    3. Waits for a majority to acknowledge

    4. Commits the entry and applies it to the state machine

    5. Notifies followers to commit on the next heartbeat


    The consistency check is critical:


    // Reject if log doesn't contain entry at prevLogIndex with prevLogTerm
    if prevLogIndex > 0 {
        if len(rf.log) < prevLogIndex || rf.log[prevLogIndex-1].Term != prevLogTerm {
            return false
        }
    }

    This ensures followers have an identical log prefix before accepting new entries. If they don't, the leader backtracks until it finds the point of divergence and overwrites the follower's log.


    Aggressive? Yes. But it guarantees that once an entry is committed, it's committed forever.


    Persistence: Because Memory is a Lie


    Here's where most "educational" Raft implementations cheat: they skip persistence.


    In production, servers crash. Power fails. Disks get replaced. If your consensus algorithm only works when nothing ever restarts, it's not a consensus algorithm. It's a demo.


    Raftokay uses a write-ahead log (WAL):


    // Every write is immediately fsynced
    func (w *WAL) Append(entry Entry) error {
        encoded := gob.Encode(entry)
        if _, err := w.file.Write(encoded); err != nil {
            return err
        }
        return w.file.Sync() // fsync after every write
    }

    The Sync() call is non-negotiable. Without it, your "persisted" data lives in the OS buffer cache, which dies with the power.


    On startup, the server replays the WAL to reconstruct state:


    1. Load persisted term and votedFor

    2. Replay log entries

    3. Load snapshot if available

    4. Resume operation


    Crash at any point? No problem. The WAL contains everything needed to recover.


    Snapshots: Log Compaction


    Logs grow forever. That's a problem.


    If a server ran for a year without compaction, its log would contain millions of entries. Recovery would take hours. Disk usage would explode.


    Snapshots solve this:


    1. Serialize the current state machine to disk

    2. Record the last included log index and term

    3. Truncate the log before that point

    4. Continue appending new entries


    For slow or recovering followers, the leader sends snapshots via InstallSnapshot RPC instead of replaying thousands of entries.


    // Truncate log after snapshot
    func (w *WAL) TruncateBefore(index uint64) error {
        // Keep only entries after snapshot point
        w.entries = w.entries[index:]
        return w.rewrite()
    }

    The result: bounded log size, fast recovery, and efficient catch-up for lagging nodes.


    Chaos Testing: Break Everything on Purpose


    Unit tests are necessary but insufficient. They test the algorithm in isolation. Production involves:


  • Network partitions that heal and re-partition
  • Message delays measured in seconds
  • Out-of-order delivery
  • Duplicate messages
  • Partial failures (one direction works, other doesn't)

  • Raftokay includes a chaos simulator that injects all of these:


    // Simulate network partition
    sim.Partition([]int{0, 1}, []int{2, 3, 4})
    
    // Delay messages by 100-500ms
    sim.AddLatency(100, 500)
    
    // Drop 10% of messages
    sim.SetDropRate(0.1)
    
    // Heal partition and verify consistency
    sim.HealPartition()

    The simulator runs hundreds of operations while chaos is active, then verifies:


  • All committed entries are identical across servers
  • No committed entries were lost
  • The state machine is consistent

  • If Raft can survive simulated chaos, it has a shot at surviving real infrastructure.


    Code Highlights


    A few implementation decisions worth noting:


    Single-threaded state machine: All Raft state mutations happen in one goroutine. RPCs and timers send messages to this goroutine via channels. This eliminates lock contention and makes reasoning about state trivial.


    Term as the source of truth: Whenever a server sees a higher term, it immediately becomes a follower. No exceptions. This ensures stale leaders can't cause damage.


    Commit rule safety: Leaders only commit entries from the current term. This prevents a subtle bug where a new leader could prematurely commit entries from a previous term that hadn't actually achieved consensus.


    // Only commit entries from current term
    if entry.Term == rf.currentTerm && matchCount > len(rf.peers)/2 {
        rf.commitIndex = index
    }

    Running It


    The repo includes raftd (the server daemon) and raftctl (the CLI tool).


    Start a 3-node cluster:


    raftd --id 1 --peers "localhost:8001,localhost:8002,localhost:8003"
    raftd --id 2 --peers "localhost:8001,localhost:8002,localhost:8003"
    raftd --id 3 --peers "localhost:8001,localhost:8002,localhost:8003"

    Submit commands:


    raftctl set foo bar
    raftctl get foo
    # bar

    Kill a node, submit more commands, restart it, verify it catches up. That's the whole point.


    Summary


    Raftokay is a Raft implementation that doesn't cut corners:


  • Full leader election with randomized timeouts
  • Log replication with consistency guarantees
  • WAL persistence with fsync
  • Snapshot-based log compaction
  • Chaos testing that breaks things on purpose

  • It's not meant to replace etcd. It's meant to teach you what etcd is actually doing, so when your production cluster starts misbehaving at 3 AM, you'll understand why.


    Distributed systems don't reward ignorance. Study the algorithm. Read the code. Break things in simulation before production breaks them for you.


    View the repository →