State Synchronization
When a new peer joins a group or falls too far behind the leader’s log, it needs a snapshot of the current state rather than replaying potentially thousands of log entries. Mosaik implements a distributed state sync mechanism that spreads the load across all group members.
Why not leader-only snapshots?
In standard Raft, the leader ships snapshots to lagging followers. This has two problems:
- Leader bottleneck — the leader must serialize and transmit potentially large state to each lagging follower.
- Snapshot inconsistency — the snapshot must be taken at a consistent point, which may block the leader.
Mosaik solves both by having all peers participate in snapshot creation and distribution.
The six-step process
1. Follower 2. Leader 3. All peers
detects lag wraps request create snapshot
──────────── as log entry at committed
sends ──────────── position
RequestSnapshot replicates it ──────────────
4. Follower 5. Follower 6. Follower
discovers which fetches batches installs snapshot
peers have from multiple and resumes
snapshots peers in parallel normal operation
────────────── ───────────────── ──────────────────
Step 1: Detect lag
A follower realizes it is behind when it receives an AppendEntries message
referencing a log prefix it does not have. It sends a StateSync message
to the leader requesting a snapshot.
Step 2: Replicate the request
The leader wraps the snapshot request as a special log entry and replicates it through the normal Raft consensus path. This ensures all peers see the request at the same log position.
Step 3: Create snapshots
When each peer commits the snapshot request entry, it creates a point-in-time snapshot of its state machine. Because all peers see the request at the same committed index, all snapshots are consistent — they represent the same logical state.
The im crate’s persistent data structures make snapshotting O(1) by
sharing structure with the live state (copy-on-write).
Step 4: Discover snapshot sources
The follower queries peers to find out which ones have a snapshot available for the requested position. Snapshots have a TTL (default 10 seconds), so they eventually expire if not fetched.
Step 5: Parallel fetching
The follower fetches snapshot data in batches from multiple peers simultaneously:
Follower ──batch request──► Peer A (items 0..2000)
batch request──► Peer B (items 2000..4000)
batch request──► Peer C (items 4000..6000)
...
Each batch contains up to fetch_batch_size (default 2000) items. By
distributing fetches across peers, the load is balanced and the follower
receives data faster.
Step 6: Install and resume
Once all batches are received, the follower:
- Installs the snapshot as its new state machine state.
- Updates its committed cursor to the snapshot position.
- Replays any log entries received after the snapshot position.
- Resumes normal follower operation (including voting).
Traits
SnapshotStateMachine
State machines that support sync must implement:
trait SnapshotStateMachine: StateMachine {
type Snapshot: Snapshot;
fn snapshot(&self) -> Self::Snapshot;
fn install_snapshot(&mut self, items: Vec<SnapshotItem>);
}
snapshot()creates a serializable snapshot of the current state.install_snapshot()replaces the state with data received from peers.
Snapshot
The snapshot itself is iterable:
trait Snapshot {
fn len(&self) -> usize;
fn into_items(self) -> impl Iterator<Item = SnapshotItem>;
}
Each collection type implements this — for example, MapSnapshot yields
key-value pairs as serialized SnapshotItem bytes.
SnapshotItem
struct SnapshotItem {
key: Bytes,
value: Bytes,
}
The generic key/value format allows any collection type to participate in the same sync protocol.
Configuration
State sync behavior is controlled by SyncConfig:
| Parameter | Default | Description |
|---|---|---|
fetch_batch_size | 2000 | Maximum items per batch request |
snapshot_ttl | 10s | How long a snapshot remains available |
snapshot_request_timeout | 15s | Timeout for requesting a snapshot from peers |
fetch_timeout | 5s | Timeout for each batch fetch operation |
Tuning tips
- Large state: Increase
fetch_batch_sizeto reduce round trips, but watch memory usage. - Slow networks: Increase
fetch_timeoutandsnapshot_request_timeout. - Fast churn: Increase
snapshot_ttlif snapshots expire before followers can fetch them.
Log replay sync
For groups that use Storage (log persistence) instead of in-memory state,
a LogReplaySync alternative exists. Instead of shipping snapshots, the
joining peer replays log entries from a replay provider — another peer
that streams its stored log entries.
This approach is simpler but only works when:
- The log fits in available storage.
- Replay is fast enough relative to new entries arriving.
The replay module provides ReplayProvider and ReplaySession types for
this path.
Collections and state sync
All built-in collection types (Map, Vec, Set, Register, Once,
PriorityQueue) implement SnapshotStateMachine. Their internal state
machines produce snapshots:
MapStateMachine ──snapshot()──► MapSnapshot (HashMap entries)
VecStateMachine ──snapshot()──► VecSnapshot (indexed entries)
SetStateMachine ──snapshot()──► SetSnapshot (set members)
RegStateMachine ──snapshot()──► RegisterSnapshot (optional value)
OnceStateMachine ──snapshot()──► OnceSnapshot (optional value)
DepqStateMachine ──snapshot()──► PriorityQueueSnapshot (by_key + by_priority)
The dual-index structure of PriorityQueue is preserved across sync — both
the by_key and by_priority indices are transmitted and reconstructed.
Register and Once snapshots are trivial (at most one item), so sync
completes in a single batch.