Making Democracy Work: Fixing and Simplifying Egalitarian Paxos (Extended Version)

Alexey Gotsman; Fedor Ryabinin; Pierre Sutra

arxiv: 2511.02743 · v3 · submitted 2025-11-04 · 💻 cs.DC

Making Democracy Work: Fixing and Simplifying Egalitarian Paxos (Extended Version)

Fedor Ryabinin , Alexey Gotsman , Pierre Sutra This is my paper

Pith reviewed 2026-05-18 00:41 UTC · model grok-4.3

classification 💻 cs.DC

keywords state machine replicationleaderless consensusPaxosfailure recoverydistributed systemscrash faultsEgalitarian Paxos

0 comments

The pith

EPaxos* replaces the complex recovery in leaderless Paxos with a simpler proved-correct version and generalizes it to optimal failure thresholds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces EPaxos*, a corrected and simplified variant of Egalitarian Paxos for collaborative command ordering in distributed replication. Egalitarian Paxos avoids a single leader to prevent bottlenecks and keep throughput under crashes, while allowing fast execution when few failures occur and commands commute. The authors supply a simpler failure-recovery algorithm with a rigorous correctness proof and extend the protocol to all pairs of thresholds f and e where the replica count n meets or exceeds the maximum of 2e plus f minus 1 and 2f plus 1. This matters because the original protocol underpins many modern systems yet was hard to implement due to complexity and bugs.

Core claim

The central claim is that Egalitarian Paxos admits a simpler failure-recovery algorithm that has been rigorously proved correct, and that the protocol can be generalized to the entire spectrum of failure thresholds f and e satisfying n greater than or equal to the maximum of 2e plus f minus 1 and 2f plus 1, a number the authors show to be optimal.

What carries the argument

The redesigned failure-recovery algorithm that restores safety and liveness after crashes without a leader while supporting the generalized f and e thresholds.

Load-bearing premise

The model assumes only crash-stop failures and reliable delivery of messages between correct processes.

What would settle it

An execution with crashes that violates safety or liveness under the new recovery rules, or a working protocol that meets the fast-path and throughput goals with fewer than the claimed optimal number of processes.

Figures

Figures reproduced from arXiv: 2511.02743 by Alexey Gotsman, Fedor Ryabinin, Pierre Sutra.

read the original abstract

Classical state-machine replication protocols, such as Paxos, rely on a distinguished leader process to order commands. Unfortunately, this approach makes the leader a single point of failure and increases the latency for clients that are not co-located with it. As a response to these drawbacks, Egalitarian Paxos introduced an alternative, leaderless approach, that allows replicas to order commands collaboratively. Not relying on a single leader allows the protocol to maintain non-zero throughput with up to $f$ crashes of any processes out of a total of $n = 2f+1$. The protocol furthermore allows any process to execute a command $c$ fast, in $2$ message delays, provided no more than $e = \lceil\frac{f+1}{2}\rceil$ other processes fail, and all concurrently submitted commands commute with $c$; the latter condition is often satisfied in practical systems. Egalitarian Paxos has served as a foundation for many other replication protocols. But unfortunately, the protocol is very complex, ambiguously specified and suffers from nontrivial bugs. In this paper, we present EPaxos* -- a simpler and correct variant of Egalitarian Paxos. Our key technical contribution is a simpler failure-recovery algorithm, which we have rigorously proved correct. Our protocol also generalizes Egalitarian Paxos to cover the whole spectrum of failure thresholds $f$ and $e$ such that $n \ge \max\{2e+f-1, 2f+1\}$ -- the number of processes that we show to be optimal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper cleans up Egalitarian Paxos with a simpler proved recovery algorithm and a broader optimal failure threshold.

read the letter

This paper gives us EPaxos*, a simpler take on Egalitarian Paxos with a new recovery algorithm that the authors prove correct. It also generalizes the protocol to a broader set of failure thresholds and shows that the bound is optimal. The original Egalitarian Paxos allowed leaderless ordering and fast execution for commuting commands, but it was complex and had bugs. This work simplifies the recovery procedure, which is often the tricky part in these protocols. A rigorous proof for the new recovery is the main advance, and the extension to different f and e combinations under the given condition on n adds flexibility without increasing the process count beyond what's necessary. The paper does well by directly tackling the usability issues that have limited adoption of the original protocol. If the proof holds up, this could make it easier for others to implement or build upon without running into the same specification ambiguities. The soft spots are minor at this stage. The abstract claims the proof but does not include a sketch, so the full manuscript needs to deliver on that. The assumptions remain the usual ones for Paxos variants: crash-stop failures and reliable message delivery among correct processes. Nothing in the description suggests a flaw in the high-level design or any circular reasoning. This kind of paper is for distributed systems researchers focused on consensus and replication. Practitioners looking for a more reliable version of leaderless Paxos would also benefit. It deserves a serious referee to go over the proof details and confirm the optimality result. I recommend sending it out for peer review.

Referee Report

0 major / 2 minor

Summary. The paper presents EPaxos*, a simplified and corrected variant of Egalitarian Paxos for leaderless state-machine replication. Its central claims are a simpler failure-recovery algorithm that has been rigorously proved correct, together with a generalization of the protocol to arbitrary f and e such that n ≥ max{2e+f-1, 2f+1}, which the authors show is optimal.

Significance. If the results hold, the work is significant for distributed computing. Egalitarian Paxos is foundational for many leaderless replication protocols; fixing its bugs and complexity while adding an optimal generalization with a rigorous proof supplies a cleaner, more reliable base for future systems. The rigorous proof of the recovery algorithm is a clear strength that the manuscript ships.

minor comments (2)

[Abstract] Abstract: the claim of a 'rigorous proof' of the recovery algorithm would be strengthened by a one-sentence high-level sketch or list of key invariants even in the abstract or introduction.
[Introduction] The optimality argument for n ≥ max{2e+f-1, 2f+1} is stated clearly but would benefit from an explicit sentence contrasting it with the original Egalitarian Paxos bounds to make the improvement immediate.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work, recognition of its significance for distributed computing, and recommendation of minor revision. We are pleased that the referee highlights the value of the simpler failure-recovery algorithm with its rigorous proof and the optimal generalization of the failure thresholds.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines EPaxos* directly via message-exchange rules and a failure-recovery procedure, then supplies an independent rigorous proof of correctness for the recovery algorithm under standard crash-stop assumptions with reliable delivery between correct processes. The generalization to thresholds satisfying n ≥ max{2e+f-1, 2f+1} is presented as a mathematical bound that the authors separately establish as optimal; neither the protocol rules nor the optimality claim reduces to a fitted parameter, self-referential definition, or load-bearing self-citation. The central claims rest on explicit protocol text and external verification rather than any construction that equates output to input by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard distributed-systems assumptions rather than introducing new free parameters or invented entities; the contribution lies in the protocol construction and its proof.

axioms (1)

domain assumption Crash-stop failure model with reliable message delivery between correct processes in an asynchronous network.
Invoked implicitly throughout the description of Paxos-style replication and the recovery procedure.

pith-pipeline@v0.9.0 · 5824 in / 1330 out tokens · 36634 ms · 2026-05-18T00:41:04.585917+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our protocol also generalizes Egalitarian Paxos to cover the whole spectrum of failure thresholds f and e such that n ≥ max{2e+f-1, 2f+1} — the number of processes that we show to be optimal.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The main technical novelty of EPaxos* is its failure recovery procedure, which significantly simplifies the one in the original EPaxos

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

For every valuev and process p∈Π\E, there exists anE-faulty synchronous run in which onlypcallspropose(), the proposed value isv, andpdecides a value by time2∆

work page
[2]

Fix a subsetE⊆Πof size e

For every valuev and process p∈Π\E, there exists anE-faulty synchronous run in which all processes inΠ\E call propose(v)at the beginning of the first round, andp decides a value by time2∆. Fix a subsetE⊆Πof size e. Assume some valuev, a processp∈Π\E, and a run of the above construction in whichp calls propose(v)at the start of the first round and the othe...

work page 2025
[3]

Each command is pre-accepted at most once by each process

work page
[4]

Once a process moves a command to theaccepted or committed phase, it remains in one of these phases for the entire execution

work page
[5]

At every process and for every id, if abal[id] > 0, then phase[id] ∈ {accepted,committed}

work page
[6]

At any moment aftert, the processp has cmd[id] = c orcmd[id] =Nop, with the latter possible only ifphase[id]̸=preaccepted

Assume that at timet a process p pre-accepts or validates a commandid with a payloadc (i.e., p executes line 13 or line 83). At any moment aftert, the processp has cmd[id] = c orcmd[id] =Nop, with the latter possible only ifphase[id]̸=preaccepted

work page
[7]

If a process executes line 57 (or line 57t), then the condition in this line evaluates to true, and only for the dependency setD

Assume that a process sendsCommit(0,id, c, D)on the fast path (line 23). If a process executes line 57 (or line 57t), then the condition in this line evaluates to true, and only for the dependency setD

work page
[8]

If a process sends PreAcceptOK(id, D[,_])and PreAcceptOK(id′, D′[,_])for two com- mands with conflicting initial payloads, thenid∈D′orid ′∈D

work page
[9]

a.For anyAccept(b ′,id, c′, D′)sent, ifb ′≥b, thenc′=candD ′=D; b.For anyCommit(b ′,id, c′, D′)sent, ifb ′≥b, thenc′=candD ′=D

Assume that a quorum of processes has receivedAccept(b,id, c, D)and replied with AcceptOK(b,id). a.For anyAccept(b ′,id, c′, D′)sent, ifb ′≥b, thenc′=candD ′=D; b.For anyCommit(b ′,id, c′, D′)sent, ifb ′≥b, thenc′=candD ′=D

work page
[10]

a.For anyAccept(_,id, c′, D′)sent,c ′=candD ′=D; b.For anyCommit(_,id, c′, D′)sent,c ′=candD ′=D

Assume that a process sendsCommit(0,id, c, D)at line 23. a.For anyAccept(_,id, c′, D′)sent,c ′=candD ′=D; b.For anyCommit(_,id, c′, D′)sent,c ′=candD ′=D. Figure 9Additional invariants ofEPaxos*. All invariants hold for all three versions ofEPaxos*. and therefore initCoord(id) /∈R. Hence, every process inQ that pre-accepted id did so with the dependency s...

work page 2025
[11]

there exists a quorum Q such that every process q∈Q previously sent a PreAcceptOK(id, Dq[,_])message with ⋃ q∈QDq =D; or 2.a process previously sent anAccept(_,id, c, D)message from line 64 or line 73. Proof. By induction on the length of the execution. It is easy to see that the proposition holds initially. For the induction step, suppose a processp eith...

work page
[12]

there exists a quorum Q such that every process q∈Q previously sent a PreAcceptOK(id, Dq[,_])message with ⋃ q∈QDq =D; or 2.a process previously sent anAccept(_,id, c, D)message from line 64 or line 73. Proof. By induction on the length of the execution. It is easy to see that the proposition holds initially. For the induction step, suppose a processp eith...

work page 2025
[13]

Assume that both commands satisfy condition 1 of Proposition 18. Then there are two quorums Q and Q′such that every process q∈Q (respectively, q∈Q′) has sent a PreAcceptOK(id, Dq[,_])message (respectively, PreAcceptOK(id′, D′ q[,_])) with⋃ q∈QDq = D (respectively, ⋃ q∈Q′D′ q = D′). Since Q∩Q′̸= ∅, there exists a process that sends bothPreAcceptOK(id, Dq[,...

work page
[14]

Then there exists a quorumQ′such that every process q∈Q′has sent aPreAcceptOK(id′, D′ q[,_])message with ⋃ q∈Q′D′ q = D′

Assume that id′satisfies condition 1 of Proposition 18, whileid satisfies condition 2 (the symmetric case is analogous). Then there exists a quorumQ′such that every process q∈Q′has sent aPreAcceptOK(id′, D′ q[,_])message with ⋃ q∈Q′D′ q = D′. Furthermore, a process p has sent anAccept(_,id, c, D)from line 64 or line 73. LetQ be a quorum that enables the c...

work page
[15]

In this case, they both validate their dependencies at their recovery quorums at lines 60–61

Assume that both commands satisfy condition 2 of Proposition 18. In this case, they both validate their dependencies at their recovery quorums at lines 60–61. Thus, there exists a process that sendsValidateOK(_,id, I)and ValidateOK(_,id′, I′). Assume that it first sendsValidateOK(_,id′, I′)(the symmetric case is analogous). By line 84, at the F. Ryabinin,...

work page 2025
[16]

8.Processesp 1,p 2,p 3, andp 9 fail

Process p2 successfully commits id′with dep[id′] ={id}and seq[id′] = 1on the slow path using a quorum{p2, p3, p4, p8, p9}. 8.Processesp 1,p 2,p 3, andp 9 fail. 9.Processp 4 starts recoveringid. a. It broadcasts Prepare and receives f + 1 = 5replies from{p4, p5, p6, p7, p8}(page 41, step 1). b.Steps 2 on page 41 and 3–4 on page 42 are not applicable. c. Ho...

work page
[17]

Process p4 repeatedly attempts to recoverid without success (i.e., it indefinitely repeats item 9)

work page

[1] [1]

For every valuev and process p∈Π\E, there exists anE-faulty synchronous run in which onlypcallspropose(), the proposed value isv, andpdecides a value by time2∆

work page

[2] [2]

Fix a subsetE⊆Πof size e

For every valuev and process p∈Π\E, there exists anE-faulty synchronous run in which all processes inΠ\E call propose(v)at the beginning of the first round, andp decides a value by time2∆. Fix a subsetE⊆Πof size e. Assume some valuev, a processp∈Π\E, and a run of the above construction in whichp calls propose(v)at the start of the first round and the othe...

work page 2025

[3] [3]

Each command is pre-accepted at most once by each process

work page

[4] [4]

Once a process moves a command to theaccepted or committed phase, it remains in one of these phases for the entire execution

work page

[5] [5]

At every process and for every id, if abal[id] > 0, then phase[id] ∈ {accepted,committed}

work page

[6] [6]

At any moment aftert, the processp has cmd[id] = c orcmd[id] =Nop, with the latter possible only ifphase[id]̸=preaccepted

Assume that at timet a process p pre-accepts or validates a commandid with a payloadc (i.e., p executes line 13 or line 83). At any moment aftert, the processp has cmd[id] = c orcmd[id] =Nop, with the latter possible only ifphase[id]̸=preaccepted

work page

[7] [7]

If a process executes line 57 (or line 57t), then the condition in this line evaluates to true, and only for the dependency setD

Assume that a process sendsCommit(0,id, c, D)on the fast path (line 23). If a process executes line 57 (or line 57t), then the condition in this line evaluates to true, and only for the dependency setD

work page

[8] [8]

If a process sends PreAcceptOK(id, D[,_])and PreAcceptOK(id′, D′[,_])for two com- mands with conflicting initial payloads, thenid∈D′orid ′∈D

work page

[9] [9]

a.For anyAccept(b ′,id, c′, D′)sent, ifb ′≥b, thenc′=candD ′=D; b.For anyCommit(b ′,id, c′, D′)sent, ifb ′≥b, thenc′=candD ′=D

Assume that a quorum of processes has receivedAccept(b,id, c, D)and replied with AcceptOK(b,id). a.For anyAccept(b ′,id, c′, D′)sent, ifb ′≥b, thenc′=candD ′=D; b.For anyCommit(b ′,id, c′, D′)sent, ifb ′≥b, thenc′=candD ′=D

work page

[10] [10]

a.For anyAccept(_,id, c′, D′)sent,c ′=candD ′=D; b.For anyCommit(_,id, c′, D′)sent,c ′=candD ′=D

Assume that a process sendsCommit(0,id, c, D)at line 23. a.For anyAccept(_,id, c′, D′)sent,c ′=candD ′=D; b.For anyCommit(_,id, c′, D′)sent,c ′=candD ′=D. Figure 9Additional invariants ofEPaxos*. All invariants hold for all three versions ofEPaxos*. and therefore initCoord(id) /∈R. Hence, every process inQ that pre-accepted id did so with the dependency s...

work page 2025

[11] [11]

there exists a quorum Q such that every process q∈Q previously sent a PreAcceptOK(id, Dq[,_])message with ⋃ q∈QDq =D; or 2.a process previously sent anAccept(_,id, c, D)message from line 64 or line 73. Proof. By induction on the length of the execution. It is easy to see that the proposition holds initially. For the induction step, suppose a processp eith...

work page

[12] [12]

there exists a quorum Q such that every process q∈Q previously sent a PreAcceptOK(id, Dq[,_])message with ⋃ q∈QDq =D; or 2.a process previously sent anAccept(_,id, c, D)message from line 64 or line 73. Proof. By induction on the length of the execution. It is easy to see that the proposition holds initially. For the induction step, suppose a processp eith...

work page 2025

[13] [13]

Assume that both commands satisfy condition 1 of Proposition 18. Then there are two quorums Q and Q′such that every process q∈Q (respectively, q∈Q′) has sent a PreAcceptOK(id, Dq[,_])message (respectively, PreAcceptOK(id′, D′ q[,_])) with⋃ q∈QDq = D (respectively, ⋃ q∈Q′D′ q = D′). Since Q∩Q′̸= ∅, there exists a process that sends bothPreAcceptOK(id, Dq[,...

work page

[14] [14]

Then there exists a quorumQ′such that every process q∈Q′has sent aPreAcceptOK(id′, D′ q[,_])message with ⋃ q∈Q′D′ q = D′

Assume that id′satisfies condition 1 of Proposition 18, whileid satisfies condition 2 (the symmetric case is analogous). Then there exists a quorumQ′such that every process q∈Q′has sent aPreAcceptOK(id′, D′ q[,_])message with ⋃ q∈Q′D′ q = D′. Furthermore, a process p has sent anAccept(_,id, c, D)from line 64 or line 73. LetQ be a quorum that enables the c...

work page

[15] [15]

In this case, they both validate their dependencies at their recovery quorums at lines 60–61

Assume that both commands satisfy condition 2 of Proposition 18. In this case, they both validate their dependencies at their recovery quorums at lines 60–61. Thus, there exists a process that sendsValidateOK(_,id, I)and ValidateOK(_,id′, I′). Assume that it first sendsValidateOK(_,id′, I′)(the symmetric case is analogous). By line 84, at the F. Ryabinin,...

work page 2025

[16] [16]

8.Processesp 1,p 2,p 3, andp 9 fail

Process p2 successfully commits id′with dep[id′] ={id}and seq[id′] = 1on the slow path using a quorum{p2, p3, p4, p8, p9}. 8.Processesp 1,p 2,p 3, andp 9 fail. 9.Processp 4 starts recoveringid. a. It broadcasts Prepare and receives f + 1 = 5replies from{p4, p5, p6, p7, p8}(page 41, step 1). b.Steps 2 on page 41 and 3–4 on page 42 are not applicable. c. Ho...

work page

[17] [17]

Process p4 repeatedly attempts to recoverid without success (i.e., it indefinitely repeats item 9)

work page