On the Reliability of Networks of AI Agents: Density Evolution, Stopping Sets, and Architecture Optimization

Ehsan Aghazadeh; Hossein Pishro-Nik

arxiv: 2606.18121 · v1 · pith:WLFR52F4new · submitted 2026-06-16 · 💻 cs.MA · cs.IT· math.IT

On the Reliability of Networks of AI Agents: Density Evolution, Stopping Sets, and Architecture Optimization

Ehsan Aghazadeh , Hossein Pishro-Nik This is my paper

Pith reviewed 2026-06-26 21:46 UTC · model grok-4.3

classification 💻 cs.MA cs.ITmath.IT

keywords density evolutionAI agentsfactor graphsmessage passingreliabilitystopping setsBoolean verifierserasure channels

0 comments

The pith

A density-evolution theorem predicts the asymptotic fraction of unresolved subclaims in AI agent networks modeled as role-typed factor graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats multi-agent AI systems as message passing on sparse graphs, where agents exchange set-valued messages about binary subclaims and verifiers apply noisy Boolean functions. It adapts density-evolution analysis to track how three erasure-type failures propagate under a logical-forcing rule at each check node. This yields a recursion that forecasts the fraction of unresolved subclaims for random architectures and extends to locally tree-like deterministic graphs. The XOR specialization matches the known LDPC erasure-channel recursion, while the AND case produces an asymmetry between positive and negative certificates. The result supplies a predictive tool for when such networks succeed or fail without requiring exhaustive simulation of every configuration.

Core claim

We prove a density-evolution theorem that predicts the asymptotic fraction of unresolved subclaims on random role-typed architectures, with an extension to deterministic, locally tree-like graph sequences. The XOR case recovers the classical LDPC recursion on the binary erasure channel (BEC); the AND case exposes an asymmetry between positive and negative verifier certificates.

What carries the argument

Density evolution on sparse role-typed factor graphs whose noisy Boolean check nodes combine set-valued messages via a single logical-forcing rule.

If this is right

Random architectures admit a computable reliability threshold below which the fraction of unresolved subclaims vanishes.
The XOR verifier recovers the classical LDPC threshold on the binary erasure channel.
The AND verifier produces distinct thresholds for positive versus negative certificates.
The same recursion applies to any locally tree-like sequence of deterministic graphs.
Stopping-set analysis becomes available for finite-length performance bounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same message-passing model could be used to compare reliability of different role assignments before deployment.
Asymmetry results for AND suggest that certificate polarity should be balanced when designing verifier prompts.
The framework supplies a concrete way to test whether a proposed multi-agent workflow will leave a positive fraction of claims unresolved.
Stopping sets identified by the analysis point to minimal subgraphs whose repair would improve overall resolution.

Load-bearing premise

The three distinct failure modes can be modeled uniformly as erasures that propagate through set-valued messages combined by a single logical-forcing rule on the factor graph.

What would settle it

Monte Carlo simulation on large random role-typed graphs whose measured fraction of unresolved subclaims deviates from the density-evolution prediction by more than the predicted variance.

Figures

Figures reproduced from arXiv: 2606.18121 by Ehsan Aghazadeh, Hossein Pishro-Nik.

**Figure 2.** Figure 2: Bipartite factor graph for the 4-step proof scenario in Section [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Density-evolution prediction (Theorem 1, Corollary 1) vs. empirical bit-erasure rate from Monte-Carlo simulation of the role-typed configuration ensemble. Single-role (dv, dc) = (3, 6)-regular XOR ensemble with three-tier erasure (ϵ V, ϵC, η) = (swept, 0.05, 0.95) and L = 50 message-passing rounds. Empirical points are averages over 30 independent graph realizations at each of three problem sizes n ∈ {200,… view at source ↗

**Figure 4.** Figure 4: AND value-conditioned DE recursion (Proposition [PITH_FULL_IMAGE:figures/full_fig_p040_4.png] view at source ↗

**Figure 5.** Figure 5: AND Monte-Carlo concentration on the single-role [PITH_FULL_IMAGE:figures/full_fig_p040_5.png] view at source ↗

**Figure 6.** Figure 6: Deterministic-graph validation of Theorem [PITH_FULL_IMAGE:figures/full_fig_p041_6.png] view at source ↗

**Figure 7.** Figure 7: XOR residual P (L) DE on a constant-additive-sum surface: ϵ V = 0.40 fixed, side budget ϵ C +(1−η) = 0.10 split as ϵ C = t·0.10 and 1−η = (1−t)·0.10. (dv, dc) = (3, 6), L = 50. Residuals range over [0.146, 0.332] along the same additive sum. The two design points at the endpoints have the same additive effective-erasure score but different residual erasure: the DE map is sensitive to where erasures enter t… view at source ↗

**Figure 8.** Figure 8: The Hilbert architecture [1] mapped onto the role-typed Boolean-verifier-node framework. Top: the four cooperating LLM roles, decomposer, refactorer, type-checker, aggregator, each annotated with the erasure-tier parameter it controls. Bottom: the underlying bipartite factor graph that the framework analyzes. Variable agents are proposed proof steps; check agents are Lean kernel invocations performing AND-… view at source ↗

read the original abstract

Modern AI systems increasingly solve a task not with a single model call but with several imperfect agents working together: some propose pieces of a solution, others verify them, and the results are combined. These systems often outperform any single model, yet it is rarely clear why they succeed or when they will fail. We model such a system as message passing on a sparse graph, the structure that underlies low-density parity-check (LDPC) codes, and extend the density-evolution machinery of coding theory to this richer setting. In our model a task is a set of coupled binary subclaims, and an agent architecture is a sparse, role-typed factor graph whose check nodes are noisy Boolean verifier nodes, each computing a local Boolean function of the subclaims it touches. Three distinct failure modes, all modeled as erasures (an agent abstaining, a verifier returning no usable output, and a message lost between two agents), propagate as the agents exchange set-valued messages. The check agents combine these messages by a single logical-forcing rule that specializes to XOR, AND, OR, implication, and Horn constraints. This is more than a relabeling of LDPC theory: the verifier functions are nonlinear and value-asymmetric, and the three failure modes do not reduce to a single effective channel, so they require new threshold, finite-length, and converse results rather than a direct reuse of parity-check density evolution. We prove a density-evolution theorem that predicts the asymptotic fraction of unresolved subclaims on random role-typed architectures, with an extension to deterministic, locally tree-like graph sequences. The XOR case recovers the classical LDPC recursion on the binary erasure channel (BEC); the AND case exposes an asymmetry between positive and negative verifier certificates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends density-evolution analysis to role-typed factor graphs for AI-agent reliability with nonlinear asymmetric verifiers, but the uniform erasure treatment of three failure modes needs close checking against the claimed single recursion.

read the letter

The core contribution is a density-evolution theorem for predicting the fraction of unresolved subclaims in random architectures built from noisy Boolean verifiers on sparse graphs. It recovers the standard LDPC recursion when the verifier is XOR and shows an asymmetry in positive versus negative certificates for the AND case. That extension to value-asymmetric functions is the part that goes beyond relabeling existing coding-theory results.

The modeling choice to treat agent abstention, verifier abstention, and message loss uniformly as erasures that propagate through set-valued messages under one logical-forcing rule is the part that needs scrutiny. The abstract itself notes that the three modes do not collapse to a single effective channel and that the verifiers are nonlinear, yet still asserts one DE theorem. If the possible message sets differ by failure mode in ways that the forcing rule does not close uniformly, the single recursion would not hold for arbitrary Boolean functions. The stress-test concern lands on the abstract as written.

The work is framed as an honest extension rather than a parameter scan or self-referential claim, and the XOR recovery provides a sanity check. Without the full derivation it is impossible to confirm whether the theorem actually closes or whether extra assumptions were added post-hoc.

This is the kind of paper a reading group in multi-agent systems or coding theory might discuss to see whether the math supports the modeling. It is worth sending to referees so the proof and the uniform-erasure step can be examined directly; the idea is specific enough that a serious review would clarify its scope.

Referee Report

1 major / 1 minor

Summary. The paper models networks of AI agents solving tasks via coupled binary subclaims as message passing on sparse, role-typed factor graphs whose check nodes are noisy Boolean verifiers. Three failure modes (agent abstention, verifier abstention, message loss) are modeled uniformly as erasures that propagate via set-valued messages combined at checks by a single logical-forcing rule. The authors prove a density-evolution theorem predicting the asymptotic fraction of unresolved subclaims on random role-typed architectures (with an extension to deterministic locally tree-like sequences), recovering the classical LDPC BEC recursion for XOR and exposing positive/negative asymmetry for AND.

Significance. If the density-evolution theorem is correct, the work supplies a non-trivial extension of coding-theory tools to multi-agent reliability analysis, including new threshold and finite-length results that do not reduce to standard LDPC. The explicit recovery of the XOR case and the identification of AND asymmetry are concrete strengths; the manuscript also ships a theorem rather than only simulations.

major comments (1)

[Abstract / density-evolution theorem] Abstract / density-evolution theorem: the claimed single DE recursion requires that the three failure modes can be treated uniformly as erasures under one logical-forcing rule on set-valued messages for arbitrary Boolean verifier functions. The abstract itself states that the verifiers are nonlinear and value-asymmetric and that the modes "do not reduce to a single effective channel"; any mode-specific effect on admissible message sets would necessitate distinct alphabets or separate recursions, undermining the single-theorem claim. This assumption is load-bearing for the central result.

minor comments (1)

Clarify whether the extension to deterministic locally tree-like sequences is proved as a separate statement or follows directly as a corollary of the random-graph theorem.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the modeling assumptions underlying the density-evolution theorem. We address the single major comment below and believe the central claim remains intact.

read point-by-point responses

Referee: [Abstract / density-evolution theorem] Abstract / density-evolution theorem: the claimed single DE recursion requires that the three failure modes can be treated uniformly as erasures under one logical-forcing rule on set-valued messages for arbitrary Boolean verifier functions. The abstract itself states that the verifiers are nonlinear and value-asymmetric and that the modes "do not reduce to a single effective channel"; any mode-specific effect on admissible message sets would necessitate distinct alphabets or separate recursions, undermining the single-theorem claim. This assumption is load-bearing for the central result.

Authors: The three erasure modes are deliberately modeled with the same message alphabet (subsets of {0,1}) and the same logical-forcing combination rule at every check node; the rule depends only on the verifier function, not on which physical mechanism produced the erasure. Consequently the message-passing dynamics remain identical across modes, permitting a single density-evolution recursion (Theorem 1) whose state is the distribution over unresolved sets. The statement that the modes “do not reduce to a single effective channel” refers to the fact that the resulting thresholds, asymmetry for AND/OR, and finite-length scaling differ from classical BEC-LDPC and therefore require fresh analysis; it does not imply that the message alphabet or recursion must be mode-specific. The derivation in Sections 3–4 makes this uniformity explicit for arbitrary Boolean verifiers. We are happy to insert one clarifying sentence in the abstract to separate the two senses of “channel.” revision: partial

Circularity Check

0 steps flagged

Density-evolution theorem extends LDPC machinery to role-typed graphs with independent content

full rationale

The paper frames its central result as a proved density-evolution theorem for the asymptotic unresolved fraction under set-valued messages and a logical-forcing rule on role-typed factor graphs. It explicitly distinguishes the setting from direct LDPC reuse by noting nonlinear asymmetric verifiers and multiple non-equivalent erasure modes, and recovers the classical BEC recursion only as the special XOR case while deriving new asymmetry for AND. No quoted step reduces the theorem to a fitted parameter, self-citation chain, or input by construction; the derivation is presented as an extension of existing coding-theory tools to a new domain with stated differences that require fresh analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard density-evolution assumptions transferred to the new setting; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

standard math Density evolution applies to large random sparse graphs that are locally tree-like
Invoked implicitly when stating the asymptotic fraction on random role-typed architectures (abstract).

pith-pipeline@v0.9.1-grok · 5856 in / 1255 out tokens · 31446 ms · 2026-06-26T21:46:03.160047+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 3 canonical work pages

[1]

Hilbert: Recursively building formal proofs with informal reasoning,

S. Varambally, T. Voice, Y. Sun, Z. Chen, R. Yu, and K. Ye, “Hilbert: Recursively building formal proofs with informal reasoning,” inThe 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025, 2025. [Online]. Available: https://openreview.net/forum?id=ljAHonPrs1 57

2025
[2]

Coder: Issue resolving with multi-agent and task graphs,

D. Chen, S. Lin, M. Zeng, D. Zan, J.-G. Wang, A. Cheshkov, J. Sun, H. Yu, G. Dong, A. Aliev, J. Wang, X. Cheng, G. Liang, Y. Ma, P. Bian, T. Xie, and Q. Wang, “Coder: Issue resolving with multi-agent and task graphs,” 2024. [Online]. Available: https://arxiv.org/abs/2406.01304

arXiv 2024
[3]

Debating with more persuasive llms leads to more truthful answers,

A. Khan, J. Hughes, D. Valentine, L. Ruis, K. Sachan, A. Radhakrishnan, E. Grefenstette, S. R. Bowman, T. Rocktäschel, and E. Perez, “Debating with more persuasive llms leads to more truthful answers,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

2024
[4]

Towards a science of ai agent reliability,

S. Rabanser, S. Kapoor, P. Kirgis, K. Liu, S. Utpala, and A. Narayanan, “Towards a science of ai agent reliability,” 2026. [Online]. Available: https://arxiv.org/abs/2602.16666

Pith/arXiv arXiv 2026
[5]

Richardson and R

T. Richardson and R. Urbanke,Modern Coding Theory. Cambridge University Press, 2008

2008
[6]

Analysis of low density codes and improved designs using irregular graphs,

M. Luby, M. Mitzenmacher, A. Shokrollah, and D. Spielman, “Analysis of low density codes and improved designs using irregular graphs,” inProceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, ser. STOC ’98. New York, NY, USA: Association for Computing Machinery, 1998, p. 249–258. [Online]. Available: https://doi.org/10.1145/276698.276756

work page doi:10.1145/276698.276756 1998
[7]

The capacity of low-density parity-check codes under message-passing decoding,

T. Richardson and R. Urbanke, “The capacity of low-density parity-check codes under message-passing decoding,”IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 599–618, 2001

2001
[8]

Design of capacity-approaching irregular low-density parity-check codes,

T. Richardson, M. Shokrollahi, and R. Urbanke, “Design of capacity-approaching irregular low-density parity-check codes,” IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 619–637, 2001

2001
[9]

Finite-length analysis of low-density parity-check codes on the binary erasure channel,

C. Di, D. Proietti, I. E. Telatar, T. J. Richardson, and R. L. Urbanke, “Finite-length analysis of low-density parity-check codes on the binary erasure channel,”IEEE Trans. Inf. Theor., vol. 48, no. 6, p. 1570–1579, Sep. 2006. [Online]. Available: https://doi.org/10.1109/TIT.2002.1003839

work page doi:10.1109/tit.2002.1003839 2006
[10]

Nonuniform error correction using low-density parity-check codes,

H. Pishro-Nik, N. Rahnavard, and F. Fekri, “Nonuniform error correction using low-density parity-check codes,”IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2702–2714, 2005

2005
[11]

Design of ldpc codes robust to noisy message-passing decoding,

A. Tarighati, H. Farhadi, and F. Lahouti, “Design of ldpc codes robust to noisy message-passing decoding,” 2015. [Online]. Available: https://arxiv.org/abs/1501.02483

Pith/arXiv arXiv 2015
[12]

Noisy density evolution with asymmetric deviation models,

E. Dupraz and F. Leduc-Primeau, “Noisy density evolution with asymmetric deviation models,”IEEE Transactions on Communications, vol. 69, no. 3, pp. 1403–1416, 2021

2021
[13]

SWE-bench: Can language models resolve real-world github issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world github issues?” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=VTF8yNQM66

2024
[14]

SWE-agent: Agent-computer interfaces enable automated software engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. R. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software engineering,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=mXpq6ut8J3

2024
[15]

CLEVER: A curated benchmark for formally verified code generation,

A. Thakur, J. Lee, G. Tsoukalas, M. Sistla, M. Zhao, S. Zetzsche, G. Durrett, Y. Yue, and S. Chaudhuri, “CLEVER: A curated benchmark for formally verified code generation,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. [Online]. Available: https: //openreview.net/forum?id=IbOacMF5qd

2026
[16]

Proving the coding interview: A benchmark for formally verified code generation,

Q. Dougherty and R. Mehta, “Proving the coding interview: A benchmark for formally verified code generation,” in2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), 2025, pp. 72–79

2025
[17]

Solving a million-step llm task with zero errors,

E. Meyerson, G. Paolo, R. Dailey, H. Shahrzad, O. Francon, C. F. Hayes, X. Qiu, B. Hodjat, and R. Miikkulainen, “Solving a million-step llm task with zero errors,” 2025. [Online]. Available: https://arxiv.org/abs/2511.09030

arXiv 2025
[18]

Improving factuality and reasoning in language models through multiagent debate,

Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

2024
[19]

On scalable oversight with weak LLMs judging strong LLMs,

Z. Kenton, N. Y. Siegel, J. Kramar, J. Brown-Cohen, S. Albanie, J. Bulian, R. Agarwal, D. Lindner, Y. Tang, N. Goodman, and R. Shah, “On scalable oversight with weak LLMs judging strong LLMs,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=O1fp9nVraj

2024
[20]

Debate or vote: Which yields better decisions in multi-agent large language models?

H. K. Choi, J. Zhu, and S. Li, “Debate or vote: Which yields better decisions in multi-agent large language models?” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. [Online]. Available: https://openreview.net/forum?id=iUjGNJzrF1

2026
[21]

Factor graphs and the sum-product algorithm,

F. Kschischang, B. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,”IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 498–519, 2001

2001
[22]

Low-density parity-check codes,

R. Gallager, “Low-density parity-check codes,”IRE Transactions on Information Theory, vol. 8, no. 1, pp. 21–28, 1962

1962
[23]

On decoding of low-density parity-check codes over the binary erasure channel,

H. Pishro-Nik and F. Fekri, “On decoding of low-density parity-check codes over the binary erasure channel,”IEEE Transactions on Information Theory, vol. 50, no. 3, pp. 439–454, 2004

2004
[24]

Results on punctured low-density parity-check codes and improved iterative decoding techniques,

——, “Results on punctured low-density parity-check codes and improved iterative decoding techniques,”IEEE Transactions on Information Theory, vol. 53, no. 2, pp. 599–614, 2007

2007
[25]

Generalized and doubly generalized ldpc codes with random component codes for the binary erasure channel,

E. Paolini, M. P. C. Fossorier, and M. Chiani, “Generalized and doubly generalized ldpc codes with random component codes for the binary erasure channel,”IEEE Transactions on Information Theory, vol. 56, no. 4, pp. 1651–1672, 2010

2010
[26]

Analysis of absorbing sets and fully absorbing sets of array-based ldpc codes,

L. Dolecek, Z. Zhang, V. Anantharam, M. J. Wainwright, and B. Nikolic, “Analysis of absorbing sets and fully absorbing sets of array-based ldpc codes,”IEEE Transactions on Information Theory, vol. 56, no. 1, pp. 181–201, 2010

2010
[27]

Density evolution for asymmetric memoryless channels,

C.-C. Wang, S. Kulkarni, and H. Poor, “Density evolution for asymmetric memoryless channels,”IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4216–4236, 2005

2005
[28]

Reflexion: language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. R. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” inThirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=vAElhFcKW6 58

2023
[29]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=WE_vluYUL-X

2023
[30]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. R. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” inThirty-seventh Conference on Neural Information Processing Systems,
[31]

Available: https://openreview.net/forum?id=5Xc1ecxO1h

[Online]. Available: https://openreview.net/forum?id=5Xc1ecxO1h
[32]

Self-consistency improves chain of thought reasoning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inThe Eleventh International Conference on Learning Representations,
[33]

Available: https://openreview.net/forum?id=1PL1NIMMrw

[Online]. Available: https://openreview.net/forum?id=1PL1NIMMrw
[34]

Training verifiers to solve math word problems,

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” 2021. [Online]. Available: https://arxiv.org/abs/2110.14168

Pith/arXiv arXiv 2021
[35]

Let’s verify step by step,

H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=v8L0pN6EOi

2024
[36]

minif2f: a cross-system benchmark for formal olympiad-level mathematics,

K. Zheng, J. M. Han, and S. Polu, “minif2f: a cross-system benchmark for formal olympiad-level mathematics,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id= 9ZPegFuFTFv

2022
[37]

Proofnet: Autoformalizing and formally proving undergraduate-level mathematics,

Z. Azerbayev, B. Piotrowski, H. Schoelkopf, E. W. Ayers, D. Radev, and J. Avigad, “Proofnet: Autoformalizing and formally proving undergraduate-level mathematics,” 2023. [Online]. Available: https://arxiv.org/abs/2302.12433

arXiv 2023
[38]

Lean-STar: Learning to interleave thinking and proving,

H. Lin, Z. Sun, S. Welleck, and Y. Yang, “Lean-STar: Learning to interleave thinking and proving,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id= SOWZ59UyNc

2025
[39]

On the reliability limits of llm-based multi-agent planning,

R. Ao, S. Gao, and D. Simchi-Levi, “On the reliability limits of llm-based multi-agent planning,” 2026. [Online]. Available: https://arxiv.org/abs/2603.26993

arXiv 2026
[40]

Representation in

M. Mézard and A. Montanari,Information, Physics, and Computation. Oxford University Press, 01 2009. [Online]. Available: https://doi.org/10.1093/acprof:oso/9780198570837.001.0001

work page doi:10.1093/acprof:oso/9780198570837.001.0001 2009
[41]

On the method of bounded differences,

C. McDiarmid, “On the method of bounded differences,” inSurveys in combinatorics, 1989 (Norwich, 1989), ser. London Math. Soc. Lecture Note Ser. Cambridge Univ. Press, Cambridge, 1989, vol. 141, pp. 148–188

1989
[42]

Mas-fire: Fault injection and reliability evaluation for llm-based multi-agent systems,

J. Jia, Z. Deng, Z. Chen, Y. Wang, and Z. Zheng, “Mas-fire: Fault injection and reliability evaluation for llm-based multi-agent systems,” 2026. [Online]. Available: https://arxiv.org/abs/2602.19843

arXiv 2026
[43]

The objective method: Probabilistic combinatorial optimization and local weak convergence,

D. Aldous and J. M. Steele, “The objective method: Probabilistic combinatorial optimization and local weak convergence,” inProbability on Discrete Structures, H. Kesten, Ed. Berlin: Springer, 2004, pp. 1–72

2004
[44]

Resolvent of large random graphs,

C. Bordenave and M. Lelarge, “Resolvent of large random graphs,”Random Structures and Algorithms, vol. 37, no. 3, pp. 332–352, 2010

2010
[45]

Interlacing families I: Bipartite Ramanujan graphs of all degrees,

A. W. Marcus, D. A. Spielman, and N. Srivastava, “Interlacing families I: Bipartite Ramanujan graphs of all degrees,” Annals of Mathematics, vol. 182, no. 1, pp. 307–325, 2015

2015
[46]

El Gamal and Y.-H

A. El Gamal and Y.-H. Kim,Network Information Theory. Cambridge University Press, 2011

2011
[47]

T. M. Cover and J. A. Thomas,Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, July 2006

2006
[48]

Large deviations for sums of partly dependent random variables,

S. Janson, “Large deviations for sums of partly dependent random variables,”Random Structures and Algorithms, vol. 24, no. 3, pp. 234–248, 2004

2004
[49]

Stein’s method for concentration inequalities,

S. Chatterjee, “Stein’s method for concentration inequalities,”Probability Theory and Related Fields, vol. 138, pp. 305–321, 2007

2007

[1] [1]

Hilbert: Recursively building formal proofs with informal reasoning,

S. Varambally, T. Voice, Y. Sun, Z. Chen, R. Yu, and K. Ye, “Hilbert: Recursively building formal proofs with informal reasoning,” inThe 5th Workshop on Mathematical Reasoning and AI at NeurIPS 2025, 2025. [Online]. Available: https://openreview.net/forum?id=ljAHonPrs1 57

2025

[2] [2]

Coder: Issue resolving with multi-agent and task graphs,

D. Chen, S. Lin, M. Zeng, D. Zan, J.-G. Wang, A. Cheshkov, J. Sun, H. Yu, G. Dong, A. Aliev, J. Wang, X. Cheng, G. Liang, Y. Ma, P. Bian, T. Xie, and Q. Wang, “Coder: Issue resolving with multi-agent and task graphs,” 2024. [Online]. Available: https://arxiv.org/abs/2406.01304

arXiv 2024

[3] [3]

Debating with more persuasive llms leads to more truthful answers,

A. Khan, J. Hughes, D. Valentine, L. Ruis, K. Sachan, A. Radhakrishnan, E. Grefenstette, S. R. Bowman, T. Rocktäschel, and E. Perez, “Debating with more persuasive llms leads to more truthful answers,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

2024

[4] [4]

Towards a science of ai agent reliability,

S. Rabanser, S. Kapoor, P. Kirgis, K. Liu, S. Utpala, and A. Narayanan, “Towards a science of ai agent reliability,” 2026. [Online]. Available: https://arxiv.org/abs/2602.16666

Pith/arXiv arXiv 2026

[5] [5]

Richardson and R

T. Richardson and R. Urbanke,Modern Coding Theory. Cambridge University Press, 2008

2008

[6] [6]

Analysis of low density codes and improved designs using irregular graphs,

M. Luby, M. Mitzenmacher, A. Shokrollah, and D. Spielman, “Analysis of low density codes and improved designs using irregular graphs,” inProceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, ser. STOC ’98. New York, NY, USA: Association for Computing Machinery, 1998, p. 249–258. [Online]. Available: https://doi.org/10.1145/276698.276756

work page doi:10.1145/276698.276756 1998

[7] [7]

The capacity of low-density parity-check codes under message-passing decoding,

T. Richardson and R. Urbanke, “The capacity of low-density parity-check codes under message-passing decoding,”IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 599–618, 2001

2001

[8] [8]

Design of capacity-approaching irregular low-density parity-check codes,

T. Richardson, M. Shokrollahi, and R. Urbanke, “Design of capacity-approaching irregular low-density parity-check codes,” IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 619–637, 2001

2001

[9] [9]

Finite-length analysis of low-density parity-check codes on the binary erasure channel,

C. Di, D. Proietti, I. E. Telatar, T. J. Richardson, and R. L. Urbanke, “Finite-length analysis of low-density parity-check codes on the binary erasure channel,”IEEE Trans. Inf. Theor., vol. 48, no. 6, p. 1570–1579, Sep. 2006. [Online]. Available: https://doi.org/10.1109/TIT.2002.1003839

work page doi:10.1109/tit.2002.1003839 2006

[10] [10]

Nonuniform error correction using low-density parity-check codes,

H. Pishro-Nik, N. Rahnavard, and F. Fekri, “Nonuniform error correction using low-density parity-check codes,”IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2702–2714, 2005

2005

[11] [11]

Design of ldpc codes robust to noisy message-passing decoding,

A. Tarighati, H. Farhadi, and F. Lahouti, “Design of ldpc codes robust to noisy message-passing decoding,” 2015. [Online]. Available: https://arxiv.org/abs/1501.02483

Pith/arXiv arXiv 2015

[12] [12]

Noisy density evolution with asymmetric deviation models,

E. Dupraz and F. Leduc-Primeau, “Noisy density evolution with asymmetric deviation models,”IEEE Transactions on Communications, vol. 69, no. 3, pp. 1403–1416, 2021

2021

[13] [13]

SWE-bench: Can language models resolve real-world github issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world github issues?” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=VTF8yNQM66

2024

[14] [14]

SWE-agent: Agent-computer interfaces enable automated software engineering,

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. R. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software engineering,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=mXpq6ut8J3

2024

[15] [15]

CLEVER: A curated benchmark for formally verified code generation,

A. Thakur, J. Lee, G. Tsoukalas, M. Sistla, M. Zhao, S. Zetzsche, G. Durrett, Y. Yue, and S. Chaudhuri, “CLEVER: A curated benchmark for formally verified code generation,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. [Online]. Available: https: //openreview.net/forum?id=IbOacMF5qd

2026

[16] [16]

Proving the coding interview: A benchmark for formally verified code generation,

Q. Dougherty and R. Mehta, “Proving the coding interview: A benchmark for formally verified code generation,” in2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), 2025, pp. 72–79

2025

[17] [17]

Solving a million-step llm task with zero errors,

E. Meyerson, G. Paolo, R. Dailey, H. Shahrzad, O. Francon, C. F. Hayes, X. Qiu, B. Hodjat, and R. Miikkulainen, “Solving a million-step llm task with zero errors,” 2025. [Online]. Available: https://arxiv.org/abs/2511.09030

arXiv 2025

[18] [18]

Improving factuality and reasoning in language models through multiagent debate,

Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

2024

[19] [19]

On scalable oversight with weak LLMs judging strong LLMs,

Z. Kenton, N. Y. Siegel, J. Kramar, J. Brown-Cohen, S. Albanie, J. Bulian, R. Agarwal, D. Lindner, Y. Tang, N. Goodman, and R. Shah, “On scalable oversight with weak LLMs judging strong LLMs,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=O1fp9nVraj

2024

[20] [20]

Debate or vote: Which yields better decisions in multi-agent large language models?

H. K. Choi, J. Zhu, and S. Li, “Debate or vote: Which yields better decisions in multi-agent large language models?” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. [Online]. Available: https://openreview.net/forum?id=iUjGNJzrF1

2026

[21] [21]

Factor graphs and the sum-product algorithm,

F. Kschischang, B. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,”IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 498–519, 2001

2001

[22] [22]

Low-density parity-check codes,

R. Gallager, “Low-density parity-check codes,”IRE Transactions on Information Theory, vol. 8, no. 1, pp. 21–28, 1962

1962

[23] [23]

On decoding of low-density parity-check codes over the binary erasure channel,

H. Pishro-Nik and F. Fekri, “On decoding of low-density parity-check codes over the binary erasure channel,”IEEE Transactions on Information Theory, vol. 50, no. 3, pp. 439–454, 2004

2004

[24] [24]

Results on punctured low-density parity-check codes and improved iterative decoding techniques,

——, “Results on punctured low-density parity-check codes and improved iterative decoding techniques,”IEEE Transactions on Information Theory, vol. 53, no. 2, pp. 599–614, 2007

2007

[25] [25]

Generalized and doubly generalized ldpc codes with random component codes for the binary erasure channel,

E. Paolini, M. P. C. Fossorier, and M. Chiani, “Generalized and doubly generalized ldpc codes with random component codes for the binary erasure channel,”IEEE Transactions on Information Theory, vol. 56, no. 4, pp. 1651–1672, 2010

2010

[26] [26]

Analysis of absorbing sets and fully absorbing sets of array-based ldpc codes,

L. Dolecek, Z. Zhang, V. Anantharam, M. J. Wainwright, and B. Nikolic, “Analysis of absorbing sets and fully absorbing sets of array-based ldpc codes,”IEEE Transactions on Information Theory, vol. 56, no. 1, pp. 181–201, 2010

2010

[27] [27]

Density evolution for asymmetric memoryless channels,

C.-C. Wang, S. Kulkarni, and H. Poor, “Density evolution for asymmetric memoryless channels,”IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4216–4236, 2005

2005

[28] [28]

Reflexion: language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. R. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” inThirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=vAElhFcKW6 58

2023

[29] [29]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=WE_vluYUL-X

2023

[30] [30]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. R. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” inThirty-seventh Conference on Neural Information Processing Systems,

[31] [31]

Available: https://openreview.net/forum?id=5Xc1ecxO1h

[Online]. Available: https://openreview.net/forum?id=5Xc1ecxO1h

[32] [32]

Self-consistency improves chain of thought reasoning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” inThe Eleventh International Conference on Learning Representations,

[33] [33]

Available: https://openreview.net/forum?id=1PL1NIMMrw

[Online]. Available: https://openreview.net/forum?id=1PL1NIMMrw

[34] [34]

Training verifiers to solve math word problems,

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training verifiers to solve math word problems,” 2021. [Online]. Available: https://arxiv.org/abs/2110.14168

Pith/arXiv arXiv 2021

[35] [35]

Let’s verify step by step,

H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=v8L0pN6EOi

2024

[36] [36]

minif2f: a cross-system benchmark for formal olympiad-level mathematics,

K. Zheng, J. M. Han, and S. Polu, “minif2f: a cross-system benchmark for formal olympiad-level mathematics,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id= 9ZPegFuFTFv

2022

[37] [37]

Proofnet: Autoformalizing and formally proving undergraduate-level mathematics,

Z. Azerbayev, B. Piotrowski, H. Schoelkopf, E. W. Ayers, D. Radev, and J. Avigad, “Proofnet: Autoformalizing and formally proving undergraduate-level mathematics,” 2023. [Online]. Available: https://arxiv.org/abs/2302.12433

arXiv 2023

[38] [38]

Lean-STar: Learning to interleave thinking and proving,

H. Lin, Z. Sun, S. Welleck, and Y. Yang, “Lean-STar: Learning to interleave thinking and proving,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id= SOWZ59UyNc

2025

[39] [39]

On the reliability limits of llm-based multi-agent planning,

R. Ao, S. Gao, and D. Simchi-Levi, “On the reliability limits of llm-based multi-agent planning,” 2026. [Online]. Available: https://arxiv.org/abs/2603.26993

arXiv 2026

[40] [40]

Representation in

M. Mézard and A. Montanari,Information, Physics, and Computation. Oxford University Press, 01 2009. [Online]. Available: https://doi.org/10.1093/acprof:oso/9780198570837.001.0001

work page doi:10.1093/acprof:oso/9780198570837.001.0001 2009

[41] [41]

On the method of bounded differences,

C. McDiarmid, “On the method of bounded differences,” inSurveys in combinatorics, 1989 (Norwich, 1989), ser. London Math. Soc. Lecture Note Ser. Cambridge Univ. Press, Cambridge, 1989, vol. 141, pp. 148–188

1989

[42] [42]

Mas-fire: Fault injection and reliability evaluation for llm-based multi-agent systems,

J. Jia, Z. Deng, Z. Chen, Y. Wang, and Z. Zheng, “Mas-fire: Fault injection and reliability evaluation for llm-based multi-agent systems,” 2026. [Online]. Available: https://arxiv.org/abs/2602.19843

arXiv 2026

[43] [43]

The objective method: Probabilistic combinatorial optimization and local weak convergence,

D. Aldous and J. M. Steele, “The objective method: Probabilistic combinatorial optimization and local weak convergence,” inProbability on Discrete Structures, H. Kesten, Ed. Berlin: Springer, 2004, pp. 1–72

2004

[44] [44]

Resolvent of large random graphs,

C. Bordenave and M. Lelarge, “Resolvent of large random graphs,”Random Structures and Algorithms, vol. 37, no. 3, pp. 332–352, 2010

2010

[45] [45]

Interlacing families I: Bipartite Ramanujan graphs of all degrees,

A. W. Marcus, D. A. Spielman, and N. Srivastava, “Interlacing families I: Bipartite Ramanujan graphs of all degrees,” Annals of Mathematics, vol. 182, no. 1, pp. 307–325, 2015

2015

[46] [46]

El Gamal and Y.-H

A. El Gamal and Y.-H. Kim,Network Information Theory. Cambridge University Press, 2011

2011

[47] [47]

T. M. Cover and J. A. Thomas,Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, July 2006

2006

[48] [48]

Large deviations for sums of partly dependent random variables,

S. Janson, “Large deviations for sums of partly dependent random variables,”Random Structures and Algorithms, vol. 24, no. 3, pp. 234–248, 2004

2004

[49] [49]

Stein’s method for concentration inequalities,

S. Chatterjee, “Stein’s method for concentration inequalities,”Probability Theory and Related Fields, vol. 138, pp. 305–321, 2007

2007