Verbalized Algorithms: Classical Algorithms are All You Need (Mostly)

Christian Farrell; Hari Pathanjaly; Marko Pavic; Masataro Asai; Sarvesh Chezhian; Supriya Lall

arxiv: 2509.08150 · v5 · submitted 2025-09-09 · 💻 cs.CL

Verbalized Algorithms: Classical Algorithms are All You Need (Mostly)

Supriya Lall , Christian Farrell , Hari Pathanjaly , Marko Pavic , Sarvesh Chezhian , Masataro Asai This is my paper

Pith reviewed 2026-05-18 17:15 UTC · model grok-4.3

classification 💻 cs.CL

keywords verbalized algorithmsLLM reasoningclassical algorithmselementary operationsnumerical reasoningtopic clusteringsubmodular maximizationRAG

0 comments

The pith

Verbalized algorithms limit LLMs to elementary string operations inside classical procedures that carry theoretical guarantees, improving the accuracy-runtime tradeoff on reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that free-form generation by LLMs lacks clear guarantees on soundness or optimality, so reasoning should instead be handled by verbalized algorithms. These methods decompose a task into simple operations such as binary comparisons or similarity judgments that an LLM can answer, then embed those answers inside a classical algorithm whose overall properties are already established. The authors demonstrate the idea with verbalized versions of maximum finding, sorting, clustering, and submodular maximization. When tested on numerical reasoning, topic clustering, Wi-Fi access-point placement, and multi-hop retrieval-augmented QA, the hybrid procedures advance the accuracy-runtime Pareto front relative to unconstrained generation. A reader should care because the approach replaces reliance on opaque model behavior with a narrower, more analyzable role for the LLM.

Core claim

Verbalized algorithms combine LLMs and classical algorithms with established guarantees by restricting the LLM to elementary operations on strings. For example, the model serves as a binary comparison oracle inside a sorting routine or as a similarity judge inside a clustering routine. Applying this pattern to verbalized maximum, sorting, clustering, and submodular maximization produces improved accuracy-runtime curves on numerical reasoning, topic clustering, Wi-Fi access point optimization, and multi-hop Q&A RAG.

What carries the argument

Verbalized algorithms, which decompose a reasoning task into elementary string operations answered by an LLM and then execute those answers inside a classical algorithm that supplies the overall soundness, completeness, or optimality guarantee.

If this is right

Numerical reasoning accuracy improves when verbalized maximum or sorting replaces direct generation.
Topic clustering quality rises when verbalized clustering algorithms are used.
Wi-Fi access point optimization reaches better solutions via verbalized submodular maximization.
Multi-hop Q&A RAG performance increases when the retrieval and answering steps are organized by verbalized procedures.
Improvements in LLM-based reasoning can be obtained through standard algorithmic analysis rather than solely through prompt engineering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern could be applied to graph or network problems where an LLM supplies local judgments inside a global algorithm that already has proven bounds.
If elementary operation reliability improves with model scale or fine-tuning, the performance advantage of verbalized algorithms over free-form generation would be expected to grow.
The explicit algorithmic skeleton makes it easier to audit or certify the reasoning trace compared with opaque chain-of-thought output.

Load-bearing premise

The LLM can reliably and consistently answer the elementary string operations required by the chosen classical algorithms without errors that propagate through the procedure.

What would settle it

An experiment showing that the same LLM, when repeatedly asked the identical elementary comparison or similarity query under fixed conditions, returns inconsistent answers often enough to produce visibly incorrect or suboptimal output from the overall verbalized algorithm on a standard benchmark.

read the original abstract

Reasoning is a fundamentally algorithmic task. Yet current work on LLM-based reasoning relies on free-form generation whose theoretical guarantees (soundness, completeness, complexity, optimality) remain poorly understood. We argue that we should not treat them as general-purpose reasoners, and as an alternative, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which combines LLMs and various algorithms with established guarantees. Instead of betting on LLM's ability to solve a reasoning task, VAs limit their scope by decomposing the task down to simple elementary operations on strings that they can answer reliably. For example, sorting a list of natural language strings could be done by using an LLM as a binary comparison oracle in a parallel or approximate sorting algorithm. We push the accuracy-runtime Pareto front with \emph{verbalized maximum}, \emph{sorting}, \emph{clustering}, and \emph{submodular maximization}, for numerical reasoning, topic clustering, Wi-Fi access point optimization, and multi-hop Q\&A RAG task. These results suggest improving LLM-based reasoning through standard algorithmic analysis is a feasible and better grounded research direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Restricting LLMs to oracle roles inside classical algorithms is a coherent direction but the abstract leaves the error propagation problem unaddressed.

read the letter

The main thing to know is that the authors want to stop asking LLMs to do open-ended reasoning and instead limit them to answering elementary string questions inside standard algorithms. Examples include using the model as a comparison oracle inside a sorting routine for numerical reasoning, as a similarity judge inside clustering for topics, and inside submodular maximization for Wi-Fi placement and multi-hop RAG. The pitch is that this decomposition inherits some of the known properties of the algorithms while still using the LLM where it might be reliable.

Referee Report

2 major / 2 minor

Summary. The paper proposes verbalized algorithms (VAs), a paradigm that restricts LLMs to elementary string operations (such as binary comparisons or similarity judgments) inside classical algorithms with established guarantees (e.g., comparison-based sorting, clustering, submodular maximization). This decomposition is applied to numerical reasoning, topic clustering, Wi-Fi access point optimization, and multi-hop Q&A RAG, with the claim that the resulting systems advance the accuracy-runtime Pareto front relative to free-form LLM reasoning.

Significance. If the empirical claims hold under rigorous controls, the work offers a concrete route to more predictable LLM reasoning by anchoring LLM use to operations whose error propagation can in principle be analyzed via classical algorithmic tools. The emphasis on leveraging existing guarantees rather than opaque end-to-end generation is a constructive reframing that could influence hybrid neuro-symbolic research.

major comments (2)

Abstract: the abstract asserts that verbalized maximum, sorting, clustering, and submodular maximization push the accuracy-runtime Pareto front on four tasks, yet supplies no information on experimental controls, baseline implementations, number of runs, error bars, or how LLM oracle errors were quantified or mitigated; without these the central empirical claim cannot be evaluated.
Method / Approach section: the validity of the paradigm rests on the assumption that LLM errors in elementary operations (binary comparisons, similarity judgments) do not compound sufficiently to invalidate the underlying algorithmic guarantees (e.g., inversion counts in sorting or approximation ratios in submodular maximization), but no error-rate bounds, sensitivity analysis, or propagation proofs are provided.

minor comments (2)

The term 'verbalized algorithm' is introduced without a concise formal definition or pseudocode that would make the decomposition strategy immediately reproducible from the text.
Related-work discussion could more explicitly contrast VAs with prior neuro-symbolic or tool-augmented LLM frameworks that also decompose reasoning into verifiable steps.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive review and for recognizing the potential of verbalized algorithms to anchor LLM reasoning in operations with analyzable error propagation. We respond to each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: Abstract: the abstract asserts that verbalized maximum, sorting, clustering, and submodular maximization push the accuracy-runtime Pareto front on four tasks, yet supplies no information on experimental controls, baseline implementations, number of runs, error bars, or how LLM oracle errors were quantified or mitigated; without these the central empirical claim cannot be evaluated.

Authors: We agree that the abstract would benefit from additional context to allow evaluation of the central claims. In the revised version we will expand the abstract with a brief clause noting that results are reported as averages over multiple independent runs with standard deviations shown in the main figures, that baselines include both free-form LLM prompting and prior hybrid methods, and that oracle errors are mitigated through repeated sampling with majority voting. Full details on controls, baseline implementations, and error quantification appear in Sections 4 and 5 together with the appendix; the abstract revision will point readers to these sections. revision: yes
Referee: Method / Approach section: the validity of the paradigm rests on the assumption that LLM errors in elementary operations (binary comparisons, similarity judgments) do not compound sufficiently to invalidate the underlying algorithmic guarantees (e.g., inversion counts in sorting or approximation ratios in submodular maximization), but no error-rate bounds, sensitivity analysis, or propagation proofs are provided.

Authors: This concern is well-taken. The present manuscript prioritizes empirical demonstration that the verbalized approach improves the accuracy-runtime frontier even when the LLM oracle is imperfect. We have added a new sensitivity analysis subsection (and accompanying figure) that injects synthetic error rates from 0% to 25% into the oracles for each algorithm and measures degradation in final task performance; these results show graceful degradation rather than catastrophic failure. We also expand the discussion to outline how classical tools (e.g., inversion-count analysis for sorting or submodular approximation-ratio proofs) could be adapted to derive formal propagation bounds, while explicitly marking full theoretical guarantees as an important direction for future work rather than a claim of the current paper. revision: partial

standing simulated objections not resolved

Deriving complete formal error-propagation bounds and proofs for all four verbalized algorithms

Circularity Check

0 steps flagged

Derivation self-contained with no circular reductions

full rationale

The paper introduces verbalized algorithms as a decomposition strategy that restricts LLMs to elementary string operations within classical algorithms possessing established theoretical guarantees. This approach does not rely on self-definitional constructs, fitted inputs presented as predictions, or load-bearing self-citations. The central claims rest on the independent properties of classical algorithms and empirical evaluations across tasks, making the derivation self-contained against external benchmarks rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption that LLMs function as reliable oracles for the elementary operations required by the algorithms; no free parameters or invented physical entities are mentioned, but the new paradigm itself is an invented conceptual entity whose independent evidence is the reported empirical gains.

axioms (1)

domain assumption LLMs can answer elementary string operations (comparisons, similarity) reliably enough that the overall algorithm retains its theoretical properties.
Invoked when the paper states that VAs limit scope to operations LLMs can answer reliably.

invented entities (1)

Verbalized algorithm (VA) no independent evidence
purpose: A hybrid paradigm that decomposes reasoning tasks into classical algorithms plus narrow LLM oracle calls.
The paper introduces this as a new paradigm; no external falsifiable prediction beyond the reported experiments is given in the abstract.

pith-pipeline@v0.9.0 · 5747 in / 1412 out tokens · 43950 ms · 2026-05-18T17:15:53.833347+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Verbalized Algorithms... limits their scope by decomposing the task down to simple elementary operations on strings... e.g., sorting... using an LLM as a binary comparison oracle in a parallel or approximate sorting algorithm.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Robust VAs... majority voting... Hoeffding bound

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 6 internal anchors

[1]

K. M. Collins, C. Wong, J. Feng, M. Wei, and J. B. Tenenbaum. Structured, flexible, and robust: bench- marking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks.arXiv preprint arXiv:2205.05718,

work page arXiv
[2]

Fine-Morris, V

M. Fine-Morris, V . Hsiao, L. N. Smith, L. M. Hiatt, and M. Roberts. Leveraging LLMs for Generating Document-Informed Hierarchical Planning Models: A Proposal. InAAAI 2025 Workshop LM4Plan,

work page 2025
[3]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek- R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Hirsch, G

E. Hirsch, G. Uziel, and A. Anaby-Tavor. What’s the Plan? Evaluating and Developing Planning-Aware Techniques for Language Models.arXiv preprint arXiv:2402.11489,

work page arXiv
[5]

Y . Hou, J. Li, Z. He, A. Yan, X. Chen, and J. McAuley. Bridging Language and Items for Retrieval and Recommendation.arXiv preprint arXiv:2403.03952,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

doi: 10.1093/biomet/33.3.239. M. G. Kendall. A New Measure of Rank Correlation.Biometrika, 30(1-2):81–93,

work page doi:10.1093/biomet/33.3.239
[7]

B. Liu, Y . Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency.arXiv preprint arXiv:2304.11477,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

X. Ma, X. Zhang, R. Pradeep, and J. Lin. Zero-Shot Listwise Document Reranking with a Large Language Model.arXiv preprint arXiv:2305.02156,

work page arXiv
[9]

D. V . McDermott. The 1998 AI Planning Systems Competition.AI Magazine, 21(2):35–55,

work page 1998
[10]

MTEB: Massive Text Embedding Benchmark

N. Muennighoff, N. Tazi, L. Magne, and N. Reimers. MTEB: Massive Text Embedding Benchmark.arXiv preprint arXiv:2210.07316,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

doi: 10.1609/icaps.v34i1.31502. R. L. Plackett. The Analysis of Permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202,

work page doi:10.1609/icaps.v34i1.31502
[12]

Z. Qin, R. Jagerman, K. Hui, H. Zhuang, J. Wu, L. Yan, J. Shen, T. Liu, J. Liu, D. Metzler, et al. Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 1504–1518,

work page 2024
[13]

URLhttps://arxiv.org/abs/2505.09388. K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati. On the Planning Abilities of Large Language Models - A Critical Investigation.Proc. of the Advances in Neural Information Processing Systems (Neurips), 36:75993–76005,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

B. T. Willard and R. Louf. Efficient Guided Generation for Large Language Models.arXiv preprint arXiv:2307.09702,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Y . Xie. Translating natural language to planning goals with large-language models.The International Journal of Robotics Research, 2019:1,

work page 2019
[16]

The comparisons in a bitonic sorting network are not data-dependent, allowing for more feasible parallelization and speed proportional to depth D

9 Appendix A Sorting: Experimental Details A.1 Bitonic Sorting Network Bitonic sorting networks (Batcher’s Algorithm) [Batcher, 1968] allow us to optimize the sorting of a given list. The comparisons in a bitonic sorting network are not data-dependent, allowing for more feasible parallelization and speed proportional to depth D. Given an input list, a bit...

work page 1968
[17]

{criteria} \nX: t.x\nY: t.y\nZ: t.z\n Answer either ’X’ or ’Y’

proposed a heavy-tailed kernel to measure local similarities between data points. In particular, we opt to use a Student-t kernel with α degrees of freedom by defining: pijℓ = 1 + ∥xi−xj ∥2 α − α+1 2 1 + ∥xi−xj ∥2 α − α+1 2 + 1 + ∥xi−xℓ∥2 α − α+1 2 We refer to the resulting technique as t-Distributed STE (t-STE). Our implementation sets α=D−1 , whereDis t...

work page 2008

[1] [1]

K. M. Collins, C. Wong, J. Feng, M. Wei, and J. B. Tenenbaum. Structured, flexible, and robust: bench- marking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks.arXiv preprint arXiv:2205.05718,

work page arXiv

[2] [2]

Fine-Morris, V

M. Fine-Morris, V . Hsiao, L. N. Smith, L. M. Hiatt, and M. Roberts. Leveraging LLMs for Generating Document-Informed Hierarchical Planning Models: A Proposal. InAAAI 2025 Workshop LM4Plan,

work page 2025

[3] [3]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek- R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Hirsch, G

E. Hirsch, G. Uziel, and A. Anaby-Tavor. What’s the Plan? Evaluating and Developing Planning-Aware Techniques for Language Models.arXiv preprint arXiv:2402.11489,

work page arXiv

[5] [5]

Y . Hou, J. Li, Z. He, A. Yan, X. Chen, and J. McAuley. Bridging Language and Items for Retrieval and Recommendation.arXiv preprint arXiv:2403.03952,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

doi: 10.1093/biomet/33.3.239. M. G. Kendall. A New Measure of Rank Correlation.Biometrika, 30(1-2):81–93,

work page doi:10.1093/biomet/33.3.239

[7] [7]

B. Liu, Y . Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency.arXiv preprint arXiv:2304.11477,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

X. Ma, X. Zhang, R. Pradeep, and J. Lin. Zero-Shot Listwise Document Reranking with a Large Language Model.arXiv preprint arXiv:2305.02156,

work page arXiv

[9] [9]

D. V . McDermott. The 1998 AI Planning Systems Competition.AI Magazine, 21(2):35–55,

work page 1998

[10] [10]

MTEB: Massive Text Embedding Benchmark

N. Muennighoff, N. Tazi, L. Magne, and N. Reimers. MTEB: Massive Text Embedding Benchmark.arXiv preprint arXiv:2210.07316,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

doi: 10.1609/icaps.v34i1.31502. R. L. Plackett. The Analysis of Permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202,

work page doi:10.1609/icaps.v34i1.31502

[12] [12]

Z. Qin, R. Jagerman, K. Hui, H. Zhuang, J. Wu, L. Yan, J. Shen, T. Liu, J. Liu, D. Metzler, et al. Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 1504–1518,

work page 2024

[13] [13]

URLhttps://arxiv.org/abs/2505.09388. K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati. On the Planning Abilities of Large Language Models - A Critical Investigation.Proc. of the Advances in Neural Information Processing Systems (Neurips), 36:75993–76005,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

B. T. Willard and R. Louf. Efficient Guided Generation for Large Language Models.arXiv preprint arXiv:2307.09702,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Y . Xie. Translating natural language to planning goals with large-language models.The International Journal of Robotics Research, 2019:1,

work page 2019

[16] [16]

The comparisons in a bitonic sorting network are not data-dependent, allowing for more feasible parallelization and speed proportional to depth D

9 Appendix A Sorting: Experimental Details A.1 Bitonic Sorting Network Bitonic sorting networks (Batcher’s Algorithm) [Batcher, 1968] allow us to optimize the sorting of a given list. The comparisons in a bitonic sorting network are not data-dependent, allowing for more feasible parallelization and speed proportional to depth D. Given an input list, a bit...

work page 1968

[17] [17]

{criteria} \nX: t.x\nY: t.y\nZ: t.z\n Answer either ’X’ or ’Y’

proposed a heavy-tailed kernel to measure local similarities between data points. In particular, we opt to use a Student-t kernel with α degrees of freedom by defining: pijℓ = 1 + ∥xi−xj ∥2 α − α+1 2 1 + ∥xi−xj ∥2 α − α+1 2 + 1 + ∥xi−xℓ∥2 α − α+1 2 We refer to the resulting technique as t-Distributed STE (t-STE). Our implementation sets α=D−1 , whereDis t...

work page 2008