Verbalized Algorithms: Classical Algorithms are All You Need (Mostly)
Pith reviewed 2026-05-18 17:15 UTC · model grok-4.3
The pith
Verbalized algorithms limit LLMs to elementary string operations inside classical procedures that carry theoretical guarantees, improving the accuracy-runtime tradeoff on reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Verbalized algorithms combine LLMs and classical algorithms with established guarantees by restricting the LLM to elementary operations on strings. For example, the model serves as a binary comparison oracle inside a sorting routine or as a similarity judge inside a clustering routine. Applying this pattern to verbalized maximum, sorting, clustering, and submodular maximization produces improved accuracy-runtime curves on numerical reasoning, topic clustering, Wi-Fi access point optimization, and multi-hop Q&A RAG.
What carries the argument
Verbalized algorithms, which decompose a reasoning task into elementary string operations answered by an LLM and then execute those answers inside a classical algorithm that supplies the overall soundness, completeness, or optimality guarantee.
If this is right
- Numerical reasoning accuracy improves when verbalized maximum or sorting replaces direct generation.
- Topic clustering quality rises when verbalized clustering algorithms are used.
- Wi-Fi access point optimization reaches better solutions via verbalized submodular maximization.
- Multi-hop Q&A RAG performance increases when the retrieval and answering steps are organized by verbalized procedures.
- Improvements in LLM-based reasoning can be obtained through standard algorithmic analysis rather than solely through prompt engineering.
Where Pith is reading between the lines
- The same pattern could be applied to graph or network problems where an LLM supplies local judgments inside a global algorithm that already has proven bounds.
- If elementary operation reliability improves with model scale or fine-tuning, the performance advantage of verbalized algorithms over free-form generation would be expected to grow.
- The explicit algorithmic skeleton makes it easier to audit or certify the reasoning trace compared with opaque chain-of-thought output.
Load-bearing premise
The LLM can reliably and consistently answer the elementary string operations required by the chosen classical algorithms without errors that propagate through the procedure.
What would settle it
An experiment showing that the same LLM, when repeatedly asked the identical elementary comparison or similarity query under fixed conditions, returns inconsistent answers often enough to produce visibly incorrect or suboptimal output from the overall verbalized algorithm on a standard benchmark.
read the original abstract
Reasoning is a fundamentally algorithmic task. Yet current work on LLM-based reasoning relies on free-form generation whose theoretical guarantees (soundness, completeness, complexity, optimality) remain poorly understood. We argue that we should not treat them as general-purpose reasoners, and as an alternative, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which combines LLMs and various algorithms with established guarantees. Instead of betting on LLM's ability to solve a reasoning task, VAs limit their scope by decomposing the task down to simple elementary operations on strings that they can answer reliably. For example, sorting a list of natural language strings could be done by using an LLM as a binary comparison oracle in a parallel or approximate sorting algorithm. We push the accuracy-runtime Pareto front with \emph{verbalized maximum}, \emph{sorting}, \emph{clustering}, and \emph{submodular maximization}, for numerical reasoning, topic clustering, Wi-Fi access point optimization, and multi-hop Q\&A RAG task. These results suggest improving LLM-based reasoning through standard algorithmic analysis is a feasible and better grounded research direction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes verbalized algorithms (VAs), a paradigm that restricts LLMs to elementary string operations (such as binary comparisons or similarity judgments) inside classical algorithms with established guarantees (e.g., comparison-based sorting, clustering, submodular maximization). This decomposition is applied to numerical reasoning, topic clustering, Wi-Fi access point optimization, and multi-hop Q&A RAG, with the claim that the resulting systems advance the accuracy-runtime Pareto front relative to free-form LLM reasoning.
Significance. If the empirical claims hold under rigorous controls, the work offers a concrete route to more predictable LLM reasoning by anchoring LLM use to operations whose error propagation can in principle be analyzed via classical algorithmic tools. The emphasis on leveraging existing guarantees rather than opaque end-to-end generation is a constructive reframing that could influence hybrid neuro-symbolic research.
major comments (2)
- Abstract: the abstract asserts that verbalized maximum, sorting, clustering, and submodular maximization push the accuracy-runtime Pareto front on four tasks, yet supplies no information on experimental controls, baseline implementations, number of runs, error bars, or how LLM oracle errors were quantified or mitigated; without these the central empirical claim cannot be evaluated.
- Method / Approach section: the validity of the paradigm rests on the assumption that LLM errors in elementary operations (binary comparisons, similarity judgments) do not compound sufficiently to invalidate the underlying algorithmic guarantees (e.g., inversion counts in sorting or approximation ratios in submodular maximization), but no error-rate bounds, sensitivity analysis, or propagation proofs are provided.
minor comments (2)
- The term 'verbalized algorithm' is introduced without a concise formal definition or pseudocode that would make the decomposition strategy immediately reproducible from the text.
- Related-work discussion could more explicitly contrast VAs with prior neuro-symbolic or tool-augmented LLM frameworks that also decompose reasoning into verifiable steps.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the potential of verbalized algorithms to anchor LLM reasoning in operations with analyzable error propagation. We respond to each major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: Abstract: the abstract asserts that verbalized maximum, sorting, clustering, and submodular maximization push the accuracy-runtime Pareto front on four tasks, yet supplies no information on experimental controls, baseline implementations, number of runs, error bars, or how LLM oracle errors were quantified or mitigated; without these the central empirical claim cannot be evaluated.
Authors: We agree that the abstract would benefit from additional context to allow evaluation of the central claims. In the revised version we will expand the abstract with a brief clause noting that results are reported as averages over multiple independent runs with standard deviations shown in the main figures, that baselines include both free-form LLM prompting and prior hybrid methods, and that oracle errors are mitigated through repeated sampling with majority voting. Full details on controls, baseline implementations, and error quantification appear in Sections 4 and 5 together with the appendix; the abstract revision will point readers to these sections. revision: yes
-
Referee: Method / Approach section: the validity of the paradigm rests on the assumption that LLM errors in elementary operations (binary comparisons, similarity judgments) do not compound sufficiently to invalidate the underlying algorithmic guarantees (e.g., inversion counts in sorting or approximation ratios in submodular maximization), but no error-rate bounds, sensitivity analysis, or propagation proofs are provided.
Authors: This concern is well-taken. The present manuscript prioritizes empirical demonstration that the verbalized approach improves the accuracy-runtime frontier even when the LLM oracle is imperfect. We have added a new sensitivity analysis subsection (and accompanying figure) that injects synthetic error rates from 0% to 25% into the oracles for each algorithm and measures degradation in final task performance; these results show graceful degradation rather than catastrophic failure. We also expand the discussion to outline how classical tools (e.g., inversion-count analysis for sorting or submodular approximation-ratio proofs) could be adapted to derive formal propagation bounds, while explicitly marking full theoretical guarantees as an important direction for future work rather than a claim of the current paper. revision: partial
- Deriving complete formal error-propagation bounds and proofs for all four verbalized algorithms
Circularity Check
Derivation self-contained with no circular reductions
full rationale
The paper introduces verbalized algorithms as a decomposition strategy that restricts LLMs to elementary string operations within classical algorithms possessing established theoretical guarantees. This approach does not rely on self-definitional constructs, fitted inputs presented as predictions, or load-bearing self-citations. The central claims rest on the independent properties of classical algorithms and empirical evaluations across tasks, making the derivation self-contained against external benchmarks rather than reducing to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can answer elementary string operations (comparisons, similarity) reliably enough that the overall algorithm retains its theoretical properties.
invented entities (1)
-
Verbalized algorithm (VA)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Verbalized Algorithms... limits their scope by decomposing the task down to simple elementary operations on strings... e.g., sorting... using an LLM as a binary comparison oracle in a parallel or approximate sorting algorithm.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Robust VAs... majority voting... Hoeffding bound
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
M. Fine-Morris, V . Hsiao, L. N. Smith, L. M. Hiatt, and M. Roberts. Leveraging LLMs for Generating Document-Informed Hierarchical Planning Models: A Proposal. InAAAI 2025 Workshop LM4Plan,
work page 2025
-
[3]
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek- R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
- [4]
-
[5]
Y . Hou, J. Li, Z. He, A. Yan, X. Chen, and J. McAuley. Bridging Language and Items for Retrieval and Recommendation.arXiv preprint arXiv:2403.03952,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
doi: 10.1093/biomet/33.3.239. M. G. Kendall. A New Measure of Rank Correlation.Biometrika, 30(1-2):81–93,
-
[7]
B. Liu, Y . Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone. LLM+P: Empowering Large Language Models with Optimal Planning Proficiency.arXiv preprint arXiv:2304.11477,
work page internal anchor Pith review Pith/arXiv arXiv
- [8]
-
[9]
D. V . McDermott. The 1998 AI Planning Systems Competition.AI Magazine, 21(2):35–55,
work page 1998
-
[10]
MTEB: Massive Text Embedding Benchmark
N. Muennighoff, N. Tazi, L. Magne, and N. Reimers. MTEB: Massive Text Embedding Benchmark.arXiv preprint arXiv:2210.07316,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
doi: 10.1609/icaps.v34i1.31502. R. L. Plackett. The Analysis of Permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202,
-
[12]
Z. Qin, R. Jagerman, K. Hui, H. Zhuang, J. Wu, L. Yan, J. Shen, T. Liu, J. Liu, D. Metzler, et al. Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 1504–1518,
work page 2024
-
[13]
URLhttps://arxiv.org/abs/2505.09388. K. Valmeekam, M. Marquez, S. Sreedharan, and S. Kambhampati. On the Planning Abilities of Large Language Models - A Critical Investigation.Proc. of the Advances in Neural Information Processing Systems (Neurips), 36:75993–76005,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
B. T. Willard and R. Louf. Efficient Guided Generation for Large Language Models.arXiv preprint arXiv:2307.09702,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Y . Xie. Translating natural language to planning goals with large-language models.The International Journal of Robotics Research, 2019:1,
work page 2019
-
[16]
9 Appendix A Sorting: Experimental Details A.1 Bitonic Sorting Network Bitonic sorting networks (Batcher’s Algorithm) [Batcher, 1968] allow us to optimize the sorting of a given list. The comparisons in a bitonic sorting network are not data-dependent, allowing for more feasible parallelization and speed proportional to depth D. Given an input list, a bit...
work page 1968
-
[17]
{criteria} \nX: t.x\nY: t.y\nZ: t.z\n Answer either ’X’ or ’Y’
proposed a heavy-tailed kernel to measure local similarities between data points. In particular, we opt to use a Student-t kernel with α degrees of freedom by defining: pijℓ = 1 + ∥xi−xj ∥2 α − α+1 2 1 + ∥xi−xj ∥2 α − α+1 2 + 1 + ∥xi−xℓ∥2 α − α+1 2 We refer to the resulting technique as t-Distributed STE (t-STE). Our implementation sets α=D−1 , whereDis t...
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.