Recognition: 2 theorem links
ReAct: Synergizing Reasoning and Acting in Language Models
Pith reviewed 2026-05-09 01:01 UTC · model claude-opus-4-7
The pith
A language model that writes its reasoning into the same stream as its actions plans, retrieves, and recovers from mistakes better than one that does either alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper argues that reasoning and acting should not be separate prompting modes for a language model. By letting a frozen LLM emit free-form "thoughts" in the same token stream as environment actions, the model can plan, revise plans, decide what to look up next, and incorporate what it just observed, all without any fine-tuning. The authors claim this simple change closes the loop between internal deliberation and external feedback: thoughts decide what to do, observations correct the thoughts, and the next thought adjusts the plan. They report this outperforms reasoning-only and acting-only prompting on a question-answering benchmark with a Wikipedia API, and on two interactive benchmark
What carries the argument
An augmented action space in which the agent's policy emits, at each step, either an external action (search, click, navigate, manipulate) or a "thought" — a free-form natural-language token sequence that produces no observation but updates the context the next decision is conditioned on. The same frozen language model generates both kinds of tokens, prompted with a handful of human-written trajectories that show when to think and when to act. Reasoning is treated as an internal action rather than a separate phase.
If this is right
- A general-purpose agent loop can be built from prompting alone if the action vocabulary is extended to include unconstrained natural-language thoughts that do not touch the environment.
- Grounding reasoning in retrieved observations reduces fabrication: on fact verification, the interleaved variant produces fewer hallucinated supporting claims than chain-of-thought reasoning that runs without external lookup.
- Combining internal reasoning with external retrieval works best as a fallback ensemble: use reasoning-with-self-consistency when the model is confident, fall back to the acting variant when it is not, and vice versa.
- Human oversight becomes lightweight: editing one or two thoughts mid-trajectory can redirect the agent's whole plan, because the thoughts are the policy's exposed control surface.
- Fine-tuning a small model on a few thousand interleaved thought-action traces can outperform a much larger model prompted in any of the non-interleaved styles, suggesting the pattern is learnable, not just elicitable.
Where Pith is reading between the lines
- The thought channel functions as an exposed working memory; this suggests that the right primitive for agent design is not 'plan then execute' but a single token stream where deliberation and action are interchangeable, which has implications for how future agent training data should be collected.
- The reported gap on interactive benchmarks compares a 540B-parameter prompted model against far smaller imitation-trained agents, so part of what is being measured is the value of pretrained world knowledge, not only the value of interleaving — a controlled comparison at matched scale would clarify how much of the effect is the pattern itself.
- Because thoughts are human-readable and editable mid-trajectory, this style of agent admits a form of supervision that gradient-based agents cannot easily support: correcting a single sentence redirects the policy, which may matter more for deployment than the headline benchmark numbers.
- The failure mode where the model loops on repeated thoughts hints that greedy decoding plus an unbounded thought channel can trap the agent; sampling, beam search over thoughts, or an explicit 'give up and retry' action are natural next steps.
Load-bearing premise
That the improvements come from the interleaving pattern itself rather than from hand-tuned prompts, best-of-several prompt selection, and the very large base model — the head-to-head against trained agents mixes a method change with a scale change.
What would settle it
Re-run the four-way ablation (Standard, CoT, Act, ReAct) under matched, randomly drawn prompt sets across multiple seeds and across base models of varying scale, on held-out task splits not used for prompt selection. If ReAct's advantage on ALFWorld and WebShop disappears once "best-of-six" prompt selection is removed, or if a smaller base model with ReAct fails to beat a similarly sized imitation-learned agent, then the gains are attributable to scale and prompt curation rather than to interleaving reasoning with action.
read the original abstract
While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples. Project site with code: https://react-lm.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ReAct, a prompting paradigm in which a frozen LLM (PaLM-540B, with GPT-3 confirmation in Appendix A.1) interleaves free-form natural-language "thoughts" with domain-specific actions and observations within a single trajectory. The action space is augmented to = A ∪ L, where thoughts are no-op tokens that update context but not the environment. ReAct is evaluated on four benchmarks: HotpotQA and FEVER (with a simple Wikipedia search/lookup/finish API), ALFWorld, and WebShop. The authors report (i) on knowledge tasks, ReAct underperforms CoT on HotpotQA EM (27.4 vs 29.4) but outperforms it on FEVER (60.9 vs 56.3), and a ReAct↔CoT-SC fallback combination beats both; (ii) on ALFWorld, ReAct (best-of-6 prompt permutations) reaches 71% success vs 45% for Act and 37% for BUTLER; (iii) on WebShop, ReAct reaches 40.0% success vs 30.1% for Act and ≤29.1% for IL/IL+RL baselines; (iv) finetuning small PaLM models on 3,000 ReAct trajectories outperforms finetuning on Standard/CoT/Act trajectories. Ablations against an IM-style "inner monologue" baseline (ReAct-IM) and a human error analysis on HotpotQA are also reported.
Significance. If the central claim holds — that interleaving reasoning and acting in one decoded trajectory yields a robust prompting pattern across both knowledge-intensive QA and embodied/web decision-making — this is a useful and general contribution. The evidence is strongest where it matters most: the within-model ablations on a frozen PaLM-540B (ReAct vs Act on ALFWorld, +26 absolute best-of-6; ReAct vs Act on WebShop, +9.9 SR; ReAct vs ReAct-IM, +18 on ALFWorld) isolate the contribution of the prompting pattern from model scale. The qualitative human error analysis (Table 2) is informative and honest about ReAct's failure modes (47% reasoning errors, 23% search errors). The combined ReAct↔CoT-SC strategy and the finetuning scaling results (Fig. 3), where small ReAct-finetuned models beat much larger CoT/Standard-prompted models, give the paper additional evidentiary depth beyond a single prompt comparison. The paradigm has since been broadly adopted, and the explicit thought-edit human-in-the-loop demonstration (Fig. 5) is a genuine interpretability dividend. Prompts and code are released, supporting reproducibility within the constraints of a closed model.
major comments (5)
- [Abstract / §4, Tables 3–4] The headline framing 'outperforms imitation and RL methods by an absolute success rate of 34% and 10%' conflates the contribution of the ReAct prompting pattern with the contribution of using a 540B LLM as the policy. Decomposing Table 3, BUTLER=37, Act(best-of-6)=45, ReAct(best-of-6)=71: roughly 8 of the 34 absolute points on ALFWorld come from 'replace small IL agent with PaLM-540B + actions only', and only ~26 from the ReAct pattern itself. On WebShop the decomposition is more favorable (IL=29.1, Act=30.1, ReAct=40.0), so +10 is mostly method. The within-model ablations are reported and support the method, but the abstract should report the within-LLM gap (Act→ReAct) alongside the cross-method gap so readers do not mis-attribute scale effects.
- [§4, Table 3 (ALFWorld)] ALFWorld headline numbers are best-of-6 prompt permutations for ReAct/Act/ReAct-IM, while BUTLER is best-of-8 with beam search and ReAct uses greedy decoding. The mixture of selection criteria (best-of-K over prompt permutations vs best-of-K over beam search) makes the head-to-head comparison less clean than presented. The authors do also report ReAct(avg)=57 vs Act(avg presumably similar to or below best-of-6=45) and ReAct-IM(avg)=48, which is the most convincing comparison and should be emphasized; please also report Act(avg) explicitly and the variance/std across the 6 prompts so readers can judge prompt-selection sensitivity.
- [§3, Table 1 / Fig. 2] Variance is not reported for HotpotQA/FEVER prompting results. Given that the EM gap between ReAct (27.4) and CoT (29.4) is 2.0 points and the gap between ReAct→CoT-SC (35.1) and CoT-SC (33.4) is 1.7 points, single-run numbers without seed variance or confidence intervals make it hard to judge whether the ordering is robust. Please report at least decoding-seed variance for the non-greedy methods and, ideally, a bootstrap CI on the 500-question (or full) eval subset.
- [§3.2 'Combining Internal and External Knowledge'] The CoT-SC↔ReAct fallback heuristic introduces hyperparameters (max steps = 7/5; majority threshold n/2 over 21 samples) that are tuned on the same evaluation distribution. The text says 'we find more steps will not improve ReAct performance' but does not state whether this was determined on a held-out split. Please clarify whether these thresholds were selected on dev or test, and report sensitivity (e.g., performance vs. step budget and vs. threshold) so the combined-method gains in Table 1 are not effectively dev-tuned on test.
- [§3.3 Table 2 (error analysis)] The human-labeled success/failure analysis is one of the strongest pieces of evidence in the paper but the methodology is under-specified. Please state: how many annotators, inter-annotator agreement, whether annotators were blind to which method produced each trajectory, and the exact sampling protocol (the text says '50 correct and 50 incorrect from each method' which oversamples failures relative to base rates). The qualitative claim that CoT hallucinates 56% of the time vs ReAct 0% is a strong one and deserves a more rigorous protocol description.
minor comments (8)
- [§2] The augmented action space = A ∪ L is introduced as a conceptual device, but L is unbounded and the policy π is left abstract. A sentence clarifying that in practice π is the LLM's next-token distribution conditioned on the full trajectory, and that 'thought vs action' is determined by a parsing convention on the decoded text, would help readers reproduce the system.
- [Figure 1] The figure uses encoded/garbled glyphs in the rendered PDF (visible as e.g. '$FW\u0003\u0014\u001d\u00037KLQN'); please re-embed the fonts so the example trajectories are legible. This is the paper's signature figure.
- [§3.1 Action Space] The Wikipedia API returns 'the first 5 sentences' of a page. This is an arbitrary cutoff that interacts with the search-error rate (23% of ReAct failures). Please state whether this was tuned, and whether longer returns hurt or help.
- [§4 WebShop] Score and SR are reported as point values without confidence intervals on 500 test instructions. A binomial CI would clarify whether the IL+RL→ReAct gap of ~11 SR points is comfortably outside noise.
- [Appendix B.1] Finetuning step counts (4000 for ReAct/Act, 1000–2000 for Standard/CoT 'because the latter degrade soon after') deserve a learning-curve plot rather than a one-line justification, since the comparison in Fig. 3 depends on this choice.
- [§5] Related work could more explicitly contrast with Inner Monologue beyond Section 4's qualitative remark — e.g., a single sentence noting that IM operates on real robotic affordances while ReAct operates on text environments would prevent over-claiming.
- [Throughout] 'Reasoning traces' and 'thoughts' are used interchangeably; pick one term and use it consistently in formal statements (Section 2) and tables.
- [Appendix A.2 / Fig. 4] The 'outdated label' anecdote on HotpotQA is interesting but only one example is shown. Either expand to a small audit (how many of the 50 'correct' ReAct samples disagree with gold for label-staleness reasons?) or label this clearly as anecdotal.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive report, and for the recommendation to accept. The major comments target presentation and methodological-rigor issues rather than the core claims, and we agree with all five. In the revision we will: (1) decompose the headline gains in the abstract into within-LLM (Act→ReAct) and cross-method components so scale effects and prompting-pattern effects are not conflated; (2) add Act(avg) and per-prompt standard deviations to Table 3, lead with the avg-vs-avg comparison, and footnote the BUTLER selection-protocol difference; (3) add bootstrap CIs and seed variance to Table 1 and Fig. 2, and be explicit about which orderings are within overlap; (4) clarify that step-budget and majority-threshold hyperparameters for the ReAct↔CoT-SC combination were set on dev, and add a sensitivity sweep; (5) document the Table 2 error-analysis protocol (author-labeled, non-blinded, stratified 50+50 sampling, percentages conditional within stratum) and run a blinded multi-annotator re-labeling with reported agreement for the camera-ready. We list as a standing objection that we cannot guarantee the exact 56%/0% hallucination figures will replicate under the stricter blinded protocol, though we expect the qualitative direction to hold.
read point-by-point responses
-
Referee: Headline framing conflates ReAct prompting contribution with the contribution of using PaLM-540B; on ALFWorld, ~8 of 34 absolute points come from 'replace small IL agent with PaLM-540B + actions only'. The abstract should report the within-LLM Act→ReAct gap alongside the cross-method gap.
Authors: We agree that the abstract framing should not let readers conflate scale effects with the prompting pattern. We will revise the abstract and §4 introduction to report the within-LLM Act→ReAct gap alongside the cross-method gap. Concretely, the revised abstract sentence will read approximately: 'On ALFWorld and WebShop, ReAct outperforms a same-LLM Act-only baseline by 26 and 10 absolute points in success rate respectively, and outperforms imitation/RL baselines by 34 and 10 points.' We will mirror this decomposition in the §4 results paragraph and in the captions of Tables 3 and 4. The within-model ablations are already the headline of our analysis (cf. §4 'On the value of internal reasoning vs. external feedback'), and we are happy to surface them more prominently in the abstract. The referee's decomposition is consistent with our own reading of the data. revision: yes
-
Referee: ALFWorld numbers mix selection criteria (best-of-6 prompt permutations for ReAct/Act/ReAct-IM, best-of-8 beam search for BUTLER, greedy decoding for ReAct). Please report Act(avg) explicitly and per-prompt variance/std, and emphasize the avg-vs-avg comparison.
Authors: The point is well taken. The most controlled comparison in our setup is indeed ReAct(avg) vs Act(avg) vs ReAct-IM(avg) under matched greedy decoding and matched prompt-permutation protocol; BUTLER's best-of-8 beam search is reported as taken from Shridhar et al. (2020b) and is not directly comparable in selection protocol. In the revision we will (i) add Act(avg) across the 6 prompt permutations to Table 3, (ii) add std across permutations for ReAct, Act, and ReAct-IM in each task category and overall, and (iii) rewrite the §4 narrative to lead with the avg-vs-avg comparison and present best-of-6 as a secondary, prompt-selection-sensitivity figure. We will also add a footnote clarifying the BUTLER selection protocol difference so the head-to-head is not over-claimed. Note: ReAct(avg)=57 and ReAct-IM(avg)=48 are already reported; the missing Act(avg) is a presentation gap we will close. revision: yes
-
Referee: Variance is not reported for HotpotQA/FEVER prompting results; with EM gaps of ~2 points (ReAct 27.4 vs CoT 29.4; ReAct→CoT-SC 35.1 vs CoT-SC 33.4), single-run numbers make it hard to judge robustness. Please report decoding-seed variance and ideally bootstrap CIs.
Authors: We agree that with gaps of ~2 EM points, point estimates alone are insufficient. In the revision we will add: (i) bootstrap 95% confidence intervals over the evaluation set (resampling questions) for all rows of Table 1, which is the appropriate measure of evaluation-set uncertainty for the deterministic-decoding methods (Standard, CoT, Act, ReAct under greedy); (ii) seed variance across at least 3 sampling seeds for the stochastic CoT-SC and ReAct↔CoT-SC entries, since these draw 21 samples at temperature 0.7. We will also add CI bands to Fig. 2. We will be transparent in the text about which orderings are statistically robust and which (e.g., ReAct vs CoT on HotpotQA EM) are within overlapping intervals; this is consistent with our existing claim that the two methods are complementary rather than that ReAct dominates CoT on HotpotQA. revision: yes
-
Referee: The CoT-SC↔ReAct fallback heuristic introduces hyperparameters (max steps 7/5; majority threshold n/2) that may have been tuned on the evaluation distribution. Clarify whether thresholds were selected on dev or test, and report sensitivity.
Authors: Thank you for raising this. To clarify: HotpotQA and FEVER have public dev sets, and our prompting evaluations are on dev (HotpotQA test labels are not public). The step budgets (7 for HotpotQA, 5 for FEVER) were chosen by inspecting the length distribution of correct ReAct trajectories on the training trajectories used to construct prompts (the footnote 'trajectories with 7 steps on HotpotQA and 5 steps on FEVER only take up 0.84% and 1.33%' refers to this). The n/2 majority threshold was not tuned; it is the natural 'no clear majority' cutoff. We acknowledge the text was not explicit on this point. In the revision we will (i) state explicitly which split was used to set the step budget, (ii) add a sensitivity sweep over step budget {3,5,7,10} and over the majority threshold for the combined methods, reported on dev, and (iii) note that all reported numbers correspond to a single fixed (budget, threshold) chosen before the final evaluation runs. If sensitivity is large, we will report the worst-case as well as the chosen-config result. revision: yes
-
Referee: The Table 2 human error analysis methodology is under-specified: number of annotators, inter-annotator agreement, blinding, and sampling protocol (50 correct + 50 incorrect oversamples failures). The CoT-hallucinates-56%-vs-ReAct-0% claim deserves a rigorous protocol description.
Authors: The referee is correct that Table 2 is under-documented. To be transparent about what was actually done: the labeling was performed by the authors, with the categories defined collaboratively before labeling and trajectories labeled with method identity visible (i.e., not blinded). Sampling was stratified 50 correct + 50 incorrect per method (200 total) precisely to study failure modes, not to estimate population-level rates; the percentages in Table 2 are conditional within the success or failure stratum, not marginal. We will revise the table caption and §3.3 text to (i) state the labeling protocol and stratified sampling explicitly, (ii) reframe the 56% / 0% numbers as conditional on the failure stratum and as labeler-author estimates rather than population rates, and (iii) note the absence of blinding as a limitation. For the camera-ready/extended version we will additionally run a blinded re-labeling with two independent annotators on a fresh sample and report Cohen's κ; we cannot promise the original 56%/0% figures will exactly replicate, but the qualitative direction (CoT more prone to fact hallucination, ReAct more prone to reasoning loops and uninformative searches) is robust in our reading of the trajectories. revision: partial
- A fully blinded, multi-annotator re-labeling of the HotpotQA error analysis (Table 2) with reported inter-annotator agreement was not part of the original submission, and we cannot guarantee the exact 56%/0% hallucination figures will replicate under that stricter protocol. We will run the blinded study and report results, but the original numbers should be read as author-labeled, non-blinded estimates conditional on the stratified sample.
Circularity Check
No meaningful circularity: ReAct's claims are evaluated on held-out external benchmarks against independent baselines, with within-LLM ablations doing the load-bearing work.
specific steps
-
self citation load bearing
[Section 4 (WebShop), Table 4]
"We compare to an imitation learning (IL) method trained with 1,012 human annotated trajectories, and a imitation + reinforcement learning (IL + RL) method additionally trained with 10,587 training instructions. ... IL/IL+RL taken from Yao et al. (2022)."
WebShop environment and the IL/IL+RL baselines are from authors' own prior work (Yao et al. 2022). This is not load-bearing circularity because the metric (success rate, score) is independently defined and the baselines are reported numbers, not values fitted to make ReAct look good. Noted only for completeness; does not raise the score meaningfully.
-
other
[Section 3.2, 'Finetuning' paragraph and Figure 3]
"we consider a bootstraping approach similar to Zelikman et al. (2022), using 3,000 trajectories with correct answers generated by ReAct (also for other baselines) to finetune smaller language models (PaLM-8/62B) to decode trajectories"
Finetuning data is filtered by ReAct's own correctness on HotpotQA, so the training distribution is method-shaped. However, the same bootstrapping is applied symmetrically to Standard/CoT/Act baselines, and evaluation EM is against external HotpotQA labels, so within-method comparisons remain fair. Mild self-loop, not load-bearing for the headline claim.
full rationale
The paper's central claim — that interleaving free-form reasoning traces with environment actions improves task performance over reasoning-only or acting-only prompting — is evaluated on four external benchmarks (HotpotQA, FEVER, ALFWorld, WebShop) using metrics (EM, accuracy, success rate) that are defined independently of the method. Baselines include CoT/CoT-SC (Wei et al. 2022; Wang et al. 2022a), Standard prompting, Act-only, BUTLER (Shridhar et al. 2020b), and IL/IL+RL (Yao et al. 2022). None of these are derived from ReAct's own outputs, so there is no self-definitional loop and no fitted-input-called-prediction pattern. Self-citation exists (WebShop is from Yao et al. 2022, an author's prior work), but it provides the environment, not the evaluation criterion or the baseline numbers being beaten — the IL/IL+RL baselines and the score/SR metric come from that prior work and are used as external comparators. This is normal benchmark reuse, not load-bearing self-citation. The reader's skeptical decomposition (that the headline "+34" on ALFWorld conflates ~+26 method gain with ~+8 scale gain over BUTLER, and that ALFWorld headline is best-of-6 prompt permutations) is a valid concern about claim framing and experimental fairness, but it is not circularity. It is a confound/presentation issue: the paper does report ReAct(avg)=57 vs. Act(best-of-6)=45, which is an honest within-LLM comparison and not self-referential. The HotpotQA finetuning experiment (Section 3.2, "bootstrapping approach... using 3,000 trajectories with correct answers generated by ReAct... to finetune smaller language models") trains on ReAct's own correct trajectories and then evaluates ReAct-style decoding. This is a mild self-loop in that the training distribution is method-shaped, but (a) evaluation is still on held-out HotpotQA EM against the same external label set, (b) all four methods (Standard, CoT, Act, ReAct) are finetuned on their own correct trajectories symmetrically, so the comparison is internally fair, and (c) the result is not presented as the headline claim. This warrants noting but not a high score. Overall: derivation chain is empirical, baselines are external, metrics are external. Score 1.
Axiom & Free-Parameter Ledger
free parameters (3)
- Few-shot demonstrations per task =
6 (HotpotQA), 3 (Fever), 3 per task type × 6 permutations (ALFWorld), 1–2 (WebShop)
- Max ReAct steps before fallback to CoT-SC =
7 (HotpotQA), 5 (Fever)
- CoT-SC sample count and temperature =
21 samples, T=0.7
axioms (2)
- domain assumption A frozen pretrained LLM has sufficient latent reasoning ability that few-shot demonstrations of thought+action suffice to elicit useful agent behavior.
- domain assumption Benchmark accuracy on HotpotQA/Fever/ALFWorld/WebShop measures the intended capability ('reasoning + acting').
invented entities (1)
-
Augmented action space = A ∪ L (language thoughts as no-op actions)
independent evidence
Forward citations
Cited by 60 Pith papers
-
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
-
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.
-
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
-
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
-
PAL: Program-aided Language Models
PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
-
From Table to Cell: Attention for Better Reasoning with TABALIGN
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...
-
ASH: Agents that Self-Hone via Embodied Learning
ASH reaches 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Zelda by self-improving via an IDM trained on its own trajectories to label internet video, while baselines plateau at roughly 6/12.
-
ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows
ProtoMedAgent uses a privacy-aware agentic workflow with neuro-symbolic bottlenecks to achieve 91.2% faithfulness in clinical report generation, significantly outperforming standard RAG methods on a large patient cohort.
-
Harnessing Agentic Evolution
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
-
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...
-
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
-
State-Centric Decision Process
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
-
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
-
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
Checkup2Action is a new multimodal dataset and benchmark for generating patient-oriented action cards from real-world clinical check-up reports.
-
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
-
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
-
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
-
RFAmpDesigner: A Self-Evolving Multi-Agent LLM Framework for Automated Radio Frequency Amplifier Design
RFAmpDesigner automates RF low-noise amplifier design across 10-50 GHz and 10-80% bandwidth using a multi-agent LLM system with resource-allocation middleware and retrieval-augmented self-evolution.
-
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
PlantMarkerBench supplies 5,550 literature sentences annotated for plant marker gene evidence validity and type across Arabidopsis, maize, rice and tomato, showing frontier LLMs handle direct expression evidence but s...
-
PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning
PlantMarkerBench is a new multi-species benchmark with 5,550 evidence instances for evaluating language models on literature-grounded plant marker gene reasoning across expression, localization, function, indirect, an...
-
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating l...
-
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...
-
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems
TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over str...
-
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.
-
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round rep...
-
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
-
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
-
Do Joint Audio-Video Generation Models Understand Physics?
Current joint audio-video generation models lack robust physical commonsense, especially during transitions and when prompted for impossible behaviors.
-
Regulating Branch Parallelism in LLM Serving
TAPER regulates LLM branch parallelism by admitting extra branches opportunistically when predicted externality fits slack, delivering 1.48-1.77x higher goodput than eager or fixed-cap baselines on Qwen3-32B while kee...
-
Stateful Agent Backdoor
A stateful backdoor for LLM agents, modeled as a Mealy machine with a decomposition framework, enables incremental malicious actions across sessions and achieves 80-95% attack success rate on four models.
-
On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows
MCPP is a Monte Carlo simulation-based online planner that improves the probability of agentic workflows completing successfully under explicit budget and deadline constraints compared to baselines on CodeFlow and Pro...
-
WAAA! Web Adversaries Against Agentic Browsers
Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.
-
ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis
LLM reasoning traces can be compiled into reusable symbolic solvers that achieve high accuracy on program synthesis benchmarks at zero inference cost and transfer to other domains.
-
SADE: Symptom-Aware Diagnostic Escalation for LLM-Based Network Troubleshooting
SADE encodes a Cisco-style phase-gated diagnostic policy into LLM agents, delivering a 37-point root-cause F1 gain on the NIKA benchmark with 22 points attributable to the policy itself.
-
Executor-Side Progressive Risk-Gated Actuation for Agentic AI in Wireless Supervisory Control
PRGA gates wireless intent execution with progressive evidence stages, cutting time-to-first-safe-action by 23-27% and control-plane bytes by 52-54% on 3GPP benchmarks while rejecting all stale inputs and staying with...
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey
Developers use LLMs like ChatGPT mainly for knowledge acquisition and code generation at the detailed design level, reporting benefits such as better technology selection and early flaw detection alongside limitations...
-
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework
AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...
-
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
-
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...
-
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
-
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constra...
-
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis
ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.
-
GSAR: Typed Grounding for Hallucination Detection and Recovery in Multi-Agent LLMs
GSAR is a grounding-evaluation framework for multi-agent LLMs that uses a four-way claim typology, evidence-weighted asymmetric scoring, and tiered recovery decisions to detect and mitigate hallucinations.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
The Platform Is Mostly Not a Platform: Token Economies and Agent Discourse on Moltbook
Moltbook operates as two largely separate layers: a dominant transactional token economy using protocols like MBC-20 and a thinner discursive conversation layer with only 3.6% agent overlap.
-
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
-
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data
MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.
-
SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation
SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.
-
Exploring Agentic Visual Analytics: A Co-Evolutionary Framework of Roles and Workflows
A survey of 55 agentic VA systems proposes a co-evolutionary framework defining four agent roles (PLANNER, CREATOR, REVIEWER, CONTEXT MANAGER) mapped to visual analytics pipeline stages along with design guidelines.
-
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
-
Feedback-Driven Execution for LLM-Based Binary Analysis
FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precis...
-
IE as Cache: Information Extraction Enhanced Agentic Reasoning
IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.
-
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies
Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
-
IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling
IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.