FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments
Pith reviewed 2026-05-07 16:21 UTC · model grok-4.3
The pith
A two-stage failure-analysis process lets open-source LLM agents correct common mistakes with targeted context and gain up to 27 percent performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FAMA first mines failure trajectories produced by ordinary agents to locate the most frequent errors, then uses an orchestration layer that selects a minimal subset of specialized agents; those agents inject focused context into the primary tool-use agent immediately before its decision step. Experiments on open-source LLMs across multiple evaluation modes report gains reaching 27 percent over the same baselines run without the meta layer.
What carries the argument
The orchestration mechanism that activates a minimal subset of specialized context-injecting agents after failure-trajectory analysis.
If this is right
- Multi-turn conversational success improves for smaller open-source models that normally suffer rapid error accumulation.
- The same failure-analysis step can be reused across different base models and tasks.
- Targeted context injection at the decision point reduces cascading mistakes without expanding the main model's context window.
- Performance lifts appear in both single-turn and full conversational evaluation settings.
Where Pith is reading between the lines
- The same failure-mining step could be applied to other agent settings such as web navigation or code generation.
- Automating the collection of failure trajectories might lower the cost of the initial analysis phase.
- In changing real-world environments the specialized agents would need re-validation to keep the injected context accurate.
Load-bearing premise
The errors that appear most often in baseline runs are the ones that actually drive performance loss and can be fixed by adding context from a few extra agents without introducing new mistakes or too much overhead.
What would settle it
Running the identical open-source models on the same conversational benchmarks with and without the FAMA orchestration layer and observing no consistent accuracy gain or even a performance drop.
Figures
read the original abstract
Large Language Models are being increasingly deployed as the decision-making core of autonomous agents capable of effecting change in external environments. Yet, in conversational benchmarks, which simulate real-world customer-centric issue resolution scenarios, these agents frequently fail due to the cascading effects of incorrect decision-making. These challenges are particularly pronounced for open-source LLMs with smaller parameter sizes, limited context windows, and constrained inference budgets, which contribute to increased error accumulation in agentic settings. To tackle these challenges, we present the Failure-Aware Meta-Agentic (FAMA) framework. FAMA operates in two stages: first, it analyzes failure trajectories from baseline agents to identify the most prevalent errors; second, it employs an orchestration mechanism that activates a minimal subset of specialized agents tailored to address these failures by injecting a targeted context for the tool-use agent before the decision-making step. Experiments across open-source LLMs demonstrate performance gains up to 27% across evaluation modes over standard baselines. These results highlight that targeted curation of context through specialized agents to address common failures is a valuable design principle for building reliable, multi-turn tool-use LLM agents that simulate real-world conversational scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Failure-Aware Meta-Agentic (FAMA) framework for open-source LLMs in interactive tool-use settings. FAMA first analyzes failure trajectories collected from baseline agents to identify prevalent errors, then applies an orchestration mechanism that activates a minimal subset of specialized agents to inject targeted context before the tool-use agent's decision step. Experiments across open-source LLMs report performance gains of up to 27% over standard baselines in conversational benchmarks simulating customer issue resolution.
Significance. If the reported gains hold under the described controls, the work offers a concrete design principle for improving reliability of smaller open-source LLM agents without scaling model size or context windows. The emphasis on failure-trajectory analysis followed by targeted context curation via specialized agents addresses a practical bottleneck in multi-turn tool use. Strengths include the use of held-out evaluation sets for failure collection, ablations on orchestration overhead, and explicit controls for total token budget, which support reproducibility and isolate the contribution of the meta-agentic component.
major comments (2)
- [§4.2] §4.2 (experimental setup): The procedure for collecting and labeling failure trajectories from baselines is described at a high level; explicit criteria for error classification, the size of the held-out set used for trajectory analysis, and any inter-annotator agreement metrics are needed to rule out post-hoc selection effects that could inflate the 27% gains.
- [§3.3] §3.3 (orchestration mechanism): The rules for selecting and activating the minimal subset of specialized agents are not fully formalized; without pseudocode or decision criteria, it remains unclear how the system guarantees that injected context does not introduce new error modes or exceed the stated token-budget controls.
minor comments (2)
- [Table 1] Table 1 and Figure 3: axis labels and legend entries use inconsistent abbreviations for evaluation modes; standardize notation and add error bars or statistical significance markers to support the cross-model claims.
- [§2] Related work section: the discussion of prior meta-agentic and failure-recovery methods is brief; citing at least two additional recent works on trajectory-based agent improvement would better situate the novelty of the failure-aware specialization step.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, the recommendation for minor revision, and the constructive comments that will improve the clarity and reproducibility of the work. We address each major comment below.
read point-by-point responses
-
Referee: [§4.2] §4.2 (experimental setup): The procedure for collecting and labeling failure trajectories from baselines is described at a high level; explicit criteria for error classification, the size of the held-out set used for trajectory analysis, and any inter-annotator agreement metrics are needed to rule out post-hoc selection effects that could inflate the 27% gains.
Authors: We acknowledge that the description in §4.2 is at a high level. To address this and strengthen transparency, we will expand the section in the revised manuscript with explicit error classification criteria, the precise size of the held-out set, and inter-annotator agreement metrics from the labeling process. These additions will help rule out concerns about post-hoc selection effects while preserving the separation between the held-out trajectories and the evaluation sets. revision: yes
-
Referee: [§3.3] §3.3 (orchestration mechanism): The rules for selecting and activating the minimal subset of specialized agents are not fully formalized; without pseudocode or decision criteria, it remains unclear how the system guarantees that injected context does not introduce new error modes or exceed the stated token-budget controls.
Authors: We agree that additional formalization of the orchestration rules in §3.3 would improve clarity. In the revision, we will add pseudocode describing the selection and activation of the minimal subset of specialized agents, along with the decision criteria. We will also elaborate on the safeguards that keep injected context within the stated token budgets and avoid introducing new error modes, including validation steps prior to injection. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external benchmarks
full rationale
The paper describes an empirical framework with two stages—failure trajectory analysis from baselines followed by orchestration of specialized agents for context injection—then reports performance gains up to 27% on held-out evaluations across open-source LLMs. No equations, fitted parameters, or self-citations appear in the load-bearing steps; the gains are measured against independent baselines with controls for token budget and ablations. The derivation chain is self-contained because the claimed improvements are falsifiable experimental outcomes rather than quantities defined or predicted from the method's own inputs.
Axiom & Free-Parameter Ledger
invented entities (1)
-
specialized agents
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Who is introducing the failure? automatically attributing failures of multi-agent systems via spec- trum analysis.arXiv preprint arXiv:2509.13782. Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, An- ton Shevtsov, Simon Karasik, Sergey Abramov, An- drei Andriushchenko, Filipp Fisin, and 1 others
-
[2]
Training long-context, multi-turn software engineering agents with reinforcement learning, 2025
Training long-context, multi-turn software en- gineering agents with reinforcement learning.arXiv preprint arXiv:2508.03501. Google DeepMind. 2025. Gemini 3 pro. Accessed: 2026-01-02. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others....
-
[3]
Environment scaling for interactive agentic experience collection: A survey
Environment scaling for interactive agentic experience collection: A survey.arXiv preprint arXiv:2511.09586. Polaris Jhandi, Owais Kazi, Shreyas Subramanian, and Neel Sendas. 2025. Small language models for efficient agentic tool calling: Outperforming large models with targeted fine-tuning.arXiv preprint arXiv:2512.15943. Dongfu Jiang, Yi Lu, Zhuofeng Li...
-
[4]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. τ-bench: A benchmark for tool-agent-user interaction in real-world domains. Preprint, arXiv:2406.12045. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning a...
work page internal anchor Pith review arXiv 2024
-
[5]
Ensure device is logged in: login_device() Methodτ-Telehealthτ-Telecom Pass^1 Pass^2 Pass^3 Pass^4 Pass^5 Pass^1 Pass^2 Pass^3 Pass^4 Pass^5 Qwen3-4B-Instruct ReAct 22.99% 16.00% 12.50% 11.00% 10.00% 15.55% 8.33% 6.11% 5.55% 5.55% FC34.00%27.00%24.00%22.00%20.00% 37.00%31.00%27.00%24.00%21.00% IRMA 22.99% 14.50% 11.00% 10.00% 10.00% 38.00 % 26.00% 21.00% ...
-
[6]
Send message to Frank
-
[7]
Confirm with user about reminder
-
[8]
Finish conversation Assistant:login_device() Verifier:PASS; Correctly initiated device login. Tool:“status”: true, “message”: “Device has been logged in” Assistant:send_message(sender_name=‘User’, re- ceiver_name=‘Frank’, message=‘...’) Verifier:PASS; Correctly sent the message as per the plan. Tool:“status”: false, “message”: “Wi-Fi is turned off...” Ass...
-
[9]
get_earliest_message_id()
-
[10]
delete_message(message_id=EARLIEST MES- SAGE ID)
-
[11]
message”: “Message ID 3 deleted
Confirm deletion Assistant:get_earliest_message_id() Verifier:PASS; Correctly identified need to get earli- est message ID. Tool:message_id: 3 Assistant:delete_message(message_id=3) Verifier:PASS; Correctly deletes the oldest message as requested. Tool:‘status”: true, “message”: “Message ID 3 deleted.” User:Sure, let’s try again. Assistant:send_message(se...
-
[12]
Send message from Grace to Frank Assistant:send_message(sender_name=‘Grace’, re- ceiver_name=‘Frank’, message=‘Are you free tomor- row to have lunch with me?’) Verifier:PASS; Correctly updated sender name and resent message. Tool:“status”: true, “message”: “Message success- fully sent to Frank.” Assistant:The message has been successfully sent to Frank. I...
-
[13]
Ask the user if they have any final tasks or requests before finishing the conversation
-
[14]
If the user has no further tasks or requests, respond with ¨finish conversation¨to end the dialogue. Result:Task Completed Successfully with Full Veri- fication The above example demonstrates that incorpo- rating only a Planner and a Verifier agent, without additional auxiliary agents, is sufficient to success- fully solve a tool-calling task. E Error Cat...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.