pith. sign in

arxiv: 2605.17352 · v1 · pith:ZYZJCU5Onew · submitted 2026-05-17 · 💻 cs.CL

AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering

Pith reviewed 2026-05-20 13:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords multi-agent trajectory alignmentknowledge-intensive QApreference optimizationfactual groundingLLM hallucinationsdirect preference optimizationagent collaborationexternal knowledge integration
0
0 comments X

The pith

AMATA aligns multi-agent trajectories to external knowledge for more factually consistent answers on complex questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AMATA to address hallucinations and knowledge gaps in LLMs during knowledge-intensive question answering. It uses six specialized agents that collaborate on structured actions while dynamically pulling in external knowledge. The core move is to treat this collaboration as a trajectory preference alignment task that includes question-aware customization of agents and harmonization of their preferences. Two techniques carry the work: intra-trajectory preference learning that prioritizes key agents for each objective, and inter-agent dependency learning that models tool-use dependencies with a dependency-aware direct preference optimization method. The result is reported outperformance over baselines and reduced token consumption on five standard benchmarks.

Core claim

AMATA is an Adaptive Multi-Agent Trajectory Alignment framework that dynamically integrates external knowledge to improve response interpretability and factual grounding. Our architecture leverages six specialized agents that collaboratively perform structured actions for complex question reasoning. We formalize multi-agent collaboration with external tools as a trajectory preference alignment problem, incorporating question-aware agent customization and inter-agent preference harmonization. AMATA introduces two principal innovations: (1) Intra-Trajectory Preference Learning, which learns objective-oriented preferences to prioritize critical agents, and (2) Inter-Agent Dependency Learning, a

What carries the argument

Trajectory preference alignment of six specialized agents, realized through Intra-Trajectory Preference Learning and Inter-Agent Dependency Learning via dependency-aware direct preference optimization.

If this is right

  • Responses achieve higher factual consistency by grounding in external knowledge through aligned agent trajectories.
  • Interpretability improves because each agent's actions and dependencies are made explicit in the optimized trajectory.
  • Token consumption drops relative to unaligned or single-agent LLM systems on the same tasks.
  • The method scales to complex reasoning questions by letting agents specialize and coordinate via learned preferences.
  • Performance exceeds both plain LLMs and prior knowledge-augmented or trajectory-based systems on five established benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same preference-alignment pattern could be tested on multi-agent setups for tasks such as long-form summarization or tool-using planning.
  • Dependency learning might surface reusable patterns of agent-tool interaction that transfer across different LLM backbones.
  • If the inter-agent harmonization step is removed, performance would likely fall between the full AMATA system and simpler baselines.
  • The approach suggests a general route for reducing hallucinations by treating agent outputs as preference-ranked trajectories rather than free generation.

Load-bearing premise

Formalizing multi-agent collaboration with external tools as a trajectory preference alignment problem, combined with question-aware customization and inter-agent harmonization, will produce measurable gains in factual grounding and interpretability.

What would settle it

Running AMATA head-to-head against the listed baselines on the five QA benchmarks and finding no consistent gains in accuracy or no reduction in token use would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.17352 by Chen Chen, Chengyu Wang, Dongyang Li, Jiuheng Wan, Qizhou Chen, Richang Hong, Taolin Zhang, Xiaofeng He.

Figure 1
Figure 1. Figure 1: Comparison of multi-agent paradigms for knowledge-intensive QA. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between AMATA and standard SFT and DPO pipelines. AMATA optimizes intra-trajectory preferences and inter-agent dependencies through adaptive prefix scoring (left) and DA-DPO (right). these special tokens to coordinate agent behav￾iors (Kwon et al., 2024; Tang et al., 2025; Yue et al., 2025). The trajectory-wise objective function is defined as L(T ) = PT i=1 − log Pr (ti | t<i, Q), where ti = (h… view at source ↗
Figure 3
Figure 3. Figure 3: Two-stage training examples of AMATA. Detailed annotation process and robustness verification for the DEPENDENCY SCORES are provided in Appendix A.1.2. Task → HealthQA ARC-C PopQA Squad1 ASQA Average Model ↓ Acc. Acc. Acc. Acc. Str_EM Rouge-L Mauve Vanilla QA Methods Alpaca2 7B 44.78(±1.2) 36.43(±1.5) 25.58(±0.8) 11.50(±1.1) 14.42(±1.6) 28.72(±2.1) 51.24(±0.9) 30.38(±1.1) Mistral-Instruct 7B 65.45(±1.4) 57… view at source ↗
Figure 4
Figure 4. Figure 4: Agent dependency analysis across preference [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance of LLM-based trajectory meth [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of different preference scores for [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Collection of “Filter” trajectory data. requiring long-form responses. Fluency is assessed using Mauve, and accuracy is mea￾sured with Str_EM and Rouge-L, consistent with official evaluation settings. A.2 Baselines A.2.1 Vanilla QA Methods LLMs acquire extensive factual knowledge, in￾ternalized within their model parameters through large-scale unsupervised pre-training. During both training and inference, … view at source ↗
Figure 10
Figure 10. Figure 10: Intra-trajectory score collection. Qwen2.5-7B-Instruct (Qwen Team, 2024), and Alpaca2-7B4 . A.2.2 Knowledge-Augmented Methods We implement standard knowledge augmentation approaches. When model weights are unavailable, methods are replicated using the same base models and training data. Uniform retrieval models and knowledge bases ensure experimental fairness. • REPLUG (Shi et al., 2024) uses frozen LLM p… view at source ↗
Figure 11
Figure 11. Figure 11: Inter-trajectory score collection. User Instruction: Given some answer candidates, choose the best answer choice. You can use the following agents in trajectory including . Is the following statement correct or not? Say true if it's correct; otherwise say false. Roche's schizophrenia drug misses goal in two late-stage trials. Intent Reconstructor: Roche's schizophrenia drug misses goal in two latestage tr… view at source ↗
Figure 12
Figure 12. Figure 12: Complete response example from PubHealth. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of training data size and computational cost for various baselines and our [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Impact of selecting the [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Complete response example from MMAgent. C Inference Algorithm 1 gives an overview of inference in our AMATA framework. During inference, AMATA first analyzes the query to determine whether external knowledge is required. If not, it directly generates and verifies the answer. Otherwise, it retrieves and filters relevant documents, then generates a grounded response. The answer is subsequently verified; if … view at source ↗
Figure 17
Figure 17. Figure 17: Complete response example from SMART. User Instruction: Given some answer candidates, choose the best answer choice. You can use the following agents in trajectory including . Is the following statement correct or not? Say true if it's correct; otherwise say false. Roche's schizophrenia drug misses goal in two late-stage trials. Intent Reconstructor: Roche's schizophrenia drug misses goal in two latestage… view at source ↗
Figure 18
Figure 18. Figure 18: Complete response example from GiGPO [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
read the original abstract

Despite substantial advances in large language models (LLMs), generating factually consistent responses for knowledge-intensive question answering remains challenging. These difficulties are primarily due to hallucinations and the limitations of LLMs in bridging long-tail knowledge gaps. To address this, we propose AMATA, an Adaptive Multi-Agent Trajectory Alignment framework that dynamically integrates external knowledge to improve response interpretability and factual grounding. Our architecture leverages six specialized agents that collaboratively perform structured actions for complex question reasoning. We formalize multi-agent collaboration with external tools as a trajectory preference alignment problem, incorporating question-aware agent customization and inter-agent preference harmonization. AMATA introduces two principal innovations: (1) Intra-Trajectory Preference Learning, which learns objective-oriented preferences to prioritize critical agents, and (2) Inter-Agent Dependency Learning, which captures cross-agent tool dependencies through a novel dependency-aware direct preference optimization technique. Empirical results show that AMATA consistently outperforms baseline approaches, knowledge-augmented frameworks, and LLM-based trajectory systems on five established knowledge-intensive QA benchmarks. Further analysis demonstrates the efficiency of our method in reducing token consumption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AMATA, an Adaptive Multi-Agent Trajectory Alignment framework for knowledge-intensive question answering. It deploys six specialized agents that collaboratively execute structured actions and integrate external knowledge, formalizing multi-agent collaboration with tools as a trajectory preference alignment problem that includes question-aware agent customization and inter-agent harmonization. Two core innovations are presented: Intra-Trajectory Preference Learning to prioritize critical agents via objective-oriented preferences, and Inter-Agent Dependency Learning that models cross-agent tool dependencies with a dependency-aware direct preference optimization (DPO) variant. The central empirical claim is consistent outperformance over baseline approaches, knowledge-augmented frameworks, and LLM-based trajectory systems across five established knowledge-intensive QA benchmarks, accompanied by reduced token consumption.

Significance. If the performance margins can be rigorously attributed to the proposed preference-alignment components rather than simply to agent count or tool access, the work could advance multi-agent reasoning for factual QA by offering a structured preference-based formalization of collaboration. The emphasis on interpretability and efficiency is timely, though the absence of independent verification mechanisms or parameter-free derivations limits the immediate theoretical impact.

major comments (2)
  1. [Experimental Results] The experimental evaluation reports end-to-end gains on five QA benchmarks but provides no ablation that removes either Intra-Trajectory Preference Learning or Inter-Agent Dependency Learning while retaining the same six-agent architecture and tool budget. Without such controls it is impossible to determine whether the headline outperformance is driven by the trajectory-alignment formalization or by the increased agent count alone.
  2. [Method] The dependency-aware DPO formulation is introduced as a novel technique for capturing inter-agent tool dependencies, yet the manuscript supplies neither the explicit loss function nor the algorithmic procedure for constructing the dependency graph, preventing assessment of whether the method differs substantively from standard DPO or multi-agent RL baselines.
minor comments (2)
  1. [Abstract and Experimental Results] The five knowledge-intensive QA benchmarks are referenced only generically in the abstract and results; explicit dataset names, sizes, and splits should be stated to allow direct comparison with prior work.
  2. [Experimental Results] Performance tables lack error bars, standard deviations, or statistical significance tests, making it difficult to judge whether the reported margins are reliable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below.

read point-by-point responses
  1. Referee: [Experimental Results] The experimental evaluation reports end-to-end gains on five QA benchmarks but provides no ablation that removes either Intra-Trajectory Preference Learning or Inter-Agent Dependency Learning while retaining the same six-agent architecture and tool budget. Without such controls it is impossible to determine whether the headline outperformance is driven by the trajectory-alignment formalization or by the increased agent count alone.

    Authors: We agree that additional controls are needed to isolate the contribution of the preference-alignment components. In the revised manuscript we will add ablations that disable Intra-Trajectory Preference Learning and Inter-Agent Dependency Learning one at a time while keeping the identical six-agent architecture and tool budget. revision: yes

  2. Referee: [Method] The dependency-aware DPO formulation is introduced as a novel technique for capturing inter-agent tool dependencies, yet the manuscript supplies neither the explicit loss function nor the algorithmic procedure for constructing the dependency graph, preventing assessment of whether the method differs substantively from standard DPO or multi-agent RL baselines.

    Authors: We acknowledge the omission. The revised manuscript will include the full mathematical definition of the dependency-aware DPO loss and a step-by-step description (with pseudocode) of the dependency-graph construction procedure, making explicit how the approach differs from standard DPO. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks, not self-referential reduction

full rationale

The paper introduces a modeling choice to formalize multi-agent tool use as trajectory preference alignment and reports end-to-end gains on five standard QA benchmarks. No equation, prediction, or central claim is shown to equal its own fitted inputs or prior self-citation by construction. The two learning objectives are presented as architectural innovations whose value is assessed against independent baselines and datasets rather than being tautological with the evaluation metric itself. This is the normal case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that multi-agent tool use can be usefully recast as a preference alignment problem; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Multi-agent collaboration with external tools can be formalized as a trajectory preference alignment problem that benefits from question-aware customization and inter-agent harmonization.
    Directly stated in the abstract as the formalization step underlying the two principal innovations.

pith-pipeline@v0.9.0 · 5739 in / 1136 out tokens · 39429 ms · 2026-05-20T13:39:00.594234+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

  1. [1]

    PubHealthTab: A public health table-based dataset for evidence-based fact checking. InNAACL. Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InICLR. Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. the ...

  2. [2]

    Group-in-Group Policy Optimization for LLM Agent Training

    Group-in-group policy optimization for LLM agent training.CoRR, abs/2505.10978. Pengyu Gao, Jinming Zhao, Xinyue Chen, and Yilin Long. 2025. An efficient context-dependent memory framework for llm-centric agents. InNAACL, pages 1055–1069. Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen

  3. [3]

    InEMNLP, pages 6465–6488

    Enabling large language models to generate text with citations. InEMNLP, pages 6465–6488. Xinming Hou, Mingming Yang, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, and Wayne Xin Zhao. 2024. Coact: A global-local hierarchy for autonomous agent collaboration.CoRR, abs/2406.13381. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, ...

  4. [4]

    InACL, pages 74–90

    Gentranslate: Large language models are gen- erative multilingual speech and machine translators. InACL, pages 74–90. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. A survey on hallucination in large lan- guage models: Principles, taxonomy, challenge...

  5. [5]

    Decomposed prompting: A modular approach for solving complex tasks. InICLR. Dorde Klisura, Astrid R. Bernaga Torres, Anna Karen Gárate-Escamilla, Rajesh Roshan Biswal, Ke Yang, Hilal Pataci, and Anthony Rios. 2025. A multi- agent framework for mitigating dialect biases in privacy policy question-answering systems.CoRR, abs/2506.02998. Bevan Koopman, Ahmed...

  6. [6]

    Agask: an agent to help answer farmer’s ques- tions from scientific documents.Int. J. Digit. Libr., 25(4):569–584. Teyun Kwon, Norman Di Palo, and Edward Johns. 2024. Language models as zero-shot trajectory generators. IEEE Robotics Autom. Lett., 9(7):6728–6735. Dongyang Li, Junbing Yan, Taolin Zhang, Chengyu Wang, Xiaofeng He, Longtao Huang, Hui Xue, and...

  7. [7]

    InNeurIPS

    Direct preference optimization: Your language model is secretly a reward model. InNeurIPS. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. InEMNLP, pages 2383–2392. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed. InKDD. Ohad Rubin, Jon...

  8. [8]

    Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

    Learning to retrieve prompts for in-context learning. InNAACL, pages 2655–2671. Weijia Shi, Sewon Min, Michihiro Yasunaga, Min- joon Seo, Richard James, Mike Lewis, Luke Zettle- moyer, and Wen-tau Yih. 2024. REPLUG: retrieval- augmented black-box language models. InNAACL, pages 8371–8384. Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Ta- laei Khoei. 2...

  9. [9]

    Zhang, M

    R 2ag: Incorporating retrieval information into retrieval augmented generation. InEMNLP, pages 11584–11596. Shengbin Yue, Siyuan Wang, Wei Chen, Xuanjing Huang, and Zhongyu Wei. 2025. Synergistic multi-agent framework with trajectory learning for knowledge-intensive tasks. InAAAI, pages 25796– 25804. Zhenrui Yue, Huimin Zeng, Lanyu Shang, Yifan Liu, Yang ...

  10. [10]

    Which American actor played fra- ternity president “Lewis Skol

    Triad: A framework leveraging a multi-role llm-based agent to solve knowledge base question answering. InEMNLP, pages 1698–1710. Longwei Zou, Qingyang Wang, Han Zhao, Jian- gangkong Jiangangkong, Yi Yang, and Yangdong Deng. 2024. CQIL: inference latency optimization with concurrent computation of quasi-independent layers. InACL, pages 7293–7307. A Detaile...

  11. [11]

    (2) SQuAD (Ra- jpurkar et al., 2016) includes 8,886 queries written by annotators based on documents

    contains 1,399 long-tail, rare-entity queries from Wikipedia. (2) SQuAD (Ra- jpurkar et al., 2016) includes 8,886 queries written by annotators based on documents. Following prior work (Asai et al., 2024), performance is evaluated using exact match (EM). • Ambiguous QA:ASQA (Gao et al., 2023) features 4,132 ambiguous factual questions Instruction Winning ...

  12. [12]

    xxxxxxxx </eor> <Filter> -----------------------------------------------

  13. [13]

    Filter ” Trajectory Data Collection Figure 8: Collection of “Filter

    the entir documents <eof> ----------------------------------------------- <Locator> [Relevant]: [1] sentences [Irrelevant]:[2] Lacking Supporting Facts. [Irrelevant]:[3] Lacking Supporting Facts. <eol> “Filter ” Trajectory Data Collection Figure 8: Collection of “Filter” trajectory data. requiring long-form responses. Fluency is assessed using Mauve, and ...

  14. [14]

    xxxxxxxx </eor> <Filter>

  15. [15]

    Verifier

    the entir documents <eof> <Locator> [Relevant]:[1] some sentence [Irrelevant]:[2] Lacking Supporting Facts. [Irrelevant]:[3] Lacking Supporting Facts. <eol> <Generator> [Cite]: [1] </eog> <Verifier> ----------------------------------------------- The answer is correct. </eov> “Verifier” Trajectory Data Collection Figure 9: Collection of “Verifier” traject...

  16. [16]

    Who was born earlier, person A or person B?

    to score each step, providing intermediate human feedback as accurate as possible. The final reward setting matches that used for GiGPO. A.3.2 Evaluation Details For the two additional agents—knowledge fil- ter ( ⟨Filter⟩) and verifier ( ⟨Verifier⟩)—if ⟨Filter⟩ outputs retrieved document indices not present in ⟨Retriever⟩, we remove them. If all in- dices...