MARFT: Multi-Agent Reinforcement Fine-Tuning

Jun Wang; Junwei Liao; Muning Wen; Weinan Zhang

MARFT: Multi-Agent Reinforcement Fine-Tuning

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 2504.16129 v5 pith:BVQGL67Q submitted 2025-04-21 cs.MA cs.AIcs.LGcs.RO

MARFT: Multi-Agent Reinforcement Fine-Tuning

Junwei Liao , Muning Wen , Jun Wang , Weinan Zhang This is my paper

classification cs.MA cs.AIcs.LGcs.RO

keywords lamasmarftmulti-agentreinforcementfine-tuningagentchallengesmarl

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Large Language Model (LLM)-based Multi-Agent Systems (LaMAS) have demonstrated strong capabilities on complex agentic tasks requiring multifaceted reasoning and collaboration, from high-quality presentation generation to scientific research. Meanwhile, Reinforcement Learning (RL) is widely recognized for enhancing agent intelligence, but limited work has studied fine-tuning LaMAS with foundational RL techniques. Directly applying conventional Multi-Agent Reinforcement Learning (MARL) to LaMAS also introduces major challenges due to the unique mechanisms of LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce Flex-MG, a new Markov Game formulation aligned with real-world LaMAS optimization, together with a universal algorithmic framework tailored to LaMAS. We review the evolution from traditional RL to Reinforcement Fine-Tuning (RFT), then analyze the multi-agent counterpart. For LaMAS, we identify key differences between classical MARL and MARFT, including asynchronous agent interactions, profile-aware agent design, and heterogeneous architectures. These differences motivate a LaMAS-oriented formulation of RFT. We present a robust and scalable MARFT framework, detail its modular algorithm, and provide an open-source implementation to support adoption and further research. The paper further discusses application perspectives and open challenges, including dynamic environment modeling, sample inefficiency, and the lack of cohesive frameworks. By connecting theoretical foundations with practical methodology, this work aims to serve as a roadmap for advancing MARFT toward resilient, adaptive, and human-aligned agentic systems. Implementation: https://github.com/jwliao-ai/MARFT.

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate
cs.CL 2026-01 unverdicted novelty 7.0

SDRL trains LLMs via self-generated multi-path debates and joint optimization of standalone plus debate-conditioned responses to boost both single-model reasoning and multi-agent debate performance.
Where Do CoT Training Gains Land in LLM based Agents?
cs.AI 2026-06 unverdicted novelty 6.0

CoT training in LLM agents improves prompt-action quality more than the advantage of generated reasoning, and selectively masking action supervision improves out-of-domain generalization.
Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection
cs.AI 2026-06 unverdicted novelty 6.0

Traj-Evolve combines non-parametric experience retrieval and multi-agent RL with a leave-one-out unification strategy to outperform baselines on lung cancer prediction from up to five years of multimodal EHRs, includi...
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
Tree-based Credit Assignment for Multi-Agent Memory System
cs.MA 2026-05 unverdicted novelty 6.0

TreeMem assigns credit to agents in multi-agent memory systems by expanding outputs into a tree and using Monte Carlo averaging of final rewards to optimize each agent's policy.
Joint Optimization of Multi-agent Memory System
cs.MA 2026-03 unverdicted novelty 6.0

CoMAM jointly optimizes agents in multi-agent LLM memory systems via end-to-end RL and adaptive credit assignment to improve collaboration and performance.
Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic
cs.AI 2026-01 unverdicted novelty 6.0

Multi-agent actor-critic methods with a centralized critic improve decentralized LLM collaboration over Monte Carlo baselines in long-horizon and sparse-reward settings.
Memory in the Age of AI Agents
cs.CL 2025-12 unverdicted novelty 6.0

The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning
cs.LG 2026-06 unverdicted novelty 5.0

Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.
Reinforced Collaboration in Multi-Agent Flow Networks
cs.LG 2026-05 unverdicted novelty 5.0

MANGO optimizes multi-agent LLM workflows via flow networks, RL, and textual gradients, delivering up to 12.8% higher performance and 47.4% better efficiency while generalizing to new domains.
Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic
cs.AI 2026-01 conditional novelty 5.0

Centralized-critic actor-critic training (CoLLM-CC) improves sample efficiency and stability over Monte-Carlo multi-agent RL for training decentralized LLM collaboration.
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
cs.AI 2025-08 unverdicted novelty 5.0

A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
Position: Agentic AI System Is a Foreseeable Pathway to AGI
cs.AI 2026-05 unverdicted novelty 4.0

Agentic AI systems with DAG topologies are claimed to deliver exponentially superior generalization and sample efficiency compared to monolithic scaling for achieving AGI.
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
cs.CL 2026-05 unverdicted novelty 4.0

This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a...
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
cs.AI 2025-03 unverdicted novelty 2.0

This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.