DLM: Unified Decision Language Models for Offline Multi-Agent Sequential Decision Making

Bin Cheng; Bin He; Zhuohui Zhang

arxiv: 2604.23557 · v1 · submitted 2026-04-26 · 💻 cs.MA · cs.AI

DLM: Unified Decision Language Models for Offline Multi-Agent Sequential Decision Making

Zhuohui Zhang , Bin Cheng , Bin He This is my paper

Pith reviewed 2026-05-08 05:07 UTC · model grok-4.3

classification 💻 cs.MA cs.AI

keywords Decision Language ModelOffline multi-agent reinforcement learningDialogue-style sequencesZero-shot generalizationSupervised fine-tuningGroup relative policy optimizationCentralized training decentralized execution

0 comments

The pith

A single Decision Language Model unifies offline multi-agent policies by treating decisions as dialogue sequences and generalizes zero-shot across tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that multi-agent decision policies can be learned from offline data using one flexible language model instead of task-specific architectures. It does this by recasting agent interactions as dialogue-style sequences that capture context across agents during centralized training, then producing decentralized actions at execution time. The two-stage process first fine-tunes the model on these sequences to imitate trajectories, then applies a lightweight policy optimization step for robustness. If successful, this removes the need for custom observation and action formats in each new multi-agent environment. A sympathetic reader would care because it promises to make reusable policies easier to build and deploy from existing datasets.

Core claim

The Decision Language Model (DLM) formulates multi-agent sequential decision making as a dialogue-style sequence prediction problem under the centralized training with decentralized execution paradigm. DLM is trained in two stages: supervised fine-tuning on dialogue-style datasets derived from offline trajectories to generate executable actions with inter-agent context, followed by group relative policy optimization that uses lightweight reward functions to improve robustness against out-of-distribution actions. Experiments demonstrate that this unified model outperforms strong offline MARL baselines and LLM-based conversational methods while exhibiting strong zero-shot generalization to new

What carries the argument

Reformulation of multi-agent trajectories into dialogue-style sequences that enable a single LLM to perform centralized training with full inter-agent context while still supporting decentralized execution.

If this is right

One trained DLM can serve as a drop-in policy for multiple distinct multi-agent tasks without per-task redesign.
Zero-shot transfer to unseen scenarios reduces the data and engineering cost of deploying agents in new environments.
The dialogue format allows heterogeneous observation and action spaces to be handled uniformly within the same model.
The second-stage policy optimization step improves reliability when the model encounters actions outside the original offline data distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dialogue framing could make it easier to incorporate human feedback or instructions directly into the decision process.
Scaling to larger numbers of agents might become limited by the length of the resulting dialogue sequences rather than by combinatorial explosion in state space.
If the approach works across domains, it suggests that language-model pretraining already encodes useful priors for coordination that traditional MARL methods must learn from scratch.

Load-bearing premise

Converting raw multi-agent trajectories into dialogue sequences and applying standard LLM training stages fully preserves the coordination dynamics without critical loss or the need for environment-specific adjustments.

What would settle it

A controlled test on a benchmark where agents must coordinate on timing or hidden state information that cannot be expressed in the chosen dialogue format, showing the DLM falling below the performance of a standard offline MARL baseline.

Figures

Figures reproduced from arXiv: 2604.23557 by Bin Cheng, Bin He, Zhuohui Zhang.

**Figure 1.** Figure 1: Training pipeline of DLM. (a) Offline data collected from online MARL algorithms is transformed into dialogue-style sequences and split into two subsets. (b) The pre-trained model is fine-tuned on the first subset via SFT to align with the decision domain, resulting in DLM-SFT. (c) DLM-SFT generates policies on the second subset, and OOD-prone samples are filtered by comparing outputs with the dataset. (d)… view at source ↗

**Figure 2.** Figure 2: Mapping SMAC environment, observations, and actions into a dialogue-style prompt, with highlighted text showing key correspondences. the next state st+1 according to the environment dynamics P : S × A → ∆(S). The system receives a global reward rt = R(st, at), and γ ∈ (0, 1) denotes the discount factor governing future returns. In the offline setting, we assume access to a fixed dataset D = {τ (k)}M k=1 co… view at source ↗

**Figure 3.** Figure 3: Training and inference frameworks for DLM. (a) Centralized training using dialogue-style trajectories with inter-agent information. (b) Decentralized inference where each agent independently generates actions based on its local trajectory. Algorithm 1 SFT Training Procedure for DLM 1: Input: Offline dataset DSFT, pre-trained parameters θinit, max length L 2: Initialize: θSFT ← θinit, tokenizer, prompts 3: … view at source ↗

**Figure 4.** Figure 4: Performance comparison with value-based offline MARL methods on representative SMAC tasks: (a)-(b) easy, (c)-(d) hard, and (e)-(f) super hard. Only the final test performance of DLM is reported. OOD Rate (%) = 𝑁!!" 𝑇($) ×100 view at source ↗

**Figure 5.** Figure 5: Comparison of OOD Rates on all SMAC tasks. 4.1. Overall Performance on SMAC Benchmark We evaluate DLM across 15 SMAC tasks. The evaluation focuses on mean test win rates as a measure of overall decision quality under decentralized partial observability. All baselines are trained on offline datasets collected following the procedure described in Sec. 3.1. For fair evaluation, all experiments are conducted … view at source ↗

**Figure 6.** Figure 6: t-SNE projection of observation distributions from the offline dataset across all SMAC tasks. 12 view at source ↗

**Figure 7.** Figure 7: An example trajectory from the ChatSMAC dataset formatted in chat format. Dialogue-Style Conversion For training DLM, we apply an additional transformation to the original offline dataset, converting it into a dialogue-style format suitable for multi-turn sequence modeling. In this process, we retain only the obs and actions fields at each timestep and organize them into observation–action pairs. Each pair… view at source ↗

**Figure 8.** Figure 8: Screenshots of SMAC tasks at different difficulty levels: (a) 2s3z (easy), (b) 2c vs 64zg (hard), and (c) corridor (super hard) view at source ↗

**Figure 9.** Figure 9: Performance comparison with baselines on remaining SMAC tasks: (a)-(c) easy, (d)-(f) hard, and (g)-(i) super hard. Only the final test performance of DLM is reported. 18 view at source ↗

**Figure 10.** Figure 10: Training curves of DLM. (a) DLM-SFT: cross-entropy loss and token accuracy over tokens. (b) DLM-GRPO: reward (top), exact match rate (middle), and penalty (bottom). For implementation, we build all value-based baselines on top of the EPyMARL framework (Papoudakis et al., 2020), which is designed for flexible MARL experimentation. For imitation-based baselines, we adapt publicly available implementations r… view at source ↗

**Figure 11.** Figure 11: Computational cost comparison. (a) Average training time per difficulty level for each baseline. (b) Total standardized GPU hours across all tasks view at source ↗

read the original abstract

Building scalable and reusable multi-agent decision policies from offline datasets remains a challenge in offline multi-agent reinforcement learning (MARL), as existing methods often rely on fixed observation formats and action spaces that limit generalization. In contrast, large language models (LLMs) offer a flexible modeling interface that can naturally accommodate heterogeneous observations and actions. Motivated by this, we propose the Decision Language Model (DLM), which formulates multi-agent decision making as a dialogue-style sequence prediction problem under the centralized training with decentralized execution paradigm. DLM is trained in two stages: a supervised fine-tuning phase, which leverages dialogue-style datasets for centralized training with inter-agent context and generates executable actions from offline trajectories, followed by a group relative policy optimization phase to enhance robustness to out-of-distribution actions through lightweight reward functions. Experiments on multiple benchmarks show that a unified DLM outperforms strong offline MARL baselines and LLM-based conversational decision-making methods, while demonstrating strong zero-shot generalization to unseen scenarios across tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DLM reframes offline MARL as LLM dialogue prediction with two-stage training, but the claim that this preserves coordination without loss needs stronger checks.

read the letter

The main takeaway is that the paper casts multi-agent offline trajectories as dialogue sequences so a single LLM can do centralized training under CTDE, then adds a GRPO stage with lightweight rewards for robustness. This formulation is new in the specific way it turns heterogeneous observations and actions into language for unified handling across tasks, rather than building separate models or fixed interfaces. The experiments reportedly show the resulting DLM beating standard offline MARL baselines and other LLM decision methods, plus decent zero-shot transfer to unseen scenarios, which is the part that could matter for people who want one policy that works on varied robotics or game domains without heavy redesign. The conversion to dialogue format is described clearly enough in the abstract to see the intent. The soft spot is exactly the one the stress-test flags: turning joint trajectories into next-token prediction on inter-agent context may drop timing, credit assignment, or synchronized coordination signals that standard MARL value functions keep explicitly. The lightweight rewards in the second stage are unlikely to fully reconstruct the original multi-agent structure, so any reported gains could partly reflect benchmark quirks rather than true unification. If the full paper has ablations on information loss during the dialogue conversion or shows that the language model still recovers the necessary joint statistics, that would address it; otherwise the generalization story stays provisional. This is worth a serious referee because the method is concrete, the claims are testable on public benchmarks, and the unification angle is a reasonable direction even if the current evidence is not yet airtight. A reader already working at the LLM-MARL intersection would get the most from it.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce the Decision Language Model (DLM) for offline multi-agent sequential decision making. It reformulates multi-agent decisions as dialogue-style sequence prediction under the centralized training with decentralized execution (CTDE) paradigm. DLM is trained in two stages: supervised fine-tuning (SFT) on dialogue-style datasets derived from offline trajectories to enable centralized training with inter-agent context, followed by group relative policy optimization (GRPO) using lightweight reward functions to improve robustness to out-of-distribution actions. Experiments on multiple benchmarks are reported to show that a unified DLM outperforms strong offline MARL baselines and LLM-based conversational decision-making methods, while demonstrating strong zero-shot generalization to unseen scenarios across tasks.

Significance. If the empirical claims hold and the dialogue reformulation preserves coordination signals, this could advance offline MARL by offering a flexible LLM-based interface for heterogeneous observations and actions, reducing reliance on fixed formats. The two-stage SFT+GRPO approach and lightweight rewards for OOD robustness are strengths that could enable better generalization. However, the significance is tempered by the need to confirm that next-token prediction on language-formatted trajectories captures joint dynamics without loss, as this underpins both outperformance and zero-shot results.

major comments (2)

[Method and Experiments] The central claim that reformulating trajectories as dialogue-style sequences under CTDE fully encodes joint dynamics, partial observability, and coordination (without domain-specific machinery) is load-bearing for the reported superiority and generalization. The SFT phase relies on inter-agent context in language form for centralized training, yet standard next-token prediction may capture only surface correlations rather than credit assignment or synchronized actions; the GRPO stage with lightweight rewards may not reconstruct the original multi-agent value structure. This requires explicit support such as ablations or coordination metrics in the method and experiments sections.
[Abstract and Experiments] The abstract asserts experimental superiority over offline MARL baselines and LLM methods plus zero-shot generalization, but the support for these claims remains unverified without details on the specific benchmarks, baseline implementations, metrics, statistical significance, and how trajectories were converted to dialogue sequences. This directly affects assessment of whether the unification succeeds or if information loss undermines the results.

minor comments (1)

Clarify notation for 'inter-agent context' and 'lightweight reward functions' to ensure they are defined consistently across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to provide stronger empirical support and clearer experimental details, as these points help improve the presentation of our contributions.

read point-by-point responses

Referee: [Method and Experiments] The central claim that reformulating trajectories as dialogue-style sequences under CTDE fully encodes joint dynamics, partial observability, and coordination (without domain-specific machinery) is load-bearing for the reported superiority and generalization. The SFT phase relies on inter-agent context in language form for centralized training, yet standard next-token prediction may capture only surface correlations rather than credit assignment or synchronized actions; the GRPO stage with lightweight rewards may not reconstruct the original multi-agent value structure. This requires explicit support such as ablations or coordination metrics in the method and experiments sections.

Authors: We agree that explicit ablations and coordination metrics would strengthen the support for our central claim. The current manuscript shows that the unified DLM outperforms baselines and generalizes zero-shot, which provides indirect evidence that the dialogue reformulation preserves key multi-agent signals under CTDE. However, to directly address the concern about next-token prediction capturing only surface correlations versus credit assignment and synchronization, we will add ablations in the revised version: (1) SFT with versus without inter-agent context in the dialogue sequences, and (2) coordination metrics such as joint action alignment rates and proxy credit-assignment scores derived from trajectory returns. For GRPO, we will expand the discussion and include analysis of how the group-relative rewards with lightweight functions maintain robustness without fully reconstructing the original value function. These will be placed in the method and experiments sections. revision: yes
Referee: [Abstract and Experiments] The abstract asserts experimental superiority over offline MARL baselines and LLM methods plus zero-shot generalization, but the support for these claims remains unverified without details on the specific benchmarks, baseline implementations, metrics, statistical significance, and how trajectories were converted to dialogue sequences. This directly affects assessment of whether the unification succeeds or if information loss undermines the results.

Authors: We acknowledge that the abstract is concise and that additional explicit details would aid verification. The full manuscript already specifies the benchmarks (SMAC, MPE, and others), baseline implementations (offline adaptations of QMIX, MAPPO, and LLM conversational baselines), metrics (win rate, episode return), statistical significance (means and standard deviations over multiple seeds), and the trajectory-to-dialogue conversion process (via structured prompt templates encoding observations, actions, and inter-agent messages under CTDE). To improve accessibility and directly address the comment, we will revise the abstract to briefly reference the benchmark suite and add a consolidated summary table in the experiments section that tabulates these elements alongside the results. This will make the support for superiority and zero-shot claims more transparent without altering the core findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in DLM derivation or training pipeline

full rationale

The paper introduces DLM by reformulating offline MARL trajectories as dialogue-style sequences under the standard CTDE paradigm, then applies conventional two-stage LLM training (SFT on generated sequences followed by GRPO with lightweight rewards). No equations or claims reduce by construction to fitted parameters or prior self-citations; the central results are empirical outperformance and zero-shot generalization measured on external benchmarks. The derivation chain relies on independent data processing and optimization steps whose outputs are not presupposed by the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the effectiveness of LLMs in this new role and the empirical results from the proposed training pipeline.

free parameters (1)

lightweight reward functions
These are used in the group relative policy optimization phase to enhance robustness, but their specific design and tuning are not specified in the abstract.

axioms (1)

domain assumption Large language models can be effectively adapted via supervised fine-tuning to predict actions in multi-agent settings when data is formatted as dialogues.
This underpins the first training stage and the overall approach.

invented entities (1)

Decision Language Model (DLM) no independent evidence
purpose: To provide a unified model for offline multi-agent decision making using language modeling techniques.
The DLM is introduced as a new concept in this paper, with performance claims based on the authors' experiments.

pith-pipeline@v0.9.0 · 5466 in / 1384 out tokens · 59812 ms · 2026-05-08T05:07:00.921184+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

You are a strategic SMAC AI assistant on the map. Work with your team to complete the task

tasks and do not support comprehensive multi-task evaluation. To address this limitation, we construct our own dataset following the data collection methodology of D4RL (Fu et al., 2020). Specifically, we adopt TGCNet (Zhang et al., 2025b) as the behavior policy for data collection, due to its ability to achieve near-perfect performance across all SMAC ta...

work page 2020
[2]

as the primary benchmarks for evaluating DLM in a multi-task setting. Following prior work such as MADT (Meng et al., 2023), each task within SMAC is conventionally treated as a distinct task, since maps differ in agent types and numbers, 15 DLM: Unified Decision Language Models for Offline Multi-Agent Sequential Decision Making available abilities, team ...

work page 2023

[1] [1]

You are a strategic SMAC AI assistant on the map. Work with your team to complete the task

tasks and do not support comprehensive multi-task evaluation. To address this limitation, we construct our own dataset following the data collection methodology of D4RL (Fu et al., 2020). Specifically, we adopt TGCNet (Zhang et al., 2025b) as the behavior policy for data collection, due to its ability to achieve near-perfect performance across all SMAC ta...

work page 2020

[2] [2]

as the primary benchmarks for evaluating DLM in a multi-task setting. Following prior work such as MADT (Meng et al., 2023), each task within SMAC is conventionally treated as a distinct task, since maps differ in agent types and numbers, 15 DLM: Unified Decision Language Models for Offline Multi-Agent Sequential Decision Making available abilities, team ...

work page 2023