Multi$^2$: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments

Minhae Kwon; Sangeun Park

arxiv: 2606.03698 · v1 · pith:SLHGFWK4new · submitted 2026-06-02 · 💻 cs.LG

Multi²: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments

Sangeun Park , Minhae Kwon This is my paper

Pith reviewed 2026-06-28 11:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords hierarchical agentsLLM-based agentsmulti-agent systemssupervised fine-tuningreinforcement learninglong-horizon planningobjective driftinteractive environments

0 comments

The pith

Hierarchical separation of high-level SFT planning and low-level RL execution lets LLM agents maintain stable goals across long interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a two-layer agent system to address objective drift in long-horizon tasks. A high-level agent generates sub-goals using supervised fine-tuning on context, while a low-level agent translates those into actions via reinforcement learning. This split allows each part to specialize, leading to better coordination and adaptation in interactive environments. The approach outperforms standard LLM agent baselines on several tasks. It also provides new benchmark datasets for hierarchical decision-making.

Core claim

Multi² decomposes agent behavior into a high-level System 1 agent that generates context-aware sub-goals through supervised fine-tuning and a low-level System 2 agent that executes atomic actions through offline-to-online reinforcement learning. This explicit role separation produces stable long-horizon control and reduces objective drift without requiring additional inter-agent feedback mechanisms.

What carries the argument

The hierarchical multi-agent framework that assigns sub-goal generation to an SFT-trained high-level agent and action execution to an RL-trained low-level agent.

If this is right

Long-horizon tasks become more reliable because the high-level agent can reset or adjust sub-goals periodically.
Training remains efficient since the SFT and RL components can be developed separately.
New hierarchical benchmark datasets enable standardized evaluation of similar multi-level agent designs.
Adaptation to new environments improves because only the low-level policy needs online updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar hierarchies might reduce drift in other sequential decision domains like robotics or dialogue systems.
Testing coordination under noisy high-level outputs would reveal how robust the separation really is.
The released benchmarks could support comparisons with single-agent methods that incorporate explicit memory.
If the low-level agent is replaced with a different controller, the high-level planning might still transfer.

Load-bearing premise

The high-level and low-level agents can coordinate effectively on the global objective even when trained mostly independently and without built-in error recovery or feedback loops between them.

What would settle it

In a controlled test, if the low-level agent frequently fails to achieve the sub-goals set by the high-level agent, or if overall performance does not exceed non-hierarchical baselines on long-horizon tasks.

Figures

Figures reproduced from arXiv: 2606.03698 by Minhae Kwon, Sangeun Park.

**Figure 1.** Figure 1: Challenges in long-horizon interaction on the ScienceWorld using Llama-3.1 8B backbone. We compare ReAct [14] (prompt-based, non-hierarchical), Glider [15] (hierarchical finetuning baseline), and our method (Multi2 ). (a) Horizon Robustness: performance as a function of horizon length, where longer horizons induce larger degradation for baselines. (b) Token Efficiency: normalized inference-time token e… view at source ↗

**Figure 2.** Figure 2: Overview of Multi2 . (a) Offline Training: System 1 (high-level agent) is trained via SFT on the high-level dataset Dsys1 to generate sub-goals, while System 2 (low-level agent) is initialized via offline RL on the low-level dataset Dsys2 using an actor-critic objective. (b) Online Training: starting from the offline policy, System 2 continues self-improvement through interaction, updating its policy while… view at source ↗

**Figure 3.** Figure 3: Representative failure examples of baseline agents on the ScienceWorld Find-a-plant task with Qwen-2.5 3B. Each column corresponds to a different method (a–f). Shaded boxes indicate action types: gray denotes valid actions, blue denotes invalid actions rejected by the environment, and yellow denotes unproductive loops that repeat actions without measurable progress [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Inference-time token efficiency on ScienceWorld ID split with Llama-3.1 8B. The x-axis denotes token usage, and the y-axis denotes performance. Bubble size indicates token efficiency, computed as performance/tokens and normalized such that ReAct is 1.0. Solid lines: fine-tuned-based; dashed lines: prompt-based. tasks. Here, ADaPT outperforms GRPO on most tasks because its hierarchical prompting provides a… view at source ↗

**Figure 6.** Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation results on the ScienceWorld ID split using Llama-3.1 8B (a) and Qwen-2.5 3B (b). (a) Offline Loss Design: Vanilla-IQL (without policy-anchored advantage term) vs. the proposed offline objective. (b) Online Loss Design: Vanilla-AWAC (without KL regularization) vs. the proposed online objective. ing is unstable for high-level planning. Only SFT achieves competitive median performance but is less ro… view at source ↗

**Figure 8.** Figure 8: Effect of backbone model scale on Multi2 performance in ScienceWorld with Qwen-2.5 backbones at three scales (1.5B, 3B, and 7B). (a) In-Distribution Split and (b) Out-ofDistribution Split. Error bars denote one standard deviation over runs. results support the importance of separate adapters for effective role specialization of System 1 and System 2. Loss Function Designs [PITH_FULL_IMAGE:figures/full_… view at source ↗

**Figure 10.** Figure 10: Training dataset example for System 1 and System 2 on ScienceWorld benchmark. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Training dataset example for System 1 and System 2 on ALFWorld benchmark. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Training dataset example for System 1 and System 2 on TextCraft benchmark. Figures 10, 11, and 12 show training dataset examples for each benchmark (ScienceWorld, ALFWorld, and TextCraft). We formulate all environments as partially observable Markov decision processes and use separate prompt templates for System 1 and System 2 to match their input–output requirements. Each System 1 instance in Dsys1 pairs… view at source ↗

**Figure 13.** Figure 13: Input-output examples of Multi2 on ScienceWorld. (a) System 1 takes the task description and the current state summary (group_action and observation) and generates a sub-goal. (b) System 2 conditions on the sub-goal and the current observation to output an atomic action and a completion flag, repeating until the sub-goal is marked done. System 1 (planner). In the shown episode (task: boil water) in [PITH… view at source ↗

**Figure 14.** Figure 14: Input-output examples of Multi2 on ALFWorld. (a) System 1 takes the task description and the current state summary (group_action and observation) and generates a sub-goal. (b) System 2 conditions on the sub-goal and the current observation to output an atomic action and a completion flag, repeating until the sub-goal is marked done. System 1 (planner). As shown in [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Input-output examples of Multi2 on TextCraft. (a) System 1 takes the task description and the current state summary (group_action and observation) and generates a sub-goal. (b) System 2 conditions on the sub-goal and the current observation to output an atomic action and a completion flag, repeating until the sub-goal is marked done. System 1 (planner). As shown in [PITH_FULL_IMAGE:figures/full_fig_p025_… view at source ↗

**Figure 16.** Figure 16: Task difficulty based on average interaction length in the ScienceWorld benchmark. For each task, we compute the average number of environment steps taken by our agent across evaluation episodes, and group tasks into Easy/Medium/Hard by quantiles of this length (short/medium/long horizon) [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

**Figure 17.** Figure 17: Token efficiency at inference time on ScienceWorld benchmark OOD split with the Llama-3.1 8B backbone. The x-axis denotes token usage, and the y-axis denotes performance. Bubble size is proportional to token efficiency, computed as performance/tokens, and the number inside each bubble denotes normalized token efficiency with ReAct set to 1.0. Solid outlines denote fine-tuning-based methods, while dashed o… view at source ↗

**Figure 18.** Figure 18: Model scaling trends on ALFWorld benchmark across Qwen-2.5 model scales. (a) In-Distribution Split and (b) Out-ofDistribution Split [PITH_FULL_IMAGE:figures/full_fig_p033_18.png] view at source ↗

**Figure 19.** Figure 19: Relative improvement rate from online adaptation on ScienceWorld, stratified by task difficulty for (a) In-Distribution Split and (b) Out-of-Distribution Split, measured relative to the offline-only baseline [PITH_FULL_IMAGE:figures/full_fig_p034_19.png] view at source ↗

**Figure 20.** Figure 20: Training curve of the offline-to-online RL stage on ScienceWorld. In [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗

**Figure 21.** Figure 21: Test case result of ReAct, evaluated on the ScienceWorld Find-a-plant task. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_21.png] view at source ↗

**Figure 22.** Figure 22: Test case result of Reflexion, evaluated on the ScienceWorld Find-a-plant task. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗

**Figure 23.** Figure 23: Test case result of ADaPT, evaluated on the ScienceWorld Find-a-plant task. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗

**Figure 24.** Figure 24: Test case result of GRPO, evaluated on the ScienceWorld Find-a-plant task. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_24.png] view at source ↗

**Figure 25.** Figure 25: Test case result of Glider, evaluated on the ScienceWorld Find-a-plant task. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_25.png] view at source ↗

read the original abstract

A central goal of large language model (LLM) research is to build agentic systems that can plan, act, and adapt through sustained interaction with dynamic environments. While recent LLM-based agents exhibit impressive contextual reasoning, their long-horizon decision-making remains fragile, often suffering from objective drift, where goals and plans drift over extended interactions. We introduce Multi$^2$, a hierarchical multi-agent decision-making framework that explicitly decomposes agent behavior into complementary roles. A high-level agent (System 1) focuses on context-aware sub-goal generation using supervised fine-tuning (SFT), while a low-level agent (System 2) executes atomic actions through offline-to-online reinforcement learning (RL) in interactive environments. This separation enables stable long-horizon control, mitigates objective drift, and allows efficient adaptation. Across diverse interactive environments, Multi$^2$ consistently outperforms strong agentic baselines, demonstrating improved robustness and coordination in multi-turn interaction. Beyond performance, we introduce and release three hierarchical benchmark datasets, filling a long-standing gap in training and evaluating hierarchical decision-making for LLM-based agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multi² puts forward an SFT high-level sub-goal agent paired with an RL low-level action agent plus three new benchmarks, but the abstract supplies no data to back its performance claims.

read the letter

The main thing to know is that this paper describes a hierarchical split in LLM agents where supervised fine-tuning handles context-aware sub-goal generation and offline-to-online RL handles atomic actions, and it releases three dedicated hierarchical benchmark datasets.

This pairing and the benchmark release are the concrete new pieces. Earlier hierarchical agent work exists, but the explicit SFT-for-System-1 and RL-for-System-2 regimen inside a multi-agent LLM setup, plus the datasets aimed at long-horizon interactive settings, goes beyond simple extensions.

The paper does a reasonable job naming objective drift as a practical failure mode in sustained interactions and offering role separation as a structural response. The benchmarks address a real shortage of evaluation resources for this style of decomposition.

The soft spots sit in the empirical claims. The abstract states that Multi² consistently outperforms baselines and mitigates drift across environments, yet it gives no metrics, no baseline descriptions, no environment details, and no statistical evidence. That leaves the central results unsupported from the text available. The stress-test concern about missing coordination also lands: the description treats the two agents as trainable largely independently, but supplies no mention of error detection, replanning triggers, or feedback from low level to high level. Without those, the drift-mitigation guarantee rests on an untested assumption.

This paper is for researchers working on LLM agents in interactive or multi-turn settings who want concrete patterns or new testbeds. The benchmarks could be useful on their own.

It shows straightforward engagement with the problem even if the evidence is thin so far. I would send it to peer review so referees can examine the actual experiments and any coordination mechanisms that may be in the full manuscript.

Referee Report

2 major / 0 minor

Summary. The paper introduces Multi², a hierarchical multi-agent framework for LLM-based agents in interactive environments. A high-level System 1 agent uses supervised fine-tuning (SFT) for context-aware sub-goal generation, while a low-level System 2 agent uses offline-to-online RL for atomic action execution. The separation is claimed to enable stable long-horizon control, mitigate objective drift, and support efficient adaptation. The work reports consistent outperformance over strong agentic baselines across diverse environments and releases three new hierarchical benchmark datasets for training and evaluating such agents.

Significance. If the empirical results hold with proper controls and statistical support, the hierarchical decomposition could offer a practical route to more robust long-horizon LLM agents by separating planning from execution. The release of three benchmark datasets is a concrete contribution that addresses a documented gap in hierarchical decision-making evaluation.

major comments (2)

[Abstract (framework paragraph)] Abstract, paragraph describing the framework: the central claim that the SFT/RL separation 'enables stable long-horizon control, mitigates objective drift' is load-bearing, yet the provided description states the agents 'can be trained largely independently' without specifying inter-agent signaling, replanning triggers, error-detection mechanisms, or closed-loop feedback. If low-level execution deviates due to stochasticity or sub-goal ambiguity, no pathway for high-level correction is described; this directly affects the drift-mitigation guarantee.
[Abstract] Abstract: the assertion of 'consistent outperformance' and 'improved robustness' across environments is presented without any metrics, baseline descriptions, statistical tests, or experimental details in the available text. The soundness of the empirical claim therefore cannot be assessed from the supplied material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate whether revisions will be made.

read point-by-point responses

Referee: [Abstract (framework paragraph)] Abstract, paragraph describing the framework: the central claim that the SFT/RL separation 'enables stable long-horizon control, mitigates objective drift' is load-bearing, yet the provided description states the agents 'can be trained largely independently' without specifying inter-agent signaling, replanning triggers, error-detection mechanisms, or closed-loop feedback. If low-level execution deviates due to stochasticity or sub-goal ambiguity, no pathway for high-level correction is described; this directly affects the drift-mitigation guarantee.

Authors: Section 3 of the manuscript details the interaction protocol, including a shared context buffer for sub-goal communication, execution-status feedback that triggers high-level replanning, and reward-based error detection for closed-loop correction. The abstract's brevity omitted these elements. We will revise the abstract to briefly reference the closed-loop feedback mechanism, thereby strengthening support for the drift-mitigation claim. revision: yes
Referee: [Abstract] Abstract: the assertion of 'consistent outperformance' and 'improved robustness' across environments is presented without any metrics, baseline descriptions, statistical tests, or experimental details in the available text. The soundness of the empirical claim therefore cannot be assessed from the supplied material.

Authors: Abstracts conventionally summarize results without full metrics or statistics; these are reported in Section 4, which includes performance tables, baseline descriptions, and statistical significance tests across environments. The claims are substantiated in the body of the paper. No revision to the abstract is required. revision: no

Circularity Check

0 steps flagged

No circularity; empirical framework without derivation chain

full rationale

The paper introduces an empirical hierarchical multi-agent framework separating high-level SFT for sub-goal generation from low-level RL for action execution. No equations, derivations, or mathematical reductions are present that would equate claimed outcomes (e.g., drift mitigation) to fitted parameters or self-defined quantities within the paper. Performance claims rest on experimental comparisons across environments rather than any self-referential derivation. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the abstract or described structure. The design is presented as an architectural choice validated empirically, with no reduction of predictions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no mathematical formulation, so no free parameters, axioms, or invented entities can be identified. The framework description relies on standard SFT and RL concepts without additional postulates.

pith-pipeline@v0.9.1-grok · 5718 in / 1246 out tokens · 30809 ms · 2026-06-28T11:28:05.694908+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 6 canonical work pages · 3 internal anchors

[1]

En- hancing decision-making of large language models via actor-critic

Heng Dong, Kefei Duan, and Chongjie Zhang. En- hancing decision-making of large language models via actor-critic. InInternational Conference on Machine Learning (ICML), 2025

2025
[2]

CollabLLM: From passive responders to active collaborators

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao. CollabLLM: From passive responders to active collaborators. InInter- national Conference on Machine Learning (ICML), 2025

2025
[3]

DAMA: Data- and model-aware align- ment of multi-modal LLMs

Jinda Lu, Junkang Wu, Jinghan Li, Xiaojun Jia, Shuo Wang, YiFan Zhang, Junfeng Fang, Xiang Wang, and Xiangnan He. DAMA: Data- and model-aware align- ment of multi-modal LLMs. InInternational Confer- ence on Machine Learning (ICML), 2025

2025
[4]

Inverse rational control with par- tially observable continuous nonlinear dynamics

Minhae Kwon, Saurabh Daptardar, Paul R Schrater, and Xaq Pitkow. Inverse rational control with par- tially observable continuous nonlinear dynamics. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020
[5]

QuBE: Question-based belief enhancement for agentic LLM reasoning

Minsoo Kim, Jongyoon Kim, Jihyuk Kim, and Seung Hwang. QuBE: Question-based belief enhancement for agentic LLM reasoning. InEmpirical Methods in Natural Language Processing (EMNLP), 2024

2024
[6]

Agentic reasoning: A streamlined frame- work for enhancing LLM reasoning with agentic tools

Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. Agentic reasoning: A streamlined frame- work for enhancing LLM reasoning with agentic tools. InAssociation for Computational Linguistics (ACL), 2025

2025
[7]

T1: Advancing language model reasoning through reinforcement learning and inference scaling

Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. T1: Advancing language model reasoning through reinforcement learning and inference scaling. InIn- ternational Conference on Machine Learning (ICML), 2025

2025
[8]

Episodic future think- ing mechanism for multi-agent reinforcement learning

Dongsu Lee and Minhae Kwon. Episodic future think- ing mechanism for multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[9]

ReCAP: Recursive context-aware reasoning and planning for large language model agents

Zhenyu Zhang, Tianyi Chen, Weiran Xu, Alex Pent- land, and Jiaxin Pei. ReCAP: Recursive context-aware reasoning and planning for large language model agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[10]

Evaluating llm-based agents for multi-turn conversations: A survey, 2025

Shengyue Guan, Jindong Wang, Jiang Bian, Bin Zhu, Jian Lou, and Haoyi Xiong. Evaluating LLM-based agents for multi-turn conversations: A survey.arXiv preprint arXiv:2503.22458, 2025

work page arXiv 2025
[11]

Path drift in large reasoning models: How first-person commitments override safety

Yuyi Huang, Runzhe Zhan, Lidia Chao, Ailin Tao, and Derek Wong. Path drift in large reasoning models: How first-person commitments override safety. In Empirical Methods in Natural Language Processing (EMNLP), 2025

2025
[12]

Drift no more? Context equilibria in multi-turn LLM interac- tions

Vardhan Dongre, Ryan Rossi, Viet Lai, Seunghyun Yoon, Dilek Hakkani-Tür, and Trung Bui. Drift no more? Context equilibria in multi-turn LLM interac- tions. InAAAI Personalization in the Era of Large Foundation Models Workshop, 2025

2025
[13]

Do as I can, not as I say: Grounding language in robotic affordances

Brian Ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, Omar Co...

2023
[14]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representa- tions (ICLR), 2023

2023
[15]

Divide and conquer: Grounding LLMs as efficient decision-making agents via offline hierarchical reinforcement learning

Zican Hu, Wei Liu, Xiaoye Qu, Xiangyu Yue, Chunlin Chen, Zhi Wang, and Yu Cheng. Divide and conquer: Grounding LLMs as efficient decision-making agents via offline hierarchical reinforcement learning. InIn- ternational Conference on Machine Learning (ICML), 2025. 10 Multi2: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments

2025
[16]

ADaPT: As-needed decomposition and planning with language models

Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. ADaPT: As-needed decomposition and planning with language models. InFindings of the Association for Computational Linguistics (NAACL), 2024

2024
[17]

Plan-and-Act: Improving planning of agents for long-horizon tasks

Lutfi Erdogan, Hiroki Furuta, Sehoon Kim, Nicholas Lee, Suhong Moon, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. Plan-and-Act: Improving planning of agents for long-horizon tasks. InInter- national Conference on Machine Learning (ICML), 2025

2025
[18]

The illusion of diminishing returns: Measuring long horizon execution in LLMs

Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping. The illusion of diminishing returns: Measuring long horizon execution in LLMs. InInternational Conference on Learning Representa- tions (ICLR), 2026

2026
[19]

Agent-oriented planning in multi-agent systems

Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi-agent systems. InInternational Conference on Learning Representations (ICLR), 2025

2025
[20]

Multi-agent collaboration via evolv- ing orchestration

Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zi- hao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, and Maosong Sun. Multi-agent collaboration via evolv- ing orchestration. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[21]

Agentic AI: The age of reason- ing—A review.Journal of Automation and Intelli- gence, 2025

Ume Nisa, Muhammad Shirazi, Mohamed Saip, and Muhammad Pozi. Agentic AI: The age of reason- ing—A review.Journal of Automation and Intelli- gence, 2025

2025
[22]

ScienceWorld: Is your agent smarter than a 5th grader? InEmpirical Methods in Natural Language Processing (EMNLP), 2022

Ruoyao Wang, Peter Jansen, Marc Côté, and Prithviraj Ammanabrolu. ScienceWorld: Is your agent smarter than a 5th grader? InEmpirical Methods in Natural Language Processing (EMNLP), 2022

2022
[23]

OASIS: Open-world adaptive self-supervised and imbalanced- aware system

Miru Kim, Mugon Joe, and Minhae Kwon. OASIS: Open-world adaptive self-supervised and imbalanced- aware system. InACM International Conference on Information and Knowledge Management (CIKM), 2025

2025
[24]

Improving network attack classification on imbalanced real-world intrusion incident datasets

Miru Kim, Mugon Joe, and Minhae Kwon. Improving network attack classification on imbalanced real-world intrusion incident datasets. InInternational Confer- ence on Mobile Systems, Applications and Services (MobiSys), 2025

2025
[25]

Con- trastive learning based network attack classifier for imbalanced data.Journal of Communications and Networks, 28(1):86–97, Feb

Mugon Joe, Miru Kim, and Minhae Kwon. Con- trastive learning based network attack classifier for imbalanced data.Journal of Communications and Networks, 28(1):86–97, Feb. 2026

2026
[26]

Per- sonalized split federated learning with early exit: Pre- training and online learning against label shifts.IEEE Internet of Things Journal, 12(22):47069–47082, Nov

Miru Kim, Heewon Park, and Minhae Kwon. Per- sonalized split federated learning with early exit: Pre- training and online learning against label shifts.IEEE Internet of Things Journal, 12(22):47069–47082, Nov. 2025

2025
[27]

Fed-ADE: Adaptive learning rate for federated post-adaptation under distribution shift

Heewon Park, Mugon Joe, Miru Kim, Kyungjin Im, and Minhae Kwon. Fed-ADE: Adaptive learning rate for federated post-adaptation under distribution shift. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[28]

Evolving intelligent network attack classifier under label distri- bution shift.IEEE Transactions on Network Science and Engineering, 13(1):7448–7464, Mar

Miru Kim, Mugon Joe, and Minhae Kwon. Evolving intelligent network attack classifier under label distri- bution shift.IEEE Transactions on Network Science and Engineering, 13(1):7448–7464, Mar. 2026

2026
[29]

Personal- ized federated sensing for heterogeneous environment

Heewon Park, Miru Kim, and Minhae Kwon. Personal- ized federated sensing for heterogeneous environment. IEEE Sensors Letters, 9(4):1–4, 2025

2025
[30]

Editable scene simulation for autonomous driving via collaborative LLM-agents

Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changx- ing Liu, Hao Zhao, Siheng Chen, and Yanfeng Wang. Editable scene simulation for autonomous driving via collaborative LLM-agents. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[31]

DriVLMe: Enhancing LLM- based autonomous driving agents with embodied and social experiences

Yidong Huang, Jacob Sansom, Ziqiao Ma, Felix Gervits, and Joyce Chai. DriVLMe: Enhancing LLM- based autonomous driving agents with embodied and social experiences. InIEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), 2024

2024
[32]

SToRM: Supervised token reduction for multi-modal LLMs toward efficient end- to-end autonomous driving

Seo Hyun Kim, Jin Bok Park, Do Yeon Koo, Hogun Park, and Il Yong Chun. SToRM: Supervised token reduction for multi-modal LLMs toward efficient end- to-end autonomous driving. InIEEE International Conference on Robotics and Automation (ICRA), 2026

2026
[33]

ASAP: Unsupervised post-training with label distribution shift adaptive learning rate

Heewon Park, Mugon Joe, Miru Kim, and Minhae Kwon. ASAP: Unsupervised post-training with label distribution shift adaptive learning rate. InACM Inter- national Conference on Information and Knowledge Management (CIKM), 2025

2025
[34]

Reflexion: Lan- guage agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Lan- guage agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[35]

Make your LLM fully utilize the context

Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian Lou, and Weizhu Chen. Make your LLM fully utilize the context. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 11 Multi2: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments

2024
[36]

Toward self- improvement of LLMs via imagination, searching, and criticizing

Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. Toward self- improvement of LLMs via imagination, searching, and criticizing. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2024

2024
[37]

The lighthouse of language: Enhancing LLM agents via critique-guided improvement

Ruihan Yang, Fanghua Ye, Jian Li, Siyu Yuan, Yikai Zhang, Zhaopeng Tu, Xiaolong Li, and Deqing Yang. The lighthouse of language: Enhancing LLM agents via critique-guided improvement. InAdvances in Neu- ral Information Processing Systems (NeurIPS), 2025

2025
[38]

The alignment problem from a deep learning perspec- tive

Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspec- tive. InInternational Conference on Learning Repre- sentations (ICLR), 2024

2024
[39]

ArCHer: Training language model agents via hierarchical multi-turn RL

Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. ArCHer: Training language model agents via hierarchical multi-turn RL. InInternational Conference on Machine Learning (ICML), 2024

2024
[40]

Robust hierar- chical anomaly detection using feature impact in iot networks.ICT Express, 11(2):358–363, Apr

Joohong Rheey and Hyunggon Park. Robust hierar- chical anomaly detection using feature impact in iot networks.ICT Express, 11(2):358–363, Apr. 2025

2025
[41]

Option discovery us- ing LLM-guided semantic hierarchical reinforcement learning.arXiv preprint arXiv:2503.19007, 2025

Chak Shek and Pratap Tokekar. Option discovery us- ing LLM-guided semantic hierarchical reinforcement learning.arXiv preprint arXiv:2503.19007, 2025

work page arXiv 2025
[42]

Leveraging imitation learning and LLMs for efficient hierarchical reinforcement learning

Runhan Yang, Jieao Shi, Mengqi Su, and Don- gruo Zhou. Leveraging imitation learning and LLMs for efficient hierarchical reinforcement learning. https://openreview.net/forum?id=6y00rooi7i, 2025

2025
[43]

Getting more juice out of the SFT data: Reward learning from human demonstration improves SFT for LLM alignment

Jiaxiang Li, Siliang Zeng, Hoi Wai, Chenliang Li, Alfredo Garcia, and Mingyi Hong. Getting more juice out of the SFT data: Reward learning from human demonstration improves SFT for LLM alignment. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[44]

Large lan- guage models as generalizable policies for embodied tasks

Andrew Szot, Max Schwarzer, Harsh Agrawal, Bog- dan Mazoure, Rin Metcalf, Walter Talbott, Natalie Mackraz, R Hjelm, and Alexander Toshev. Large lan- guage models as generalizable policies for embodied tasks. InInternational Conference on Learning Repre- sentations (ICLR), 2024

2024
[45]

Unlocking LLMs’ self-improvement capacity with autonomous learning for domain adaptation

Ke Ji, Junying Chen, Anningzhe Gao, Wenya Xie, Xiang Wan, and Benyou Wang. Unlocking LLMs’ self-improvement capacity with autonomous learning for domain adaptation. InFindings of the Association for Computational Linguistics (ACL), 2025

2025
[46]

Data mix- ing optimization for supervised fine-tuning of large language models

Yuan Li, Zhengzhong Liu, and Eric Xing. Data mix- ing optimization for supervised fine-tuning of large language models. InInternational Conference on Ma- chine Learning (ICML), 2025

2025
[47]

Wei Lu, Rachel Luu, and Markus Buehler. Fine-tuning large language models for domain adaptation: Explo- ration of training strategies, scaling, model merging and synergistic capabilities.npj Computational Mate- rials, 11(1):84, 2025

2025
[48]

Coevolving with the other you: Fine-tuning LLM with sequential coopera- tive multi-agent reinforcement learning

Hao Ma, Tianyi Hu, Zhiqiang Pu, Boyin Liu, Xiaolin Ai, Yanyan Liang, and Min Chen. Coevolving with the other you: Fine-tuning LLM with sequential coopera- tive multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[49]

Instant inverse mod- eling of stochastic driving behavior with deep rein- forcement learning.IEEE Transactions on Consumer Electronics, 71(1):2152–2162, Feb

Dongsu Lee and Minhae Kwon. Instant inverse mod- eling of stochastic driving behavior with deep rein- forcement learning.IEEE Transactions on Consumer Electronics, 71(1):2152–2162, Feb. 2025

2025
[50]

Control- ling large language model with latent action

Chengxing Jia, Ziniu Li, Pengyuan Wang, Yi Li, Zhenyu Hou, Yuxiao Dong, and Yang Yu. Control- ling large language model with latent action. InIn- ternational Conference on Machine Learning (ICML), 2025

2025
[51]

Stability analysis in mixed-autonomous traffic with deep reinforcement learning.IEEE Transactions on Vehicular Technology, 72(3):2848–2862, Mar

Dongsu Lee and Minhae Kwon. Stability analysis in mixed-autonomous traffic with deep reinforcement learning.IEEE Transactions on Vehicular Technology, 72(3):2848–2862, Mar. 2023

2023
[52]

QLASS: Boost- ing language agent inference via Q-guided stepwise search

Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, and Kai Chang. QLASS: Boost- ing language agent inference via Q-guided stepwise search. InInternational Conference on Machine Learn- ing (ICML), 2025

2025
[53]

Temporal distance- aware transition augmentation for offline model-based reinforcement learning

Dongsu Lee and Minhae Kwon. Temporal distance- aware transition augmentation for offline model-based reinforcement learning. InInternational Conference on Machine Learning (ICML), 2025

2025
[54]

Online reinforcement learning in stochastic games

Chen Wei, Yi Hong, and Chi Lu. Online reinforcement learning in stochastic games. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017
[55]

Efficient online reinforcement learning with offline data

Philip Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning (ICML), 2023

2023
[56]

Leveraging offline data in online reinforcement learning

Andrew Wagenmaker and Aldo Pacchiano. Leveraging offline data in online reinforcement learning. InIn- ternational Conference on Machine Learning (ICML), 2023

2023
[57]

Continuous control with deep reinforcement learning

Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2016. 12 Multi2: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments

2016
[58]

Foresighted decisions for inter-vehicle interactions: An offline reinforcement learning approach

Dongsu Lee and Minhae Kwon. Foresighted decisions for inter-vehicle interactions: An offline reinforcement learning approach. InIEEE International Conference on Intelligent Transportation Systems (ITSC), 2023

2023
[59]

Ad- dressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Ad- dressing function approximation error in actor-critic methods. InInternational Conference on Machine Learning (ICML), 2018

2018
[60]

Selec- tive imitation for efficient online reinforcement learn- ing with pre-collected data.ICT Express, 10(6):1308– 1314, Dec

Chanin Eom, Dongsu Lee, and Minhae Kwon. Selec- tive imitation for efficient online reinforcement learn- ing with pre-collected data.ICT Express, 10(6):1308– 1314, Dec. 2024

2024
[61]

Of- fline reinforcement learning with implicit Q-learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Of- fline reinforcement learning with implicit Q-learning. InInternational Conference on Learning Representa- tions (ICLR), 2022

2022
[62]

Price of the au- tonomous strategy with reinforcement learning in mixed-autonomy traffic networks.IEEE Transactions on Intelligent Transportation Systems, 27(2):2741– 2752, Feb

Chanin Eom and Minhae Kwon. Price of the au- tonomous strategy with reinforcement learning in mixed-autonomy traffic networks.IEEE Transactions on Intelligent Transportation Systems, 27(2):2741– 2752, Feb. 2026

2026
[63]

The impact of dataset on offline reinforcement learning performance in uav-based emergency network recov- ery tasks.IEEE Communications Letters, 28(5):1058– 1061, May

Jeyeon Eo, Dongsu Lee, and Minhae Kwon. The impact of dataset on offline reinforcement learning performance in uav-based emergency network recov- ery tasks.IEEE Communications Letters, 28(5):1058– 1061, May. 2024

2024
[64]

Curriculum reinforcement learning for cohesive team in mobile ad hoc networks.IEEE Communications Letters, 26(8):1809–1813, Aug

Nayoung Kim, Minhae Kwon, and Hyunggon Park. Curriculum reinforcement learning for cohesive team in mobile ad hoc networks.IEEE Communications Letters, 26(8):1809–1813, Aug. 2022

2022
[65]

AD4RL: Autonomous driving benchmarks for offline reinforcement learning with value-based dataset

Dongsu Lee, Chanin Eom, and Minhae Kwon. AD4RL: Autonomous driving benchmarks for offline reinforcement learning with value-based dataset. In IEEE International Conference on Robotics and Au- tomation (ICRA), 2024

2024
[66]

Episodic future thinking with offline reinforcement learning for au- tonomous driving.IEEE Internet of Things Journal, 12(11):17012–17023, Jun

Dongsu Lee and Minhae Kwon. Episodic future thinking with offline reinforcement learning for au- tonomous driving.IEEE Internet of Things Journal, 12(11):17012–17023, Jun. 2025

2025
[67]

A unified principle of pessimism for offline reinforce- ment learning under model mismatch

Yue Wang, Zhongchang Sun, and Shaofeng Zou. A unified principle of pessimism for offline reinforce- ment learning under model mismatch. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[68]

Is value learning really the main bottleneck in offline RL? InAdvances in Neural Information Processing Systems (NeurIPS), 2024

Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline RL? InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[69]

Beyond online sampling: Bridging offline- to-online alignment via dynamic data transformation for LLMs

Zhang Zhang, Guhao Feng, Jian Guan, Di He, and Wei Wu. Beyond online sampling: Bridging offline- to-online alignment via dynamic data transformation for LLMs. InEmpirical Methods in Natural Language Processing (EMNLP), 2025

2025
[70]

DigiRL: Training in-the-wild device-control agents with au- tonomous reinforcement learning

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: Training in-the-wild device-control agents with au- tonomous reinforcement learning. InAdvances in Neu- ral Information Processing Systems (NeurIPS), 2024

2024
[71]

Unpacking DPO and PPO: Disentangling best practices for learning from preference feedback

Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah Smith, Yejin Choi, and Hannaneh Hajishirzi. Unpacking DPO and PPO: Disentangling best practices for learning from preference feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[72]

Bridging offline and online reinforcement learning for llms.arXiv preprint arXiv:2506.21495, 2025

Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason Weston, Sainbayar Sukhbaatar, and Ilia Kulikov. Bridging offline and online re- inforcement learning for LLMs.arXiv preprint arXiv:2506.21495, 2025

work page arXiv 2025
[73]

Scenario-free au- tonomous driving with multi-task offline-to-online re- inforcement learning.IEEE Transactions on Intelli- gent Transportation Systems, 26(9):13317–13330, Sep

Dongsu Lee and Minhae Kwon. Scenario-free au- tonomous driving with multi-task offline-to-online re- inforcement learning.IEEE Transactions on Intelli- gent Transportation Systems, 26(9):13317–13330, Sep. 2025

2025
[74]

Test-time fine-tuning of image compression models for multi- task adaptability

Unki Park, Seongmoon Jeong, Youngchan Jang, Gyeong-Moon Park, and Jong Hwan Ko. Test-time fine-tuning of image compression models for multi- task adaptability. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025

2025
[75]

KL-regularized reinforce- ment learning is designed to mode collapse

Anthony Chen, Jatin Prakash, Jeff Guo, Rob Fergus, and Rajesh Ranganath. KL-regularized reinforce- ment learning is designed to mode collapse. InIn- ternational Conference on Learning Representations (ICLR), 2026

2026
[76]

KL-regularised Q-learning: A token- level action-value perspective on online RLHF

Jason Brown, Lennie Wells, Edward Young, and Ser- gio Bacallado. KL-regularised Q-learning: A token- level action-value perspective on online RLHF. In ICML Workshop on Models of Human Feedback for AI Alignment, 2025

2025
[77]

The choice of diver- gence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward

Long Li, Jiaran Hao, Jason Liu, Zhijian Zhou, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, and Yuan Qi. The choice of diver- gence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward. In International Conference on Learning Representations (ICLR), 2026. 13 Multi2: Hierarchical M...

2026
[78]

ALF- World: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALF- World: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations (ICLR), 2021

2021
[79]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Ke- qin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

Mistral 7B

Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Chaplot, Diego Casas, Flo- rian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Lavaud, Marie Lachaux, Pierre Stock, Teven Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William Sayed. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.

[1] [1]

En- hancing decision-making of large language models via actor-critic

Heng Dong, Kefei Duan, and Chongjie Zhang. En- hancing decision-making of large language models via actor-critic. InInternational Conference on Machine Learning (ICML), 2025

2025

[2] [2]

CollabLLM: From passive responders to active collaborators

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao. CollabLLM: From passive responders to active collaborators. InInter- national Conference on Machine Learning (ICML), 2025

2025

[3] [3]

DAMA: Data- and model-aware align- ment of multi-modal LLMs

Jinda Lu, Junkang Wu, Jinghan Li, Xiaojun Jia, Shuo Wang, YiFan Zhang, Junfeng Fang, Xiang Wang, and Xiangnan He. DAMA: Data- and model-aware align- ment of multi-modal LLMs. InInternational Confer- ence on Machine Learning (ICML), 2025

2025

[4] [4]

Inverse rational control with par- tially observable continuous nonlinear dynamics

Minhae Kwon, Saurabh Daptardar, Paul R Schrater, and Xaq Pitkow. Inverse rational control with par- tially observable continuous nonlinear dynamics. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020

[5] [5]

QuBE: Question-based belief enhancement for agentic LLM reasoning

Minsoo Kim, Jongyoon Kim, Jihyuk Kim, and Seung Hwang. QuBE: Question-based belief enhancement for agentic LLM reasoning. InEmpirical Methods in Natural Language Processing (EMNLP), 2024

2024

[6] [6]

Agentic reasoning: A streamlined frame- work for enhancing LLM reasoning with agentic tools

Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, and Yueming Jin. Agentic reasoning: A streamlined frame- work for enhancing LLM reasoning with agentic tools. InAssociation for Computational Linguistics (ACL), 2025

2025

[7] [7]

T1: Advancing language model reasoning through reinforcement learning and inference scaling

Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. T1: Advancing language model reasoning through reinforcement learning and inference scaling. InIn- ternational Conference on Machine Learning (ICML), 2025

2025

[8] [8]

Episodic future think- ing mechanism for multi-agent reinforcement learning

Dongsu Lee and Minhae Kwon. Episodic future think- ing mechanism for multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[9] [9]

ReCAP: Recursive context-aware reasoning and planning for large language model agents

Zhenyu Zhang, Tianyi Chen, Weiran Xu, Alex Pent- land, and Jiaxin Pei. ReCAP: Recursive context-aware reasoning and planning for large language model agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[10] [10]

Evaluating llm-based agents for multi-turn conversations: A survey, 2025

Shengyue Guan, Jindong Wang, Jiang Bian, Bin Zhu, Jian Lou, and Haoyi Xiong. Evaluating LLM-based agents for multi-turn conversations: A survey.arXiv preprint arXiv:2503.22458, 2025

work page arXiv 2025

[11] [11]

Path drift in large reasoning models: How first-person commitments override safety

Yuyi Huang, Runzhe Zhan, Lidia Chao, Ailin Tao, and Derek Wong. Path drift in large reasoning models: How first-person commitments override safety. In Empirical Methods in Natural Language Processing (EMNLP), 2025

2025

[12] [12]

Drift no more? Context equilibria in multi-turn LLM interac- tions

Vardhan Dongre, Ryan Rossi, Viet Lai, Seunghyun Yoon, Dilek Hakkani-Tür, and Trung Bui. Drift no more? Context equilibria in multi-turn LLM interac- tions. InAAAI Personalization in the Era of Large Foundation Models Workshop, 2025

2025

[13] [13]

Do as I can, not as I say: Grounding language in robotic affordances

Brian Ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, Omar Co...

2023

[14] [14]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representa- tions (ICLR), 2023

2023

[15] [15]

Divide and conquer: Grounding LLMs as efficient decision-making agents via offline hierarchical reinforcement learning

Zican Hu, Wei Liu, Xiaoye Qu, Xiangyu Yue, Chunlin Chen, Zhi Wang, and Yu Cheng. Divide and conquer: Grounding LLMs as efficient decision-making agents via offline hierarchical reinforcement learning. InIn- ternational Conference on Machine Learning (ICML), 2025. 10 Multi2: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments

2025

[16] [16]

ADaPT: As-needed decomposition and planning with language models

Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit Bansal, and Tushar Khot. ADaPT: As-needed decomposition and planning with language models. InFindings of the Association for Computational Linguistics (NAACL), 2024

2024

[17] [17]

Plan-and-Act: Improving planning of agents for long-horizon tasks

Lutfi Erdogan, Hiroki Furuta, Sehoon Kim, Nicholas Lee, Suhong Moon, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. Plan-and-Act: Improving planning of agents for long-horizon tasks. InInter- national Conference on Machine Learning (ICML), 2025

2025

[18] [18]

The illusion of diminishing returns: Measuring long horizon execution in LLMs

Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping. The illusion of diminishing returns: Measuring long horizon execution in LLMs. InInternational Conference on Learning Representa- tions (ICLR), 2026

2026

[19] [19]

Agent-oriented planning in multi-agent systems

Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. Agent-oriented planning in multi-agent systems. InInternational Conference on Learning Representations (ICLR), 2025

2025

[20] [20]

Multi-agent collaboration via evolv- ing orchestration

Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zi- hao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, and Maosong Sun. Multi-agent collaboration via evolv- ing orchestration. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[21] [21]

Agentic AI: The age of reason- ing—A review.Journal of Automation and Intelli- gence, 2025

Ume Nisa, Muhammad Shirazi, Mohamed Saip, and Muhammad Pozi. Agentic AI: The age of reason- ing—A review.Journal of Automation and Intelli- gence, 2025

2025

[22] [22]

ScienceWorld: Is your agent smarter than a 5th grader? InEmpirical Methods in Natural Language Processing (EMNLP), 2022

Ruoyao Wang, Peter Jansen, Marc Côté, and Prithviraj Ammanabrolu. ScienceWorld: Is your agent smarter than a 5th grader? InEmpirical Methods in Natural Language Processing (EMNLP), 2022

2022

[23] [23]

OASIS: Open-world adaptive self-supervised and imbalanced- aware system

Miru Kim, Mugon Joe, and Minhae Kwon. OASIS: Open-world adaptive self-supervised and imbalanced- aware system. InACM International Conference on Information and Knowledge Management (CIKM), 2025

2025

[24] [24]

Improving network attack classification on imbalanced real-world intrusion incident datasets

Miru Kim, Mugon Joe, and Minhae Kwon. Improving network attack classification on imbalanced real-world intrusion incident datasets. InInternational Confer- ence on Mobile Systems, Applications and Services (MobiSys), 2025

2025

[25] [25]

Con- trastive learning based network attack classifier for imbalanced data.Journal of Communications and Networks, 28(1):86–97, Feb

Mugon Joe, Miru Kim, and Minhae Kwon. Con- trastive learning based network attack classifier for imbalanced data.Journal of Communications and Networks, 28(1):86–97, Feb. 2026

2026

[26] [26]

Per- sonalized split federated learning with early exit: Pre- training and online learning against label shifts.IEEE Internet of Things Journal, 12(22):47069–47082, Nov

Miru Kim, Heewon Park, and Minhae Kwon. Per- sonalized split federated learning with early exit: Pre- training and online learning against label shifts.IEEE Internet of Things Journal, 12(22):47069–47082, Nov. 2025

2025

[27] [27]

Fed-ADE: Adaptive learning rate for federated post-adaptation under distribution shift

Heewon Park, Mugon Joe, Miru Kim, Kyungjin Im, and Minhae Kwon. Fed-ADE: Adaptive learning rate for federated post-adaptation under distribution shift. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[28] [28]

Evolving intelligent network attack classifier under label distri- bution shift.IEEE Transactions on Network Science and Engineering, 13(1):7448–7464, Mar

Miru Kim, Mugon Joe, and Minhae Kwon. Evolving intelligent network attack classifier under label distri- bution shift.IEEE Transactions on Network Science and Engineering, 13(1):7448–7464, Mar. 2026

2026

[29] [29]

Personal- ized federated sensing for heterogeneous environment

Heewon Park, Miru Kim, and Minhae Kwon. Personal- ized federated sensing for heterogeneous environment. IEEE Sensors Letters, 9(4):1–4, 2025

2025

[30] [30]

Editable scene simulation for autonomous driving via collaborative LLM-agents

Yuxi Wei, Zi Wang, Yifan Lu, Chenxin Xu, Changx- ing Liu, Hao Zhao, Siheng Chen, and Yanfeng Wang. Editable scene simulation for autonomous driving via collaborative LLM-agents. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[31] [31]

DriVLMe: Enhancing LLM- based autonomous driving agents with embodied and social experiences

Yidong Huang, Jacob Sansom, Ziqiao Ma, Felix Gervits, and Joyce Chai. DriVLMe: Enhancing LLM- based autonomous driving agents with embodied and social experiences. InIEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), 2024

2024

[32] [32]

SToRM: Supervised token reduction for multi-modal LLMs toward efficient end- to-end autonomous driving

Seo Hyun Kim, Jin Bok Park, Do Yeon Koo, Hogun Park, and Il Yong Chun. SToRM: Supervised token reduction for multi-modal LLMs toward efficient end- to-end autonomous driving. InIEEE International Conference on Robotics and Automation (ICRA), 2026

2026

[33] [33]

ASAP: Unsupervised post-training with label distribution shift adaptive learning rate

Heewon Park, Mugon Joe, Miru Kim, and Minhae Kwon. ASAP: Unsupervised post-training with label distribution shift adaptive learning rate. InACM Inter- national Conference on Information and Knowledge Management (CIKM), 2025

2025

[34] [34]

Reflexion: Lan- guage agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Lan- guage agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023

[35] [35]

Make your LLM fully utilize the context

Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian Lou, and Weizhu Chen. Make your LLM fully utilize the context. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 11 Multi2: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments

2024

[36] [36]

Toward self- improvement of LLMs via imagination, searching, and criticizing

Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. Toward self- improvement of LLMs via imagination, searching, and criticizing. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2024

2024

[37] [37]

The lighthouse of language: Enhancing LLM agents via critique-guided improvement

Ruihan Yang, Fanghua Ye, Jian Li, Siyu Yuan, Yikai Zhang, Zhaopeng Tu, Xiaolong Li, and Deqing Yang. The lighthouse of language: Enhancing LLM agents via critique-guided improvement. InAdvances in Neu- ral Information Processing Systems (NeurIPS), 2025

2025

[38] [38]

The alignment problem from a deep learning perspec- tive

Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspec- tive. InInternational Conference on Learning Repre- sentations (ICLR), 2024

2024

[39] [39]

ArCHer: Training language model agents via hierarchical multi-turn RL

Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. ArCHer: Training language model agents via hierarchical multi-turn RL. InInternational Conference on Machine Learning (ICML), 2024

2024

[40] [40]

Robust hierar- chical anomaly detection using feature impact in iot networks.ICT Express, 11(2):358–363, Apr

Joohong Rheey and Hyunggon Park. Robust hierar- chical anomaly detection using feature impact in iot networks.ICT Express, 11(2):358–363, Apr. 2025

2025

[41] [41]

Option discovery us- ing LLM-guided semantic hierarchical reinforcement learning.arXiv preprint arXiv:2503.19007, 2025

Chak Shek and Pratap Tokekar. Option discovery us- ing LLM-guided semantic hierarchical reinforcement learning.arXiv preprint arXiv:2503.19007, 2025

work page arXiv 2025

[42] [42]

Leveraging imitation learning and LLMs for efficient hierarchical reinforcement learning

Runhan Yang, Jieao Shi, Mengqi Su, and Don- gruo Zhou. Leveraging imitation learning and LLMs for efficient hierarchical reinforcement learning. https://openreview.net/forum?id=6y00rooi7i, 2025

2025

[43] [43]

Getting more juice out of the SFT data: Reward learning from human demonstration improves SFT for LLM alignment

Jiaxiang Li, Siliang Zeng, Hoi Wai, Chenliang Li, Alfredo Garcia, and Mingyi Hong. Getting more juice out of the SFT data: Reward learning from human demonstration improves SFT for LLM alignment. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[44] [44]

Large lan- guage models as generalizable policies for embodied tasks

Andrew Szot, Max Schwarzer, Harsh Agrawal, Bog- dan Mazoure, Rin Metcalf, Walter Talbott, Natalie Mackraz, R Hjelm, and Alexander Toshev. Large lan- guage models as generalizable policies for embodied tasks. InInternational Conference on Learning Repre- sentations (ICLR), 2024

2024

[45] [45]

Unlocking LLMs’ self-improvement capacity with autonomous learning for domain adaptation

Ke Ji, Junying Chen, Anningzhe Gao, Wenya Xie, Xiang Wan, and Benyou Wang. Unlocking LLMs’ self-improvement capacity with autonomous learning for domain adaptation. InFindings of the Association for Computational Linguistics (ACL), 2025

2025

[46] [46]

Data mix- ing optimization for supervised fine-tuning of large language models

Yuan Li, Zhengzhong Liu, and Eric Xing. Data mix- ing optimization for supervised fine-tuning of large language models. InInternational Conference on Ma- chine Learning (ICML), 2025

2025

[47] [47]

Wei Lu, Rachel Luu, and Markus Buehler. Fine-tuning large language models for domain adaptation: Explo- ration of training strategies, scaling, model merging and synergistic capabilities.npj Computational Mate- rials, 11(1):84, 2025

2025

[48] [48]

Coevolving with the other you: Fine-tuning LLM with sequential coopera- tive multi-agent reinforcement learning

Hao Ma, Tianyi Hu, Zhiqiang Pu, Boyin Liu, Xiaolin Ai, Yanyan Liang, and Min Chen. Coevolving with the other you: Fine-tuning LLM with sequential coopera- tive multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[49] [49]

Instant inverse mod- eling of stochastic driving behavior with deep rein- forcement learning.IEEE Transactions on Consumer Electronics, 71(1):2152–2162, Feb

Dongsu Lee and Minhae Kwon. Instant inverse mod- eling of stochastic driving behavior with deep rein- forcement learning.IEEE Transactions on Consumer Electronics, 71(1):2152–2162, Feb. 2025

2025

[50] [50]

Control- ling large language model with latent action

Chengxing Jia, Ziniu Li, Pengyuan Wang, Yi Li, Zhenyu Hou, Yuxiao Dong, and Yang Yu. Control- ling large language model with latent action. InIn- ternational Conference on Machine Learning (ICML), 2025

2025

[51] [51]

Stability analysis in mixed-autonomous traffic with deep reinforcement learning.IEEE Transactions on Vehicular Technology, 72(3):2848–2862, Mar

Dongsu Lee and Minhae Kwon. Stability analysis in mixed-autonomous traffic with deep reinforcement learning.IEEE Transactions on Vehicular Technology, 72(3):2848–2862, Mar. 2023

2023

[52] [52]

QLASS: Boost- ing language agent inference via Q-guided stepwise search

Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, and Kai Chang. QLASS: Boost- ing language agent inference via Q-guided stepwise search. InInternational Conference on Machine Learn- ing (ICML), 2025

2025

[53] [53]

Temporal distance- aware transition augmentation for offline model-based reinforcement learning

Dongsu Lee and Minhae Kwon. Temporal distance- aware transition augmentation for offline model-based reinforcement learning. InInternational Conference on Machine Learning (ICML), 2025

2025

[54] [54]

Online reinforcement learning in stochastic games

Chen Wei, Yi Hong, and Chi Lu. Online reinforcement learning in stochastic games. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017

[55] [55]

Efficient online reinforcement learning with offline data

Philip Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. InInternational Conference on Machine Learning (ICML), 2023

2023

[56] [56]

Leveraging offline data in online reinforcement learning

Andrew Wagenmaker and Aldo Pacchiano. Leveraging offline data in online reinforcement learning. InIn- ternational Conference on Machine Learning (ICML), 2023

2023

[57] [57]

Continuous control with deep reinforcement learning

Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2016. 12 Multi2: Hierarchical Multi-Agent Decision-Making with LLM-Based Agents in Interactive Environments

2016

[58] [58]

Foresighted decisions for inter-vehicle interactions: An offline reinforcement learning approach

Dongsu Lee and Minhae Kwon. Foresighted decisions for inter-vehicle interactions: An offline reinforcement learning approach. InIEEE International Conference on Intelligent Transportation Systems (ITSC), 2023

2023

[59] [59]

Ad- dressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Ad- dressing function approximation error in actor-critic methods. InInternational Conference on Machine Learning (ICML), 2018

2018

[60] [60]

Selec- tive imitation for efficient online reinforcement learn- ing with pre-collected data.ICT Express, 10(6):1308– 1314, Dec

Chanin Eom, Dongsu Lee, and Minhae Kwon. Selec- tive imitation for efficient online reinforcement learn- ing with pre-collected data.ICT Express, 10(6):1308– 1314, Dec. 2024

2024

[61] [61]

Of- fline reinforcement learning with implicit Q-learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Of- fline reinforcement learning with implicit Q-learning. InInternational Conference on Learning Representa- tions (ICLR), 2022

2022

[62] [62]

Price of the au- tonomous strategy with reinforcement learning in mixed-autonomy traffic networks.IEEE Transactions on Intelligent Transportation Systems, 27(2):2741– 2752, Feb

Chanin Eom and Minhae Kwon. Price of the au- tonomous strategy with reinforcement learning in mixed-autonomy traffic networks.IEEE Transactions on Intelligent Transportation Systems, 27(2):2741– 2752, Feb. 2026

2026

[63] [63]

The impact of dataset on offline reinforcement learning performance in uav-based emergency network recov- ery tasks.IEEE Communications Letters, 28(5):1058– 1061, May

Jeyeon Eo, Dongsu Lee, and Minhae Kwon. The impact of dataset on offline reinforcement learning performance in uav-based emergency network recov- ery tasks.IEEE Communications Letters, 28(5):1058– 1061, May. 2024

2024

[64] [64]

Curriculum reinforcement learning for cohesive team in mobile ad hoc networks.IEEE Communications Letters, 26(8):1809–1813, Aug

Nayoung Kim, Minhae Kwon, and Hyunggon Park. Curriculum reinforcement learning for cohesive team in mobile ad hoc networks.IEEE Communications Letters, 26(8):1809–1813, Aug. 2022

2022

[65] [65]

AD4RL: Autonomous driving benchmarks for offline reinforcement learning with value-based dataset

Dongsu Lee, Chanin Eom, and Minhae Kwon. AD4RL: Autonomous driving benchmarks for offline reinforcement learning with value-based dataset. In IEEE International Conference on Robotics and Au- tomation (ICRA), 2024

2024

[66] [66]

Episodic future thinking with offline reinforcement learning for au- tonomous driving.IEEE Internet of Things Journal, 12(11):17012–17023, Jun

Dongsu Lee and Minhae Kwon. Episodic future thinking with offline reinforcement learning for au- tonomous driving.IEEE Internet of Things Journal, 12(11):17012–17023, Jun. 2025

2025

[67] [67]

A unified principle of pessimism for offline reinforce- ment learning under model mismatch

Yue Wang, Zhongchang Sun, and Shaofeng Zou. A unified principle of pessimism for offline reinforce- ment learning under model mismatch. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[68] [68]

Is value learning really the main bottleneck in offline RL? InAdvances in Neural Information Processing Systems (NeurIPS), 2024

Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline RL? InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[69] [69]

Beyond online sampling: Bridging offline- to-online alignment via dynamic data transformation for LLMs

Zhang Zhang, Guhao Feng, Jian Guan, Di He, and Wei Wu. Beyond online sampling: Bridging offline- to-online alignment via dynamic data transformation for LLMs. InEmpirical Methods in Natural Language Processing (EMNLP), 2025

2025

[70] [70]

DigiRL: Training in-the-wild device-control agents with au- tonomous reinforcement learning

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: Training in-the-wild device-control agents with au- tonomous reinforcement learning. InAdvances in Neu- ral Information Processing Systems (NeurIPS), 2024

2024

[71] [71]

Unpacking DPO and PPO: Disentangling best practices for learning from preference feedback

Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah Smith, Yejin Choi, and Hannaneh Hajishirzi. Unpacking DPO and PPO: Disentangling best practices for learning from preference feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[72] [72]

Bridging offline and online reinforcement learning for llms.arXiv preprint arXiv:2506.21495, 2025

Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason Weston, Sainbayar Sukhbaatar, and Ilia Kulikov. Bridging offline and online re- inforcement learning for LLMs.arXiv preprint arXiv:2506.21495, 2025

work page arXiv 2025

[73] [73]

Scenario-free au- tonomous driving with multi-task offline-to-online re- inforcement learning.IEEE Transactions on Intelli- gent Transportation Systems, 26(9):13317–13330, Sep

Dongsu Lee and Minhae Kwon. Scenario-free au- tonomous driving with multi-task offline-to-online re- inforcement learning.IEEE Transactions on Intelli- gent Transportation Systems, 26(9):13317–13330, Sep. 2025

2025

[74] [74]

Test-time fine-tuning of image compression models for multi- task adaptability

Unki Park, Seongmoon Jeong, Youngchan Jang, Gyeong-Moon Park, and Jong Hwan Ko. Test-time fine-tuning of image compression models for multi- task adaptability. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025

2025

[75] [75]

KL-regularized reinforce- ment learning is designed to mode collapse

Anthony Chen, Jatin Prakash, Jeff Guo, Rob Fergus, and Rajesh Ranganath. KL-regularized reinforce- ment learning is designed to mode collapse. InIn- ternational Conference on Learning Representations (ICLR), 2026

2026

[76] [76]

KL-regularised Q-learning: A token- level action-value perspective on online RLHF

Jason Brown, Lennie Wells, Edward Young, and Ser- gio Bacallado. KL-regularised Q-learning: A token- level action-value perspective on online RLHF. In ICML Workshop on Models of Human Feedback for AI Alignment, 2025

2025

[77] [77]

The choice of diver- gence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward

Long Li, Jiaran Hao, Jason Liu, Zhijian Zhou, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, and Yuan Qi. The choice of diver- gence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward. In International Conference on Learning Representations (ICLR), 2026. 13 Multi2: Hierarchical M...

2026

[78] [78]

ALF- World: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALF- World: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations (ICLR), 2021

2021

[79] [79]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Ke- qin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[80] [80]

Mistral 7B

Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Chaplot, Diego Casas, Flo- rian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Lavaud, Marie Lachaux, Pierre Stock, Teven Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William Sayed. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023