pith. sign in

arxiv: 2510.09474 · v2 · submitted 2025-10-10 · 💻 cs.CL · cs.AI

Multimodal Policy Internalization for Conversational Agents

Pith reviewed 2026-05-18 07:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multimodal policy internalizationconversational agentspolicy followingreinforcement learningcontinual pretrainingtool usagedecision making
0
0 comments X

The pith

Internalizing multimodal policies into model parameters allows stronger adherence without including them during inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern conversational agents rely on complex policies for metadata, response styles, and tool-usage rules that grow lengthy and hard to follow as tasks become multimodal. The paper establishes that these reasoning-intensive policies can be internalized directly into model parameters through a three-stage process, yielding better end-to-end accuracy, generalization, and resistance to forgetting. A sympathetic reader would care because this removes the constant computational burden of long prompts and makes faithful policy following more reliable even when instructions are new or conflicting.

Core claim

TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting by internalizing reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference.

What carries the argument

TriMPI, a three-stage training framework that injects policy knowledge via continual pretraining, performs supervised finetuning, and applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration.

If this is right

  • Notable gains in end-to-end accuracy on synthetic and real-world decision-making and tool-using tasks.
  • Improved generalization to novel instructions.
  • Increased robustness to forgetting of policies after training.
  • Lower computational costs during inference by excluding long policies from prompts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could support faster adaptation to policy changes through targeted updates to the internalized knowledge rather than prompt rewrites.
  • Internalization might help models resolve conflicts between rules more reliably when policies reside in weights instead of context.
  • This shift from prompt-based to parameter-based control could extend to other instruction-heavy domains such as planning or multi-agent coordination.

Load-bearing premise

That complex multimodal policies containing visual behaviors and tool-usage rules can be effectively compressed and internalized into model parameters via the three-stage pipeline without losing the ability to handle novel or conflicting instructions at test time.

What would settle it

A test showing that the internalized model fails on novel or conflicting multimodal instructions where a standard prompt-based policy model succeeds would challenge the claim that internalization preserves full policy reasoning.

Figures

Figures reproduced from arXiv: 2510.09474 by Amin Fazel, Chenlei Guo, Heng Ji, Jiateng Liu, Ritesh Sarkhel, Ruhi Sarikaya, Xiang Li, Xing Fan, Zhenhailong Wang.

Figure 1
Figure 1. Figure 1: Motivation of the proposed Multimodal Policy Internalization task. The goal is to enhance the policy-following abilities of a large multimodal model without requiring the policy to be provided in-context during inference, thereby improving both performance and efficiency. et al., 2025), leading to a 20× to 250× higher input token cost from the policy prompts compared with the actual user queries. Moreover,… view at source ↗
Figure 2
Figure 2. Figure 2: ClevrPolicy dataset. Left: Illustration of policy generation, where a decision tree is first generated and converted into natural language instructions (see Appendix C.1 for details on the decision node ontology, and Figures 14, 15 for full policy examples). Right: Example input-output pair corresponding to the policy. The policy is available only during training and not during inference. learning (RL) sta… view at source ↗
Figure 3
Figure 3. Figure 3: GTAPolicy dataset. Left: illustration of the policy, consisting of two major parts, tool description and tool calling rules (see [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of different training algorithms for multimodal policy internalization. The solid purple outlines indicate the parts where the next-token prediction loss is computed. On the right, we illustrate the proposed three-stage training strategy, TriMPI, which enables direct policy knowledge injection through the VM-CPT stage and policy-grounded reinforcement learning through PolicyRollout. The PolicyRoll… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the PolicyRollout algorithm (applied to GRPO as an example). During the rollout phase, we additionally construct a set of input instances with the policy included in-context. These policy-aware responses are added to the rollout space as if they were generated from the original inputs without the policy in-context. The advantage and policy gradient are then computed on the combined rollouts… view at source ↗
Figure 6
Figure 6. Figure 6: Efficiency metrics before and after MPI. Efficiency Analysis. We report efficiency metrics in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative example comparing different MPI algorithms. On the left, we show the inputs and the ground-truth reasoning trajectory annotated with the original policy sections. On the right, the CoT SFT model makes an error in correctly recalling the policy condition, and the CoT SFT + GRPO model makes an incorrect decision at the fifth condition, both leading to an incorrect final outcome. In contrast, the … view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of the Policy Referral evaluation setup. We take the responses from the internalized model and ask a strong LLM (Claude-4 (Anthropic, 2025)) to score the consistency between any policy referral in the response and the original policy. Policy referral is designed to evaluate the quality of the embedded policy knowledge beyond end-task performance. G RESULTS ON ROBUSTNESS TO CATASTROPHIC FORGETT… view at source ↗
Figure 9
Figure 9. Figure 9: Prompt to Claude-4 for the Policy Referral evaluation. Suppose we have an updated Policy: {L3- Surinu} as follows: --- Policy: L3-Surinu --- # Condition 1: ... # Condition 1.1: ... --- According to the updated Policy: L3- Surinu, what should be the response for the given image? "Case 1" Policy Override on ClevrPolicy { "name": "OCR_v1", "arguments": {"image": "image/image_203.jpg"} } ### Tool Calling Rule … view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of the Policy Override evaluation setup. For ClevrPolicy, we provide a randomly sampled new policy in-context during inference as updated policy content while keeping its original unique name. For GTAPolicy, we modify the tool-calling rules, resulting in a different model version. Policy Override evaluates the model’s capability to generalize to updated or modified policies beyond overfitting… view at source ↗
Figure 11
Figure 11. Figure 11: CoT data example (ClevrPolicy-T N=6). supported concepts are typically limited to one or a few personal photos, which is far less complex than the policies required for multimodal conversational agents. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: CoT data example (ClevrPolicy-M N=6). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: CoT data example (GTAPolicy). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Full policy example (ClevrPolicy-T N=6). 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Full policy example (ClevrPolicy-M N=6). 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Full policy example (GTAPolicy). 27 [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
read the original abstract

Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: https://mikewangwzhl.github.io/TriMPI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Multimodal Policy Internalization (MPI) as a new task for embedding reasoning-intensive multimodal policies (metadata, response styles, visual behaviors, and tool-usage rules) directly into LLM parameters. This enables faithful policy following at inference time without including the policy as an in-context prompt. The authors construct two datasets covering synthetic and real-world decision-making/tool-use scenarios and propose TriMPI, a three-stage pipeline: continual pretraining for knowledge injection, supervised fine-tuning, and PolicyRollout (a GRPO-style RL method that augments rollouts with policy-aware responses). They report gains in end-to-end accuracy, generalization, and robustness to forgetting, positioning the work as the first study on multimodal policy internalization.

Significance. If the empirical claims hold under rigorous verification, the paper would make a meaningful contribution by addressing the practical problem of lengthy, complex policies in multimodal conversational agents, which currently incur high inference costs and poor adherence. Being the first work on this task, the release of datasets, training recipes, and evaluations provides a useful foundation for future research. The PolicyRollout extension of GRPO is a reasonable algorithmic choice for encouraging grounded exploration.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The headline claim of generalization to novel or conflicting multimodal policies rests on the test instances containing genuinely out-of-distribution policy rules (new visual behaviors or tool-usage conflicts). The manuscript does not describe the test-set construction procedure or provide evidence that held-out policies require abstraction rather than memorization of patterns from the same policy families; without this, measured gains in accuracy and forgetting robustness do not establish the required compression and internalization.
  2. [§3.3] §3.3 (PolicyRollout): The description of how policy-aware responses are generated for the GRPO rollouts is insufficiently detailed to assess whether they provide grounded exploration beyond standard RLHF-style methods. This detail is load-bearing for reproducing the reported robustness improvements.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including at least one quantitative result (e.g., accuracy delta or forgetting metric) to convey the magnitude of the claimed gains.
  2. [Figures and Tables] Table captions and axis labels in the experimental figures should explicitly state the evaluation metric and whether results are averaged over multiple seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our introduction of the Multimodal Policy Internalization (MPI) task and the TriMPI training pipeline. We address each major comment below with clarifications and revisions to improve the manuscript's rigor and reproducibility.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline claim of generalization to novel or conflicting multimodal policies rests on the test instances containing genuinely out-of-distribution policy rules (new visual behaviors or tool-usage conflicts). The manuscript does not describe the test-set construction procedure or provide evidence that held-out policies require abstraction rather than memorization of patterns from the same policy families; without this, measured gains in accuracy and forgetting robustness do not establish the required compression and internalization.

    Authors: We agree that explicit details on test-set construction are essential to substantiate the generalization claims. The test sets were constructed by partitioning policies such that training uses one set of rules while test instances introduce entirely novel multimodal policies, including new visual behaviors (e.g., previously unseen gesture or scene-interaction rules) and conflicting tool-usage constraints not present in any training policy family. In the revised manuscript, we will expand §4 with a dedicated subsection describing the construction procedure, including the policy partitioning strategy, examples of held-out rules, and quantitative measures of distributional shift (e.g., rule overlap statistics). We will also add analysis showing that performance gains persist on these OOD policies and are not explained by memorization of similar patterns, thereby supporting the internalization interpretation. revision: yes

  2. Referee: [§3.3] §3.3 (PolicyRollout): The description of how policy-aware responses are generated for the GRPO rollouts is insufficiently detailed to assess whether they provide grounded exploration beyond standard RLHF-style methods. This detail is load-bearing for reproducing the reported robustness improvements.

    Authors: We acknowledge that the current description of PolicyRollout in §3.3 is too concise for full reproducibility. In the revised manuscript, we will substantially expand this section to detail the generation of policy-aware responses: specifically, the prompting templates that condition response sampling on both the user query and the full policy (including multimodal elements), the rollout augmentation procedure that mixes policy-grounded and standard responses, and the precise integration with GRPO to promote exploration that respects policy constraints. We will include pseudocode and additional implementation notes to distinguish this from standard RLHF rollouts and to enable reproduction of the robustness gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training recipe with held-out evaluation

full rationale

The paper introduces MPI as a task and TriMPI as a three-stage empirical pipeline (continual pretraining for knowledge injection, SFT, then PolicyRollout GRPO). Central results are accuracy, generalization, and forgetting metrics measured on held-out synthetic and real-world decision-making/tool-use tasks. No equations, first-principles derivations, or predictions appear that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The work is self-contained as an experimental method with external benchmarks (held-out test instances), satisfying the criteria for a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard assumptions of continual pretraining and RL fine-tuning being sufficient to embed policy knowledge, plus the existence of representative synthetic and real-world multimodal decision datasets; no new mathematical axioms or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 5817 in / 1220 out tokens · 32673 ms · 2026-05-18T07:51:51.256246+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

  1. [1]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    2, 9 Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1988–1997. IEEE Computer Societ...

  2. [2]

    2, 9 Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee

    URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/3d77c6dcc7f143aa2154e7f4d5e22d68-Paper-Conference.pdf. 2, 9 Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’llava: Your personalized language and vision assistant.Advances in Neural Information Processing Systems, 37:40913–40951, 2024. 9, 21 Thao Nguyen, Krishn...

  3. [3]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    21 Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3, 6 Zirui Song, Bin Yan, Yuhan Liu, Miao Fang, Mingzhe Li, Rui Yan, and Xiuying Chen. Injecting domain-specif...

  4. [7]

    So, we move to Condition 1.2.2.1.2

    **Condition 1.2.2.1: check if there is any cylinder object** - There is a yellow cylinder in the image, but it is not brown. So, we move to Condition 1.2.2.1.2

  5. [8]

    </think> \boxed{Case 39} <think> Let's analyze the image step by step according to the policy:

    **Condition 1.2.2.1.2: check if there is any small object** ... </think> \boxed{Case 39} <think> Let's analyze the image step by step according to the policy:

  6. [9]

    So, we move to Condition 1.2

    **Condition 1: Check if there is any gray object** - There are no gray objects in the image. So, we move to Condition 1.2

  7. [10]

    So, we move to Condition 1.2.2

    **Condition 1.2: Check if there is any purple object** - There are no purple objects in the image. So, we move to Condition 1.2.2

  8. [11]

    So, we focus on brown objects and continue with Condition 1.2.2.1

    **Condition 1.2.2: Check if there is any brown object** - There is a brown sphere in the image. So, we focus on brown objects and continue with Condition 1.2.2.1

  9. [12]

    So, we move to Condition 1.2.2.1.2

    **Condition 1.2.2.1: Check if there is any cylinder object** - There is a yellow cylinder in the image, not a brown one. So, we move to Condition 1.2.2.1.2

  10. [13]

    So, we focus on large brown objects and continue with Condition 1.2.2.1.2.1

    **Condition 1.2.2.1.2: Check if there is any large object** - The yellow cylinder is large. So, we focus on large brown objects and continue with Condition 1.2.2.1.2.1. ... </think> \boxed{Case 43} <think> Let's analyze the image step by step according to the policy:

  11. [14]

    So, we move to Condition 1.2

    **Condition 1: check if there is any gray object** - There are no gray objects in the image. So, we move to Condition 1.2

  12. [15]

    So, we move to Condition 1.2.2

    **Condition 1.2: check if there is any purple object** - There are no purple objects in the image. So, we move to Condition 1.2.2

  13. [16]

    So, we focus on brown objects and continue with Condition 1.2.2.1

    **Condition 1.2.2: check if there is any brown object** - There is a brown sphere in the image. So, we focus on brown objects and continue with Condition 1.2.2.1

  14. [17]

    So, we move to Condition 1.2.2.1.2

    **Condition 1.2.2.1: check if there is any cylinder object** - There are no brown cylinder objects in the image. So, we move to Condition 1.2.2.1.2

  15. [18]

    So, we move to Condition 1.2.2.1.2.2

    **Condition 1.2.2.1.2: check if there is any large object** - There are no large brown objects in the image. So, we move to Condition 1.2.2.1.2.2

  16. [19]

    So, we respond with "Case 45"

    **Condition 1.2.2.1.2.2: check if there is any small object** - There is a small brown sphere in the image. So, we respond with "Case 45". </think> \boxed{Case 45} CoT SFT Response CoT SFT + GRPO Response TriMPI w/ PoRo-GRPO Response Figure 7:Qualitative example comparing different MPI algorithms.On the left, we show the inputs and the ground-truth reason...

  17. [20]

    Case 1" Policy Override on ClevrPolicy {

    **Condition 1: Check if there is any cyan object** ... </think> /boxed{Case 28} Follow the Policy L3-Surinu, provide your response for this image. Internalized Model Image Query LLM (Claude 4) Policy (L3-Surinu) Check if any referral to the original policy is accurate 0-10 Score Figure 8:Illustration of the Policy Referral evaluation setup.We take the res...

  18. [21]

    Since there is a cyan object, we focus only on cyan objects and continue with Condition 1.1

    There is a cyan cylinder in the image.2. Since there is a cyan object, we focus only on cyan objects and continue with Condition 1.1. Condition 1.1: check if there is any small object

  19. [22]

    Since there is a small cyan object, we focus only on small cyan objects and continue with Condition 1.1.1

    The cyan cylinder is small.2. Since there is a small cyan object, we focus only on small cyan objects and continue with Condition 1.1.1. Condition 1.1.1: check if there is any cylinder object

  20. [23]

    Since there is a small cyan cylinder, we focus only on small cyan cylinder objects and continue with Condition 1.1.1.1

    The small cyan object is a cylinder.2. Since there is a small cyan cylinder, we focus only on small cyan cylinder objects and continue with Condition 1.1.1.1. Condition 1.1.1.1: check if there is any rubber object

  21. [24]

    user_id":

    The small cyan cylinder appears to be rubber.2. Since there is a rubber object, we respond with "Case 0".</think> Query: <image>According to the Policy: L6-Tavete, what should be the response for the given image? You first think through the reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer e...

  22. [25]

    OCR_v5"- If 250 < user credits <= 750, use

    The first image contains 2 bottles of Hennessy cognac (liquor)2. The second image is a price menu/table3. I need to find the price of Hennessy from the menu to calculate the total cost The assistant has already identified that there are 2 bottles of Hennessy liquor in the first image. Now I need to extract the pricing information from the menu in the seco...