Multimodal Policy Internalization for Conversational Agents
Pith reviewed 2026-05-18 07:51 UTC · model grok-4.3
The pith
Internalizing multimodal policies into model parameters allows stronger adherence without including them during inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting by internalizing reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference.
What carries the argument
TriMPI, a three-stage training framework that injects policy knowledge via continual pretraining, performs supervised finetuning, and applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration.
If this is right
- Notable gains in end-to-end accuracy on synthetic and real-world decision-making and tool-using tasks.
- Improved generalization to novel instructions.
- Increased robustness to forgetting of policies after training.
- Lower computational costs during inference by excluding long policies from prompts.
Where Pith is reading between the lines
- The approach could support faster adaptation to policy changes through targeted updates to the internalized knowledge rather than prompt rewrites.
- Internalization might help models resolve conflicts between rules more reliably when policies reside in weights instead of context.
- This shift from prompt-based to parameter-based control could extend to other instruction-heavy domains such as planning or multi-agent coordination.
Load-bearing premise
That complex multimodal policies containing visual behaviors and tool-usage rules can be effectively compressed and internalized into model parameters via the three-stage pipeline without losing the ability to handle novel or conflicting instructions at test time.
What would settle it
A test showing that the internalized model fails on novel or conflicting multimodal instructions where a standard prompt-based policy model succeeds would challenge the claim that internalization preserves full policy reasoning.
Figures
read the original abstract
Modern conversational agents like ChatGPT and Alexa+ rely on predefined policies specifying metadata, response styles, and tool-usage rules. As these LLM-based systems expand to support diverse business and user queries, such policies, often implemented as in-context prompts, are becoming increasingly complex and lengthy, making faithful adherence difficult and imposing large fixed computational costs. With the rise of multimodal agents, policies that govern visual and multimodal behaviors are critical but remain understudied. Prior prompt-compression work mainly shortens task templates and demonstrations, while existing policy-alignment studies focus only on text-based safety rules. We introduce Multimodal Policy Internalization (MPI), a new task that internalizes reasoning-intensive multimodal policies into model parameters, enabling stronger policy-following without including the policy during inference. MPI poses unique data and algorithmic challenges. We build two datasets spanning synthetic and real-world decision-making and tool-using tasks and propose TriMPI, a three-stage training framework. TriMPI first injects policy knowledge via continual pretraining, then performs supervised finetuning, and finally applies PolicyRollout, a GRPO-style reinforcement learning extension that augments rollouts with policy-aware responses for grounded exploration. TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness to forgetting. As the first work on multimodal policy internalization, we provide datasets, training recipes, and comprehensive evaluations to foster future research. Project page: https://mikewangwzhl.github.io/TriMPI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Multimodal Policy Internalization (MPI) as a new task for embedding reasoning-intensive multimodal policies (metadata, response styles, visual behaviors, and tool-usage rules) directly into LLM parameters. This enables faithful policy following at inference time without including the policy as an in-context prompt. The authors construct two datasets covering synthetic and real-world decision-making/tool-use scenarios and propose TriMPI, a three-stage pipeline: continual pretraining for knowledge injection, supervised fine-tuning, and PolicyRollout (a GRPO-style RL method that augments rollouts with policy-aware responses). They report gains in end-to-end accuracy, generalization, and robustness to forgetting, positioning the work as the first study on multimodal policy internalization.
Significance. If the empirical claims hold under rigorous verification, the paper would make a meaningful contribution by addressing the practical problem of lengthy, complex policies in multimodal conversational agents, which currently incur high inference costs and poor adherence. Being the first work on this task, the release of datasets, training recipes, and evaluations provides a useful foundation for future research. The PolicyRollout extension of GRPO is a reasonable algorithmic choice for encouraging grounded exploration.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The headline claim of generalization to novel or conflicting multimodal policies rests on the test instances containing genuinely out-of-distribution policy rules (new visual behaviors or tool-usage conflicts). The manuscript does not describe the test-set construction procedure or provide evidence that held-out policies require abstraction rather than memorization of patterns from the same policy families; without this, measured gains in accuracy and forgetting robustness do not establish the required compression and internalization.
- [§3.3] §3.3 (PolicyRollout): The description of how policy-aware responses are generated for the GRPO rollouts is insufficiently detailed to assess whether they provide grounded exploration beyond standard RLHF-style methods. This detail is load-bearing for reproducing the reported robustness improvements.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one quantitative result (e.g., accuracy delta or forgetting metric) to convey the magnitude of the claimed gains.
- [Figures and Tables] Table captions and axis labels in the experimental figures should explicitly state the evaluation metric and whether results are averaged over multiple seeds.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our introduction of the Multimodal Policy Internalization (MPI) task and the TriMPI training pipeline. We address each major comment below with clarifications and revisions to improve the manuscript's rigor and reproducibility.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline claim of generalization to novel or conflicting multimodal policies rests on the test instances containing genuinely out-of-distribution policy rules (new visual behaviors or tool-usage conflicts). The manuscript does not describe the test-set construction procedure or provide evidence that held-out policies require abstraction rather than memorization of patterns from the same policy families; without this, measured gains in accuracy and forgetting robustness do not establish the required compression and internalization.
Authors: We agree that explicit details on test-set construction are essential to substantiate the generalization claims. The test sets were constructed by partitioning policies such that training uses one set of rules while test instances introduce entirely novel multimodal policies, including new visual behaviors (e.g., previously unseen gesture or scene-interaction rules) and conflicting tool-usage constraints not present in any training policy family. In the revised manuscript, we will expand §4 with a dedicated subsection describing the construction procedure, including the policy partitioning strategy, examples of held-out rules, and quantitative measures of distributional shift (e.g., rule overlap statistics). We will also add analysis showing that performance gains persist on these OOD policies and are not explained by memorization of similar patterns, thereby supporting the internalization interpretation. revision: yes
-
Referee: [§3.3] §3.3 (PolicyRollout): The description of how policy-aware responses are generated for the GRPO rollouts is insufficiently detailed to assess whether they provide grounded exploration beyond standard RLHF-style methods. This detail is load-bearing for reproducing the reported robustness improvements.
Authors: We acknowledge that the current description of PolicyRollout in §3.3 is too concise for full reproducibility. In the revised manuscript, we will substantially expand this section to detail the generation of policy-aware responses: specifically, the prompting templates that condition response sampling on both the user query and the full policy (including multimodal elements), the rollout augmentation procedure that mixes policy-grounded and standard responses, and the precise integration with GRPO to promote exploration that respects policy constraints. We will include pseudocode and additional implementation notes to distinguish this from standard RLHF rollouts and to enable reproduction of the robustness gains. revision: yes
Circularity Check
No circularity: empirical training recipe with held-out evaluation
full rationale
The paper introduces MPI as a task and TriMPI as a three-stage empirical pipeline (continual pretraining for knowledge injection, SFT, then PolicyRollout GRPO). Central results are accuracy, generalization, and forgetting metrics measured on held-out synthetic and real-world decision-making/tool-use tasks. No equations, first-principles derivations, or predictions appear that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The work is self-contained as an experimental method with external benchmarks (held-out test instances), satisfying the criteria for a non-circular empirical contribution.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TriMPI consists of three stages: (1) Visually-Masked Continual Pretraining (VM-CPT); (2) Supervised Finetuning with Chain-of-thought (CoT SFT); (3) Reinforcement learning (RL) with PolicyRollout.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Prefix-Tuning: Optimizing Continuous Prompts for Generation
2, 9 Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1988–1997. IEEE Computer Societ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvpr.2017.215 2017
-
[2]
2, 9 Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee
URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/3d77c6dcc7f143aa2154e7f4d5e22d68-Paper-Conference.pdf. 2, 9 Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. Yo’llava: Your personalized language and vision assistant.Advances in Neural Information Processing Systems, 37:40913–40951, 2024. 9, 21 Thao Nguyen, Krishn...
-
[3]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
21 Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 3, 6 Zirui Song, Bin Yan, Yuhan Liu, Miao Fang, Mingzhe Li, Rui Yan, and Xiuying Chen. Injecting domain-specif...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.repl4nlp-1.10 2024
-
[7]
So, we move to Condition 1.2.2.1.2
**Condition 1.2.2.1: check if there is any cylinder object** - There is a yellow cylinder in the image, but it is not brown. So, we move to Condition 1.2.2.1.2
-
[8]
</think> \boxed{Case 39} <think> Let's analyze the image step by step according to the policy:
**Condition 1.2.2.1.2: check if there is any small object** ... </think> \boxed{Case 39} <think> Let's analyze the image step by step according to the policy:
-
[9]
**Condition 1: Check if there is any gray object** - There are no gray objects in the image. So, we move to Condition 1.2
-
[10]
So, we move to Condition 1.2.2
**Condition 1.2: Check if there is any purple object** - There are no purple objects in the image. So, we move to Condition 1.2.2
-
[11]
So, we focus on brown objects and continue with Condition 1.2.2.1
**Condition 1.2.2: Check if there is any brown object** - There is a brown sphere in the image. So, we focus on brown objects and continue with Condition 1.2.2.1
-
[12]
So, we move to Condition 1.2.2.1.2
**Condition 1.2.2.1: Check if there is any cylinder object** - There is a yellow cylinder in the image, not a brown one. So, we move to Condition 1.2.2.1.2
-
[13]
So, we focus on large brown objects and continue with Condition 1.2.2.1.2.1
**Condition 1.2.2.1.2: Check if there is any large object** - The yellow cylinder is large. So, we focus on large brown objects and continue with Condition 1.2.2.1.2.1. ... </think> \boxed{Case 43} <think> Let's analyze the image step by step according to the policy:
-
[14]
**Condition 1: check if there is any gray object** - There are no gray objects in the image. So, we move to Condition 1.2
-
[15]
So, we move to Condition 1.2.2
**Condition 1.2: check if there is any purple object** - There are no purple objects in the image. So, we move to Condition 1.2.2
-
[16]
So, we focus on brown objects and continue with Condition 1.2.2.1
**Condition 1.2.2: check if there is any brown object** - There is a brown sphere in the image. So, we focus on brown objects and continue with Condition 1.2.2.1
-
[17]
So, we move to Condition 1.2.2.1.2
**Condition 1.2.2.1: check if there is any cylinder object** - There are no brown cylinder objects in the image. So, we move to Condition 1.2.2.1.2
-
[18]
So, we move to Condition 1.2.2.1.2.2
**Condition 1.2.2.1.2: check if there is any large object** - There are no large brown objects in the image. So, we move to Condition 1.2.2.1.2.2
-
[19]
**Condition 1.2.2.1.2.2: check if there is any small object** - There is a small brown sphere in the image. So, we respond with "Case 45". </think> \boxed{Case 45} CoT SFT Response CoT SFT + GRPO Response TriMPI w/ PoRo-GRPO Response Figure 7:Qualitative example comparing different MPI algorithms.On the left, we show the inputs and the ground-truth reason...
-
[20]
Case 1" Policy Override on ClevrPolicy {
**Condition 1: Check if there is any cyan object** ... </think> /boxed{Case 28} Follow the Policy L3-Surinu, provide your response for this image. Internalized Model Image Query LLM (Claude 4) Policy (L3-Surinu) Check if any referral to the original policy is accurate 0-10 Score Figure 8:Illustration of the Policy Referral evaluation setup.We take the res...
work page 2025
-
[21]
Since there is a cyan object, we focus only on cyan objects and continue with Condition 1.1
There is a cyan cylinder in the image.2. Since there is a cyan object, we focus only on cyan objects and continue with Condition 1.1. Condition 1.1: check if there is any small object
-
[22]
The cyan cylinder is small.2. Since there is a small cyan object, we focus only on small cyan objects and continue with Condition 1.1.1. Condition 1.1.1: check if there is any cylinder object
-
[23]
The small cyan object is a cylinder.2. Since there is a small cyan cylinder, we focus only on small cyan cylinder objects and continue with Condition 1.1.1.1. Condition 1.1.1.1: check if there is any rubber object
-
[24]
The small cyan cylinder appears to be rubber.2. Since there is a rubber object, we respond with "Case 0".</think> Query: <image>According to the Policy: L6-Tavete, what should be the response for the given image? You first think through the reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer e...
-
[25]
OCR_v5"- If 250 < user credits <= 750, use
The first image contains 2 bottles of Hennessy cognac (liquor)2. The second image is a price menu/table3. I need to find the price of Hennessy from the menu to calculate the total cost The assistant has already identified that there are 2 bottles of Hennessy liquor in the first image. Now I need to extract the pricing information from the menu in the seco...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.