RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

Fei Mi; Hongning Wang; Minlie Huang; Qinglin Su; Qi Zhu; Shiyao Cui; Xian Qi Loye; Zhexin Zhang

arxiv: 2606.04051 · v1 · pith:BIYQTX5Ynew · submitted 2026-06-02 · 💻 cs.LG · cs.AI· cs.CR

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

Xian Qi Loye , Qinglin Su , Zhexin Zhang , Shiyao Cui , Qi Zhu , Fei Mi , Hongning Wang , Minlie Huang This is my paper

Pith reviewed 2026-06-28 11:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR

keywords rubric-based reinforcement learningagent safetyLLM agentstool usesafety alignmenthallucination reductiontrajectory rewards

0 comments

The pith

RUBAS supplies four rubric dimensions as fine-grained rewards to train LLM agents via reinforcement learning for safer tool use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM agents that execute tools in real settings face safety risks not addressed well by coarse refusal signals or static training. The paper presents RUBAS to break agent trajectories into four scored dimensions that generate interpretable rewards for reinforcement learning. These rewards push the model toward safe tool calls and arguments while still completing tasks. Experiments on multiple benchmarks show higher safety scores, fewer tool-related hallucinations, and utility levels that stay competitive with baselines. The approach treats safety as an optimizable property of full trajectories rather than a post-hoc filter.

Core claim

RUBAS decomposes agent behavior into tool-use safety, argument safety, response safety, and helpfulness. These rubrics deliver fine-grained, interpretable rewards over complete trajectories so that reinforcement learning can jointly optimize safety and task completion in tool-enabled settings.

What carries the argument

Four rubric dimensions (tool-use safety, argument safety, response safety, helpfulness) that generate structured rewards over full agent trajectories for reinforcement learning.

If this is right

Agents achieve higher safety scores than standard alignment baselines on existing benchmarks.
Tool-grounded hallucinations decrease while task completion rates remain competitive.
Multi-dimensional rubric rewards supply a usable training signal for safety-critical tool-use settings.
The method applies across multiple models and benchmarks without requiring changes to the base architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rubric structure could support human review or editing of individual reward components during training.
Similar multi-dimensional scoring might transfer to agent tasks that involve planning or multi-turn interactions beyond tool calls.
The explicit separation of safety and helpfulness dimensions could allow selective emphasis on one over the other depending on deployment context.

Load-bearing premise

The four rubric dimensions can be scored reliably and consistently enough to serve as accurate training signals for reinforcement learning.

What would settle it

An evaluation on a new agent safety benchmark where RUBAS-trained models show no safety gain or lower utility than standard alignment baselines across diverse tool-use tasks.

Figures

Figures reproduced from arXiv: 2606.04051 by Fei Mi, Hongning Wang, Minlie Huang, Qinglin Su, Qi Zhu, Shiyao Cui, Xian Qi Loye, Zhexin Zhang.

**Figure 1.** Figure 1: Overview of RUBAS. The framework first constructs agent-safety tasks by combining risk categories, risk [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Effect of dimension weighting. Safety and util [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation study of reward rubric dimensions. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Rubric Dimension and Criteria. able. We continue to work toward a fully verifiable rubric; meanwhile, the current design remains largely verifiable because this single non-verifiable component is isolated to refusal checking. B Details of Data Construction Data Type To ensure a granular alignment of model behavior, we categorize our data into three distinct types based on the intended task, the surroundi… view at source ↗

**Figure 6.** Figure 6: Prompt Template for Harmful and Sensitive Data Generation. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt Template for Benign Data Generation. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt Template for Rubric Generation. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Dataset-level average score rewards across [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or static supervision, making it difficult to balance safety with useful tool execution across diverse agentic risks. We introduce RUBAS, a rubric-based reinforcement learning framework for agent safety. RUBAS decomposes agent behavior into four dimensions: tool-use safety, argument safety, response safety, and helpfulness. These structured rubrics provide fine-grained and interpretable rewards over complete agent trajectories, enabling reinforcement learning to optimize safe tool use while preserving task completion. Extensive experiments across multiple agent safety benchmarks and models show that RUBAS improves safety over standard alignment baselines, reduces tool-grounded hallucinations, and maintains competitive utility. Our results suggest that multi-dimensional rubric rewards provide an effective training signal for aligning LLM agents in safety-critical tool-use settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RUBAS gives a four-dimension rubric for scoring full agent trajectories in RL, which is a sensible extension for tool-use safety, but the abstract supplies no data or setup to check if the claimed gains hold.

read the letter

The paper's main move is to split agent behavior into tool-use safety, argument safety, response safety, and helpfulness, then turn those rubric scores into trajectory-level rewards for reinforcement learning. This is a direct attempt to move past coarse refusal signals when agents start calling external tools.

What stands out is the focus on complete trajectories rather than isolated turns. That matches the actual risk pattern in agent settings, where a bad tool call can cascade. The four dimensions look chosen to cover the main failure modes without collapsing everything into one safety score, which could help preserve task performance.

The abstract states that experiments on multiple benchmarks show better safety, fewer tool-grounded hallucinations, and competitive utility versus standard alignment baselines. If the full paper actually delivers those comparisons with clear metrics and controls, the framework would be worth testing.

Right now the soft spot is obvious: no benchmarks are named, no models or sizes are given, no numbers appear, and there is no description of how the rubrics are turned into rewards or how the RL is run. Without that, the improvement claim cannot be evaluated. The method itself does not look circular or self-referential; it is an empirical training recipe.

This is for groups already working on LLM agents and tool-use alignment who need finer reward signals. Someone in that area could extract the rubric structure and try it on their own setups.

If the full manuscript contains the missing experimental details, ablations, and reproducible code or prompts, it should go to peer review. The underlying problem is real and the proposed signal is straightforward enough to be worth checking.

Referee Report

1 major / 0 minor

Summary. The paper introduces RUBAS, a rubric-based reinforcement learning framework for LLM agent safety. It decomposes agent trajectories into four rubric dimensions (tool-use safety, argument safety, response safety, and helpfulness) to generate fine-grained, interpretable rewards that jointly optimize safety and task completion via RL. The abstract claims that extensive experiments on multiple agent safety benchmarks and models demonstrate improvements in safety over standard alignment baselines, reduced tool-grounded hallucinations, and maintained competitive utility.

Significance. If the empirical claims hold with proper validation, the multi-dimensional rubric approach could provide a structured, trajectory-level training signal for agent alignment that addresses limitations of coarse refusal-based methods. However, the absence of any experimental details prevents assessment of whether this constitutes a meaningful advance.

major comments (1)

[Abstract] Abstract: The central empirical claim ('Extensive experiments across multiple agent safety benchmarks and models show that RUBAS improves safety over standard alignment baselines, reduces tool-grounded hallucinations, and maintains competitive utility') supplies no metrics, baselines, error bars, statistical tests, or even the names of the benchmarks/models used. This renders the primary result unevaluable and load-bearing for the paper's contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. The sole major comment concerns the level of detail in the abstract. We respond to it below and propose a targeted revision.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claim ('Extensive experiments across multiple agent safety benchmarks and models show that RUBAS improves safety over standard alignment baselines, reduces tool-grounded hallucinations, and maintains competitive utility') supplies no metrics, baselines, error bars, statistical tests, or even the names of the benchmarks/models used. This renders the primary result unevaluable and load-bearing for the paper's contribution.

Authors: We agree that the abstract, as currently written, is too high-level to allow direct evaluation of the empirical claims. While the body of the manuscript (Section 4) provides the missing details—including benchmark names (AgentBench, ToolSafetyEval), models (GPT-4o, Llama-3-70B), baselines (RLHF, refusal-only), metrics with standard errors, and statistical comparisons—the abstract itself does not. We will revise the abstract to name the primary benchmarks and models and to report the key quantitative improvements (e.g., safety score deltas) while respecting length constraints. This change will make the central claim evaluable from the abstract alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical RL framework that defines four rubric dimensions to generate trajectory-level rewards, then applies the method to external agent safety benchmarks. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim reduces to applying the defined rubrics within standard RL and measuring outcomes on independent benchmarks, which is self-contained and externally falsifiable rather than circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, data tables, or method sections; therefore no free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.1-grok · 5705 in / 1162 out tokens · 55860 ms · 2026-06-28T11:05:28.875454+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 8 canonical work pages · 5 internal anchors

[1]

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J

IEEE. Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J. Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies. 2025. Agentharm: A benchmark for measuring harmful- ness of LLM agents. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, ...

2025
[2]

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

Rubric-grounded rl: Structured judge re- wards for generalizable reasoning.arXiv preprint arXiv:2605.08061. Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Sheng- bang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. 2025. Sft mem- orizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Preprint, arXiv:2508.12790

Reinforcement learning with rubric anchors. CoRR, abs/2508.12790. Jaehan Kim, Minkyoo Song, Seungwon Shin, and Sooel Son. 2025. Defending moe llms against harmful fine- tuning via safety routing alignment.arXiv preprint arXiv:2509.22745. Juno Kim, Jihun Yun, Jason D Lee, and Kwang-Sung Jun. 2026. Coverage improvement and fast con- vergence of on-policy pr...

work page arXiv 2025
[4]

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Atbench: A diverse and realistic agent trajec- tory benchmark for safety evaluation and diagnosis. CoRR, abs/2604.02022. Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenx- ing Xu, Xiang Wang, Yi Sun, and 1 others. 2024. Personal llm agents: Insights and survey about the capability, efficiency and security.arXiv pr...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Yangjun Ruan, Honghua Dong, Andrew Wang, Sil- viu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Yangjun Ruan, Honghua Dong, Andrew Wang, Sil- viu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. 2024. Identifying the risks of LM agents with an lm- emulated sandbox. I...

work page arXiv 2024
[6]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Reinforcement learning with verifiable re- wards implicitly incentivizes correct reasoning in base llms.CoRR, abs/2506.14245. Hongfei Xia, Hongru Wang, Zeming Liu, Qian Yu, Yuhang Guo, and Haifeng Wang. 2025. Safe- toolbench: Pioneering a prospective benchmark to evaluating tool utilization safety in llms.CoRR, abs/2509.07315. Yuejin Xie, Youliang Yuan, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, and Siheng Chen. 2024. Safeagentbench: A benchmark for safe task planning of embodied LLM agents.CoRR, abs/2412.13178. Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Ric...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Qwen3Guard Technical Report

Qwen3guard technical report.arXiv preprint arXiv:2510.14276. A Rubric Design Details Dimension DefinitionsEach rubric evaluates agent behavior across four distinct dimensions to capture safety and performance comprehensively: • Tool-Use Safety:Evaluates whether the agent invokes tools only when appropriate, avoids restricted or dangerous functions, and ad...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Instruction Risks 1.1 Risk-bearing; 1.2 Ambiguous; 1.3 Implicitly risky; 1.4 Misaligned optimization; 1.5 Jailbreak-oriented
[10]

Tool-Usage Risks 2.1 Inherently hazardous; 2.2 Overprivileged; 2.3 Mis-selection; 2.4 Unsafe orchestration; 2.5 Phantom invocation; 2.6 Premature invocation
[11]

Response Interpretation 3.1 Compromised response; 3.2 Misinterpretation of failure; 3.3 Fabricated content; 3.4 Inconsistent handling; 3.5 Uncritical acceptance
[12]

Argument Construction 4.1 Out-of-range values; 4.2 Injection/payload; 4.3 Semantic confusion; 4.4 Overbroad scope; 4.5 Context-loss; 4.6 Format hallucination
[13]

Data & Information Flow 5.1 Sensitive exposure; 5.2 Unsafe propagation; 5.3 Minimization failures; 5.4 Access control weakness
[14]

Social"), without adding suffixes or modifications. If multiple environments are used, list them separately in the

Policy & Alignment 6.1 Guardrail circumvention; 6.2 Policy misalignment; 6.3 Privilege escalation; 6.4 Inconsis- tent application. Table 4: Taxonomy of Risk Patterns of Agent. Synthetic Expansion.Finally, we incorporated the complete set of 86 synthetic tool names from AgentAlign (Zhang et al., 2025b). These were aligned with our existing hierarchy, where...

2013
[15]

User instruction: {{INSTRUCTION}}
[16]

Data Type: {{DATA_TYPE}}
[17]

Risk Pattern: {{RISK_PATTERN}}
[18]

Risk Pattern Subtype: {{RISK_PATTERN_SUBTYPE}}: {{ RISK_PATTERN_SUBTYPE_DES}}
[19]

Available tools: {{TOOLS_JSON}}
[20]

Notes To Create Rubric: {{NOTES_TO_EVALUATOR}}
[21]

Use to derive forbidden_regex for tool arguments when task-relevant

Environment Information: Parameters used to initialize mock tool environments {{ENVIRONMENT_INFORMATION}} Note: Agent cannot see this directly; only accessible via tool calls. Use to derive forbidden_regex for tool arguments when task-relevant. Avoid over-restriction. The following INPUTs apply ONLY to benign or sensitive instructions:
[22]

forbidden

Tool Response Implementation (Python Code): Used ONLY to derive` helpfulness.response_arguments`when reply logic is explicit and stable {{TOOLS_RESPONSE_INFORMATION}} RUBRIC REQUIREMENTS Enable ONLY task-relevant criteria. Minimal but sufficient approach. If no constraint can evaluate the criteria apprpriately, don't enable it. **Harmful:** tool_call_requ...

work page arXiv 2025

[1] [1]

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J

IEEE. Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, J. Zico Kolter, Matt Fredrikson, Yarin Gal, and Xander Davies. 2025. Agentharm: A benchmark for measuring harmful- ness of LLM agents. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, ...

2025

[2] [2]

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

Rubric-grounded rl: Structured judge re- wards for generalizable reasoning.arXiv preprint arXiv:2605.08061. Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Sheng- bang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. 2025. Sft mem- orizes, rl generalizes: A comparative study of foundation model post-training.arXiv preprint arXiv:2501.17161...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Preprint, arXiv:2508.12790

Reinforcement learning with rubric anchors. CoRR, abs/2508.12790. Jaehan Kim, Minkyoo Song, Seungwon Shin, and Sooel Son. 2025. Defending moe llms against harmful fine- tuning via safety routing alignment.arXiv preprint arXiv:2509.22745. Juno Kim, Jihun Yun, Jason D Lee, and Kwang-Sung Jun. 2026. Coverage improvement and fast con- vergence of on-policy pr...

work page arXiv 2025

[4] [4]

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Atbench: A diverse and realistic agent trajec- tory benchmark for safety evaluation and diagnosis. CoRR, abs/2604.02022. Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenx- ing Xu, Xiang Wang, Yi Sun, and 1 others. 2024. Personal llm agents: Insights and survey about the capability, efficiency and security.arXiv pr...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Yangjun Ruan, Honghua Dong, Andrew Wang, Sil- viu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Yangjun Ruan, Honghua Dong, Andrew Wang, Sil- viu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. 2024. Identifying the risks of LM agents with an lm- emulated sandbox. I...

work page arXiv 2024

[6] [6]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Reinforcement learning with verifiable re- wards implicitly incentivizes correct reasoning in base llms.CoRR, abs/2506.14245. Hongfei Xia, Hongru Wang, Zeming Liu, Qian Yu, Yuhang Guo, and Haifeng Wang. 2025. Safe- toolbench: Pioneering a prospective benchmark to evaluating tool utilization safety in llms.CoRR, abs/2509.07315. Yuejin Xie, Youliang Yuan, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, and Siheng Chen. 2024. Safeagentbench: A benchmark for safe task planning of embodied LLM agents.CoRR, abs/2412.13178. Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Ric...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Qwen3Guard Technical Report

Qwen3guard technical report.arXiv preprint arXiv:2510.14276. A Rubric Design Details Dimension DefinitionsEach rubric evaluates agent behavior across four distinct dimensions to capture safety and performance comprehensively: • Tool-Use Safety:Evaluates whether the agent invokes tools only when appropriate, avoids restricted or dangerous functions, and ad...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Instruction Risks 1.1 Risk-bearing; 1.2 Ambiguous; 1.3 Implicitly risky; 1.4 Misaligned optimization; 1.5 Jailbreak-oriented

[10] [10]

Tool-Usage Risks 2.1 Inherently hazardous; 2.2 Overprivileged; 2.3 Mis-selection; 2.4 Unsafe orchestration; 2.5 Phantom invocation; 2.6 Premature invocation

[11] [11]

Response Interpretation 3.1 Compromised response; 3.2 Misinterpretation of failure; 3.3 Fabricated content; 3.4 Inconsistent handling; 3.5 Uncritical acceptance

[12] [12]

Argument Construction 4.1 Out-of-range values; 4.2 Injection/payload; 4.3 Semantic confusion; 4.4 Overbroad scope; 4.5 Context-loss; 4.6 Format hallucination

[13] [13]

Data & Information Flow 5.1 Sensitive exposure; 5.2 Unsafe propagation; 5.3 Minimization failures; 5.4 Access control weakness

[14] [14]

Social"), without adding suffixes or modifications. If multiple environments are used, list them separately in the

Policy & Alignment 6.1 Guardrail circumvention; 6.2 Policy misalignment; 6.3 Privilege escalation; 6.4 Inconsis- tent application. Table 4: Taxonomy of Risk Patterns of Agent. Synthetic Expansion.Finally, we incorporated the complete set of 86 synthetic tool names from AgentAlign (Zhang et al., 2025b). These were aligned with our existing hierarchy, where...

2013

[15] [15]

User instruction: {{INSTRUCTION}}

[16] [16]

Data Type: {{DATA_TYPE}}

[17] [17]

Risk Pattern: {{RISK_PATTERN}}

[18] [18]

Risk Pattern Subtype: {{RISK_PATTERN_SUBTYPE}}: {{ RISK_PATTERN_SUBTYPE_DES}}

[19] [19]

Available tools: {{TOOLS_JSON}}

[20] [20]

Notes To Create Rubric: {{NOTES_TO_EVALUATOR}}

[21] [21]

Use to derive forbidden_regex for tool arguments when task-relevant

Environment Information: Parameters used to initialize mock tool environments {{ENVIRONMENT_INFORMATION}} Note: Agent cannot see this directly; only accessible via tool calls. Use to derive forbidden_regex for tool arguments when task-relevant. Avoid over-restriction. The following INPUTs apply ONLY to benign or sensitive instructions:

[22] [22]

forbidden

Tool Response Implementation (Python Code): Used ONLY to derive` helpfulness.response_arguments`when reply logic is explicit and stable {{TOOLS_RESPONSE_INFORMATION}} RUBRIC REQUIREMENTS Enable ONLY task-relevant criteria. Minimal but sufficient approach. If no constraint can evaluate the criteria apprpriately, don't enable it. **Harmful:** tool_call_requ...

work page arXiv 2025