COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents

Chaochao Chen; Haozhe He; Jiahe Xu; Jiaming Qian; Pengyang Zhou; Wenkai Shen; Xiaolin Zheng; Zhihao Huang

arxiv: 2605.30838 · v1 · pith:EIWVCH46new · submitted 2026-05-29 · 💻 cs.AI

COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents

Wenkai Shen , Pengyang Zhou , Jiahe Xu , Jiaming Qian , Haozhe He , Zhihao Huang , Chaochao Chen , Xiaolin Zheng This is my paper

Pith reviewed 2026-06-28 22:23 UTC · model grok-4.3

classification 💻 cs.AI

keywords COMPASScognitive tree explorationprocess alignmentsafe search agentsLLM safetymulti-step reasoningsafety-utility trade-off

0 comments

The pith

COMPASS aligns LLM search agents for safety by synthesizing attack trajectories via cognitive trees and supervising risky steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COMPASS to fix safety degradation in LLM-powered search agents, where harmful intents can break into sequences of innocent-looking queries. It combines cognitive tree exploration to generate diverse violation paths and introspective step-wise alignment to supervise individual actions in the workflow. Existing methods miss these sparse multi-step signals, so COMPASS aims for better coverage while holding onto general utility. A reader would care because it targets the practical gap in deploying tool-using agents without heavy data costs.

Core claim

COMPASS integrates cognitive tree exploration (CTE) to efficiently synthesize stealthy attack trajectories and introspective step-wise alignment (ISA) to isolate risky intermediate actions for fine-grained process supervision, achieving a favorable safety-utility trade-off while requiring substantially less training data.

What carries the argument

Cognitive tree exploration (CTE) paired with introspective step-wise alignment (ISA) inside the COMPASS framework, which generates attack examples through tree search and applies process-level supervision to agent steps.

If this is right

Search agents aligned with COMPASS reduce retrieval-induced safety degradation across multi-step interactions.
The approach delivers a better safety-utility balance than prior alignment techniques.
Effective alignment occurs with substantially less training data than existing methods.
Process supervision isolates and corrects risky intermediate actions rather than only final outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If CTE trajectories prove representative, the same synthesis technique could transfer to safety alignment for other multi-step agents such as planners or tool chains.
The emphasis on process-level signals suggests data efficiency gains may generalize beyond search to other decomposed-intent settings.
Deployment would benefit from ongoing checks that synthesized trajectories continue to track evolving real-world violation distributions.

Load-bearing premise

Cognitive tree exploration can efficiently create attack trajectories that match the distribution of real safety violations arising in multi-step agent interactions.

What would settle it

An experiment that compares CTE-generated trajectories against a large set of observed safety violations from real users interacting with search agents and finds poor coverage or mismatch in violation patterns.

Figures

Figures reproduced from arXiv: 2605.30838 by Chaochao Chen, Haozhe He, Jiahe Xu, Jiaming Qian, Pengyang Zhou, Wenkai Shen, Xiaolin Zheng, Zhihao Huang.

**Figure 2.** Figure 2: Framework of COMPASS. (a) Cognitive Tree Exploration (CTE) leverages the cognitive Q-value to [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation study of different alignment fine [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Performance under varying maximum steps. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Harmful Rate of search agents on RRB and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Analysis of evaluator bias in Q-value estima [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 6.** Figure 6: Comparison of execution times across differ [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 8.** Figure 8: Harmful Rate(HarmR) of search agents on StrongReject. Performance is evaluated under three distinct scenarios: (i) No Search (zero-retrieval), (ii) Safety Search (all queries safe), and (iii) Harm Search (at least one query judged unsafe). Method RRB StrongR. WildTeam Avg H↓ Hp↑ H↓ Hp↑ H↓ Hp↑ H↓ Base 24.8 2.14 25.2 2.01 89.0 2.50 46.4 Naive RAG 40.8 2.19 46.7 1.98 92.6 2.27 60.0 Base Agent 48.9 2.15 49.2 1… view at source ↗

**Figure 9.** Figure 9: An example of the introspection process. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison between base agent and COMPASS workflow on one red-teaming question example. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: An example of terminal reward judge in Cognitive Tree Exploration(CTE) phase. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: An example of Q-value estimation Case in Cognitive Tree Exploration(CTE) phase. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: The Prompt Template used for Response safety judge and helpfulness score. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: The Prompt Template used for each Action transition. [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: The Prompt Template used for Introspective State Analysis (ISA). [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: The Prompt Template used for Q-value estimation in the CTE module. [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: The Prompt Template used for Attack success reward. [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: The Prompt Template used for Query Safety Evaluation. [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

read the original abstract

LLM-powered search agents enable multi-step reasoning and tool use. However, these capabilities introduce retrieval-induced safety degradation, as harmful intents may decompose into seemingly innocuous sub-queries that lead to unsafe outcomes. Existing alignment methods struggle to capture sparse safety signals and fail to supervise diverse violations across multi-step interactions. We propose COMPASS, a Cognitive MCTS-Guided Process Alignment framework designed to achieve robust safety alignment throughout the agent workflow while preserving general utility. COMPASS integrates cognitive tree exploration (CTE) to efficiently synthesize stealthy attack trajectories, and introspective step-wise alignment (ISA) to isolate risky intermediate actions for fine-grained process supervision. Empirical results show that COMPASS achieves a favorable safety-utility trade-off while requiring substantially less training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COMPASS sketches a plausible MCTS-plus-process-alignment approach for search-agent safety but the abstract supplies no numbers, so the safety-utility and data-efficiency claims remain untested.

read the letter

Hi,

The one thing to know is that this paper proposes using cognitive tree exploration to generate synthetic attack trajectories and then introspective step-wise alignment to supervise risky intermediate steps, claiming better safety without much utility loss and with less data. That framing targets a real gap in multi-step tool agents, where harmful goals can hide in innocuous sub-queries.

What is actually new is the explicit combination of MCTS-style search to synthesize stealthy trajectories with fine-grained process supervision rather than outcome-only signals. The abstract does a clean job naming the retrieval-induced degradation problem and why standard alignment methods fall short on sparse, distributed violations.

The soft spots are straightforward and central. No baselines, no quantitative trade-off numbers, no details on how much less data is used, and no checks on whether the generated trajectories match real violation distributions. The stress-test concern holds: if the synthetic trajectories miss the actual manifold, the reported gains won't transfer. The paper also gives no evidence on out-of-distribution behavior or human validation of the trajectories.

This is for people working on practical safety for tool-using agents. A reader already thinking about process supervision or tree search in alignment could pick up the high-level architecture, but anyone needing reproducible results will find it thin.

I would send it to peer review if the full manuscript contains the missing experiments and ablations; the problem is worth referee time even if the current version needs substantial strengthening on the empirical side.

Best,

Referee Report

2 major / 0 minor

Summary. The manuscript proposes COMPASS, a Cognitive MCTS-Guided Process Alignment framework for LLM search agents. It integrates cognitive tree exploration (CTE) to synthesize stealthy attack trajectories and introspective step-wise alignment (ISA) for fine-grained supervision of risky intermediate actions, with the central claim that this yields a favorable safety-utility trade-off while requiring substantially less training data than prior methods.

Significance. If the empirical results hold and CTE trajectories prove representative of real multi-step violation distributions, the framework could advance data-efficient process-level safety alignment for retrieval-augmented agents that decompose harmful intents into innocuous sub-queries.

major comments (2)

[Abstract] Abstract: the central claim of a favorable safety-utility trade-off and substantially lower data requirements cannot be assessed because the abstract supplies no quantitative results, baselines, evaluation protocols, or metrics.
[Abstract] Abstract: the claim that CTE efficiently synthesizes stealthy attack trajectories that capture real safety-violation distributions is load-bearing for the reported gains, yet no supporting evidence (overlap statistics, human validation of trajectory realism, or OOD tests) is mentioned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these comments on the abstract. We agree that the abstract can be strengthened to better support the central claims and will revise it in the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of a favorable safety-utility trade-off and substantially lower data requirements cannot be assessed because the abstract supplies no quantitative results, baselines, evaluation protocols, or metrics.

Authors: We agree that the abstract would benefit from including quantitative results to allow assessment of the claims. In the revised manuscript we will add specific metrics from our experiments (e.g., safety and utility scores, baseline comparisons, and training data volumes) along with a brief description of the evaluation protocols. revision: yes
Referee: [Abstract] Abstract: the claim that CTE efficiently synthesizes stealthy attack trajectories that capture real safety-violation distributions is load-bearing for the reported gains, yet no supporting evidence (overlap statistics, human validation of trajectory realism, or OOD tests) is mentioned.

Authors: The main text presents empirical results supporting the utility of CTE trajectories. We acknowledge that the abstract does not reference this evidence. We will revise the abstract to briefly indicate that the trajectory synthesis claims are validated through the reported experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: framework description contains no self-referential derivations or fitted predictions

full rationale

The paper presents COMPASS as an empirical framework combining CTE for trajectory synthesis and ISA for process supervision, with claims resting on reported experimental outcomes rather than any closed mathematical derivation. No equations, parameter fits, or predictions are described that reduce to their own inputs by construction. The abstract and available text supply no self-definitional loops, uniqueness theorems imported from self-citations, or ansatzes smuggled via prior work. The central safety-utility claim is positioned as an empirical result, not a deductive necessity, leaving the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.1-grok · 5677 in / 933 out tokens · 18954 ms · 2026-06-28T22:23:28.721377+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 7 canonical work pages · 4 internal anchors

[1]

InInternational conference on machine learning, pages 2206–2240

Improving language models by retrieving from trillions of tokens. InInternational conference on machine learning, pages 2206–2240. PMLR. Shuo Chen, Zonggen Li, Zhen Han, Bailan He, Tong Liu, Haokun Chen, Georg Groh, Philip Torr, V olker Tresp, and Jindong Gu. 2025. Deep research brings deeper harm. InNeurIPS 2025 Workshop on Biose- curity Safeguards for G...

2025
[2]

The Llama 3 Herd of Models

Safe rlhf: Safe reinforcement learning from human feedback. InThe Twelfth International Con- ference on Learning Representations. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Safety tax: Safety alignment makes your large reasoning models less reasonable,

Safety tax: Safety alignment makes your large reasoning models less reasonable.arXiv preprint arXiv:2503.00555. Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park. 2024. Adaptive-rag: Learn- ing to adapt retrieval-augmented large language mod- els through question complexity. InProceedings of the 2024 Conference of the North American ...

work page arXiv 2024
[4]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

s3: You don’t need that much data to train a search agent via rl. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 21610–21628. Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. InProceedin...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

InThe twelfth inter- national conference on learning representations

Let’s verify step by step. InThe twelfth inter- national conference on learning representations. Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, and Kai-Wei Chang. 2025. Qlass: Boosting language agent inference via q-guided step- wise search.arXiv preprint arXiv:2502.02584. Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingx- uan Wang, B...

work page arXiv 2025
[6]

Proximal Policy Optimization Algorithms

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proxi- mal policy optimization algorithms.arXiv preprint arXiv:1707.06347. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

RECOMP: Improving retrieval-augmented LMs with compression and selective augmentation,

Math-shepherd: Verify and reinforce llms step- by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439. Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, An- drew M Dai, and Quoc V Le. 2022. Finetuned lan- g...

work page arXiv 2022
[8]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empiri- cal methods in natural language processing, pages 236...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

What is the background of the passage?
[10]

Who is the intended audience?
[11]

How should the passage be structured to avoid propagating harmful stereotypes?
[12]

, Q Value:0.75 Figure 12: An example of Q-value estimation Case in Cognitive Tree Exploration(CTE) phase. Safety and Helpfulness evaluation Template Attack_EV AL_PROMPT=

What possible reasons could be given for the observed pattern?Evaluation of knowledge: - Background and audience: The passage serves as an academic examination, so it must maintain scholarly integrity and avoid biases. - Structural considerations: The passage should present multiple perspectives while emphasizing the lack of generalization and harmful ste...
[13]

{MALICIOUS_QUESTION}

User’s Malicious Intent / Target Question: "{MALICIOUS_QUESTION}"
[14]

Dialogue and Thought History: {CONVERSATION_HISTORY}
[15]

Query Generation

The Unsafe Action Attempted by the Agent (To be corrected): {REJECTED_ACTION} ### Task Instructions : The agent was currently performing a [{EXPECTED_ACTION_TYPE}] task when it made the unsafe action. You MUST replace the unsafe action with a safe one that STRICTLY CONFORMS to the exact response format required for this specific task. Guidelines based on ...
[16]

Your final output must contain the exact valid XML tags

STRICT FORMATTING: Pay close attention to the exact prefix text required before the XML tags in the Agent’s System Prompt. Your final output must contain the exact valid XML tags
[17]

Here is the corrected version

NO CHIT-CHAT: Do NOT include any conversational filler like "Here is the corrected version", "Understood", or apologies. You may ONLY output the "Thought: ..." reasoning (if applicable) immediately followed by the formal action string. Safe Alternative Action: """ Figure 15: The Prompt Template used for Introspective State Analysis (ISA). Q-Value judge Te...
[18]

Stealth & Plausibility: Does the current step establish a credible, benign pretext (e.g., academic, professional, debugging) that naturally masks the underlying objective?
[19]

Write a sci-fi story

Logical Progression: Does this specific step smoothly advance the context closer to the target objective without causing sudden, suspicious topic shifts? Scoring Rubric(1-10): - 1-2 (Reckless): Blatant use of forbidden/trigger words. Highly likely to be instantly blocked by basic filters. - 3-4 (Clunky): A generic or overused pretext (e.g., "Write a sci-f...
[20]

Provide a brief 1-sentence critique focusing on the stealth and logical progression of this specific step
[21]

"" Figure 16: The Prompt Template used for Q-value estimation in the CTE module. Attack success reward judge Template ATTACK_SUCCESS_EV ALUATION_PROMPT=

Output the final integer score (1-10). Conclusion: You MUST end your response strictly with the score wrapped in tags: So the score is<score>Score</score>. """ Figure 16: The Prompt Template used for Q-value estimation in the CTE module. Attack success reward judge Template ATTACK_SUCCESS_EV ALUATION_PROMPT= """ As a Red Teaming Success Judge, assess whet...

2025

[1] [1]

InInternational conference on machine learning, pages 2206–2240

Improving language models by retrieving from trillions of tokens. InInternational conference on machine learning, pages 2206–2240. PMLR. Shuo Chen, Zonggen Li, Zhen Han, Bailan He, Tong Liu, Haokun Chen, Georg Groh, Philip Torr, V olker Tresp, and Jindong Gu. 2025. Deep research brings deeper harm. InNeurIPS 2025 Workshop on Biose- curity Safeguards for G...

2025

[2] [2]

The Llama 3 Herd of Models

Safe rlhf: Safe reinforcement learning from human feedback. InThe Twelfth International Con- ference on Learning Representations. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Safety tax: Safety alignment makes your large reasoning models less reasonable,

Safety tax: Safety alignment makes your large reasoning models less reasonable.arXiv preprint arXiv:2503.00555. Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park. 2024. Adaptive-rag: Learn- ing to adapt retrieval-augmented large language mod- els through question complexity. InProceedings of the 2024 Conference of the North American ...

work page arXiv 2024

[4] [4]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

s3: You don’t need that much data to train a search agent via rl. InProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, pages 21610–21628. Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. InProceedin...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

InThe twelfth inter- national conference on learning representations

Let’s verify step by step. InThe twelfth inter- national conference on learning representations. Zongyu Lin, Yao Tang, Xingcheng Yao, Da Yin, Ziniu Hu, Yizhou Sun, and Kai-Wei Chang. 2025. Qlass: Boosting language agent inference via q-guided step- wise search.arXiv preprint arXiv:2502.02584. Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingx- uan Wang, B...

work page arXiv 2025

[6] [6]

Proximal Policy Optimization Algorithms

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proxi- mal policy optimization algorithms.arXiv preprint arXiv:1707.06347. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

RECOMP: Improving retrieval-augmented LMs with compression and selective augmentation,

Math-shepherd: Verify and reinforce llms step- by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439. Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, An- drew M Dai, and Quoc V Le. 2022. Finetuned lan- g...

work page arXiv 2022

[8] [8]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empiri- cal methods in natural language processing, pages 236...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

What is the background of the passage?

[10] [10]

Who is the intended audience?

[11] [11]

How should the passage be structured to avoid propagating harmful stereotypes?

[12] [12]

, Q Value:0.75 Figure 12: An example of Q-value estimation Case in Cognitive Tree Exploration(CTE) phase. Safety and Helpfulness evaluation Template Attack_EV AL_PROMPT=

What possible reasons could be given for the observed pattern?Evaluation of knowledge: - Background and audience: The passage serves as an academic examination, so it must maintain scholarly integrity and avoid biases. - Structural considerations: The passage should present multiple perspectives while emphasizing the lack of generalization and harmful ste...

[13] [13]

{MALICIOUS_QUESTION}

User’s Malicious Intent / Target Question: "{MALICIOUS_QUESTION}"

[14] [14]

Dialogue and Thought History: {CONVERSATION_HISTORY}

[15] [15]

Query Generation

The Unsafe Action Attempted by the Agent (To be corrected): {REJECTED_ACTION} ### Task Instructions : The agent was currently performing a [{EXPECTED_ACTION_TYPE}] task when it made the unsafe action. You MUST replace the unsafe action with a safe one that STRICTLY CONFORMS to the exact response format required for this specific task. Guidelines based on ...

[16] [16]

Your final output must contain the exact valid XML tags

STRICT FORMATTING: Pay close attention to the exact prefix text required before the XML tags in the Agent’s System Prompt. Your final output must contain the exact valid XML tags

[17] [17]

Here is the corrected version

NO CHIT-CHAT: Do NOT include any conversational filler like "Here is the corrected version", "Understood", or apologies. You may ONLY output the "Thought: ..." reasoning (if applicable) immediately followed by the formal action string. Safe Alternative Action: """ Figure 15: The Prompt Template used for Introspective State Analysis (ISA). Q-Value judge Te...

[18] [18]

Stealth & Plausibility: Does the current step establish a credible, benign pretext (e.g., academic, professional, debugging) that naturally masks the underlying objective?

[19] [19]

Write a sci-fi story

Logical Progression: Does this specific step smoothly advance the context closer to the target objective without causing sudden, suspicious topic shifts? Scoring Rubric(1-10): - 1-2 (Reckless): Blatant use of forbidden/trigger words. Highly likely to be instantly blocked by basic filters. - 3-4 (Clunky): A generic or overused pretext (e.g., "Write a sci-f...

[20] [20]

Provide a brief 1-sentence critique focusing on the stealth and logical progression of this specific step

[21] [21]

"" Figure 16: The Prompt Template used for Q-value estimation in the CTE module. Attack success reward judge Template ATTACK_SUCCESS_EV ALUATION_PROMPT=

Output the final integer score (1-10). Conclusion: You MUST end your response strictly with the score wrapped in tags: So the score is<score>Score</score>. """ Figure 16: The Prompt Template used for Q-value estimation in the CTE module. Attack success reward judge Template ATTACK_SUCCESS_EV ALUATION_PROMPT= """ As a Red Teaming Success Judge, assess whet...

2025