arxiv: 2509.10546 · v2 · submitted 2025-09-07 · 💻 cs.CL · cs.AI· cs.LG

Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain

Gang Cheng , Haibo Jin , Wenbin Zhang , Haohan Wang , Jun Zhuang This is my paper

Pith reviewed 2026-05-18 17:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords red-teaminglarge language modelsfinancial domainmulti-turn attacksrisk concealmentattack success rateFinRisk-BenchLLM safety

0 comments

The pith

CoRT framework uses controllable multi-turn concealment to reach 95% attack success on financial LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a red-teaming method for large language models in finance that creates conversations appearing safe on the surface but that lead the models to give answers violating financial regulations. It does this through a process where an attacker component refines prompts over multiple turns and a controller component predicts a score for how well the risk is hidden, using that score to adjust the next prompt. This fills a gap left by red-teaming techniques that only test for obvious harmful content. A reader would care because financial applications of LLMs carry real regulatory consequences if the models produce bad advice, and current tests may not catch these hidden risks.

Core claim

The authors introduce the CoRT framework for controllable black-box multi-turn risk-concealed red-teaming. It includes a Risk Concealment Attacker that generates prompts through iterative refinement and a Risk Concealment Controller that predicts a turn-level Risk Concealment Score to steer the attacker's follow-up style. Evaluated on the FinRisk-Bench benchmark of 522 instructions in six financial risk categories, the method achieves 93.19% average attack success rate with the attacker alone and 95.00% when combined with the controller, across nine LLMs.

What carries the argument

The Risk Concealment Score, a predicted value at each turn that controls the style of the next prompt generated by the Risk Concealment Attacker to balance concealment of risk with the ability to elicit unsafe responses.

If this is right

Financial LLMs remain susceptible to attacks that build up over multiple conversation turns rather than single prompts.
The controller component measurably boosts the attack success rate beyond the attacker alone.
The FinRisk-Bench provides a new standardized set for testing red-teaming in six specific financial risk areas.
Black-box methods can effectively target regulatory-violating behaviors without needing model internals.
Progressive concealment in prompts can evade initial safety checks in deployed LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar multi-turn concealment methods could be developed for testing LLMs in other regulated fields like medicine or law.
LLM safety training might benefit from including examples of conversations that slowly reveal risky intent.
Financial regulators may consider requiring multi-turn red-teaming as part of AI system approvals.
Extending the approach could help create more context-aware safety filters that track risk across dialogue turns.

Load-bearing premise

The framework assumes that the predicted Risk Concealment Score can reliably steer the attacker to produce prompts that progressively conceal surface risk while still eliciting regulatory-violating behaviors from the target LLM without the model recognizing the adversarial pattern.

What would settle it

An experiment that applies the same prompts with high predicted concealment scores to the target LLMs and measures whether the models refuse the queries or produce safe responses at a much higher rate than reported.

Figures

Figures reproduced from arXiv: 2509.10546 by Gang Cheng, Haibo Jin, Haohan Wang, Jun Zhuang, Wenbin Zhang.

**Figure 2.** Figure 2: Overview of our framework, which consists of two phases: In phase 1, we construct an initial prompt using a structured [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Breakdown of attack success by categories: Finan [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Risk level distribution and ASR across RCA refine [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of data source in FIN-Bench. The [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Proportion of data sources for each financial behav [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly deployed in finance, where unsafe behavior can lead to serious regulatory risks. However, most red-teaming research focuses on overtly harmful content and overlooks attacks that appear legitimate on the surface yet induce regulatory-violating responses. We address this gap by introducing a controllable black-box multi-turn risk-concealed red-teaming framework (CoRT) that progressively conceals surface-level risk while exploiting regulatory-violating behaviors. CoRT contains two key components: (i) a Risk Concealment Attacker (RCA) that generates multi-turn prompts via iterative refinement, and (ii) a Risk Concealment Controller (RCC) that predicts a turn-level Risk Concealment Score (RCS) to steer RCA's follow-up style. We also built a domain-specific benchmark, FinRisk-Bench, with 522 instructions spanning six financial risk categories. Experiments on nine widely used LLMs show that CoRT (RCA) achieves 93.19% average attack success rate (ASR), and CoRT (RCA+RCC) further improves the average ASR to 95.00%. Our code and FinRisk-Bench are available at https://github.com/gcheng128/CoRT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoRT adds a controllable multi-turn angle to red-teaming for hidden financial regulatory risks, with a released benchmark and high reported ASRs, but the success judgments need clearer, reproducible criteria.

read the letter

This paper gives a practical way to test LLMs for multi-turn attacks that hide regulatory risks in finance. The CoRT setup uses iterative refinement to build prompts that start safe and gradually steer toward violations, plus a controller that scores concealment at each turn to guide the next prompt style. They also built FinRisk-Bench with 522 cases across six risk categories and tested nine models, reaching 93% average attack success with the basic attacker and 95% when the controller is added. Releasing the code and dataset is a real help for anyone who wants to run or extend these tests themselves.

Referee Report

2 major / 2 minor

Summary. The paper introduces CoRT, a black-box controllable multi-turn red-teaming framework for LLMs in finance. It consists of a Risk Concealment Attacker (RCA) that iteratively refines prompts to conceal surface risk while eliciting regulatory violations, and a Risk Concealment Controller (RCC) that predicts a turn-level Risk Concealment Score (RCS) to steer prompt style. The authors construct FinRisk-Bench containing 522 instructions across six financial risk categories and evaluate on nine LLMs, reporting average attack success rates (ASR) of 93.19% for RCA alone and 95.00% when combined with RCC.

Significance. If the ASR measurements prove robust, this work would meaningfully advance red-teaming methodology for high-stakes domains by shifting focus from overt harm to subtle, multi-turn attacks that appear legitimate. The open release of code and the domain-specific benchmark would support reproducibility and further research on financial LLM safety.

major comments (2)

[Experiments] Experiments section: The headline ASR figures (93.19% for RCA, 95.00% for RCA+RCC) rest on an undefined success criterion. No explicit rubric, judging protocol (human, LLM-as-judge, or hybrid), or inter-annotator agreement statistics are reported for the 522 FinRisk-Bench cases, making it impossible to assess whether the numbers reflect genuine elicitation of regulatory violations or lenient labeling.
[Method] Method (§3.2, RCC component): The claim that the predicted Risk Concealment Score reliably steers the attacker toward progressively concealed yet effective prompts assumes the target LLM will not detect the adversarial pattern. No ablation, qualitative examples across turns, or analysis of detection rates is provided to support this load-bearing assumption.

minor comments (2)

[Abstract] The abstract and experimental setup omit the names of the nine evaluated LLMs and any direct baseline comparisons (e.g., single-turn or non-controllable red-teaming methods), which would help contextualize the reported gains.
[Experiments] Tables reporting per-model and per-category ASR should include standard errors or statistical significance tests for the observed improvements from adding RCC.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and outline the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The headline ASR figures (93.19% for RCA, 95.00% for RCA+RCC) rest on an undefined success criterion. No explicit rubric, judging protocol (human, LLM-as-judge, or hybrid), or inter-annotator agreement statistics are reported for the 522 FinRisk-Bench cases, making it impossible to assess whether the numbers reflect genuine elicitation of regulatory violations or lenient labeling.

Authors: We agree that the success criterion for ASR requires explicit documentation. In the original evaluation, a response was deemed successful if it produced content violating the regulatory constraints tied to the FinRisk-Bench instruction's risk category, assessed via a hybrid protocol of automated keyword matching for regulatory indicators followed by expert manual review on a sampled subset. In the revised manuscript we will insert a dedicated subsection detailing the full rubric, the hybrid judging protocol (LLM-assisted scoring with human adjudication on 20% of cases), and inter-annotator agreement statistics (Cohen's kappa) computed during annotation. revision: yes
Referee: [Method] Method (§3.2, RCC component): The claim that the predicted Risk Concealment Score reliably steers the attacker toward progressively concealed yet effective prompts assumes the target LLM will not detect the adversarial pattern. No ablation, qualitative examples across turns, or analysis of detection rates is provided to support this load-bearing assumption.

Authors: We acknowledge that direct evidence for the RCC's steering behavior without eliciting detection was limited. While the black-box multi-turn formulation is intended to avoid overt adversarial signatures, we will add (i) an ablation isolating RCC's contribution to concealment progression, (ii) qualitative turn-by-turn prompt examples with corresponding RCS values, and (iii) a detection-rate analysis measuring refusal or pattern-recognition signals from the nine target models. These additions will be placed in §3.2 and the Experiments section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external LLMs and new benchmark

full rationale

The paper introduces an empirical red-teaming framework CoRT consisting of RCA for iterative prompt generation and RCC for steering via predicted RCS, evaluated directly on nine external LLMs using the newly constructed FinRisk-Bench with 522 cases. All performance claims (93.19% and 95.00% ASR) are measured outcomes from model interactions rather than derived from equations or self-referential fits. No mathematical derivations, parameter fittings renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The chain is self-contained against external benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility into hyperparameters or modeling choices; no explicit free parameters, invented entities, or non-standard axioms are stated. Relies on the domain assumption that LLMs exhibit exploitable regulatory-violating behaviors under concealed multi-turn prompting.

axioms (1)

domain assumption LLMs can be induced to produce regulatory-violating responses through progressively concealed multi-turn prompts
Foundational premise for the red-teaming goal and success metric.

pith-pipeline@v0.9.0 · 5767 in / 1322 out tokens · 59602 ms · 2026-05-18T17:45:00.313308+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CoRT contains two key components: (i) a Risk Concealment Attacker (RCA) that generates multi-turn prompts via iterative refinement, and (ii) a Risk Concealment Controller (RCC) that predicts a turn-level Risk Concealment Score (RCS) to steer RCA's follow-up style.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on nine widely used LLMs show that CoRT (RCA) achieves 93.19% average attack success rate (ASR)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 9 internal anchors

[1]

Alibaba, D. A. 2024. Qwen-72B: Alibaba’s Large Language Model

work page 2024
[2]

Alibaba, Q. T. 2025. Qwen3 Technical Report. arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Anthropic. 2025 a . Claude 3.7 Sonnet System Card

work page 2025
[4]

Anthropic. 2025 b . Claude 4 Sonnet System Card. Accessed: 2025-08-01

work page 2025
[5]

Bai, Y.; Kadavath, S.; et al. 2022. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Bhardwaj, R.; and Poria, S. 2023. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. arXiv:2308.09662

work page arXiv 2023
[7]

Cao, Z.; Li, J.; Ye, J.; Yan, J.; Zhang, W.; and Yu, Y. 2024. Chain of Attack: a Semantic-Driven Contextual Multi-Turn Attacker for LLM. arXiv:2405.19752

work page arXiv 2024
[8]

J.; Tramèr, F.; Hassani, H.; and Wong, E

Chao, P.; Debenedetti, E.; Robey, A.; Andriushchenko, M.; Croce, F.; Sehwag, V.; Dobriban, E.; Flammarion, N.; Pappas, G. J.; Tramèr, F.; Hassani, H.; and Wong, E. 2024. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. In NeurIPS Datasets and Benchmarks Track

work page 2024
[9]

Jailbreaking Black Box Large Language Models in Twenty Queries

Chao, P.; Robey, A.; Dobriban, E.; Hassani, H.; Pappas, G. J.; and Wong, E. 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

J.; and Bing, L

Deng, Y.; Zhang, W.; Pan, S. J.; and Bing, L. 2024. Multilingual Jailbreak Challenges in Large Language Models. In The Twelfth International Conference on Learning Representations

work page 2024
[11]

Ding, P.; Kuang, J.; Ma, D.; Cao, X.; Xian, Y.; Chen, J.; and Huang, S. 2024. A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2136--2153

work page 2024
[12]

Google, G. T. 2024. Gemini 2.5 Flash Technical Overview

work page 2024
[13]

GPT-4o System Card

Hurst, A.; Lerer, A.; Goucher, A. P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Jaech, A.; Kalai, A.; Lerer, A.; Richardson, A.; El-Kishky, A.; Low, A.; Helyar, A.; Madry, A.; Beutel, A.; Carney, A.; et al. 2024. Openai o1 system card. arXiv preprint arXiv:2412.16720

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Ji, J.; Liu, M.; Dai, J.; Pan, X.; Zhang, C.; Bian, C.; Zhang, C.; Sun, R.; Wang, Y.; and Yang, Y. 2023. BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset. arXiv preprint arXiv:2307.04657

work page arXiv 2023
[16]

Jiang, F.; Xu, Z.; Niu, L.; Xiang, Z.; Ramasubramanian, B.; Li, B.; and Poovendran, R. 2024. Artprompt: Ascii art-based jailbreak attacks against aligned llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15157--15173

work page 2024
[17]

Jin, H.; Hu, L.; Li, X.; Zhang, P.; Chen, C.; Zhuang, J.; and Wang, H. 2024. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models. arXiv preprint arXiv:2407.01599

work page arXiv 2024
[18]

G.; and Mandic, D

Konstantinidis, T.; Iacovides, G.; Xu, M.; Constantinides, T. G.; and Mandic, D. 2024. FinLlama: Financial Sentiment Classification for Algorithmic Trading Applications. arXiv preprint arXiv:2403.11557

work page arXiv 2024
[19]

C.; and Song, M

Lee, J.; Stevens, N.; Han, S. C.; and Song, M. 2023. A Survey of Large Language Models in Finance (FinLLMs). arXiv preprint arXiv:2312.15590

work page arXiv 2023
[20]

Li, X.; Zhou, Z.; Zhu, J.; Yao, J.; Liu, T.; and Han, B. 2024. DeepInception: Hypnotize Large Language Model to Be Jailbreaker. In Neurips Safe Generative AI Workshop

work page 2024
[21]

Liu, Y.; He, X.; Xiong, M.; Fu, J.; Deng, S.; and Hooi, B. 2024. Flipattack: Jailbreak llms via flipping. arXiv preprint arXiv:2410.02832

work page arXiv 2024
[22]

Lv, H.; Wang, X.; Zhang, Y.; Huang, C.; Dou, S.; Ye, J.; Gui, T.; Zhang, Q.; and Huang, X. 2024. CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models. arXiv preprint arXiv:2402.16717

work page arXiv 2024
[23]

Martin, R. 2007. An Empire of Indifference: American War and the Financial Logic of Risk Management. Duke University Press

work page 2007
[24]

A.; and Knapp, M

McCornack, S. A.; and Knapp, M. L. 1992. Interpersonal deception theory. Communication Theory

work page 1992
[25]

Mehrotra, A.; Zampetakis, M.; Kassianik, P.; Nelson, B.; Anderson, H.; Singer, Y.; and Karbasi, A. 2024. Tree of attacks: Jailbreaking black-box llms automatically. Advances in Neural Information Processing Systems, 37: 61065--61105

work page 2024
[26]

Meta, A. 2025. LLaMA 3 Technical Report

work page 2025
[27]

M.; Poor, H

Nie, Y.; Kong, Y.; Dong, X.; Mulvey, J. M.; Poor, H. V.; Wen, Q.; and Zohren, S. 2024. A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges. arXiv preprint arXiv:2406.11903

work page arXiv 2024
[28]

OpenAI. 2023 a . GPT-4 Technical Report. Technical report, OpenAI

work page 2023
[29]

OpenAI. 2023 b . OpenAI Moderation System Card. https://openai.com/systems/moderation. Accessed: August 2, 2025

work page 2023
[30]

OpenAI. 2025. GPT-4.1 Model Card. Technical report, OpenAI

work page 2025
[31]

Ouyang, L. e. a. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems

work page 2022
[32]

Radharapu, B.; Robinson, K.; Aroyo, L.; and Lahoti, P. 2023. AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications. arXiv preprint arXiv:2311.08592

work page arXiv 2023
[33]

Russinovich, M.; et al. 2024. Great, now write an article about that: The Crescendo multi-turn LLM jailbreak attack . arXiv preprint arXiv:2404.01833

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Wang, Y.; Li, H.; Han, X.; Nakov, P.; and Baldwin, T. 2024 a . Do-Not-Answer: Evaluating Safeguards in LLM s. In Graham, Y.; and Purver, M., eds., Findings of the Association for Computational Linguistics: EACL 2024, 896--911. St. Julian ' s, Malta: Association for Computational Linguistics

work page 2024
[35]

Wang, Y.; et al. 2024 b . Foot-In-The-Door: A Multi-turn Jailbreak for LLMs . arXiv preprint arXiv:2502.19820

work page arXiv 2024
[36]

Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; and Mann, G. 2023. BloombergGPT: A Large Language Model for Finance. arXiv preprint arXiv:2303.17564

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Xiao, Y.; Sun, E.; Luo, D.; and Wang, W. 2024. TradingAgents: Multi-agents LLM financial trading framework. arXiv preprint arXiv:2412.20138

work page arXiv 2024
[38]

Xie, Q.; Han, W.; Zhang, X.; Lai, Y.; Peng, M.; Lopez-Lira, A.; and Huang, J. 2024. PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance. arXiv preprint arXiv:2402.00838

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Y.; and Ren, X

Xu, H.; Liu, Y.; Zhang, Y.; Ma, Y.; Li, X.; Lin, B. Y.; and Ren, X. 2024. A Defense Against Jailbreaking Large Language Models via Step-wise Detection. In Proceedings of the International Conference on Learning Representations (ICLR)

work page 2024
[40]

Yadav, A.; Jin, H.; Luo, M.; Zhuang, J.; and Wang, H. 2025. InfoFlood: Jailbreaking Large Language Models with Information Overload. arXiv preprint arXiv:2506.12274

work page arXiv 2025
[41]

B.; Liu, X.-Y.; and Wang, C

Yang, H. B.; Liu, X.-Y.; and Wang, C. D. 2023. FinGPT: Open-Source Financial Large Language Models. arXiv preprint arXiv:2306.10658

work page arXiv 2023
[42]

Yuan, Y.; Jiao, W.; Wang, W.; Huang, J.-t.; He, P.; Shi, S.; and Tu, Z. 2024. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher. In ICLR

work page 2024
[43]

G.; and Wang, H

Zhuang, J.; Jin, H.; Zhang, Y.; Kang, Z.; Zhang, W.; Dagher, G. G.; and Wang, H. 2025. Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation. arXiv preprint arXiv:2505.18556

work page arXiv 2025
[44]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J. Z.; and Fredrikson, M. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023