Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain
Pith reviewed 2026-05-18 17:45 UTC · model grok-4.3
The pith
CoRT framework uses controllable multi-turn concealment to reach 95% attack success on financial LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce the CoRT framework for controllable black-box multi-turn risk-concealed red-teaming. It includes a Risk Concealment Attacker that generates prompts through iterative refinement and a Risk Concealment Controller that predicts a turn-level Risk Concealment Score to steer the attacker's follow-up style. Evaluated on the FinRisk-Bench benchmark of 522 instructions in six financial risk categories, the method achieves 93.19% average attack success rate with the attacker alone and 95.00% when combined with the controller, across nine LLMs.
What carries the argument
The Risk Concealment Score, a predicted value at each turn that controls the style of the next prompt generated by the Risk Concealment Attacker to balance concealment of risk with the ability to elicit unsafe responses.
If this is right
- Financial LLMs remain susceptible to attacks that build up over multiple conversation turns rather than single prompts.
- The controller component measurably boosts the attack success rate beyond the attacker alone.
- The FinRisk-Bench provides a new standardized set for testing red-teaming in six specific financial risk areas.
- Black-box methods can effectively target regulatory-violating behaviors without needing model internals.
- Progressive concealment in prompts can evade initial safety checks in deployed LLMs.
Where Pith is reading between the lines
- Similar multi-turn concealment methods could be developed for testing LLMs in other regulated fields like medicine or law.
- LLM safety training might benefit from including examples of conversations that slowly reveal risky intent.
- Financial regulators may consider requiring multi-turn red-teaming as part of AI system approvals.
- Extending the approach could help create more context-aware safety filters that track risk across dialogue turns.
Load-bearing premise
The framework assumes that the predicted Risk Concealment Score can reliably steer the attacker to produce prompts that progressively conceal surface risk while still eliciting regulatory-violating behaviors from the target LLM without the model recognizing the adversarial pattern.
What would settle it
An experiment that applies the same prompts with high predicted concealment scores to the target LLMs and measures whether the models refuse the queries or produce safe responses at a much higher rate than reported.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly deployed in finance, where unsafe behavior can lead to serious regulatory risks. However, most red-teaming research focuses on overtly harmful content and overlooks attacks that appear legitimate on the surface yet induce regulatory-violating responses. We address this gap by introducing a controllable black-box multi-turn risk-concealed red-teaming framework (CoRT) that progressively conceals surface-level risk while exploiting regulatory-violating behaviors. CoRT contains two key components: (i) a Risk Concealment Attacker (RCA) that generates multi-turn prompts via iterative refinement, and (ii) a Risk Concealment Controller (RCC) that predicts a turn-level Risk Concealment Score (RCS) to steer RCA's follow-up style. We also built a domain-specific benchmark, FinRisk-Bench, with 522 instructions spanning six financial risk categories. Experiments on nine widely used LLMs show that CoRT (RCA) achieves 93.19% average attack success rate (ASR), and CoRT (RCA+RCC) further improves the average ASR to 95.00%. Our code and FinRisk-Bench are available at https://github.com/gcheng128/CoRT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CoRT, a black-box controllable multi-turn red-teaming framework for LLMs in finance. It consists of a Risk Concealment Attacker (RCA) that iteratively refines prompts to conceal surface risk while eliciting regulatory violations, and a Risk Concealment Controller (RCC) that predicts a turn-level Risk Concealment Score (RCS) to steer prompt style. The authors construct FinRisk-Bench containing 522 instructions across six financial risk categories and evaluate on nine LLMs, reporting average attack success rates (ASR) of 93.19% for RCA alone and 95.00% when combined with RCC.
Significance. If the ASR measurements prove robust, this work would meaningfully advance red-teaming methodology for high-stakes domains by shifting focus from overt harm to subtle, multi-turn attacks that appear legitimate. The open release of code and the domain-specific benchmark would support reproducibility and further research on financial LLM safety.
major comments (2)
- [Experiments] Experiments section: The headline ASR figures (93.19% for RCA, 95.00% for RCA+RCC) rest on an undefined success criterion. No explicit rubric, judging protocol (human, LLM-as-judge, or hybrid), or inter-annotator agreement statistics are reported for the 522 FinRisk-Bench cases, making it impossible to assess whether the numbers reflect genuine elicitation of regulatory violations or lenient labeling.
- [Method] Method (§3.2, RCC component): The claim that the predicted Risk Concealment Score reliably steers the attacker toward progressively concealed yet effective prompts assumes the target LLM will not detect the adversarial pattern. No ablation, qualitative examples across turns, or analysis of detection rates is provided to support this load-bearing assumption.
minor comments (2)
- [Abstract] The abstract and experimental setup omit the names of the nine evaluated LLMs and any direct baseline comparisons (e.g., single-turn or non-controllable red-teaming methods), which would help contextualize the reported gains.
- [Experiments] Tables reporting per-model and per-category ASR should include standard errors or statistical significance tests for the observed improvements from adding RCC.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment below and outline the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The headline ASR figures (93.19% for RCA, 95.00% for RCA+RCC) rest on an undefined success criterion. No explicit rubric, judging protocol (human, LLM-as-judge, or hybrid), or inter-annotator agreement statistics are reported for the 522 FinRisk-Bench cases, making it impossible to assess whether the numbers reflect genuine elicitation of regulatory violations or lenient labeling.
Authors: We agree that the success criterion for ASR requires explicit documentation. In the original evaluation, a response was deemed successful if it produced content violating the regulatory constraints tied to the FinRisk-Bench instruction's risk category, assessed via a hybrid protocol of automated keyword matching for regulatory indicators followed by expert manual review on a sampled subset. In the revised manuscript we will insert a dedicated subsection detailing the full rubric, the hybrid judging protocol (LLM-assisted scoring with human adjudication on 20% of cases), and inter-annotator agreement statistics (Cohen's kappa) computed during annotation. revision: yes
-
Referee: [Method] Method (§3.2, RCC component): The claim that the predicted Risk Concealment Score reliably steers the attacker toward progressively concealed yet effective prompts assumes the target LLM will not detect the adversarial pattern. No ablation, qualitative examples across turns, or analysis of detection rates is provided to support this load-bearing assumption.
Authors: We acknowledge that direct evidence for the RCC's steering behavior without eliciting detection was limited. While the black-box multi-turn formulation is intended to avoid overt adversarial signatures, we will add (i) an ablation isolating RCC's contribution to concealment progression, (ii) qualitative turn-by-turn prompt examples with corresponding RCS values, and (iii) a detection-rate analysis measuring refusal or pattern-recognition signals from the nine target models. These additions will be placed in §3.2 and the Experiments section of the revised manuscript. revision: yes
Circularity Check
No circularity: empirical results on external LLMs and new benchmark
full rationale
The paper introduces an empirical red-teaming framework CoRT consisting of RCA for iterative prompt generation and RCC for steering via predicted RCS, evaluated directly on nine external LLMs using the newly constructed FinRisk-Bench with 522 cases. All performance claims (93.19% and 95.00% ASR) are measured outcomes from model interactions rather than derived from equations or self-referential fits. No mathematical derivations, parameter fittings renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The chain is self-contained against external benchmarks and does not reduce any result to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can be induced to produce regulatory-violating responses through progressively concealed multi-turn prompts
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CoRT contains two key components: (i) a Risk Concealment Attacker (RCA) that generates multi-turn prompts via iterative refinement, and (ii) a Risk Concealment Controller (RCC) that predicts a turn-level Risk Concealment Score (RCS) to steer RCA's follow-up style.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments on nine widely used LLMs show that CoRT (RCA) achieves 93.19% average attack success rate (ASR)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Alibaba, D. A. 2024. Qwen-72B: Alibaba’s Large Language Model
work page 2024
-
[2]
Alibaba, Q. T. 2025. Qwen3 Technical Report. arXiv:2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Anthropic. 2025 a . Claude 3.7 Sonnet System Card
work page 2025
-
[4]
Anthropic. 2025 b . Claude 4 Sonnet System Card. Accessed: 2025-08-01
work page 2025
-
[5]
Bai, Y.; Kadavath, S.; et al. 2022. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [6]
- [7]
-
[8]
J.; Tramèr, F.; Hassani, H.; and Wong, E
Chao, P.; Debenedetti, E.; Robey, A.; Andriushchenko, M.; Croce, F.; Sehwag, V.; Dobriban, E.; Flammarion, N.; Pappas, G. J.; Tramèr, F.; Hassani, H.; and Wong, E. 2024. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. In NeurIPS Datasets and Benchmarks Track
work page 2024
-
[9]
Jailbreaking Black Box Large Language Models in Twenty Queries
Chao, P.; Robey, A.; Dobriban, E.; Hassani, H.; Pappas, G. J.; and Wong, E. 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Deng, Y.; Zhang, W.; Pan, S. J.; and Bing, L. 2024. Multilingual Jailbreak Challenges in Large Language Models. In The Twelfth International Conference on Learning Representations
work page 2024
-
[11]
Ding, P.; Kuang, J.; Ma, D.; Cao, X.; Xian, Y.; Chen, J.; and Huang, S. 2024. A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2136--2153
work page 2024
-
[12]
Google, G. T. 2024. Gemini 2.5 Flash Technical Overview
work page 2024
-
[13]
Hurst, A.; Lerer, A.; Goucher, A. P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Jaech, A.; Kalai, A.; Lerer, A.; Richardson, A.; El-Kishky, A.; Low, A.; Helyar, A.; Madry, A.; Beutel, A.; Carney, A.; et al. 2024. Openai o1 system card. arXiv preprint arXiv:2412.16720
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [15]
-
[16]
Jiang, F.; Xu, Z.; Niu, L.; Xiang, Z.; Ramasubramanian, B.; Li, B.; and Poovendran, R. 2024. Artprompt: Ascii art-based jailbreak attacks against aligned llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15157--15173
work page 2024
- [17]
-
[18]
Konstantinidis, T.; Iacovides, G.; Xu, M.; Constantinides, T. G.; and Mandic, D. 2024. FinLlama: Financial Sentiment Classification for Algorithmic Trading Applications. arXiv preprint arXiv:2403.11557
-
[19]
Lee, J.; Stevens, N.; Han, S. C.; and Song, M. 2023. A Survey of Large Language Models in Finance (FinLLMs). arXiv preprint arXiv:2312.15590
-
[20]
Li, X.; Zhou, Z.; Zhu, J.; Yao, J.; Liu, T.; and Han, B. 2024. DeepInception: Hypnotize Large Language Model to Be Jailbreaker. In Neurips Safe Generative AI Workshop
work page 2024
- [21]
- [22]
-
[23]
Martin, R. 2007. An Empire of Indifference: American War and the Financial Logic of Risk Management. Duke University Press
work page 2007
-
[24]
McCornack, S. A.; and Knapp, M. L. 1992. Interpersonal deception theory. Communication Theory
work page 1992
-
[25]
Mehrotra, A.; Zampetakis, M.; Kassianik, P.; Nelson, B.; Anderson, H.; Singer, Y.; and Karbasi, A. 2024. Tree of attacks: Jailbreaking black-box llms automatically. Advances in Neural Information Processing Systems, 37: 61065--61105
work page 2024
-
[26]
Meta, A. 2025. LLaMA 3 Technical Report
work page 2025
-
[27]
Nie, Y.; Kong, Y.; Dong, X.; Mulvey, J. M.; Poor, H. V.; Wen, Q.; and Zohren, S. 2024. A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges. arXiv preprint arXiv:2406.11903
-
[28]
OpenAI. 2023 a . GPT-4 Technical Report. Technical report, OpenAI
work page 2023
-
[29]
OpenAI. 2023 b . OpenAI Moderation System Card. https://openai.com/systems/moderation. Accessed: August 2, 2025
work page 2023
-
[30]
OpenAI. 2025. GPT-4.1 Model Card. Technical report, OpenAI
work page 2025
-
[31]
Ouyang, L. e. a. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems
work page 2022
- [32]
-
[33]
Russinovich, M.; et al. 2024. Great, now write an article about that: The Crescendo multi-turn LLM jailbreak attack . arXiv preprint arXiv:2404.01833
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Wang, Y.; Li, H.; Han, X.; Nakov, P.; and Baldwin, T. 2024 a . Do-Not-Answer: Evaluating Safeguards in LLM s. In Graham, Y.; and Purver, M., eds., Findings of the Association for Computational Linguistics: EACL 2024, 896--911. St. Julian ' s, Malta: Association for Computational Linguistics
work page 2024
- [35]
-
[36]
Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; and Mann, G. 2023. BloombergGPT: A Large Language Model for Finance. arXiv preprint arXiv:2303.17564
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [37]
-
[38]
Xie, Q.; Han, W.; Zhang, X.; Lai, Y.; Peng, M.; Lopez-Lira, A.; and Huang, J. 2024. PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance. arXiv preprint arXiv:2402.00838
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Xu, H.; Liu, Y.; Zhang, Y.; Ma, Y.; Li, X.; Lin, B. Y.; and Ren, X. 2024. A Defense Against Jailbreaking Large Language Models via Step-wise Detection. In Proceedings of the International Conference on Learning Representations (ICLR)
work page 2024
- [40]
-
[41]
Yang, H. B.; Liu, X.-Y.; and Wang, C. D. 2023. FinGPT: Open-Source Financial Large Language Models. arXiv preprint arXiv:2306.10658
-
[42]
Yuan, Y.; Jiao, W.; Wang, W.; Huang, J.-t.; He, P.; Shi, S.; and Tu, Z. 2024. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher. In ICLR
work page 2024
-
[43]
Zhuang, J.; Jin, H.; Zhang, Y.; Kang, Z.; Zhang, W.; Dagher, G. G.; and Wang, H. 2025. Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation. arXiv preprint arXiv:2505.18556
-
[44]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J. Z.; and Fredrikson, M. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.