pith. machine review for the scientific record. sign in

arxiv: 2509.10546 · v2 · submitted 2025-09-07 · 💻 cs.CL · cs.AI· cs.LG

Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain

Pith reviewed 2026-05-18 17:45 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords red-teaminglarge language modelsfinancial domainmulti-turn attacksrisk concealmentattack success rateFinRisk-BenchLLM safety
0
0 comments X

The pith

CoRT framework uses controllable multi-turn concealment to reach 95% attack success on financial LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a red-teaming method for large language models in finance that creates conversations appearing safe on the surface but that lead the models to give answers violating financial regulations. It does this through a process where an attacker component refines prompts over multiple turns and a controller component predicts a score for how well the risk is hidden, using that score to adjust the next prompt. This fills a gap left by red-teaming techniques that only test for obvious harmful content. A reader would care because financial applications of LLMs carry real regulatory consequences if the models produce bad advice, and current tests may not catch these hidden risks.

Core claim

The authors introduce the CoRT framework for controllable black-box multi-turn risk-concealed red-teaming. It includes a Risk Concealment Attacker that generates prompts through iterative refinement and a Risk Concealment Controller that predicts a turn-level Risk Concealment Score to steer the attacker's follow-up style. Evaluated on the FinRisk-Bench benchmark of 522 instructions in six financial risk categories, the method achieves 93.19% average attack success rate with the attacker alone and 95.00% when combined with the controller, across nine LLMs.

What carries the argument

The Risk Concealment Score, a predicted value at each turn that controls the style of the next prompt generated by the Risk Concealment Attacker to balance concealment of risk with the ability to elicit unsafe responses.

If this is right

  • Financial LLMs remain susceptible to attacks that build up over multiple conversation turns rather than single prompts.
  • The controller component measurably boosts the attack success rate beyond the attacker alone.
  • The FinRisk-Bench provides a new standardized set for testing red-teaming in six specific financial risk areas.
  • Black-box methods can effectively target regulatory-violating behaviors without needing model internals.
  • Progressive concealment in prompts can evade initial safety checks in deployed LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar multi-turn concealment methods could be developed for testing LLMs in other regulated fields like medicine or law.
  • LLM safety training might benefit from including examples of conversations that slowly reveal risky intent.
  • Financial regulators may consider requiring multi-turn red-teaming as part of AI system approvals.
  • Extending the approach could help create more context-aware safety filters that track risk across dialogue turns.

Load-bearing premise

The framework assumes that the predicted Risk Concealment Score can reliably steer the attacker to produce prompts that progressively conceal surface risk while still eliciting regulatory-violating behaviors from the target LLM without the model recognizing the adversarial pattern.

What would settle it

An experiment that applies the same prompts with high predicted concealment scores to the target LLMs and measures whether the models refuse the queries or produce safe responses at a much higher rate than reported.

Figures

Figures reproduced from arXiv: 2509.10546 by Gang Cheng, Haibo Jin, Haohan Wang, Jun Zhuang, Wenbin Zhang.

Figure 1
Figure 1. Figure 1: Comparison between explicit and implicit risks in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our framework, which consists of two phases: In phase 1, we construct an initial prompt using a structured [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Breakdown of attack success by categories: Finan [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Risk level distribution and ASR across RCA refine [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of data source in FIN-Bench. The [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Proportion of data sources for each financial behav [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly deployed in finance, where unsafe behavior can lead to serious regulatory risks. However, most red-teaming research focuses on overtly harmful content and overlooks attacks that appear legitimate on the surface yet induce regulatory-violating responses. We address this gap by introducing a controllable black-box multi-turn risk-concealed red-teaming framework (CoRT) that progressively conceals surface-level risk while exploiting regulatory-violating behaviors. CoRT contains two key components: (i) a Risk Concealment Attacker (RCA) that generates multi-turn prompts via iterative refinement, and (ii) a Risk Concealment Controller (RCC) that predicts a turn-level Risk Concealment Score (RCS) to steer RCA's follow-up style. We also built a domain-specific benchmark, FinRisk-Bench, with 522 instructions spanning six financial risk categories. Experiments on nine widely used LLMs show that CoRT (RCA) achieves 93.19% average attack success rate (ASR), and CoRT (RCA+RCC) further improves the average ASR to 95.00%. Our code and FinRisk-Bench are available at https://github.com/gcheng128/CoRT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CoRT, a black-box controllable multi-turn red-teaming framework for LLMs in finance. It consists of a Risk Concealment Attacker (RCA) that iteratively refines prompts to conceal surface risk while eliciting regulatory violations, and a Risk Concealment Controller (RCC) that predicts a turn-level Risk Concealment Score (RCS) to steer prompt style. The authors construct FinRisk-Bench containing 522 instructions across six financial risk categories and evaluate on nine LLMs, reporting average attack success rates (ASR) of 93.19% for RCA alone and 95.00% when combined with RCC.

Significance. If the ASR measurements prove robust, this work would meaningfully advance red-teaming methodology for high-stakes domains by shifting focus from overt harm to subtle, multi-turn attacks that appear legitimate. The open release of code and the domain-specific benchmark would support reproducibility and further research on financial LLM safety.

major comments (2)
  1. [Experiments] Experiments section: The headline ASR figures (93.19% for RCA, 95.00% for RCA+RCC) rest on an undefined success criterion. No explicit rubric, judging protocol (human, LLM-as-judge, or hybrid), or inter-annotator agreement statistics are reported for the 522 FinRisk-Bench cases, making it impossible to assess whether the numbers reflect genuine elicitation of regulatory violations or lenient labeling.
  2. [Method] Method (§3.2, RCC component): The claim that the predicted Risk Concealment Score reliably steers the attacker toward progressively concealed yet effective prompts assumes the target LLM will not detect the adversarial pattern. No ablation, qualitative examples across turns, or analysis of detection rates is provided to support this load-bearing assumption.
minor comments (2)
  1. [Abstract] The abstract and experimental setup omit the names of the nine evaluated LLMs and any direct baseline comparisons (e.g., single-turn or non-controllable red-teaming methods), which would help contextualize the reported gains.
  2. [Experiments] Tables reporting per-model and per-category ASR should include standard errors or statistical significance tests for the observed improvements from adding RCC.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and outline the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The headline ASR figures (93.19% for RCA, 95.00% for RCA+RCC) rest on an undefined success criterion. No explicit rubric, judging protocol (human, LLM-as-judge, or hybrid), or inter-annotator agreement statistics are reported for the 522 FinRisk-Bench cases, making it impossible to assess whether the numbers reflect genuine elicitation of regulatory violations or lenient labeling.

    Authors: We agree that the success criterion for ASR requires explicit documentation. In the original evaluation, a response was deemed successful if it produced content violating the regulatory constraints tied to the FinRisk-Bench instruction's risk category, assessed via a hybrid protocol of automated keyword matching for regulatory indicators followed by expert manual review on a sampled subset. In the revised manuscript we will insert a dedicated subsection detailing the full rubric, the hybrid judging protocol (LLM-assisted scoring with human adjudication on 20% of cases), and inter-annotator agreement statistics (Cohen's kappa) computed during annotation. revision: yes

  2. Referee: [Method] Method (§3.2, RCC component): The claim that the predicted Risk Concealment Score reliably steers the attacker toward progressively concealed yet effective prompts assumes the target LLM will not detect the adversarial pattern. No ablation, qualitative examples across turns, or analysis of detection rates is provided to support this load-bearing assumption.

    Authors: We acknowledge that direct evidence for the RCC's steering behavior without eliciting detection was limited. While the black-box multi-turn formulation is intended to avoid overt adversarial signatures, we will add (i) an ablation isolating RCC's contribution to concealment progression, (ii) qualitative turn-by-turn prompt examples with corresponding RCS values, and (iii) a detection-rate analysis measuring refusal or pattern-recognition signals from the nine target models. These additions will be placed in §3.2 and the Experiments section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external LLMs and new benchmark

full rationale

The paper introduces an empirical red-teaming framework CoRT consisting of RCA for iterative prompt generation and RCC for steering via predicted RCS, evaluated directly on nine external LLMs using the newly constructed FinRisk-Bench with 522 cases. All performance claims (93.19% and 95.00% ASR) are measured outcomes from model interactions rather than derived from equations or self-referential fits. No mathematical derivations, parameter fittings renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The chain is self-contained against external benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility into hyperparameters or modeling choices; no explicit free parameters, invented entities, or non-standard axioms are stated. Relies on the domain assumption that LLMs exhibit exploitable regulatory-violating behaviors under concealed multi-turn prompting.

axioms (1)
  • domain assumption LLMs can be induced to produce regulatory-violating responses through progressively concealed multi-turn prompts
    Foundational premise for the red-teaming goal and success metric.

pith-pipeline@v0.9.0 · 5767 in / 1322 out tokens · 59602 ms · 2026-05-18T17:45:00.313308+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 9 internal anchors

  1. [1]

    Alibaba, D. A. 2024. Qwen-72B: Alibaba’s Large Language Model

  2. [2]

    Alibaba, Q. T. 2025. Qwen3 Technical Report. arXiv:2505.09388

  3. [3]

    Anthropic. 2025 a . Claude 3.7 Sonnet System Card

  4. [4]

    Anthropic. 2025 b . Claude 4 Sonnet System Card. Accessed: 2025-08-01

  5. [5]

    Bai, Y.; Kadavath, S.; et al. 2022. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073

  6. [6]

    Bhardwaj, R.; and Poria, S. 2023. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. arXiv:2308.09662

  7. [7]

    Cao, Z.; Li, J.; Ye, J.; Yan, J.; Zhang, W.; and Yu, Y. 2024. Chain of Attack: a Semantic-Driven Contextual Multi-Turn Attacker for LLM. arXiv:2405.19752

  8. [8]

    J.; Tramèr, F.; Hassani, H.; and Wong, E

    Chao, P.; Debenedetti, E.; Robey, A.; Andriushchenko, M.; Croce, F.; Sehwag, V.; Dobriban, E.; Flammarion, N.; Pappas, G. J.; Tramèr, F.; Hassani, H.; and Wong, E. 2024. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. In NeurIPS Datasets and Benchmarks Track

  9. [9]

    Jailbreaking Black Box Large Language Models in Twenty Queries

    Chao, P.; Robey, A.; Dobriban, E.; Hassani, H.; Pappas, G. J.; and Wong, E. 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419

  10. [10]

    J.; and Bing, L

    Deng, Y.; Zhang, W.; Pan, S. J.; and Bing, L. 2024. Multilingual Jailbreak Challenges in Large Language Models. In The Twelfth International Conference on Learning Representations

  11. [11]

    Ding, P.; Kuang, J.; Ma, D.; Cao, X.; Xian, Y.; Chen, J.; and Huang, S. 2024. A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2136--2153

  12. [12]

    Google, G. T. 2024. Gemini 2.5 Flash Technical Overview

  13. [13]

    GPT-4o System Card

    Hurst, A.; Lerer, A.; Goucher, A. P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

  14. [14]

    Jaech, A.; Kalai, A.; Lerer, A.; Richardson, A.; El-Kishky, A.; Low, A.; Helyar, A.; Madry, A.; Beutel, A.; Carney, A.; et al. 2024. Openai o1 system card. arXiv preprint arXiv:2412.16720

  15. [15]

    Ji, J.; Liu, M.; Dai, J.; Pan, X.; Zhang, C.; Bian, C.; Zhang, C.; Sun, R.; Wang, Y.; and Yang, Y. 2023. BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset. arXiv preprint arXiv:2307.04657

  16. [16]

    Jiang, F.; Xu, Z.; Niu, L.; Xiang, Z.; Ramasubramanian, B.; Li, B.; and Poovendran, R. 2024. Artprompt: Ascii art-based jailbreak attacks against aligned llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15157--15173

  17. [17]

    Jin, H.; Hu, L.; Li, X.; Zhang, P.; Chen, C.; Zhuang, J.; and Wang, H. 2024. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models. arXiv preprint arXiv:2407.01599

  18. [18]

    G.; and Mandic, D

    Konstantinidis, T.; Iacovides, G.; Xu, M.; Constantinides, T. G.; and Mandic, D. 2024. FinLlama: Financial Sentiment Classification for Algorithmic Trading Applications. arXiv preprint arXiv:2403.11557

  19. [19]

    C.; and Song, M

    Lee, J.; Stevens, N.; Han, S. C.; and Song, M. 2023. A Survey of Large Language Models in Finance (FinLLMs). arXiv preprint arXiv:2312.15590

  20. [20]

    Li, X.; Zhou, Z.; Zhu, J.; Yao, J.; Liu, T.; and Han, B. 2024. DeepInception: Hypnotize Large Language Model to Be Jailbreaker. In Neurips Safe Generative AI Workshop

  21. [21]

    Liu, Y.; He, X.; Xiong, M.; Fu, J.; Deng, S.; and Hooi, B. 2024. Flipattack: Jailbreak llms via flipping. arXiv preprint arXiv:2410.02832

  22. [22]

    Lv, H.; Wang, X.; Zhang, Y.; Huang, C.; Dou, S.; Ye, J.; Gui, T.; Zhang, Q.; and Huang, X. 2024. CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models. arXiv preprint arXiv:2402.16717

  23. [23]

    Martin, R. 2007. An Empire of Indifference: American War and the Financial Logic of Risk Management. Duke University Press

  24. [24]

    A.; and Knapp, M

    McCornack, S. A.; and Knapp, M. L. 1992. Interpersonal deception theory. Communication Theory

  25. [25]

    Mehrotra, A.; Zampetakis, M.; Kassianik, P.; Nelson, B.; Anderson, H.; Singer, Y.; and Karbasi, A. 2024. Tree of attacks: Jailbreaking black-box llms automatically. Advances in Neural Information Processing Systems, 37: 61065--61105

  26. [26]

    Meta, A. 2025. LLaMA 3 Technical Report

  27. [27]

    M.; Poor, H

    Nie, Y.; Kong, Y.; Dong, X.; Mulvey, J. M.; Poor, H. V.; Wen, Q.; and Zohren, S. 2024. A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges. arXiv preprint arXiv:2406.11903

  28. [28]

    OpenAI. 2023 a . GPT-4 Technical Report. Technical report, OpenAI

  29. [29]

    OpenAI. 2023 b . OpenAI Moderation System Card. https://openai.com/systems/moderation. Accessed: August 2, 2025

  30. [30]

    OpenAI. 2025. GPT-4.1 Model Card. Technical report, OpenAI

  31. [31]

    Ouyang, L. e. a. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems

  32. [32]

    Radharapu, B.; Robinson, K.; Aroyo, L.; and Lahoti, P. 2023. AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications. arXiv preprint arXiv:2311.08592

  33. [33]

    Russinovich, M.; et al. 2024. Great, now write an article about that: The Crescendo multi-turn LLM jailbreak attack . arXiv preprint arXiv:2404.01833

  34. [34]

    Wang, Y.; Li, H.; Han, X.; Nakov, P.; and Baldwin, T. 2024 a . Do-Not-Answer: Evaluating Safeguards in LLM s. In Graham, Y.; and Purver, M., eds., Findings of the Association for Computational Linguistics: EACL 2024, 896--911. St. Julian ' s, Malta: Association for Computational Linguistics

  35. [35]

    Wang, Y.; et al. 2024 b . Foot-In-The-Door: A Multi-turn Jailbreak for LLMs . arXiv preprint arXiv:2502.19820

  36. [36]

    Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; and Mann, G. 2023. BloombergGPT: A Large Language Model for Finance. arXiv preprint arXiv:2303.17564

  37. [37]

    Xiao, Y.; Sun, E.; Luo, D.; and Wang, W. 2024. TradingAgents: Multi-agents LLM financial trading framework. arXiv preprint arXiv:2412.20138

  38. [38]

    Xie, Q.; Han, W.; Zhang, X.; Lai, Y.; Peng, M.; Lopez-Lira, A.; and Huang, J. 2024. PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance. arXiv preprint arXiv:2402.00838

  39. [39]

    Y.; and Ren, X

    Xu, H.; Liu, Y.; Zhang, Y.; Ma, Y.; Li, X.; Lin, B. Y.; and Ren, X. 2024. A Defense Against Jailbreaking Large Language Models via Step-wise Detection. In Proceedings of the International Conference on Learning Representations (ICLR)

  40. [40]

    Yadav, A.; Jin, H.; Luo, M.; Zhuang, J.; and Wang, H. 2025. InfoFlood: Jailbreaking Large Language Models with Information Overload. arXiv preprint arXiv:2506.12274

  41. [41]

    B.; Liu, X.-Y.; and Wang, C

    Yang, H. B.; Liu, X.-Y.; and Wang, C. D. 2023. FinGPT: Open-Source Financial Large Language Models. arXiv preprint arXiv:2306.10658

  42. [42]

    Yuan, Y.; Jiao, W.; Wang, W.; Huang, J.-t.; He, P.; Shi, S.; and Tu, Z. 2024. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher. In ICLR

  43. [43]

    G.; and Wang, H

    Zhuang, J.; Jin, H.; Zhang, Y.; Kang, Z.; Zhang, W.; Dagher, G. G.; and Wang, H. 2025. Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation. arXiv preprint arXiv:2505.18556

  44. [44]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J. Z.; and Fredrikson, M. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043