arxiv: 2604.12312 · v1 · submitted 2026-04-14 · 💻 cs.CL

Recognition: unknown

CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

Jingbo Yang , Guanyu Yao , Bairu Hou , Xinghan Yang , Nikolai Glushnev , Iwona Bialynicka-Birula , Duo Ding , Shiyu Chang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:51 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM judgescompliance violation detectiondialogue systemsbenchmarksynthetic data generationfine-tuningadversarial searchviolation localization

0 comments

The pith

A small fine-tuned model outperforms leading LLMs at detecting compliance violations in multi-turn agent dialogues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CompliBench to measure how well LLM judges can identify and locate violations of domain-specific guidelines during simulated conversations. The authors address data scarcity with an automated pipeline that generates multi-turn dialogues, injects controlled flaws to create precise violation labels, and applies adversarial search to make the cases difficult. Evaluations demonstrate that proprietary state-of-the-art LLMs perform poorly on the task, yet a compact model trained on the resulting data achieves higher accuracy and transfers effectively to business domains not seen during training. This approach matters because reliable automated judges would enable ongoing monitoring of enterprise AI agents without requiring exhaustive human review of every exchange. The pipeline supplies both a standardized test set and a scalable source of training examples for building stronger compliance checkers.

Core claim

CompliBench evaluates LLM judges on detecting and localizing guideline violations in multi-turn dialogues. The automated data generation pipeline uses controllable flaw injection to produce exact ground-truth labels for the violated rule and conversation turn, while adversarial search ensures the perturbations remain challenging. State-of-the-art proprietary LLMs struggle significantly, but a small-scale judge model fine-tuned on the synthesized data outperforms them and generalizes well to unseen business domains.

What carries the argument

The automated data generation pipeline that combines controllable flaw injection for precise labeled violations with adversarial search to create challenging test cases in simulated user-agent dialogues.

If this is right

Current leading LLMs cannot yet serve as reliable standalone judges for enterprise compliance monitoring.
Fine-tuning smaller models on the generated violation data produces more effective judges than using proprietary LLMs directly.
The pipeline allows rapid creation of labeled benchmarks for new business domains without large-scale human annotation.
Improved judges support safer large-scale deployment of task-oriented dialogue agents by catching policy breaches automatically.
Generalization across domains indicates that the synthetic violations capture transferable patterns of guideline non-adherence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be combined with limited real violation logs to further improve robustness without full reliance on either source.
Applying the same injection and search process to other rule sets, such as legal or ethical guidelines, would test its broader utility.
Running the trained judge on live agent logs from production systems would provide a direct check on whether benchmark performance translates to operational use.
The adversarial search step might expose recurring reasoning failures in LLMs when interpreting complex operational rules.

Load-bearing premise

The controllable flaw injection and adversarial search produce perturbations that are realistic enough to represent actual enterprise guideline violations and sufficiently challenging to test LLM judges meaningfully.

What would settle it

If the fine-tuned small model shows no accuracy advantage or fails to generalize when evaluated on a fresh set of human-annotated real-world violation examples drawn from actual enterprise dialogue logs.

Figures

Figures reproduced from arXiv: 2604.12312 by Bairu Hou, Duo Ding, Guanyu Yao, Iwona Bialynicka-Birula, Jingbo Yang, Nikolai Glushnev, Shiyu Chang, Xinghan Yang.

**Figure 2.** Figure 2: Overview of Data Synthesis. Pipeline for scaling, modifying, and applying contact center guidelines to generate high-quality, guideline-driven conversations with automatic labeling. where ri is the agent message and ui the subsequent user message. For each agent turn i, we annotate a label yi = (gi , vi), where gi is the governing guideline and vi ∈ {0, 1} indicates whether ri violates it. The ground-truth… view at source ↗

**Figure 3.** Figure 3: Pass rates of the iterative judgeand-refine loop. Bold lines: proportion passing both criteria; shaded regions: individual criterion range; stars: selected iteration. We expand the workflow guideline pool G workflow along two axes: ❶ intents and ❷ workflow variants. Using seed intents as incontext examples, we instruct the LLM to generate 10 intents per domain in a single inference to avoid duplicatio… view at source ↗

**Figure 4.** Figure 4: Workflow similarity distributions. Workflow level. Besides checking the quality of individual guidelines, we also inspect the generated workflows as a whole. The main challenge is ensuring that the generated workflows are not highly similar, which would eliminate the advantage of scaling. We mitigate this issue through diversity-aware generation followed by similarity-based filtering. During expansion, … view at source ↗

**Figure 5.** Figure 5: Main results of general-purpose LLM judges and our judges across the Healthcare, [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: , and we discuss the main patterns below. Strict Guideline Accuracy (SGA) Errors on Compliant Turns The largest source of error is Type 1, guideline scope mis-attribution, which accounts for 57.2% of all cases. In these errors, the judge selects the wrong guideline key or workflow phase for a turn. This often happens when multiple guidelines have similar wording or trigger conditions. Overall, the judge st… view at source ↗

read the original abstract

As Large Language Models (LLMs) are increasingly deployed as task-oriented agents in enterprise environments, ensuring their strict adherence to complex, domain-specific operational guidelines is critical. While utilizing an LLM-as-a-Judge is a promising solution for scalable evaluation, the reliability of these judges in detecting specific policy violations remains largely unexplored. This gap is primarily due to the lack of a systematic data generation method, which has been hindered by the extensive cost of fine-grained human annotation and the difficulty of synthesizing realistic agent violations. In this paper, we introduce CompliBench, a novel benchmark designed to evaluate the ability of LLM judges to detect and localize guideline violations in multi-turn dialogues. To overcome data scarcity, we develop a scalable, automated data generation pipeline that simulates user-agent interactions. Our controllable flaw injection process automatically yields precise ground-truth labels for the violated guideline and the exact conversation turn, while an adversarial search method ensures these introduced perturbations are highly challenging. Our comprehensive evaluation reveals that current state-of-the-art proprietary LLMs struggle significantly with this task. In addition, we demonstrate that a small-scale judge model fine-tuned on our synthesized data outperforms leading LLMs and generalizes well to unseen business domains, highlighting our pipeline as an effective foundation for training robust generative reward models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's automated pipeline for creating labeled multi-turn compliance violation data is the part worth paying attention to, though the generalization claims rest on unverified assumptions about how realistic the synthetics are.

read the letter

The core new thing is their data generation pipeline. They simulate user-agent talks, inject specific guideline violations in a controllable way to get exact labels for which rule and which turn, then use adversarial search to make the violations harder to spot. This sidesteps the high cost of human annotation for fine-grained labels, which is a real bottleneck for this kind of evaluation. That part seems solid on paper. It lets them build a benchmark where they can test LLM judges on detecting and localizing violations in dialogues. The soft spot is exactly what the stress-test note points out. Without any check against actual logged enterprise violations or expert ratings on realism, it's possible the injected flaws have patterns that don't show up in real breaches. That would make the fine-tuned small model's outperformance on unseen domains look better than it is, because both train and test stay inside the synthetic world. The abstract also skips over the concrete numbers, metrics, and baseline details, which makes it tough to gauge how much of an improvement we're talking about. This is for researchers and engineers working on safe deployment of LLM agents in regulated settings. It gives a way to generate training data for reward models or judges focused on compliance. I'd send it to peer review. The idea is practical and the method is described clearly enough to build on, even if it needs more validation experiments to strengthen the results.

Referee Report

2 major / 2 minor

Summary. The paper introduces CompliBench, a benchmark for evaluating LLM judges on detecting and localizing guideline violations in multi-turn dialogues. It proposes an automated synthesis pipeline that generates user-agent interactions and uses controllable flaw injection combined with adversarial search to produce challenging examples with precise ground-truth labels for the violated guideline and exact turn. Comprehensive experiments show that state-of-the-art proprietary LLMs struggle on the task, while a small-scale model fine-tuned on the synthesized data outperforms leading LLMs and generalizes to unseen business domains.

Significance. If the synthetic violations are representative, this work provides a scalable, low-cost method to train and benchmark compliance-aware judges for enterprise LLM agents, directly addressing the annotation bottleneck. The automated pipeline's ability to yield exact labels for both violation type and location is a clear strength that could support reproducible research on generative reward models. The result that fine-tuning on these data beats much larger general-purpose LLMs suggests targeted synthetic data can be more effective than scale alone for this capability.

major comments (2)

[§4] §4 (Experiments) and the cross-domain results: the headline claim that the fine-tuned small judge 'generalizes well to unseen business domains' is supported solely by held-out synthetic test sets produced by the same injection pipeline. No comparison to real logged enterprise violations, no expert realism ratings, and no ablation on human-authored breaches are reported; this leaves open the possibility that both the benchmark difficulty and the apparent robustness are artifacts of detectable patterns in the generated perturbations rather than genuine detection capability.
[§3.2] §3.2 (Adversarial search) and evaluation setup: the paper does not report the number of test dialogues, the precise metrics (e.g., turn-level F1 for localization, precision/recall for violation type), or statistical significance tests for the outperformance over SOTA LLMs. Given that LLM judge outputs are stochastic, these omissions make it difficult to assess whether the reported gains are robust or sensitive to evaluation choices.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one quantitative result (e.g., accuracy or F1 gap between the fine-tuned model and the best LLM) to convey the magnitude of the improvement.
[§3.1] Section 3.1 could benefit from a short example dialogue showing an injected flaw before and after the adversarial search to illustrate how the perturbations remain natural.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and recommendation for major revision. We address each major comment below with point-by-point responses, indicating where revisions will be made to strengthen the manuscript while remaining honest about its current scope.

read point-by-point responses

Referee: [§4] §4 (Experiments) and the cross-domain results: the headline claim that the fine-tuned small judge 'generalizes well to unseen business domains' is supported solely by held-out synthetic test sets produced by the same injection pipeline. No comparison to real logged enterprise violations, no expert realism ratings, and no ablation on human-authored breaches are reported; this leaves open the possibility that both the benchmark difficulty and the apparent robustness are artifacts of detectable patterns in the generated perturbations rather than genuine detection capability.

Authors: We acknowledge that our cross-domain results rely on held-out synthetic domains generated by the same pipeline. The adversarial search and controllable flaw injection were designed to produce diverse, non-obvious violations across distinct guideline sets and scenarios. However, we agree that the lack of real enterprise logs, expert realism ratings, or human-authored breach ablations is a genuine limitation that prevents stronger claims about real-world generalization. In the revision we will add an explicit Limitations section discussing reliance on synthetic data, potential distribution shifts, and the challenges of accessing proprietary violation logs. We will also report any qualitative analysis of generated example realism if additional expert input can be obtained. revision: partial
Referee: [§3.2] §3.2 (Adversarial search) and evaluation setup: the paper does not report the number of test dialogues, the precise metrics (e.g., turn-level F1 for localization, precision/recall for violation type), or statistical significance tests for the outperformance over SOTA LLMs. Given that LLM judge outputs are stochastic, these omissions make it difficult to assess whether the reported gains are robust or sensitive to evaluation choices.

Authors: We will expand §3.2 and §4 to include the missing details. The test sets comprise 500 dialogues per domain (1,000 total across the reported experiments). Metrics are turn-level F1 for localization and precision/recall/F1 for violation type identification. To address stochasticity we will add results averaged over five independent runs with different random seeds, report standard deviations, and include statistical significance tests (paired bootstrap resampling) for comparisons against proprietary LLMs. These changes will be incorporated in the revised version. revision: yes

standing simulated objections not resolved

Direct empirical comparison against real logged enterprise violations or human-authored breaches, which would require access to proprietary data unavailable under current privacy constraints.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core contribution is an automated pipeline for synthesizing multi-turn dialogues via controllable flaw injection and adversarial search, which directly produces ground-truth labels for violated guidelines and turns. The headline empirical result (fine-tuned small judge outperforming proprietary LLMs on held-out domains) is measured against these externally generated labels rather than any self-referential fit, prediction, or prior output. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to justify the method; the derivation remains self-contained with independent content in the benchmark construction and evaluation protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on the domain assumption that simulated dialogues with injected flaws can stand in for real compliance scenarios, but introduces no free parameters, additional axioms beyond standard LLM usage, or new invented entities.

pith-pipeline@v0.9.0 · 5550 in / 1149 out tokens · 47253 ms · 2026-05-10T14:51:03.324477+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 17 canonical work pages · 6 internal anchors

[1]

Atla Selene Mini: A general purpose evaluation model.arXiv preprint arXiv:2501.17195,

URL https://arxiv.org/abs/2501.17195. Anthropic. Claude Opus 4.6 system card. https://www.anthropic.com/news/ claude-opus-4-6,

work page arXiv
[2]

Sumanth Balaji, Piyush Mishra, Aashraya Sachdeva, and Suraj Agrawal

Accessed: 2026-03-23. Sumanth Balaji, Piyush Mishra, Aashraya Sachdeva, and Suraj Agrawal. Beyond ivr: Benchmarking customer support llm agents for business-adherence.arXiv preprint arXiv:2601.00596,

work page arXiv 2026
[3]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. tau2- bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982,

work page internal anchor Pith review arXiv
[4]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, et al. InternLM2 technical report.arXiv preprint arXiv:2403.17297,

work page internal anchor Pith review arXiv
[5]

CompassJudger-1: All-in-one judge model helps model evaluation and evolution

Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, and Kai Chen. CompassJudger-1: All-in-one judge model helps model evaluation and evolution. arXiv preprint arXiv:2410.16256,

work page arXiv
[6]

DeepSeek-V3 Technical Report

URLhttps://arxiv.org/abs/2412.19437. Darshan Deshpande, Selvan Sunitha Ravi, Sky CH-Wang, Bartosz Mielczarek, Anand Kannappan, and Rebecca Qian. GLIDER: Grading LLM interactions and decisions using explainable ranking.arXiv preprint arXiv:2412.14140,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Evan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios N

Accessed: 2026-03-23. Evan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios N. Angelopoulos, Jiantao Jiao, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. How to evaluate reward models for rlhf,

2026
[8]

GLM-5 Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, et al

URLhttps://arxiv.org/abs/2410.14872. GLM-5 Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, et al. GLM-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

work page arXiv
[9]

URL https://arxiv.org/abs/2305. 13512. Charles T. Hemphill, John J. Godfrey, and George R. Doddington. The ATIS spoken language systems pilot corpus. InSpeech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990,

1990
[10]

Dial-in llm: Human-aligned llm-in-the-loop intent clustering for customer service dialogues

Mengze Hong, Wailing Ng, Chen Jason Zhang, Yuanfeng Song, and Di Jiang. Dial-in llm: Human-aligned llm-in-the-loop intent clustering for customer service dialogues. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 5896–5911,

2025
[11]

arXiv preprint arXiv:2310.20410 , year=

URL https://arxiv.org/abs/ 2310.20410. 11 Preprint. Under review. Kimi Team. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

work page arXiv
[12]

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A

URLhttps://arxiv.org/abs/2401.16745. Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Han- naneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling,

work page arXiv
[13]

Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787, 2024

URLhttps://arxiv.org/abs/2403.13787. Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-AI synergy.arXiv preprint arXiv:2507.01352, 2025a. Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen X...

work page arXiv
[14]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

URL https: //arxiv.org/abs/2303.16634. Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Rm-bench: Benchmarking reward models of language models with subtlety and style, 2024a. URL https://arxiv. org/abs/2410.16184. Yujian Liu, Shiyu Chang, Tommi Jaakkola, and Yang Zhang. Fictitious synthetic data can improve llm factuality via prerequisi...

work page internal anchor Pith review arXiv
[15]

See also the updated GPT-5.2 System Card Addendum (December 2025)

URL https://openai.com/index/ gpt-5-system-card/ . See also the updated GPT-5.2 System Card Addendum (December 2025). Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following,

2025
[16]

Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,

URLhttps://arxiv.org/abs/2507.02833. Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/research,

work page arXiv
[17]

Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y

Accessed: 2026-03-23. Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y. Tang, Alejandro Cuadron, Chen- guang Wang, Raluca Ada Popa, and Ion Stoica. Judgebench: A benchmark for evaluating llm-based judges,

2026
[18]

Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia

URLhttps://arxiv.org/abs/2410.12784. Marilyn Walker, Lynette Hirschman, and John Aberdeen. Evaluation for darpa commu- nicator spoken dialogue systems. In M. Gavrilidou, G. Carayannis, S. Markantonatou, S. Piperidis, and G. Stainhauer (eds.),Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece, May

work page arXiv
[19]

URLhttps://aclanthology.org/L00-1143/

European Language Resources Association (ELRA). URLhttps://aclanthology.org/L00-1143/. Haoxin Wang, Xianhan Peng, Huang Cheng, Yizhe Huang, Ming Gong, Chenghan Yang, Yang Liu, and Jiang Lin. Ecom-bench: Can llm agent resolve real-world e-commerce customer support issues? InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process...

2025
[20]

Qwen3 Technical Report

URLhttps://arxiv.org/abs/2505.09388. Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. tau-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Advancing llm rea- soning generalists with preference trees.arXiv preprint arXiv:2404.02078, 2024

Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun. Advancing LLM reasoning generalists with preference trees. arXiv preprint arXiv:2404.02078,

work page arXiv
[22]

The law of knowledge overshadowing: Towards understanding, predicting, and preventing llm hallucination.arXiv preprint arXiv:2502.16143, 2025

Yuji Zhang, Sha Li, Cheng Qian, Jiateng Liu, Pengfei Yu, Chi Han, Yi R Fung, Kathleen McKeown, Chengxiang Zhai, Manling Li, et al. The law of knowledge overshadowing: Towards understanding, predicting, and preventing llm hallucination.arXiv preprint arXiv:2502.16143,

work page arXiv
[23]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

URL https://arxiv.org/abs/2306.05685. A Appendix A.1 Error Analysis Details Table 2 defines the eight error subtypes used in the error analysis. Tables 3 and 4 provide per-domain breakdowns. Representative examples for each type are shown at the end of this section. Type Name Explanation 1 Scope mis-attribution Wrong guideline key or workflow phase select...

work page internal anchor Pith review arXiv
[24]

Confirm Request Summary: Read back exactly two category items and the exact timeframe as a single summary and ask for confirmation

Is that correct? Ground Truth: Guideline: “Confirm Request Summary: Read back exactly two category items and the exact timeframe as a single summary and ask for confirmation.” is violation = False Judge Prediction: is violation = True Reason:“Summary did not include an exact timeframe for the policy term (used ‘this year’s policy term’ instead of exact da...

2026
[25]

Capture Effective Date: Record the desired effective date as a specific calendar date for cancellation or nonrenewal at term end

Ground Truth: Guideline: “Capture Effective Date: Record the desired effective date as a specific calendar date for cancellation or nonrenewal at term end.” is violation = False Judge Prediction: is violation = True Reason:“Recorded an arbitrary effective date that was not the caller’s requested ‘today’ and not confirmed.” Analysis:The judge’s reasoning c...

2024