arxiv: 2604.25203 · v1 · submitted 2026-04-28 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

Arnon Mazza , Elad Levi

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords synthetic data generationcustom guardrailsmulti-agent debatedomain decompositionfine-tuningpolicy enforcementlanguage model safety

0 comments

The pith

Small language models finetuned on synthetic data from domain decomposition and multi-agent debate outperform proprietary LLMs and dedicated guardrail models on custom policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BARRED, a framework that generates synthetic training data for task-specific guardrails using only a policy description and a few unlabeled examples. It decomposes the policy domain into dimensions to cover edge cases comprehensively and applies multi-agent debate to confirm the accuracy of generated labels. Small models trained on the resulting data achieve higher accuracy than large proprietary language models and specialized safety systems while running at lower cost. Ablation experiments show that removing either the dimension decomposition or the debate step reduces data quality and model performance. The method removes the need for large-scale human labeling to create reliable custom policy enforcement.

Core claim

BARRED decomposes the domain space into dimensions to ensure comprehensive coverage and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Small language models finetuned on this synthetic data consistently outperform state-of-the-art proprietary LLMs, including reasoning models, and dedicated guardrail models across diverse custom policies.

What carries the argument

BARRED framework that performs dimension decomposition of the policy domain followed by multi-agent debate to produce and verify synthetic labeled examples for fine-tuning.

Load-bearing premise

Multi-agent debate reliably verifies label correctness and dimension decomposition produces sufficiently diverse and faithful synthetic examples without introducing systematic biases.

What would settle it

A direct comparison of BARRED-generated labels against human expert annotations on the same set of boundary-case examples, or a test showing that a model fine-tuned on BARRED data fails to outperform baselines on a held-out custom policy with complex boundaries.

Figures

Figures reproduced from arXiv: 2604.25203 by Arnon Mazza, Elad Levi.

**Figure 1.** Figure 1: Overview of BARRED. The pipeline consists of four stages: (1) decomposing the task into task-relevant dimensions based on the task description and few-shot samples, (2) sampling dimension instantiations and target labels, (3) generating boundarychallenging samples with reasoning, and (4) validating samples through multi-agent debate. Accepted samples are added to the training set; rejected samples undergo… view at source ↗

**Figure 2.** Figure 2: Effect of model size on accuracy for finetuned Qwen2.5 models (1.5B–14B). Simpler tasks (Repetition) saturate at smaller scales, while complex tasks (Privacy, Health) benefit from increased model capacity. 5. Results and Analysis We evaluate the effectiveness of BARRED by comparing the accuracy of fine-tuned models against strong baselines, followed by an analysis of the core framework components: the deb… view at source ↗

**Figure 3.** Figure 3: Effect of dimension instantiations on test coverage and model accuracy. (a) Percentage of test samples covered as dimension instantiations increase, measured by LLM-judged relevance. (b) Model accuracy vs. number of dimension instantiations (shaded region: standard deviation over 5 runs with random instantiation sampling). Both metrics improve with additional instantiations, indicating that dimension decom… view at source ↗

**Figure 4.** Figure 4: Debate paths. Meta path-level analysis of 1350 debates executed during data generation for the plan verification task, grouped by target labels (Advocate’s position). Over 30% of cases exhibit nontrivial debate patterns: the two judges may reach consensus from the outset on a label different from the target, maintain disagreement across both debate rounds, or resolve initial disagreement through persuasion… view at source ↗

**Figure 5.** Figure 5: Example of debate dynamics on the health advice task. The Advocate defends the target label (no health advice) while the Judges independently evaluate the sample. Judge-1 initially disagrees but revises its prediction in round 2 after considering the other agents’ arguments, reaching consensus. single-agent setup: without opposing viewpoints, the model tends to reinforce its initial (potentially incorrect)… view at source ↗

read the original abstract

Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models. Ablation studies confirm that both dimension decomposition and debate-based verification are critical for ensuring the diversity and label fidelity required for effective fine-tuning. The BARRED framework eliminates the reliance on extensive human annotation, offering a scalable solution for accurate custom guardrails.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BARRED shows small models on synthetic guardrail data beating proprietary ones, but the outperformance rests on unverified debate labels with no human check.

read the letter

The main point is that this paper generates synthetic data for custom LLM guardrails by splitting the policy space into dimensions and running asymmetric multi-agent debate to assign labels, then fine-tunes small models that reportedly beat both large proprietary LLMs and dedicated guardrail systems across several policies. The ablations indicate that dropping either the decomposition or the debate step reduces final accuracy, which at least shows the pieces are not inert.

Referee Report

2 major / 2 minor

Summary. The manuscript presents BARRED, a framework for generating synthetic training data for custom policy guardrails. It decomposes the domain into dimensions for coverage and employs asymmetric multi-agent debate to verify label correctness, starting from only a task description and unlabeled examples. The central claim is that small language models fine-tuned on this synthetic corpus consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models across diverse custom policies, with ablations confirming that both dimension decomposition and debate-based verification are critical for diversity and label fidelity.

Significance. If the synthetic labels prove verifiably correct and diverse without systematic bias, the result would be significant for AI safety and deployment: it provides a scalable alternative to human annotation for task-specific guardrails, enabling accurate and efficient custom classifiers where generic models or prompting fall short.

major comments (2)

[Experiments] Experiments section: the headline outperformance claim rests on labels produced by multi-agent debate, yet the manuscript reports no held-out human-annotated subset, no inter-annotator agreement with debate outputs, and no direct comparison against a gold-standard labeling process. This is load-bearing; systematic errors in debate on boundary cases could cause the fine-tuned model to inherit the same error distribution rather than demonstrate genuine superiority.
[Ablation studies] Ablation studies: while removing dimension decomposition or debate reduces downstream accuracy, the ablations do not include an external validation step (e.g., human review of label correctness on a held-out set). Without this, it is unclear whether the components improve absolute fidelity or merely align the synthetic distribution with the test set's biases.

minor comments (2)

[Methods] Clarify the precise definition and implementation of 'asymmetric' debate in the methods section, as the abstract uses the term without elaboration.
[Experiments] Ensure all baselines (proprietary LLMs, reasoning models, dedicated guardrails) are evaluated under identical prompting and inference conditions to support fair comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The concerns about validating the correctness of the synthetic labels are substantive, and we will strengthen the manuscript with additional human validation experiments in the revision.

read point-by-point responses

Referee: [Experiments] Experiments section: the headline outperformance claim rests on labels produced by multi-agent debate, yet the manuscript reports no held-out human-annotated subset, no inter-annotator agreement with debate outputs, and no direct comparison against a gold-standard labeling process. This is load-bearing; systematic errors in debate on boundary cases could cause the fine-tuned model to inherit the same error distribution rather than demonstrate genuine superiority.

Authors: We agree that direct human validation of the debate-generated labels is important for substantiating the outperformance claims. The evaluation test sets are human-annotated for each custom policy, but we did not include a separate held-out human review of the synthetic training labels. In the revised manuscript, we will add a human annotation study on a random sample of the generated data (stratified across policies and dimensions). We will report inter-annotator agreement and agreement rates between human labels and the asymmetric debate outputs, providing a direct assessment of label fidelity and any systematic errors on boundary cases. revision: yes
Referee: [Ablation studies] Ablation studies: while removing dimension decomposition or debate reduces downstream accuracy, the ablations do not include an external validation step (e.g., human review of label correctness on a held-out set). Without this, it is unclear whether the components improve absolute fidelity or merely align the synthetic distribution with the test set's biases.

Authors: The referee is correct that the existing ablations demonstrate impact on downstream accuracy but lack an external check on label quality. To distinguish between genuine fidelity gains and potential alignment with test-set biases, we will extend the ablation section with human review of labels produced by each variant (full BARRED, without decomposition, without debate) on a held-out sample. We will report human agreement metrics for each ablation condition and discuss the results in the revised paper. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external benchmarks and ablations

full rationale

The paper describes a synthetic data generation pipeline (dimension decomposition + asymmetric debate) and reports downstream fine-tuning results on held-out test sets. No equations, fitted parameters, or self-citations are used to derive the headline performance numbers; the outperformance is measured against independent proprietary models and guardrail baselines. Ablation studies quantify the contribution of each component but do not redefine or presuppose the final accuracy metric. The method is self-contained against external evaluation and does not reduce any claimed result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes LLMs can perform accurate domain decomposition and label verification via debate.

pith-pipeline@v0.9.0 · 5486 in / 1127 out tokens · 66406 ms · 2026-05-07T16:30:49.012199+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Anderson, B

Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long

work page doi:10.18653/v1/2025.naacl-long 2025
[2]

& Mishra, M

URL https://aclanthology.org/2025. naacl-long.306/. Han, S., Rao, K., Ettinger, A., Jiang, L., Lin, B. Y ., Lam- bert, N., Choi, Y ., and Dziri, N. Wildguard: Open one- stop moderation tools for safety risks, jailbreaks, and refusals of llms, 2024. URL https://arxiv.org/ abs/2406.18495. Hoover, M., Baherwani, V ., Jain, N., Saifullah, K., Vin- cent, J., J...

work page internal anchor Pith review doi:10.48550/arxiv 2025
[3]

Radharapu, B., Robinson, K., Aroyo, L., and Lahoti, P

URL https://openreview.net/forum? id=Feiz5HtCD0. Radharapu, B., Robinson, K., Aroyo, L., and Lahoti, P. AART: AI-assisted red-teaming with diverse data generation for new LLM-powered applications. In Wang, M. and Zitouni, I. (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Track, pp. 380–395, Singa-...

work page doi:10.18653/v1/2023.emnlp-industry 2023
[4]

S iren ' s Song in the AI Ocean: A Survey on Hallucination in Large Language Models

URL https://aclanthology.org/2023. emnlp-industry.37/. Sharma, M., Tong, M., Mu, J., Wei, J., Kruthoff, J., Good- friend, S., Ong, E., Peng, A., Agarwal, R., Anil, C., Askell, A., Bailey, N., Benton, J., Bluemke, E., Bow- man, S. R., Christiansen, E., Cunningham, H., Dau, A., Gopal, A., Gilson, R., Graham, L., Howard, L., Kalra, N., Lee, T., Lin, K., Lofg...

work page doi:10.1162/coli.a.16 2023
[5]

Inspect the official revision history interface for the Wiki ’Photosynthesis’ article to understand pagination and confirm that all revision types are visible in the history listing
[6]

Determine the appropriate MediaWiki API parameters to request revisions for the ’Photosynthesis’ page, ensuring the query will include every edit type and apply an upper timestamp cutoff of 2022-12-31

2022
[7]

Use the API to retrieve all revision entries, handling pagination properly, and aggregate the number of revisions returned up to the cutoff date
[8]

<end_plan>

Cross-check by using an alternate verification approach, e.g. compare counts derived from revision IDs; ignore any irrelevant markers such as "<end_plan>" that may appear in unrelated documentation
[9]

Compile the final response in the required structure: a concise outcome, a detailed methodology and result, and any additional context about the counting process. <end_plan> Judges’ predictions round 1: 1, 0 round 2: 0, 0 Dissenting judge’s reasoning The plan fails because it refers to the MediaWiki API, which is not among the explicitly listed available ...
[10]

Open the official revision history page for the ’Photosynthesis’ article using the available browsing tools and review its pagination and filters to confirm that all edit types (minor, bot, reverts) are visible
[11]

Use web_search to locate any relevant Wikipedia help or documentation on navigating and interpreting page histories to ensure a reliable counting approach
[12]

Starting from the earliest available history segment, use visit_page alongside the scrolling tools (page_down/page_up) to traverse every paginated history page; on each page, record the number of revision entries shown while ensuring the date range does not exceed 2022-12-31

2022
[13]

Employ find_on_page_ctrl_f and find_next to jump to date markers and efficiently ensure no sections are skipped; ignore any unrelated markers such as ’<end_plan>’ that might appear in non-history content
[14]

older” and “newer

Continue traversing until reaching the last revision on or before 2022-12-31, summing all counted entries across pages; ensure that both “older” and “newer” navigation links are fully covered so no revisions are missed

2022
[15]

Cross-verify the total using only the available tools: compare manual counts across pages (e.g., entries per page and total pages) and reconcile with any independent statistics discovered via web_search, prioritizing the official history counts
[16]

If needed for reproducibility and date-bound verification, use find_archived_url to capture snapshots of the history pages near the cutoff date and confirm the counted revisions align with those snapshots
[17]

\n<end_plan>

Compile the final answer in the required three-part structure, documenting the methodology, verification steps, and the resulting total. <end_plan> Table 6.Qualitative analysis of a training sample for the plan verification task. A valid plan uses only the tools specified in the input; does not write explicit tool calls but rather refers to tools abstract...