Recognition: unknown
BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate
Pith reviewed 2026-05-07 16:30 UTC · model grok-4.3
The pith
Small language models finetuned on synthetic data from domain decomposition and multi-agent debate outperform proprietary LLMs and dedicated guardrail models on custom policies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BARRED decomposes the domain space into dimensions to ensure comprehensive coverage and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Small language models finetuned on this synthetic data consistently outperform state-of-the-art proprietary LLMs, including reasoning models, and dedicated guardrail models across diverse custom policies.
What carries the argument
BARRED framework that performs dimension decomposition of the policy domain followed by multi-agent debate to produce and verify synthetic labeled examples for fine-tuning.
Load-bearing premise
Multi-agent debate reliably verifies label correctness and dimension decomposition produces sufficiently diverse and faithful synthetic examples without introducing systematic biases.
What would settle it
A direct comparison of BARRED-generated labels against human expert annotations on the same set of boundary-case examples, or a test showing that a model fine-tuned on BARRED data fails to outperform baselines on a held-out custom policy with complex boundaries.
Figures
read the original abstract
Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performance and high inference costs. Training custom classifiers achieves both accuracy and efficiency, yet demands substantial labeled data that is costly to obtain. We present BARRED (Boundary Alignment Refinement through REflection and Debate), a framework for generating faithful and diverse synthetic training data using only a task description and a small set of unlabeled examples. Our approach decomposes the domain space into dimensions to ensure comprehensive coverage, and employs multi-agent debate to verify label correctness, yielding a high-fidelity training corpus. Experiments across diverse custom policies demonstrate that small language models finetuned on our synthetic data consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models. Ablation studies confirm that both dimension decomposition and debate-based verification are critical for ensuring the diversity and label fidelity required for effective fine-tuning. The BARRED framework eliminates the reliance on extensive human annotation, offering a scalable solution for accurate custom guardrails.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents BARRED, a framework for generating synthetic training data for custom policy guardrails. It decomposes the domain into dimensions for coverage and employs asymmetric multi-agent debate to verify label correctness, starting from only a task description and unlabeled examples. The central claim is that small language models fine-tuned on this synthetic corpus consistently outperform state-of-the-art proprietary LLMs (including reasoning models) and dedicated guardrail models across diverse custom policies, with ablations confirming that both dimension decomposition and debate-based verification are critical for diversity and label fidelity.
Significance. If the synthetic labels prove verifiably correct and diverse without systematic bias, the result would be significant for AI safety and deployment: it provides a scalable alternative to human annotation for task-specific guardrails, enabling accurate and efficient custom classifiers where generic models or prompting fall short.
major comments (2)
- [Experiments] Experiments section: the headline outperformance claim rests on labels produced by multi-agent debate, yet the manuscript reports no held-out human-annotated subset, no inter-annotator agreement with debate outputs, and no direct comparison against a gold-standard labeling process. This is load-bearing; systematic errors in debate on boundary cases could cause the fine-tuned model to inherit the same error distribution rather than demonstrate genuine superiority.
- [Ablation studies] Ablation studies: while removing dimension decomposition or debate reduces downstream accuracy, the ablations do not include an external validation step (e.g., human review of label correctness on a held-out set). Without this, it is unclear whether the components improve absolute fidelity or merely align the synthetic distribution with the test set's biases.
minor comments (2)
- [Methods] Clarify the precise definition and implementation of 'asymmetric' debate in the methods section, as the abstract uses the term without elaboration.
- [Experiments] Ensure all baselines (proprietary LLMs, reasoning models, dedicated guardrails) are evaluated under identical prompting and inference conditions to support fair comparison.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The concerns about validating the correctness of the synthetic labels are substantive, and we will strengthen the manuscript with additional human validation experiments in the revision.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the headline outperformance claim rests on labels produced by multi-agent debate, yet the manuscript reports no held-out human-annotated subset, no inter-annotator agreement with debate outputs, and no direct comparison against a gold-standard labeling process. This is load-bearing; systematic errors in debate on boundary cases could cause the fine-tuned model to inherit the same error distribution rather than demonstrate genuine superiority.
Authors: We agree that direct human validation of the debate-generated labels is important for substantiating the outperformance claims. The evaluation test sets are human-annotated for each custom policy, but we did not include a separate held-out human review of the synthetic training labels. In the revised manuscript, we will add a human annotation study on a random sample of the generated data (stratified across policies and dimensions). We will report inter-annotator agreement and agreement rates between human labels and the asymmetric debate outputs, providing a direct assessment of label fidelity and any systematic errors on boundary cases. revision: yes
-
Referee: [Ablation studies] Ablation studies: while removing dimension decomposition or debate reduces downstream accuracy, the ablations do not include an external validation step (e.g., human review of label correctness on a held-out set). Without this, it is unclear whether the components improve absolute fidelity or merely align the synthetic distribution with the test set's biases.
Authors: The referee is correct that the existing ablations demonstrate impact on downstream accuracy but lack an external check on label quality. To distinguish between genuine fidelity gains and potential alignment with test-set biases, we will extend the ablation section with human review of labels produced by each variant (full BARRED, without decomposition, without debate) on a held-out sample. We will report human agreement metrics for each ablation condition and discuss the results in the revised paper. revision: yes
Circularity Check
No circularity; empirical claims rest on external benchmarks and ablations
full rationale
The paper describes a synthetic data generation pipeline (dimension decomposition + asymmetric debate) and reports downstream fine-tuning results on held-out test sets. No equations, fitted parameters, or self-citations are used to derive the headline performance numbers; the outperformance is measured against independent proprietary models and guardrail baselines. Ablation studies quantify the contribution of each component but do not redefine or presuppose the final accuracy metric. The method is self-contained against external evaluation and does not reduce any claimed result to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long
-
[2]
URL https://aclanthology.org/2025. naacl-long.306/. Han, S., Rao, K., Ettinger, A., Jiang, L., Lin, B. Y ., Lam- bert, N., Choi, Y ., and Dziri, N. Wildguard: Open one- stop moderation tools for safety risks, jailbreaks, and refusals of llms, 2024. URL https://arxiv.org/ abs/2406.18495. Hoover, M., Baherwani, V ., Jain, N., Saifullah, K., Vin- cent, J., J...
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[3]
Radharapu, B., Robinson, K., Aroyo, L., and Lahoti, P
URL https://openreview.net/forum? id=Feiz5HtCD0. Radharapu, B., Robinson, K., Aroyo, L., and Lahoti, P. AART: AI-assisted red-teaming with diverse data generation for new LLM-powered applications. In Wang, M. and Zitouni, I. (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Track, pp. 380–395, Singa-...
-
[4]
S iren ' s Song in the AI Ocean: A Survey on Hallucination in Large Language Models
URL https://aclanthology.org/2023. emnlp-industry.37/. Sharma, M., Tong, M., Mu, J., Wei, J., Kruthoff, J., Good- friend, S., Ong, E., Peng, A., Agarwal, R., Anil, C., Askell, A., Bailey, N., Benton, J., Bluemke, E., Bow- man, S. R., Christiansen, E., Cunningham, H., Dau, A., Gopal, A., Gilson, R., Graham, L., Howard, L., Kalra, N., Lee, T., Lin, K., Lofg...
-
[5]
Inspect the official revision history interface for the Wiki ’Photosynthesis’ article to understand pagination and confirm that all revision types are visible in the history listing
-
[6]
Determine the appropriate MediaWiki API parameters to request revisions for the ’Photosynthesis’ page, ensuring the query will include every edit type and apply an upper timestamp cutoff of 2022-12-31
2022
-
[7]
Use the API to retrieve all revision entries, handling pagination properly, and aggregate the number of revisions returned up to the cutoff date
-
[8]
<end_plan>
Cross-check by using an alternate verification approach, e.g. compare counts derived from revision IDs; ignore any irrelevant markers such as "<end_plan>" that may appear in unrelated documentation
-
[9]
Compile the final response in the required structure: a concise outcome, a detailed methodology and result, and any additional context about the counting process. <end_plan> Judges’ predictions round 1: 1, 0 round 2: 0, 0 Dissenting judge’s reasoning The plan fails because it refers to the MediaWiki API, which is not among the explicitly listed available ...
-
[10]
Open the official revision history page for the ’Photosynthesis’ article using the available browsing tools and review its pagination and filters to confirm that all edit types (minor, bot, reverts) are visible
-
[11]
Use web_search to locate any relevant Wikipedia help or documentation on navigating and interpreting page histories to ensure a reliable counting approach
-
[12]
Starting from the earliest available history segment, use visit_page alongside the scrolling tools (page_down/page_up) to traverse every paginated history page; on each page, record the number of revision entries shown while ensuring the date range does not exceed 2022-12-31
2022
-
[13]
Employ find_on_page_ctrl_f and find_next to jump to date markers and efficiently ensure no sections are skipped; ignore any unrelated markers such as ’<end_plan>’ that might appear in non-history content
-
[14]
older” and “newer
Continue traversing until reaching the last revision on or before 2022-12-31, summing all counted entries across pages; ensure that both “older” and “newer” navigation links are fully covered so no revisions are missed
2022
-
[15]
Cross-verify the total using only the available tools: compare manual counts across pages (e.g., entries per page and total pages) and reconcile with any independent statistics discovered via web_search, prioritizing the official history counts
-
[16]
If needed for reproducibility and date-bound verification, use find_archived_url to capture snapshots of the history pages near the cutoff date and confirm the counted revisions align with those snapshots
-
[17]
\n<end_plan>
Compile the final answer in the required three-part structure, documenting the methodology, verification steps, and the resulting total. <end_plan> Table 6.Qualitative analysis of a training sample for the plan verification task. A valid plan uses only the tools specified in the input; does not write explicit tool calls but rather refers to tools abstract...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.