pith. sign in

arxiv: 2607.02079 · v1 · pith:FSOPWZADnew · submitted 2026-07-02 · 💻 cs.CL · cs.CR· cs.LG

HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety

Pith reviewed 2026-07-03 14:36 UTC · model grok-4.3

classification 💻 cs.CL cs.CRcs.LG
keywords constitutional classifierAI safetyprompt safetymultilingual safetyopen weightsguard modelsynthetic datacounterfactual examples
0
0 comments X

The pith

An 0.8B open-weights constitutional classifier tops multilingual prompt-safety benchmarks at 90.9 average F1.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HaloGuard 1.0 as an open-weights implementation of constitutional classification for detecting unsafe input prompts. A natural-language constitution of 46 policies and 2,940 subcategories organizes the creation of synthetic training data that includes exhaustive paired counterfactuals flipping only intent while fixing topic and vocabulary. This data-generation method, together with a two-tier design for harmlessness and balanced coverage across 46 languages, produces a 0.8B model that records the highest average F1 of 90.9 on seven benchmarks while keeping false-positive rate at 4.3 and false-negative rate at 9.5. The same approach yields a 4B variant with 92.1 F1 and 3.5 false-positive rate. A sympathetic reader would care because the result suggests that constitution-driven synthetic data can deliver strong safety classification performance at far smaller scale than current open guard models.

Core claim

HaloGuard 1.0-0.8B attains the best average F1 (90.9) of any open guard evaluated, outperforming baselines up to 27B parameters while holding false-positive rate to 4.3 and false-negative rate to 9.5. The 4B variant reaches 92.1 F1 and 3.5 FPR. Most apparent missed-harm cases in remaining failures are benchmark mislabels rather than genuine model misses. An always-on adversarial red-teaming protocol continuously hardens the guard against content-level and agentic attacks.

What carries the argument

The safety constitution of 46 policies and 2,940 subcategories that drives synthetic data generation with one-to-one paired counterfactuals holding topic and vocabulary fixed while flipping intent, plus a two-tier harmless design and balanced multilingual materialisation across 46 languages treating language as surface form.

If this is right

  • The 0.8B model delivers 90.9 average F1 with FPR 4.3 and FNR 9.5 across the evaluated benchmarks.
  • The 4B variant spends added capacity on improved precision, reaching 92.1 F1 and 3.5 FPR.
  • Structured adjudication shows most remaining failures trace to benchmark label errors rather than model shortcomings.
  • Continuous adversarial red-teaming hardens the model against both content-level and agentic attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The constitution-driven synthetic data pipeline could be reused for other classification tasks that require fine-grained boundary detection.
  • Public release of the weights enables independent verification on fresh benchmarks that were not used in training or evaluation.
  • Treating language as surface form rather than adversarial signal may reduce unintended language-based disparities in safety decisions.

Load-bearing premise

The seven prompt-safety benchmarks used for evaluation are representative of real-world safety requirements and free of systematic label errors that would inflate reported performance.

What would settle it

A new, independently constructed and labeled prompt-safety benchmark on which the HaloGuard models record substantially lower average F1 than the reported 90.9 would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2607.02079 by Ashmiya Lenin, Navaneeth Sangameswaran, Preetham S.

Figure 1
Figure 1. Figure 1: The HaloGuard 1.0 constituional pipeline. (A) Conventional open guards collect a corpus and then label it to [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: For each harmful subcategory a 1:1 benign mirror holds topic and harm-adjacent vocabulary fixed and flips [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: HaloGuard 1.0 targets FPR/FNR frontier through harmless-boundary data, paired counterfactuals, shared [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sliding-window classification for long inputs. The input is partitioned into overlapping fixed length windows; [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average prompt-harmfulness F1 versus model size across seven safety benchmarks. Model size is shown [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-lingual guard performance by language group. Left: over-refusal (FPR) against missed harm (FNR); [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The continuous adversarial red-teaming loop. Each seven-day cycle runs a four-day submission window and a [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
read the original abstract

We present HaloGuard 1.0, an open-weights implementation of the constitutional-classifier paradigm for input safety. It achieves state-of-the-art performance on English and multilingual prompt-safety benchmarks at roughly one-tenth the model size of current leading open guard models. The safety constitution is the organising structure of the corpus: a natural-language constitution of 46 policies and 2,940 subcategories drives synthetic data generation, with exhaustive one-to-one paired counterfactuals that hold topic and vocabulary fixed while flipping intent, a two-tier harmless design that separately targets boundary and baseline false positives (FPs), and balanced multilingual materialisation across 46 languages that treats language as a surface form appearing on both sides of the boundary rather than as an adversarial signal. Across seven prompt-safety benchmarks, HaloGuard 1.0-0.8B attains the best average F1 (90.9) of any open guard we evaluate, outperforming baselines up to 27B parameters (over 30 times larger) while holding false-positive rate (FPR) to 4.3 and false-negative rate (FNR) to 9.5. The HaloGuard 1.0-4B variant reaches average F1 of 92.1 and FPR of 3.5, spending its extra capacity on precision rather than recall. A structured adjudication of the remaining failures indicates that most apparent missed-harm cases are benchmark mislabels rather than genuine model misses. An always-on adversarial red-teaming protocol continuously hardens the guard against both content-level and agentic attacks. We release the models as open weights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents HaloGuard 1.0, an open-weights constitutional classifier for input safety. Using a natural-language constitution of 46 policies and 2,940 subcategories, it generates synthetic training data with exhaustive one-to-one counterfactual pairs, a two-tier harmless design, and balanced multilingual materialisation across 46 languages. The 0.8B model achieves the highest average F1 score of 90.9 across seven prompt-safety benchmarks, outperforming larger open guard models up to 27B parameters, with FPR of 4.3 and FNR of 9.5. The 4B variant improves to F1 92.1 and FPR 3.5. An adjudication suggests most remaining errors are benchmark mislabels. The models are released as open weights.

Significance. If the performance claims hold without contamination from the evaluation benchmarks in the training data pipeline, this would represent a significant advance in efficient, multilingual AI safety classification. It would demonstrate that detailed constitutional synthetic data generation can enable small models to achieve superior safety performance compared to much larger models, with implications for accessible and open safety tools. The open-weights release and the structured approach to data generation are notable strengths.

major comments (3)
  1. [Abstract] Abstract: The headline performance metrics (average F1 90.9, FPR 4.3, FNR 9.5 for the 0.8B model) are reported without accompanying details on train/test splits, benchmark construction, or confirmation that the seven benchmarks did not influence the synthetic data generation or hyperparameter choices. This omission makes it impossible to verify the central claim against the possibility of benchmark-specific fitting.
  2. [Abstract] Abstract (adjudication paragraph): The finding that most apparent missed-harm cases are benchmark mislabels indicates that the reported F1 may partly reflect label noise or distributional overlap with the 2,940 subcategories rather than genuine generalisation. No independent human re-annotation or out-of-distribution test set is described that would falsify this.
  3. [Abstract] Synthetic data generation description (Abstract): The constitution of 46 policies and 2,940 subcategories, combined with one-to-one counterfactual pairs and multilingual materialisation, creates a risk that evaluation benchmarks share structural artifacts or phrasing patterns; the manuscript provides no explicit check that the benchmarks are free of such overlap with the training corpus.
minor comments (1)
  1. [Abstract] Abstract: The mention of an 'always-on adversarial red-teaming protocol' lacks any implementation details or quantitative results, which would help assess its contribution to the reported robustness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points about verifiability and potential overlap that we address below. We will revise the abstract and add supporting analyses to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline performance metrics (average F1 90.9, FPR 4.3, FNR 9.5 for the 0.8B model) are reported without accompanying details on train/test splits, benchmark construction, or confirmation that the seven benchmarks did not influence the synthetic data generation or hyperparameter choices. This omission makes it impossible to verify the central claim against the possibility of benchmark-specific fitting.

    Authors: The full manuscript details the train/test splits (synthetic data split 80/20 with all benchmarks held completely out), benchmark construction from standard public sources, and explicit statements that benchmarks played no role in data generation or hyperparameter selection. These appear in the Methods and Evaluation sections. We will add a single sentence to the abstract summarizing these safeguards for immediate visibility. revision: yes

  2. Referee: [Abstract] Abstract (adjudication paragraph): The finding that most apparent missed-harm cases are benchmark mislabels indicates that the reported F1 may partly reflect label noise or distributional overlap with the 2,940 subcategories rather than genuine generalisation. No independent human re-annotation or out-of-distribution test set is described that would falsify this.

    Authors: The adjudication protocol is described in the Results section and all adjudicated cases are listed in the appendix. We did not include an independent third-party re-annotation or a dedicated OOD test set in the submitted version. We will add both a small-scale independent re-annotation study and performance on a new OOD adversarial prompt set in the revision. revision: yes

  3. Referee: [Abstract] Synthetic data generation description (Abstract): The constitution of 46 policies and 2,940 subcategories, combined with one-to-one counterfactual pairs and multilingual materialisation, creates a risk that evaluation benchmarks share structural artifacts or phrasing patterns; the manuscript provides no explicit check that the benchmarks are free of such overlap with the training corpus.

    Authors: The submitted manuscript does not contain an explicit overlap analysis. We will add a new subsection reporting n-gram and semantic similarity statistics between the full synthetic training corpus and each of the seven benchmarks, confirming negligible overlap, in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance on external benchmarks with no self-referential derivation

full rationale

The paper presents an empirical model whose training corpus is organized by a fixed constitution (46 policies, 2940 subcategories) and whose performance is measured on seven separate prompt-safety benchmarks. No derivation chain, equation, or first-principles claim is offered that reduces by construction to the inputs; the reported F1, FPR, and FNR are direct measurements on held-out benchmarks rather than fitted quantities renamed as predictions. The note that most remaining misses are benchmark mislabels is an observation about label noise, not a self-definitional loop. No self-citation load-bearing step, uniqueness theorem, or ansatz smuggling appears in the provided text. The result is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivation or closed-form claim; the central result is an empirical performance number on external benchmarks. The constitution, subcategories, and synthetic-data recipe are design decisions rather than fitted parameters or invented physical entities.

pith-pipeline@v0.9.1-grok · 5837 in / 1237 out tokens · 37999 ms · 2026-07-03T14:36:02.673571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 10 internal anchors

  1. [1]

    Constitutional AI: Harmlessness from AI Feedback

    URL https://arxiv. org/abs/2212.08073. Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanj...

  2. [2]

    doi:10.18653/v1/2020.findings-emnlp.117

    Association for Computational Linguistics. doi:10.18653/v1/2020.findings-emnlp.117. URLhttps://aclanthology.org/2020.findings-emnlp.117/. Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms,

  3. [3]

    WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

    URL https://arxiv.org/abs/2406.18495. Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Merve Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations,

  4. [4]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    URLhttps://arxiv.org/abs/2312.06674. Divyansh Kaushik, Eduard Hovy, and Zachary C. Lipton. Learning the difference that makes a difference with counterfactually-augmented data. InInternational Conference on Learning Representations (ICLR),

  5. [5]

    arXiv preprint arXiv:2504.04377 , year=

    URL https://arxiv.org/abs/ 2504.04377. Zi Lin, Zihan Wang, Yitao Tong, Yuhang Wang, Yuting Guo, Yujia Wang, and Jingbo Shang. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation,

  6. [6]

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks

    URLhttps://arxiv.org/abs/ 2310.17389. Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal,

  7. [7]

    URLhttps://arxiv.org/abs/2402.04249. NVIDIA. Nemo guardrails and nemotron content safety dataset,

  8. [8]

    Red Teaming Language Models with Language Models

    URL https://arxiv.org/abs/ 2202.03286. Qwen Team. Qwen3guard technical report,

  9. [9]

    Qwen3Guard Technical Report

    URLhttps://arxiv.org/abs/2510.14276. 27 HaloGuard 1.0: Open Weights Constitutional ClassifierA PREPRINT Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models,

  10. [10]

    URL https://arxiv.org/ abs/2308.01263. Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Tae...

  11. [11]

    Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

    URL https://arxiv.org/ abs/2501.18837. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions,

  12. [12]

    URL https://arxiv.org/abs/ 2212.10560. Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, Olivia Sturman, and Oscar Wahltinez. Shieldgemma: Generative ai content moderation based on gemma,

  13. [13]

    ShieldGemma: Generative AI Content Moderation Based on Gemma

    URLhttps://arxiv.org/abs/2407.21772. A Example constitution Harmless boundary Harmful constitution Shared harmless • Explaining chemistry education and laboratory safety • Discussing OPCW treaties and chemical-weapons history • Describing public-health, epidemiology, and biosafety concepts • Explaining nuclear physics, reactor safety, and radiation medici...