Granite guardian,

· 2024 · arXiv 2412.07724

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

BELLS-O: Evaluating the Operational Trade-offs of LLM Supervision Systems

cs.CR · 2026-06-12 · unverdicted · novelty 7.0

BELLS-O is the first vendor-neutral operational benchmark comparing specialized guardrails and repurposed frontier LLMs on accuracy, false-positive rates, speed, and monetary cost across 11 harm categories and 13 jailbreak techniques.

RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.

When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.

Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment

cs.LG · 2025-05-30 · unverdicted · novelty 6.0

Disentangled Safety Adapters decouple safety computations from task-optimized LLMs via lightweight adapters, yielding up to 53% better AUC on safety tasks and dynamic inference-time alignment with reduced performance trade-offs.

Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety

cs.CR · 2026-07-01 · unverdicted · novelty 5.0

Cognitive Firewall applies four gates (intent, zero-trust context, consistency, output risk) via an oversight model to cut jailbreak success to 2% or below on most tested sets while keeping over-refusal at 8%.

Structured Security Auditing and Robustness Enhancement for Untrusted Agent Skills

cs.CR · 2026-04-28 · unverdicted · novelty 5.0

SkillGuard-Robust formulates pre-load auditing of untrusted Agent Skills as a three-way classification task and achieves 97.30% exact match and 98.33% malicious-risk recall on held-out benchmarks.

Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs

cs.LG · 2026-04-08 · unverdicted · novelty 5.0

Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.

Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

cs.CL · 2026-02-08 · unverdicted · novelty 5.0

Bielik Guard delivers compact Polish safety classifiers with F1 scores near 0.79 and superior real-prompt precision over baselines.

Distilling Safe LLM Systems via Soft Prompts for On Device Settings

cs.LG · 2026-06-08 · unverdicted · novelty 4.0

Soft prompt distillation with total variation and KL divergence transfers safety behaviors from guard models to on-device LLMs and outperforms LoRA adapters, steering vectors, and direct optimization in safety-usefulness trade-offs with minimal inference cost.

Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability

cs.CL · 2026-06-02 · unverdicted · novelty 4.0

Fine-tuned LLMs produce incoherent safety responses and yield benchmark-dependent conclusions unless evaluations are grounded in explicit capability targets.

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

cs.LG · 2026-05-28 · unverdicted · novelty 4.0

Opir introduces efficient multi-task encoder models trained on a 996-category safety taxonomy that match or exceed larger baselines on most safety benchmarks while using under 100M parameters for edge variants.

TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts

cs.CR · 2026-04-17 · unverdicted · novelty 4.0

TWGuard achieves +0.289 F1 improvement and 94.9% false-positive reduction for LLM safety guardrails in the Taiwan linguistic context compared to foundation models and baselines.

citing papers explorer

Showing 12 of 12 citing papers.

BELLS-O: Evaluating the Operational Trade-offs of LLM Supervision Systems cs.CR · 2026-06-12 · unverdicted · none · ref 2
BELLS-O is the first vendor-neutral operational benchmark comparing specialized guardrails and repurposed frontier LLMs on accuracy, false-positive rates, speed, and monetary cost across 11 harm categories and 13 jailbreak techniques.
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration cs.CL · 2026-04-17 · unverdicted · none · ref 23
RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models cs.LG · 2026-04-08 · unverdicted · none · ref 6
Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.
Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment cs.LG · 2025-05-30 · unverdicted · none · ref 32
Disentangled Safety Adapters decouple safety computations from task-optimized LLMs via lightweight adapters, yielding up to 53% better AUC on safety tasks and dynamic inference-time alignment with reduced performance trade-offs.
Cognitive Firewall: A Proactive, Zero-Trust, Multi-Gate Framework for LLM Safety cs.CR · 2026-07-01 · unverdicted · none · ref 4
Cognitive Firewall applies four gates (intent, zero-trust context, consistency, output risk) via an oversight model to cut jailbreak success to 2% or below on most tested sets while keeping over-refusal at 8%.
Structured Security Auditing and Robustness Enhancement for Untrusted Agent Skills cs.CR · 2026-04-28 · unverdicted · none · ref 16
SkillGuard-Robust formulates pre-load auditing of untrusted Agent Skills as a three-way classification task and achieves 97.30% exact match and 98.33% malicious-risk recall on held-out benchmarks.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs cs.LG · 2026-04-08 · unverdicted · none · ref 57
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation cs.CL · 2026-02-08 · unverdicted · none · ref 16
Bielik Guard delivers compact Polish safety classifiers with F1 scores near 0.79 and superior real-prompt precision over baselines.
Distilling Safe LLM Systems via Soft Prompts for On Device Settings cs.LG · 2026-06-08 · unverdicted · none · ref 45
Soft prompt distillation with total variation and KL divergence transfers safety behaviors from guard models to on-device LLMs and outperforms LoRA adapters, steering vectors, and direct optimization in safety-usefulness trade-offs with minimal inference cost.
Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability cs.CL · 2026-06-02 · unverdicted · none · ref 47
Fine-tuned LLMs produce incoherent safety responses and yield benchmark-dependent conclusions unless evaluations are grounded in explicit capability targets.
Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content cs.LG · 2026-05-28 · unverdicted · none · ref 29
Opir introduces efficient multi-task encoder models trained on a 996-category safety taxonomy that match or exceed larger baselines on most safety benchmarks while using under 100M parameters for edge variants.
TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts cs.CR · 2026-04-17 · unverdicted · none · ref 12
TWGuard achieves +0.289 F1 improvement and 94.9% false-positive reduction for LLM safety guardrails in the Taiwan linguistic context compared to foundation models and baselines.

Granite guardian,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer