Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-LLM Interactions

Abhejay Murali; Amit Dhurandhar; David Atkinson; Junfeng Jiao; Kevin Chen; Saleh Afroogh

REVIEW 2 major objections 2 minor 5 cited by

Leading LLMs show critical safety shortfalls when tested against child and adolescent users.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-19 09:28 UTC pith:MSSSPMZQ

load-bearing objection The paper releases a new age-split benchmark for LLM safety with kids but builds it on adult red-teaming prompts without clear developmental grounding. the 2 major comments →

arxiv 2506.13510 v4 pith:MSSSPMZQ submitted 2025-06-16 cs.CY

Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-LLM Interactions

Junfeng Jiao , Saleh Afroogh , Kevin Chen , Abhejay Murali , David Atkinson , Amit Dhurandhar This is my paper

classification cs.CY

keywords LLM safetychild-AI interactionadversarial benchmarkdevelopmental stagesethical refusalred-teaminggenerative AI risks

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Safe-Child-LLM, a benchmark designed to measure how safely large language models handle interactions with children aged 7-12 and adolescents aged 13-17. It supplies a collection of 200 adversarial prompts drawn from existing red-teaming sets and scored by humans on a 0-5 ethical refusal scale. Evaluations of models such as ChatGPT, Claude, Gemini, LLaMA, DeepSeek, Grok, Vicuna, and Mistral reveal repeated failures to refuse harmful or age-inappropriate content in these younger-user scenarios. The authors release the full dataset and evaluation code to support further work on protecting minors. The effort rests on the view that adult-centered safety tests miss the distinct risks children and teens face with generative AI.

Core claim

We introduce Safe-Child-LLM, a benchmark and dataset that evaluates LLM safety across two developmental stages using 200 adversarial prompts with human-annotated jailbreak and 0-5 ethical refusal labels, and we show that leading models exhibit critical safety deficiencies in child-facing scenarios.

What carries the argument

The Safe-Child-LLM multi-part dataset of 200 adversarial prompts, sourced from red-teaming corpora and labeled for jailbreak success plus ethical refusal on a 0-5 scale, applied separately to child and adolescent age groups.

Load-bearing premise

The 200 adversarial prompts and the 0-5 ethical refusal scale together capture the safety risks that actually matter for real children and adolescents.

What would settle it

A controlled study in which real children or adolescents interact with the same models and the models refuse all harmful requests that the benchmark prompts were meant to elicit.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Developers must add age-specific refusal mechanisms beyond those used for adult users.
Adult-only safety evaluations leave measurable gaps when models are deployed with minors.
Public release of child-focused adversarial datasets can accelerate community improvements in ethical AI.
Continuous benchmark updates will be needed as new models and prompt techniques emerge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompt set could be adapted to measure safety differences between open-source and closed-source models over time.
Regulators might use similar age-graded tests when setting standards for AI tools in schools or family apps.
Real-world logging of child-AI conversations could provide a stronger validation signal than static prompt sets alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Referee Report

2 major / 2 minor

Summary. The paper presents Safe-Child-LLM, a benchmark and dataset for evaluating LLM safety in interactions with children (ages 7-12) and adolescents (13-17). It consists of 200 adversarial prompts curated from existing red-teaming corpora such as SG-Bench and HarmBench, human-annotated for jailbreak success and scored on a 0-5 ethical refusal scale. The authors evaluate eight leading LLMs (ChatGPT, Claude, Gemini, LLaMA, DeepSeek, Grok, Vicuna, Mistral) and report critical safety deficiencies in child-facing scenarios, while releasing the benchmark and code publicly.

Significance. If the prompts and annotation scale validly capture age-specific risks rather than generic adult jailbreak patterns, the work would provide a useful starting point for community benchmarks in child-AI safety. The public release of datasets and code is a positive contribution to reproducibility in this area.

major comments (2)

[Abstract / Dataset construction] Abstract and dataset description: The 200 prompts are described as 'curated from red-teaming corpora (e.g., SG-Bench, HarmBench)' with no details on adaptation or filtering for developmental stages. Because these source corpora target adult users, it is unclear whether the resulting prompts distinguish risks such as grooming, emotional manipulation, or age-inappropriate self-disclosure that are central to child safety. Without pilot validation against child-development literature or naturalistic logs, the headline claim of 'critical safety deficiencies in child-facing scenarios' rests on an untested assumption that adult-derived adversarial prompts measure the relevant risks.
[Abstract / Results] Evaluation section: The abstract states that deficiencies were found but supplies no quantitative results (e.g., mean refusal scores per model and age group, inter-annotator agreement, or baseline comparisons). This prevents verification of the data-to-claim link and makes it impossible to assess whether the observed deficiencies are statistically or practically significant.

minor comments (2)

[Dataset description] Clarify the exact prompt-selection criteria and any modifications made to the source corpora to target the two developmental stages.
[Annotation procedure] Specify the number of annotators, their qualifications, and the inter-annotator agreement metric for the 0-5 ethical refusal scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify key aspects of our benchmark's construction and reporting. We address each major comment below and indicate revisions to the manuscript.

read point-by-point responses

Referee: [Abstract / Dataset construction] Abstract and dataset description: The 200 prompts are described as 'curated from red-teaming corpora (e.g., SG-Bench, HarmBench)' with no details on adaptation or filtering for developmental stages. Because these source corpora target adult users, it is unclear whether the resulting prompts distinguish risks such as grooming, emotional manipulation, or age-inappropriate self-disclosure that are central to child safety. Without pilot validation against child-development literature or naturalistic logs, the headline claim of 'critical safety deficiencies in child-facing scenarios' rests on an untested assumption that adult-derived adversarial prompts measure the relevant risks.

Authors: We appreciate the referee's emphasis on ensuring the prompts capture age-specific risks. The 200 prompts were selected from the source corpora by prioritizing adversarial scenarios involving requests for inappropriate content, manipulation, or self-disclosure that could apply to minors, followed by human annotation that incorporated developmental considerations in both jailbreak success labels and the 0-5 ethical refusal scale. That said, the current manuscript provides limited explicit description of the selection and filtering criteria used to adapt prompts for the 7-12 and 13-17 age groups. We will revise the dataset construction section to add these details, including examples of how prompts were reviewed for relevance to child and adolescent vulnerabilities. We also acknowledge the absence of formal pilot validation against child-development literature or naturalistic interaction logs; this benchmark is positioned as an initial community resource, and we will add an explicit limitations discussion noting this gap and the value of such validation in future extensions. revision: partial
Referee: [Abstract / Results] Evaluation section: The abstract states that deficiencies were found but supplies no quantitative results (e.g., mean refusal scores per model and age group, inter-annotator agreement, or baseline comparisons). This prevents verification of the data-to-claim link and makes it impossible to assess whether the observed deficiencies are statistically or practically significant.

Authors: We agree that the abstract would be strengthened by including key quantitative indicators to support the claim of deficiencies. The full manuscript's evaluation section already reports mean refusal scores broken down by model and developmental stage, inter-annotator agreement statistics, and comparisons across the eight evaluated LLMs. To address the referee's point directly, we will revise the abstract to incorporate concise quantitative highlights (e.g., average refusal scores for child vs. adolescent prompts) while respecting length limits, thereby improving the link between data and claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent evaluation

full rationale

The paper introduces Safe-Child-LLM as a benchmark consisting of 200 adversarial prompts curated from external red-teaming corpora (SG-Bench, HarmBench) together with a standard 0-5 human-annotated refusal scale. It then reports direct empirical evaluations of multiple LLMs against this fixed benchmark. No equations, fitted parameters, predictions, or derivations appear in the abstract or described methodology; the central claims rest on straightforward measurement rather than any self-referential construction, self-citation chain, or renaming of prior results. The work is therefore self-contained as an empirical release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the 200 prompts and the reliability of the human refusal annotations; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Human annotations on jailbreak success and a 0-5 ethical refusal scale provide a valid proxy for LLM safety with minors.
Used to label the dataset and interpret model outputs.

pith-pipeline@v0.9.0 · 5765 in / 1201 out tokens · 34683 ms · 2026-05-19T09:28:07.901038+00:00 · methodology

0 comments

read the original abstract

As Large Language Models (LLMs) increasingly power applications used by children and adolescents, ensuring safe and age-appropriate interactions has become an urgent ethical imperative. Despite progress in AI safety, current evaluations predominantly focus on adults, neglecting the unique vulnerabilities of minors engaging with generative AI. We introduce Safe-Child-LLM, a comprehensive benchmark and dataset for systematically assessing LLM safety across two developmental stages: children (7-12) and adolescents (13-17). Our framework includes a novel multi-part dataset of 200 adversarial prompts, curated from red-teaming corpora (e.g., SG-Bench, HarmBench), with human-annotated labels for jailbreak success and a standardized 0-5 ethical refusal scale. Evaluating leading LLMs -- including ChatGPT, Claude, Gemini, LLaMA, DeepSeek, Grok, Vicuna, and Mistral -- we uncover critical safety deficiencies in child-facing scenarios. This work highlights the need for community-driven benchmarks to protect young users in LLM interactions. To promote transparency and collaborative advancement in ethical AI development, we are publicly releasing both our benchmark datasets and evaluation codebase at https://github.com/The-Responsible-AI-Initiative/Safe_Child_LLM_Benchmark.git

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our framework includes a novel multi-part dataset of 200 adversarial prompts, curated from red-teaming corpora (e.g., SG-Bench, HarmBench), with human-annotated labels for jailbreak success and a standardized 0-5 ethical refusal scale.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Evaluating leading LLMs ... we uncover critical safety deficiencies in child-facing scenarios.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evaluating Cognitive Age Alignment in Interactive AI Agents
cs.AI 2026-05 unverdicted novelty 7.0

The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.
CAREBench: A Child-Safety Risk Benchmark for Language Models
cs.LG 2026-06 unverdicted novelty 6.0

CAREBench is a new benchmark with 500 prompts in 12 risk categories that measures how often frontier LLMs fail to refuse or redirect child-safety risks, reporting failure rates between 2% and 58%.
Child Safety in Generative AI: An Expert-Guided and Incident-Grounded Evaluation Framework
cs.HC 2026-07 unverdicted novelty 5.0

Proposes an expert-guided and incident-grounded framework for child safety evaluation in generative AI and applies it in education to find that Llama Guard models struggle with unsafe prompts.
CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety
cs.CL 2026-05 unverdicted novelty 5.0

CR4T is a model-agnostic framework using lightweight risk detection and domain-conditioned rewriting to convert unsafe or refusal-style LLM responses into developmentally appropriate guidance for adolescents.
LLM Harms: A Taxonomy and Discussion
cs.CY 2025-12 unverdicted novelty 3.0

This paper proposes a taxonomy of LLM harms in five categories and suggests mitigation strategies plus a dynamic auditing system for responsible development.