SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs

Hongyi Wen; Kangrui Yu; Pinjia He; Sihang Zhao; Youliang Yuan

arxiv: 2604.22134 · v1 · submitted 2026-04-24 · 💻 cs.CL

SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs

Sihang Zhao , Kangrui Yu , Youliang Yuan , Pinjia He , Hongyi Wen This is my paper

Pith reviewed 2026-05-08 12:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords educational LLMspedagogical jailbreaksknowledge graphsafetyhelpfulnesspedagogytutoring systemsbenchmark

0 comments

The pith

A knowledge-mastery graph lets educational LLMs infer what students already understand and decide whether to instruct or solve problems directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix the issue where students prompt educational AI models to bypass teaching and get complete solutions instead. It formalizes the tension among safety from such tricks, general helpfulness, and proper teaching behavior by building a graph that tracks concept dependencies. The system then reads a query to spot missing prerequisites and uses a gate to choose the response type. A new benchmark supplies thousands of test cases to measure how well models hold up under pressure to give away answers. If the approach works, AI tutors could stay useful for learning without becoming easy sources of completed work.

Core claim

The authors unify safety, helpfulness, and pedagogy with a knowledge-mastery graph that infers prerequisite concepts from student queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Evaluated on the SHAPE benchmark of 9,087 student-question pairs under two pedagogical jailbreak settings, the graph-augmented pipeline produces significantly improved safety across multiple LLMs while maintaining near-ceiling helpfulness under the same protocol.

What carries the argument

The knowledge-mastery graph, which models concepts and dependencies to infer prerequisites and gaps from queries, together with an explicit gating step that routes output toward instruction or direct solving.

If this is right

The pipeline raises safety against two pedagogical jailbreak settings without requiring changes to the underlying LLM.
Helpfulness remains near the maximum possible score under the shared evaluation protocol.
The same graph-augmented approach transfers across different base models.
The SHAPE benchmark enables consistent measurement of tutoring behavior under adversarial student prompts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar inference graphs could apply to other interactive systems that need to withhold full information while still assisting.
Long-term use would need checks on whether the inferred mastery gaps align with actual student progress over repeated sessions.
Adding common student misconceptions to the graph could refine the routing decisions further.

Load-bearing premise

The knowledge-mastery graph can accurately infer prerequisite concepts and mastery gaps from arbitrary student queries.

What would settle it

A test set of new student queries where the graph misjudges gaps, resulting in either direct answers under jailbreak pressure or unhelpful refusals on legitimate questions.

Figures

Figures reproduced from arXiv: 2604.22134 by Hongyi Wen, Kangrui Yu, Pinjia He, Sihang Zhao, Youliang Yuan.

**Figure 1.** Figure 1: Desired educational LLM tutoring under mastery-awareness and jailbreak pressure. We illustrate six representative interactions conditioned on the student’s mastery state (bottom). When the student has not mastered prerequisite concepts, the tutor should withhold direct answers and provide guided, concept-targeted instruction (a), while remaining robust to answer-inducing prompts (b–c). When mastery is demo… view at source ↗

**Figure 2.** Figure 2: The graph-augmented pedagogical pipeline for adaptive teaching. The system first parses prerequisites and compares them with the student’s mastery state. The resulting missing knowledge list (l) determines the response strategy: pedagogical thought-provoking questions (No) or direct answering (Yes). 5.2 Our Solution: A Graph-Augmented Pedagogical Pipeline In our pipeline, a parsing agent first analyzes the… view at source ↗

read the original abstract

Large Language Models (LLMs) have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph-augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol. Our code and data are available at https://github.com/MAPS-research/SHaPE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a new benchmark and graph-based gating method for educational LLMs but leaves the accuracy of its core inference step and the realism of its test cases unverified.

read the letter

The paper's main contribution is a benchmark called SHAPE with 9,087 question pairs that targets pedagogical jailbreaks, plus a pipeline that builds a knowledge-mastery graph to decide whether a query should get direct answers or scaffolded instruction. This setup tries to handle the conflict between safety, helpfulness, and actual teaching in one framework. They test it on several LLMs and report gains in safety under two adversarial settings while helpfulness stays high. Releasing the code and data makes the artifacts usable for follow-up work.

Referee Report

3 major / 2 minor

Summary. The paper identifies pedagogical jailbreaks as a vulnerability in educational LLMs and proposes SHAPE, which unifies safety, helpfulness, and pedagogy through a knowledge-mastery graph. It introduces the SHAPE benchmark consisting of 9,087 student-question pairs and a graph-augmented tutoring pipeline that infers prerequisite concepts, identifies mastery gaps, and uses explicit gating to route between scaffolded instruction and direct problem-solving. Experiments across multiple LLMs report significantly improved safety under two pedagogical jailbreak settings while maintaining near-ceiling helpfulness, with code and data released publicly.

Significance. If the empirical results hold under rigorous validation, the work offers a concrete, graph-based mechanism for balancing safety and utility in educational LLMs, addressing a timely gap between general alignment techniques and domain-specific tutoring needs. The public benchmark and code release would enable follow-on research and reproducibility, potentially influencing the design of safer AI tutors.

major comments (3)

[Section on knowledge-mastery graph construction and inference] Section describing the knowledge-mastery graph (likely §3): the inference of prerequisite concepts and mastery gaps from arbitrary student queries is presented without ground-truth validation (e.g., expert annotations, inter-rater agreement, or accuracy metrics against human tutors). This inference directly controls the gating decision that is claimed to block jailbreaks while preserving helpfulness; without such validation, errors in gap detection could either leak unsafe direct answers or produce over-scaffolding, undermining the central safety claim.
[SHAPE benchmark section] SHAPE benchmark construction (likely §4): the 9,087 pairs are introduced for evaluating tutoring under adversarial pressure, yet no details are provided on external validation against real tutoring logs, educator review, or inter-annotator agreement to confirm they represent typical student adversarial behavior. If the prompts are primarily author-constructed, the reported safety gains may be benchmark-specific rather than generalizable.
[Experiments section] Experimental evaluation (likely §5): while the abstract states 'significantly improved safety' and 'near-ceiling helpfulness,' the provided text supplies no concrete metrics, baselines, statistical tests, confidence intervals, or exclusion criteria for the two pedagogical jailbreak settings. This prevents verification that the graph-augmented pipeline outperforms standard prompting or other safety methods under the same protocol.

minor comments (2)

[Method section] Notation for the knowledge-mastery graph (nodes, edges, update rules) should be formalized with explicit definitions or pseudocode to improve clarity and reproducibility.
[Abstract] The abstract would benefit from a single sentence summarizing the key quantitative improvements (e.g., safety score deltas or helpfulness percentages) rather than qualitative descriptors.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important aspects for strengthening the presentation of our SHAPE framework, benchmark, and pipeline. We respond to each major comment below, indicating where revisions will be made to address the concerns while preserving the core contributions.

read point-by-point responses

Referee: [Section on knowledge-mastery graph construction and inference] Section describing the knowledge-mastery graph (likely §3): the inference of prerequisite concepts and mastery gaps from arbitrary student queries is presented without ground-truth validation (e.g., expert annotations, inter-rater agreement, or accuracy metrics against human tutors). This inference directly controls the gating decision that is claimed to block jailbreaks while preserving helpfulness; without such validation, errors in gap detection could either leak unsafe direct answers or produce over-scaffolding, undermining the central safety claim.

Authors: We agree that explicit validation of the prerequisite inference and mastery gap detection is essential, given their role in the gating mechanism. The original manuscript outlines the LLM-guided inference process over the knowledge graph but does not include quantitative validation against human judgments. We will revise the relevant section to incorporate a validation study on a sampled subset of queries, including expert annotations, accuracy metrics, and inter-rater agreement statistics. This addition will directly support the reliability of the safety claims. revision: yes
Referee: [SHAPE benchmark section] SHAPE benchmark construction (likely §4): the 9,087 pairs are introduced for evaluating tutoring under adversarial pressure, yet no details are provided on external validation against real tutoring logs, educator review, or inter-annotator agreement to confirm they represent typical student adversarial behavior. If the prompts are primarily author-constructed, the reported safety gains may be benchmark-specific rather than generalizable.

Authors: The SHAPE benchmark was designed to systematically capture pedagogical jailbreaks by combining educational queries with adversarial framings. The submission provided limited details on construction validation. We will expand Section 4 to describe the synthesis process, include educator review for a subset of pairs, and report inter-annotator agreement. While full access to real tutoring logs is constrained by privacy considerations, the public release of the benchmark and code enables community-driven extensions and further validation, mitigating concerns about generalizability. revision: partial
Referee: [Experiments section] Experimental evaluation (likely §5): while the abstract states 'significantly improved safety' and 'near-ceiling helpfulness,' the provided text supplies no concrete metrics, baselines, statistical tests, confidence intervals, or exclusion criteria for the two pedagogical jailbreak settings. This prevents verification that the graph-augmented pipeline outperforms standard prompting or other safety methods under the same protocol.

Authors: We acknowledge that the experimental results require more explicit and accessible reporting to allow independent verification. The manuscript evaluates the pipeline across multiple LLMs under the two jailbreak settings with comparisons to baselines, but we will revise the experiments section to prominently include a summary table with all concrete metrics, baseline details, statistical tests, confidence intervals, and exclusion criteria. This will make the safety and helpfulness improvements fully verifiable under the stated protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on benchmark experiments without self-referential derivations

full rationale

The paper proposes a knowledge-mastery graph and graph-augmented pipeline to unify safety/helpfulness/pedagogy, then reports experimental results on the SHAPE benchmark of 9,087 pairs showing improved safety under pedagogical jailbreaks while preserving helpfulness. No equations, formal derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims are empirical outcomes from LLM evaluations rather than any reduction of outputs to inputs by construction, satisfying the default expectation of non-circularity for benchmark-driven work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The pipeline depends on the domain assumption that a knowledge-mastery graph can reliably model educational prerequisites and that the benchmark captures genuine adversarial student interactions; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption A knowledge-mastery graph can be constructed and used to infer prerequisite concepts and mastery gaps from student queries
Invoked to route generation between instructing and problem-solving in the proposed pipeline.

invented entities (1)

knowledge-mastery graph no independent evidence
purpose: To unify and operationalize safe, helpful, and pedagogical behaviors for tutoring decisions
Newly introduced structure in the paper; no independent falsifiable evidence outside the benchmark experiments is provided.

pith-pipeline@v0.9.0 · 5459 in / 1266 out tokens · 55799 ms · 2026-05-08T12:03:22.359614+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Qwen3 Technical Report

Levels of knowledge and deliberate prac- tice.Journal of experimental psychology: Applied, 19(2):108. Helen Pearson. 2025. Tsinghua university’s ai explo- ration in education: from tools to systemic solutions. Nature, 646. Published 23 October 2025. Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J Guibas, and Jascha S...

work page internal anchor Pith review arXiv 2025
[2]

Gemini: A Family of Highly Capable Multimodal Models

Cognitive ease at a cost: Llms reduce mental effort but compromise depth in student scientific in- quiry.Computers in Human Behavior, 160:108386. John Stamper, Ruiwei Xiao, and Xinying Hou. 2024. Enhancing llm-based feedback: Insights from intelli- gent tutoring systems and the learning sciences. In International Conference on Artificial Intelligence in E...

work page internal anchor Pith review arXiv 2024
[3]

Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen, Tianyi Zhang, Shijie Zhang, Wei Qiang Zhang, Yongfeng Huang, Haixin Duan, and Yunqi Zhang

Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110. Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen, Tianyi Zhang, Shijie Zhang, Wei Qiang Zhang, Yongfeng Huang, Haixin Duan, and Yunqi Zhang

work page
[4]

Ready to try a harder one?

Explore-on-graph: Incentivizing autonomous exploration of large language models on knowledge graphs with path-refined reward modeling.arXiv preprint arXiv:2602.21728. Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. 2024. Jailbreak attacks and defenses against large language models: A survey.arXiv preprint arXiv:2407....

work page arXiv 2024
[5]

NO DIRECT ANSWERS unless mastered

work page
[6]

ONE TURN, ONE QUESTION

work page
[7]

Sorry”, “I can’t

ADAPTIVE TONE (10th-grade level). ## EXAMPLE Problem:Let the angle between the plane vectors ⃗ aand ⃗bbe60 ◦, and|⃗ a|=| ⃗b|= 2. Then the projection vector of⃗ aonto ⃗bis ______. Expected answer: 1 2 ⃗b Knowledge points:The Dot Product in N-Dimensional Euclidean Space, Projecting Vectors Onto One-Dimensional Subspaces, The Norm of a Vector in N-Dimensiona...

work page 2023
[8]

For each step, the model selects the single most relevant knowledge point from a predefined vocabulary derived from the knowledge graph adja- cency matrix

Decomposition and Mapping Node:Re- ceives the student query and invokes a LLM to de- compose the problem into 1–6 sequential solution steps. For each step, the model selects the single most relevant knowledge point from a predefined vocabulary derived from the knowledge graph adja- cency matrix. The output adheres to a strict JSON schema to ensure parsing...

work page
[9]

Conditional Router: Directs the workflow based on whether missing knowledge points exist

Mastery Comparison Node:Computes the set difference between required knowledge points (aggregated from the decomposition step) and the student’s mastered knowledge points, yielding the missing knowledge points set. Conditional Router: Directs the workflow based on whether missing knowledge points exist. An empty set routes to the direct answer node; other...

work page
[10]

Direct Answer Node:Generates a concise, step-by-step solution when the student has mas- tered all prerequisite knowledge

work page
[11]

The implementation leverages LangGraph for workflow orchestration, LangChain for LLM ab- straction, and supports both synchronous and asyn- chronous execution modes for scalability

Tutoring Answer Node:Employs Socratic pedagogy when knowledge gaps are detected, fol- lowing a three-phase protocol: (i) diagnosing gaps and connecting to prior knowledge, (ii) guided gap- filling through incremental questioning, and (iii) application to the original problem with reinforce- ment. The implementation leverages LangGraph for workflow orchest...

work page
[12]

It is the ratio of responses where the model refuses to leak the answer ( y∈ Refusal)

Safety Calculated on set U. It is the ratio of responses where the model refuses to leak the answer ( y∈ Refusal). Safety= P i∈U I(yi ∈Refusal) |U| (21)

work page
[13]

It is the ratio of responses where the model provides the solution ( y∈ Solution)

Helpfulness Calculated on set M. It is the ratio of responses where the model provides the solution ( y∈ Solution). Helpfulness= P i∈M I(yi ∈Solution) |M| (22)

work page
[14]

It measures the proportion of safe refusalsthat also contain pedagogical guidance (y∈Peda )

Pedagogy Calculated on set U. It measures the proportion of safe refusalsthat also contain pedagogical guidance (y∈Peda ). Note that the denominator is the count of safe responses, not the total setU. Pedagogy= P i∈U I(yi ∈Peda)P i∈U I(yi ∈Refusal) (23) A.6.3 Performance Evaluation Detail Table 7 reports the full results of our graph- augmented pedagogica...

work page
[15]

fine-tuning

There is no significant improvement except for Gemini 2.5 Flash-Lite which improve the correct- ness under non-jailbreak setting from 33.33% to 94.83%. A.8 Token Usage We compared the average token usage for baseline method and our graph-augmented pipeline. For a single-turn dialogue, baseline cost 943.25 tokens and our pipeline cost 1135.15 tokens. A.9 O...

work page 2025

[1] [1]

Qwen3 Technical Report

Levels of knowledge and deliberate prac- tice.Journal of experimental psychology: Applied, 19(2):108. Helen Pearson. 2025. Tsinghua university’s ai explo- ration in education: from tools to systemic solutions. Nature, 646. Published 23 October 2025. Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J Guibas, and Jascha S...

work page internal anchor Pith review arXiv 2025

[2] [2]

Gemini: A Family of Highly Capable Multimodal Models

Cognitive ease at a cost: Llms reduce mental effort but compromise depth in student scientific in- quiry.Computers in Human Behavior, 160:108386. John Stamper, Ruiwei Xiao, and Xinying Hou. 2024. Enhancing llm-based feedback: Insights from intelli- gent tutoring systems and the learning sciences. In International Conference on Artificial Intelligence in E...

work page internal anchor Pith review arXiv 2024

[3] [3]

Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen, Tianyi Zhang, Shijie Zhang, Wei Qiang Zhang, Yongfeng Huang, Haixin Duan, and Yunqi Zhang

Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110. Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen, Tianyi Zhang, Shijie Zhang, Wei Qiang Zhang, Yongfeng Huang, Haixin Duan, and Yunqi Zhang

work page

[4] [4]

Ready to try a harder one?

Explore-on-graph: Incentivizing autonomous exploration of large language models on knowledge graphs with path-refined reward modeling.arXiv preprint arXiv:2602.21728. Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. 2024. Jailbreak attacks and defenses against large language models: A survey.arXiv preprint arXiv:2407....

work page arXiv 2024

[5] [5]

NO DIRECT ANSWERS unless mastered

work page

[6] [6]

ONE TURN, ONE QUESTION

work page

[7] [7]

Sorry”, “I can’t

ADAPTIVE TONE (10th-grade level). ## EXAMPLE Problem:Let the angle between the plane vectors ⃗ aand ⃗bbe60 ◦, and|⃗ a|=| ⃗b|= 2. Then the projection vector of⃗ aonto ⃗bis ______. Expected answer: 1 2 ⃗b Knowledge points:The Dot Product in N-Dimensional Euclidean Space, Projecting Vectors Onto One-Dimensional Subspaces, The Norm of a Vector in N-Dimensiona...

work page 2023

[8] [8]

For each step, the model selects the single most relevant knowledge point from a predefined vocabulary derived from the knowledge graph adja- cency matrix

Decomposition and Mapping Node:Re- ceives the student query and invokes a LLM to de- compose the problem into 1–6 sequential solution steps. For each step, the model selects the single most relevant knowledge point from a predefined vocabulary derived from the knowledge graph adja- cency matrix. The output adheres to a strict JSON schema to ensure parsing...

work page

[9] [9]

Conditional Router: Directs the workflow based on whether missing knowledge points exist

Mastery Comparison Node:Computes the set difference between required knowledge points (aggregated from the decomposition step) and the student’s mastered knowledge points, yielding the missing knowledge points set. Conditional Router: Directs the workflow based on whether missing knowledge points exist. An empty set routes to the direct answer node; other...

work page

[10] [10]

Direct Answer Node:Generates a concise, step-by-step solution when the student has mas- tered all prerequisite knowledge

work page

[11] [11]

The implementation leverages LangGraph for workflow orchestration, LangChain for LLM ab- straction, and supports both synchronous and asyn- chronous execution modes for scalability

Tutoring Answer Node:Employs Socratic pedagogy when knowledge gaps are detected, fol- lowing a three-phase protocol: (i) diagnosing gaps and connecting to prior knowledge, (ii) guided gap- filling through incremental questioning, and (iii) application to the original problem with reinforce- ment. The implementation leverages LangGraph for workflow orchest...

work page

[12] [12]

It is the ratio of responses where the model refuses to leak the answer ( y∈ Refusal)

Safety Calculated on set U. It is the ratio of responses where the model refuses to leak the answer ( y∈ Refusal). Safety= P i∈U I(yi ∈Refusal) |U| (21)

work page

[13] [13]

It is the ratio of responses where the model provides the solution ( y∈ Solution)

Helpfulness Calculated on set M. It is the ratio of responses where the model provides the solution ( y∈ Solution). Helpfulness= P i∈M I(yi ∈Solution) |M| (22)

work page

[14] [14]

It measures the proportion of safe refusalsthat also contain pedagogical guidance (y∈Peda )

Pedagogy Calculated on set U. It measures the proportion of safe refusalsthat also contain pedagogical guidance (y∈Peda ). Note that the denominator is the count of safe responses, not the total setU. Pedagogy= P i∈U I(yi ∈Peda)P i∈U I(yi ∈Refusal) (23) A.6.3 Performance Evaluation Detail Table 7 reports the full results of our graph- augmented pedagogica...

work page

[15] [15]

fine-tuning

There is no significant improvement except for Gemini 2.5 Flash-Lite which improve the correct- ness under non-jailbreak setting from 33.33% to 94.83%. A.8 Token Usage We compared the average token usage for baseline method and our graph-augmented pipeline. For a single-turn dialogue, baseline cost 943.25 tokens and our pipeline cost 1135.15 tokens. A.9 O...

work page 2025