MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

Chaochao Lu; Guangze Ye; Guoqing Wang; Jie Zhou; Jingqi Huang; Kaicheng Shen; Liang He; Liang Shan; Qingshan Liu; Wen Wu

arxiv: 2511.07107 · v3 · pith:S6W53TJInew · submitted 2025-11-10 · 💻 cs.AI · cs.CL

MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

Liang Shan , Kaicheng Shen , Wen Wu , Zhenyu Ying , Chaochao Lu , Yan Teng , Jingqi Huang , Qingshan Liu

show 4 more authors

Guangze Ye Guoqing Wang Jie Zhou Liang He

This is my paper

classification 💻 cs.AI cs.CL

keywords mentorllmssafetyacrossdatasetframeworkimplicitmetacognition-driven

0 comments

read the original abstract

Ensuring the safety of Large Language Models (LLMs) is critical for real-world deployment. However, current safety measures often fail to address implicit, domain-specific risks. To investigate this gap, we introduce a dataset of 3,000 annotated queries spanning education, finance, and management. Evaluations across 14 leading LLMs reveal a concerning vulnerability: an average jailbreak success rate of 57.8\%. In response, we propose MENTOR, a metacognition-driven self-evolution framework. MENTOR performs metacognitive self-assessment, using strategies such as perspective-taking and consequential reasoning to uncover latent model misalignments. The resulting reflections are distilled into dynamic rule-based knowledge graphs, from which retrieved rules are converted into activation-level steering signals to guide internal representations during inference. Experiments demonstrate that MENTOR substantially reduces attack success rates across all tested domains and outperforms existing safety alignment methods. The code and dataset for MENTOR are available at: https://anonymous.4open.science/r/MENTOR-Evo.

This paper has not been read by Pith yet.

MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

discussion (0)