ProbeLLM: Automating Principled Diagnosis of LLM Failures

Kehan Guo; Pin-Yu Chen; Stefan Feuerriegel; Xiangliang Zhang; Xiangqi Wang; Yuchen Ma; Yue Huang; Yuexing Hao; Yu Jiang; Yujun Zhou

arxiv: 2602.12966 · v2 · pith:IKJF3F3Tnew · submitted 2026-02-13 · 💻 cs.CL · cs.SE

ProbeLLM: Automating Principled Diagnosis of LLM Failures

Yue Huang , Zhengzhe Jiang , Yuchen Ma , Yu Jiang , Xiangqi Wang , Yujun Zhou , Yuexing Hao , Kehan Guo

show 3 more authors

Pin-Yu Chen Stefan Feuerriegel Xiangliang Zhang

This is my paper

classification 💻 cs.CL cs.SE

keywords failureprobellmprobingautomateddiscoveryfailuresprincipledbenchmarks

0 comments

read the original abstract

Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering
cs.AI 2026-05 unverdicted novelty 6.0

Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
cs.LG 2026-04 unverdicted novelty 5.0

Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.