Model-adaptive tool necessity shows 26-54% mismatch with actual tool calls across LLMs, driven by nearly orthogonal hidden-state signals for cognition versus action.
Do Large Language Models Know What They Don
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Frontier LLMs struggle to discriminate data uncertainty from model uncertainty even when accurate, but a new benchmark and lightweight RL strategy improve attribution without sacrificing answer accuracy.
BLINKG is a benchmark for evaluating LLMs on mapping input data schemas to ontology concepts for knowledge graph construction, with experiments showing promising but limited performance in complex real-world scenarios.
LaaB improves LLM hallucination detection by mapping self-judgment labels back into neural feature space and using mutual learning under logical consistency constraints between responses and meta-judgments.
citing papers explorer
-
Beyond "I Don't Know": Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty
Frontier LLMs struggle to discriminate data uncertainty from model uncertainty even when accurate, but a new benchmark and lightweight RL strategy improve attribution without sacrificing answer accuracy.
-
Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments
LaaB improves LLM hallucination detection by mapping self-judgment labels back into neural feature space and using mutual learning under logical consistency constraints between responses and meta-judgments.