SADU benchmark shows top VLMs reach only 70% accuracy on software architecture diagram tasks, revealing gaps in visual reasoning for engineering artifacts.
Mitigating overthinking in large reasoning models via manifold steering
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
background 2polarities
background 2representative citing papers
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
CLEAR reveals that LLMs' accuracy on medical questions drops and their 'humility deficit' grows as the number of plausible answers increases and abstention options shift from assertive to uncertain phrasing.
After length-correcting hidden-state trajectories during chain-of-thought, reasoning models show systematically different geometry on harder problems than baselines, strongest in competitive programming.
citing papers explorer
-
Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding
SADU benchmark shows top VLMs reach only 70% accuracy on software architecture diagram tasks, revealing gaps in visual reasoning for engineering artifacts.
-
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
-
CLEAR: Revealing How Noise and Ambiguity Degrade Reliability in LLMs for Medicine
CLEAR reveals that LLMs' accuracy on medical questions drops and their 'humility deficit' grows as the number of plausible answers increases and abstention options shift from assertive to uncertain phrasing.
-
Reasoning Models Don't Just Think Longer, They Move Differently
After length-correcting hidden-state trajectories during chain-of-thought, reasoning models show systematically different geometry on harder problems than baselines, strongest in competitive programming.