The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.
Numerical sensitivity and robustness: Exploring the flaws of mathematical reasoning in large language models, 2025
2 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 2representative citing papers
RLAA is a localized adversarial anonymization framework that adds an arbitrator to filter ghost leaks and enforce rational early stopping, yielding superior privacy-utility trade-offs on benchmarks compared to greedy baselines.
citing papers explorer
-
Robust Reasoning Benchmark
The Robust Reasoning Benchmark shows frontier LLMs are mostly resilient to textual perturbations on AIME problems while open-weight models suffer up to 54% accuracy drops and exhibit accuracy decay on later problems due to attention dilution during chain-of-thought.
-
Look Twice before You Leap: A Rational Framework for Localized Adversarial Anonymization
RLAA is a localized adversarial anonymization framework that adds an arbitrator to filter ghost leaks and enforce rational early stopping, yielding superior privacy-utility trade-offs on benchmarks compared to greedy baselines.