LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.
citing papers explorer
-
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.
-
Design and Report Benchmarks for Knowledge Work
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
-
Learning to Control Summaries with Score Ranking
A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.