{"paper":{"title":"ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"","cross_cats":["cs.CL","cs.LG"],"primary_cat":"cs.AI","authors_text":"Akhil Arora, Dhairya Kuchhal, Har Ashish Arora, Lars Klein, Nearchos Potamitis, Vansh Ramani","submitted_at":"2025-12-08T18:26:58Z","abstract_excerpt":"Benchmark scores for LLM reasoning systems are reported as single numbers, yet the same model, strategy, and task can produce meaningfully different answers and costs across repeated executions, even under greedy decoding (T = 0). This variance is not a statistical nuisance: the highest-performing strategy wins only 77% of head-to-head runs against its nearest competitor, meaning a single observed score can silently misrank systems. We introduce ReasonBench, a benchmark suite recording 30 independent trials across 10 reasoning strategies, 12 models, and 6 tasks, treating quality and cost as di"},"claims":{"count":0,"items":[],"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"source":{"id":"2512.07795","kind":"arxiv","version":2},"verdict":{"id":null,"model_set":{},"created_at":null,"strongest_claim":"","one_line_summary":"","pipeline_version":null,"weakest_assumption":"","pith_extraction_headline":""},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2512.07795/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}