pith. sign in

arxiv: 2511.05722 · v3 · pith:6PMSN6WRnew · submitted 2025-11-07 · 💻 cs.CL · cs.AI

OckBench: Measuring the Efficiency of LLM Reasoning

classification 💻 cs.CL cs.AI
keywords efficiencytokenreasoningaccuracymodelsockbenchabilityacross
0
0 comments X
read the original abstract

Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation. Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage. The token efficiency is highly variable in practical. Models solving the same problem with similar accuracy can exhibit up to a \textbf{5.0$\times$} difference in token length, leading to massive gap of model reasoning ability. Such variance exposes significant redundancy, highlighting the critical need for a standardized benchmark to quantify the gap of token efficiency. Thus, we introduce OckBench, the first benchmark that jointly measures accuracy and token efficiency across reasoning and coding tasks. Our evaluation reveals that token efficiency remains largely unoptimized across current models, significantly inflating serving costs and latency. These findings provide a concrete roadmap for the community to optimize the latent reasoning ability, token efficiency. Ultimately, we argue for an evaluation paradigm shift: tokens must not be multiplied beyond necessity. Our benchmarks are available at https://ockbench.github.io/.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation

    cs.LG 2026-06 unverdicted novelty 6.0

    Curating pretraining data for concision in VLMs produces models with up to 35x lower cost-of-pass at matched accuracy by reducing output token count.

  2. Brevity is the Soul of Inference Efficiency: Inducing Concision in VLMs via Data Curation

    cs.LG 2026-06 unverdicted novelty 6.0

    Curating concise data for VLMs induces brevity, delivering 35x lower Cost-of-Pass at near-identical accuracy and higher matched-length accuracy than uncurated baselines.

  3. Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

    cs.AI 2026-06 unverdicted novelty 6.0

    A local Llama 3.2 3B model preprocesses multilingual coding prompts via translation and structural rewriting, cutting prompt tokens 34-47% and total tokens up to 18.8% while preserving accuracy on OMH-Polyglot benchmark.

  4. OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    OpenSkillEval automatically builds realistic tasks from evolving artifacts to audit skill effectiveness in LLM agents, finding that skill use depends on model and framework and that many popular skills do not outperfo...

  5. OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    OpenSkillEval dynamically builds task instances across five application domains to evaluate 30 open skills with over 600 tests, finding that skill use depends heavily on model and framework and that many popular skill...

  6. JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

    cs.CL 2026-04 unverdicted novelty 5.0

    JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.