pith. sign in

arxiv: 2605.29676 · v2 · pith:ZNV5C4VYnew · submitted 2026-05-28 · 💻 cs.AI · cs.CL

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

Pith reviewed 2026-06-29 07:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords token optimizationagentic AIJSON alternativesTOONTRONtool callingLLM efficiencystructured output
0
0 comments X

The pith

TRON reduces tokens by up to 27% in agentic tool use while staying within 14 points of JSON accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests two compact notations, TOON and TRON, as replacements for JSON when language models read tool schemas, receive results, and emit structured calls inside full agent loops. JSON carries structural overhead that inflates token counts; the new formats aim to shrink that cost without breaking the agent's ability to understand instructions or produce correct outputs. Across four benchmarks and five open-weight models the authors separate input compression from output compression and find TRON delivers the largest savings at modest accuracy cost while TOON introduces extra failures that compound over multiple turns and block parallel calls. A reader would care because every token saved lowers cost and latency in deployed agents that repeatedly exchange structured data. The evaluation design isolates whether gains come from better comprehension, better generation, or both.

Core claim

TRON reduces tokens by up to 27% with accuracy within 14 percentage points of the JSON baseline across the four agentic benchmarks. TOON reaches up to 18% token reduction at a similar accuracy cost yet additionally produces cascading multi-turn parsing failures and collapses parallel tool-call output for most models tested.

What carries the argument

Decoupled measurement of input compression versus output compression on agentic benchmarks, applied to TOON and TRON as compact object notations.

If this is right

  • Agent frameworks could adopt TRON for tool schemas and results to lower token budgets while preserving most task success.
  • TOON would require additional safeguards against multi-turn parsing drift before it can replace JSON in chained agent workflows.
  • Parallel tool-call generation must be re-tested when switching notations because TOON breaks it for most models.
  • Token savings appear in both comprehension and generation stages, so the formats can be applied independently to input and output sides.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pattern holds on proprietary models, production agents might default to TRON for cost-sensitive tool loops.
  • The multi-turn failure mode in TOON suggests that future benchmarks should track cumulative error over conversation length rather than single turns.
  • Notation choice could interact with model scale; larger models might absorb the accuracy cost more easily than the five tested here.

Load-bearing premise

The four chosen benchmarks and five open-weight models are representative enough that the measured token and accuracy differences will hold in other real-world agent loops.

What would settle it

Re-running the identical benchmarks on closed-source models or on longer-horizon tasks and observing token savings fall below 10% or accuracy gaps exceed 20 points would falsify the reported reductions.

Figures

Figures reproduced from arXiv: 2605.29676 by Bernhard Geiger, Lorenz Kutschka.

Figure 1
Figure 1. Figure 1: Format substitution in the tool-augmented pipeline. (a) The same tool schema encoded in JSON, TOON, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mean accuracy per model and format, aver [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-model token–accuracy tradeoff against the JSON baseline, averaged across benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example encodings of the same data in JSON, TOON, and TRON. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-benchmark token–accuracy facets under input-only compression for all five model configurations. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-model absolute token–accuracy view, summed across the four benchmarks. Each marker is one [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models. The code is available at: https://github.com/lkutschka/notation-matters

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper evaluates whether token-optimized notations TOON and TRON preserve their reported efficiency gains when used inside full agentic loops rather than isolated tasks. On four benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs it decouples input and output compression and reports that TRON cuts tokens by up to 27 % while keeping accuracy within 14 pp of the JSON baseline; TOON cuts up to 18 % at a 9 pp accuracy cost but additionally triggers multi-turn parsing cascades and collapses parallel tool-call output for most models. Reproducible code is released.

Significance. If the measured deltas generalize, the study supplies the first end-to-end evidence that compact notations can be substituted for JSON in agentic pipelines without catastrophic accuracy loss, together with an open implementation that supports direct replication and extension.

major comments (2)
  1. [Experimental Setup] Experimental Setup (abstract and benchmark description): the claim that the observed token reductions 'hold inside end-to-end agentic loops' rests on the unexamined assumption that the four selected benchmarks adequately sample tool-schema complexity, multi-turn dynamics, parallel calls, and tokenizer behavior; the abstract itself flags TOON's sensitivity to interaction style, yet no diversity analysis or coverage argument is supplied.
  2. [Results] Results and Methods: accuracy deltas are stated as 'within 14 pp' and 'within 9 pp' without reported variance, confidence intervals, or statistical tests across the five models or repeated runs, so it is impossible to judge whether the observed differences are distinguishable from noise.
minor comments (1)
  1. The GitHub link is given but the manuscript does not state the exact commit or tag used for the reported numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental setup and statistical presentation. We address each major comment below and outline the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Experimental Setup] Experimental Setup (abstract and benchmark description): the claim that the observed token reductions 'hold inside end-to-end agentic loops' rests on the unexamined assumption that the four selected benchmarks adequately sample tool-schema complexity, multi-turn dynamics, parallel calls, and tokenizer behavior; the abstract itself flags TOON's sensitivity to interaction style, yet no diversity analysis or coverage argument is supplied.

    Authors: We agree that an explicit coverage argument strengthens the justification. The four benchmarks were selected as established suites in the agentic tool-use literature. In the revision we will add a dedicated paragraph to Section 3 (Experimental Setup) that maps each benchmark to the dimensions of schema complexity, multi-turn dynamics, parallel calls, and tokenizer behavior, thereby supplying the requested diversity analysis. revision: partial

  2. Referee: [Results] Results and Methods: accuracy deltas are stated as 'within 14 pp' and 'within 9 pp' without reported variance, confidence intervals, or statistical tests across the five models or repeated runs, so it is impossible to judge whether the observed differences are distinguishable from noise.

    Authors: We accept that variance reporting improves interpretability. The revised manuscript will include mean accuracy and standard deviation across the five models for each notation-benchmark pair and will note the consistency of the deltas. Because the original runs were not repeated with multiple random seeds, we cannot supply confidence intervals derived from repeated trials; we will state this limitation explicitly. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark measurements with no derivation chain or self-referential reductions

full rationale

The paper consists entirely of direct empirical measurements of token counts and task accuracy for JSON, TOON, and TRON on four public benchmarks across five LLMs. No equations, fitted parameters, uniqueness theorems, or ansatzes are present; the reported deltas (e.g., up to 27% token reduction) are computed from raw tokenizer outputs and benchmark scores rather than being derived from or equivalent to any inputs by construction. Self-citations to prior format proposals are incidental background and do not bear the load of the central claims, which rest on reproducible benchmark runs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study relies on standard benchmark evaluation practices and the existence of the four named agentic benchmarks; no free parameters, axioms beyond ordinary statistics, or new invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5743 in / 1016 out tokens · 26212 ms · 2026-06-29T07:10:52.887295+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    ONTO: A Token-Efficient Columnar Notation for LLM Input Optimization

    ONTO: A token- efficient columnar notation for LLM input optimiza- tion. (arXiv:2604.17512). ArXiv:2604.17512 [cs]. 10 DeepSeek-AI

  2. [2]

    DeepSeek-V3 Technical Report

    DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437. Shiqing Fan, Xichen Ding, Liang Zhang, and Linjian Mo

  3. [3]

    arXiv preprint arXiv:2508.07575

    MCPToolBench++: A large scale ai agent model context protocol MCP tool use benchmark. arXiv preprint arXiv:2508.07575. Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu

  4. [4]

    (arXiv:2403.07714)

    StableToolBench: Towards stable large-scale benchmarking on tool learning of large language models. (arXiv:2403.07714). ArXiv:2403.07714 [cs]. Tim Huang

  5. [5]

    https: //github.com/tron-format/tron-javascript

    TRON-JavaScript: A JavaScript library for working with the TRON format. https: //github.com/tron-format/tron-javascript. MIT License. Accessed: 2026-05-06. Mateo Lafalce. 2025.TOON vs. JSON: A Mathematical Evaluation of Byte Efficiency in Structured Data. Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Si...

  6. [6]

    arXiv preprint arXiv:2508.14704

    MCP-Universe: Benchmarking large language mod- els with real-world model context protocol servers. arXiv preprint arXiv:2508.14704. Elio Masciari, Vincenzo Moscato, Enea Vincenzo Napolitano, Gian Marco Orlando, Marco Per- illo, and Diego Russo

  7. [7]

    (arXiv:2601.12014)

    Are LLMs ready for TOON? benchmarking structural correctness- sustainability trade-offs in novel structured output formats. (arXiv:2601.12014). ArXiv:2601.12014 [cs]. Ivan Matveev

  8. [8]

    (arXiv:2603.03306)

    Token-oriented object nota- tion vs JSON: A benchmark of plain and con- strained decoding generation. (arXiv:2603.03306). ArXiv:2603.03306 [cs]. Damon McMillan

  9. [9]

    (arXiv:2602.05447)

    Structured context engi- neering for file-native agentic systems: Evaluat- ing schema accuracy, format effectiveness, and multi-file navigation at scale. (arXiv:2602.05447). ArXiv:2602.05447 [cs]. Model Context Protocol Authors

  10. [10]

    https://github.com/modelcontextprotocol

    Model context protocol: An open protocol for seamless integration between LLM applications and external data. https://github.com/modelcontextprotocol. Accessed: 2026-01-13. Manas Mudbari and Chandan Bhagat. 2026.TSLN: Time-Series Lean Notation: A Novel Data Serializa- tion Format for Token-Efficient Analysis with Large Language Models. Gowthamkumar Nandakishore

  11. [11]

    JTON: A Token-Efficient JSON Superset with Zen Grid Tabular Encoding for Large Language Models

    JTON: A token- efficient JSON superset with zen grid tabular encod- ing for large language models. (arXiv:2604.05865). ArXiv:2604.05865 [cs]. Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez

  12. [12]

    Gorilla: Large Language Model Connected with Massive APIs

    Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, and 1 others

  13. [13]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789. Johann Schopplich

  14. [14]

    compact, human-readable, schema- aware JSON for LLM prompts

    TOON: Token-oriented ob- ject notation. compact, human-readable, schema- aware JSON for LLM prompts. spec, bench- marks, TypeScript sdk. https://github.com/ toon-format/toon. MIT License. Accessed: 2026- 05-06. Janghoon Yang

  15. [15]

    evaluates multi- turn agentic reasoning. While MCP-Universe ships with both a native function-calling mode (using the OpenAI tool-use API) and a text-based ReAct agent (Yao et al., 2023), we use the ReAct agent because the native API manages tool schemas in- ternally and so cannot accept custom serializations. The ReAct agent embeds tool schemas as text i...