Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

Bernhard Geiger; Lorenz Kutschka

arxiv: 2605.29676 · v2 · pith:ZNV5C4VYnew · submitted 2026-05-28 · 💻 cs.AI · cs.CL

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

Lorenz Kutschka , Bernhard Geiger This is my paper

Pith reviewed 2026-06-29 07:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords token optimizationagentic AIJSON alternativesTOONTRONtool callingLLM efficiencystructured output

0 comments

The pith

TRON reduces tokens by up to 27% in agentic tool use while staying within 14 points of JSON accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests two compact notations, TOON and TRON, as replacements for JSON when language models read tool schemas, receive results, and emit structured calls inside full agent loops. JSON carries structural overhead that inflates token counts; the new formats aim to shrink that cost without breaking the agent's ability to understand instructions or produce correct outputs. Across four benchmarks and five open-weight models the authors separate input compression from output compression and find TRON delivers the largest savings at modest accuracy cost while TOON introduces extra failures that compound over multiple turns and block parallel calls. A reader would care because every token saved lowers cost and latency in deployed agents that repeatedly exchange structured data. The evaluation design isolates whether gains come from better comprehension, better generation, or both.

Core claim

TRON reduces tokens by up to 27% with accuracy within 14 percentage points of the JSON baseline across the four agentic benchmarks. TOON reaches up to 18% token reduction at a similar accuracy cost yet additionally produces cascading multi-turn parsing failures and collapses parallel tool-call output for most models tested.

What carries the argument

Decoupled measurement of input compression versus output compression on agentic benchmarks, applied to TOON and TRON as compact object notations.

If this is right

Agent frameworks could adopt TRON for tool schemas and results to lower token budgets while preserving most task success.
TOON would require additional safeguards against multi-turn parsing drift before it can replace JSON in chained agent workflows.
Parallel tool-call generation must be re-tested when switching notations because TOON breaks it for most models.
Token savings appear in both comprehension and generation stages, so the formats can be applied independently to input and output sides.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern holds on proprietary models, production agents might default to TRON for cost-sensitive tool loops.
The multi-turn failure mode in TOON suggests that future benchmarks should track cumulative error over conversation length rather than single turns.
Notation choice could interact with model scale; larger models might absorb the accuracy cost more easily than the five tested here.

Load-bearing premise

The four chosen benchmarks and five open-weight models are representative enough that the measured token and accuracy differences will hold in other real-world agent loops.

What would settle it

Re-running the identical benchmarks on closed-source models or on longer-horizon tasks and observing token savings fall below 10% or accuracy gaps exceed 20 points would falsify the reported reductions.

Figures

Figures reproduced from arXiv: 2605.29676 by Bernhard Geiger, Lorenz Kutschka.

**Figure 2.** Figure 2: Mean accuracy per model and format, aver [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Per-model token–accuracy tradeoff against the JSON baseline, averaged across benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Example encodings of the same data in JSON, TOON, and TRON. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Per-benchmark token–accuracy facets under input-only compression for all five model configurations. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Per-model absolute token–accuracy view, summed across the four benchmarks. Each marker is one [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models. The code is available at: https://github.com/lkutschka/notation-matters

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRON delivers up to 27% token cuts in end-to-end agentic loops with accuracy within 14pp of JSON, while TOON saves less and breaks on multi-turn and parallel calls.

read the letter

The main takeaway is that TRON reduces tokens by up to 27% inside full agentic loops with accuracy drops no larger than 14 points from the JSON baseline, and TOON reaches 18% savings at a 9-point cost but then fails on multi-turn parsing and parallel tool outputs for most models.

The paper moves earlier isolated-task tests of these formats into four complete benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) across five open-weight LLMs. It separates input compression from output compression so comprehension and generation can be measured on their own. Releasing the code makes the numbers checkable.

The reported deltas come from direct benchmark runs with no fitted parameters or circular citations. That part is clean. The softer part is coverage: the four setups and open models may not capture the full range of tool schemas, long contexts, or closed-model tokenizers that appear in production. The abstract already flags TOON's sensitivity to interaction style, which shows the results are setup-dependent.

Teams tuning agent cost and latency will find the concrete numbers useful. Readers who need to know whether a format change survives real loops get direct evidence here.

Send it to peer review. The extension to end-to-end measurement is new enough and the code is public, so referees can test the generalization claim themselves.

Referee Report

2 major / 1 minor

Summary. The paper evaluates whether token-optimized notations TOON and TRON preserve their reported efficiency gains when used inside full agentic loops rather than isolated tasks. On four benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs it decouples input and output compression and reports that TRON cuts tokens by up to 27 % while keeping accuracy within 14 pp of the JSON baseline; TOON cuts up to 18 % at a 9 pp accuracy cost but additionally triggers multi-turn parsing cascades and collapses parallel tool-call output for most models. Reproducible code is released.

Significance. If the measured deltas generalize, the study supplies the first end-to-end evidence that compact notations can be substituted for JSON in agentic pipelines without catastrophic accuracy loss, together with an open implementation that supports direct replication and extension.

major comments (2)

[Experimental Setup] Experimental Setup (abstract and benchmark description): the claim that the observed token reductions 'hold inside end-to-end agentic loops' rests on the unexamined assumption that the four selected benchmarks adequately sample tool-schema complexity, multi-turn dynamics, parallel calls, and tokenizer behavior; the abstract itself flags TOON's sensitivity to interaction style, yet no diversity analysis or coverage argument is supplied.
[Results] Results and Methods: accuracy deltas are stated as 'within 14 pp' and 'within 9 pp' without reported variance, confidence intervals, or statistical tests across the five models or repeated runs, so it is impossible to judge whether the observed differences are distinguishable from noise.

minor comments (1)

The GitHub link is given but the manuscript does not state the exact commit or tag used for the reported numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental setup and statistical presentation. We address each major comment below and outline the revisions we will incorporate.

read point-by-point responses

Referee: [Experimental Setup] Experimental Setup (abstract and benchmark description): the claim that the observed token reductions 'hold inside end-to-end agentic loops' rests on the unexamined assumption that the four selected benchmarks adequately sample tool-schema complexity, multi-turn dynamics, parallel calls, and tokenizer behavior; the abstract itself flags TOON's sensitivity to interaction style, yet no diversity analysis or coverage argument is supplied.

Authors: We agree that an explicit coverage argument strengthens the justification. The four benchmarks were selected as established suites in the agentic tool-use literature. In the revision we will add a dedicated paragraph to Section 3 (Experimental Setup) that maps each benchmark to the dimensions of schema complexity, multi-turn dynamics, parallel calls, and tokenizer behavior, thereby supplying the requested diversity analysis. revision: partial
Referee: [Results] Results and Methods: accuracy deltas are stated as 'within 14 pp' and 'within 9 pp' without reported variance, confidence intervals, or statistical tests across the five models or repeated runs, so it is impossible to judge whether the observed differences are distinguishable from noise.

Authors: We accept that variance reporting improves interpretability. The revised manuscript will include mean accuracy and standard deviation across the five models for each notation-benchmark pair and will note the consistency of the deltas. Because the original runs were not repeated with multiple random seeds, we cannot supply confidence intervals derived from repeated trials; we will state this limitation explicitly. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark measurements with no derivation chain or self-referential reductions

full rationale

The paper consists entirely of direct empirical measurements of token counts and task accuracy for JSON, TOON, and TRON on four public benchmarks across five LLMs. No equations, fitted parameters, uniqueness theorems, or ansatzes are present; the reported deltas (e.g., up to 27% token reduction) are computed from raw tokenizer outputs and benchmark scores rather than being derived from or equivalent to any inputs by construction. Self-citations to prior format proposals are incidental background and do not bear the load of the central claims, which rest on reproducible benchmark runs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study relies on standard benchmark evaluation practices and the existence of the four named agentic benchmarks; no free parameters, axioms beyond ordinary statistics, or new invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5743 in / 1016 out tokens · 26212 ms · 2026-06-29T07:10:52.887295+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 11 canonical work pages · 5 internal anchors

[1]

ONTO: A Token-Efficient Columnar Notation for LLM Input Optimization

ONTO: A token- efficient columnar notation for LLM input optimiza- tion. (arXiv:2604.17512). ArXiv:2604.17512 [cs]. 10 DeepSeek-AI

work page internal anchor Pith review Pith/arXiv arXiv
[2]

DeepSeek-V3 Technical Report

DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437. Shiqing Fan, Xichen Ding, Liang Zhang, and Linjian Mo

work page internal anchor Pith review Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2508.07575

MCPToolBench++: A large scale ai agent model context protocol MCP tool use benchmark. arXiv preprint arXiv:2508.07575. Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu

work page arXiv
[4]

(arXiv:2403.07714)

StableToolBench: Towards stable large-scale benchmarking on tool learning of large language models. (arXiv:2403.07714). ArXiv:2403.07714 [cs]. Tim Huang

work page arXiv
[5]

https: //github.com/tron-format/tron-javascript

TRON-JavaScript: A JavaScript library for working with the TRON format. https: //github.com/tron-format/tron-javascript. MIT License. Accessed: 2026-05-06. Mateo Lafalce. 2025.TOON vs. JSON: A Mathematical Evaluation of Byte Efficiency in Structured Data. Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Si...

2026
[6]

arXiv preprint arXiv:2508.14704

MCP-Universe: Benchmarking large language mod- els with real-world model context protocol servers. arXiv preprint arXiv:2508.14704. Elio Masciari, Vincenzo Moscato, Enea Vincenzo Napolitano, Gian Marco Orlando, Marco Per- illo, and Diego Russo

work page arXiv
[7]

(arXiv:2601.12014)

Are LLMs ready for TOON? benchmarking structural correctness- sustainability trade-offs in novel structured output formats. (arXiv:2601.12014). ArXiv:2601.12014 [cs]. Ivan Matveev

work page arXiv
[8]

(arXiv:2603.03306)

Token-oriented object nota- tion vs JSON: A benchmark of plain and con- strained decoding generation. (arXiv:2603.03306). ArXiv:2603.03306 [cs]. Damon McMillan

work page arXiv
[9]

(arXiv:2602.05447)

Structured context engi- neering for file-native agentic systems: Evaluat- ing schema accuracy, format effectiveness, and multi-file navigation at scale. (arXiv:2602.05447). ArXiv:2602.05447 [cs]. Model Context Protocol Authors

work page arXiv
[10]

https://github.com/modelcontextprotocol

Model context protocol: An open protocol for seamless integration between LLM applications and external data. https://github.com/modelcontextprotocol. Accessed: 2026-01-13. Manas Mudbari and Chandan Bhagat. 2026.TSLN: Time-Series Lean Notation: A Novel Data Serializa- tion Format for Token-Efficient Analysis with Large Language Models. Gowthamkumar Nandakishore

2026
[11]

JTON: A Token-Efficient JSON Superset with Zen Grid Tabular Encoding for Large Language Models

JTON: A token- efficient JSON superset with zen grid tabular encod- ing for large language models. (arXiv:2604.05865). ArXiv:2604.05865 [cs]. Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Gorilla: Large Language Model Connected with Massive APIs

Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[13]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789. Johann Schopplich

work page internal anchor Pith review Pith/arXiv arXiv
[14]

compact, human-readable, schema- aware JSON for LLM prompts

TOON: Token-oriented ob- ject notation. compact, human-readable, schema- aware JSON for LLM prompts. spec, bench- marks, TypeScript sdk. https://github.com/ toon-format/toon. MIT License. Accessed: 2026- 05-06. Janghoon Yang

2026
[15]

evaluates multi- turn agentic reasoning. While MCP-Universe ships with both a native function-calling mode (using the OpenAI tool-use API) and a text-based ReAct agent (Yao et al., 2023), we use the ReAct agent because the native API manages tool schemas in- ternally and so cannot accept custom serializations. The ReAct agent embeds tool schemas as text i...

2023

[1] [1]

ONTO: A Token-Efficient Columnar Notation for LLM Input Optimization

ONTO: A token- efficient columnar notation for LLM input optimiza- tion. (arXiv:2604.17512). ArXiv:2604.17512 [cs]. 10 DeepSeek-AI

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

DeepSeek-V3 Technical Report

DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437. Shiqing Fan, Xichen Ding, Liang Zhang, and Linjian Mo

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

arXiv preprint arXiv:2508.07575

MCPToolBench++: A large scale ai agent model context protocol MCP tool use benchmark. arXiv preprint arXiv:2508.07575. Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu

work page arXiv

[4] [4]

(arXiv:2403.07714)

StableToolBench: Towards stable large-scale benchmarking on tool learning of large language models. (arXiv:2403.07714). ArXiv:2403.07714 [cs]. Tim Huang

work page arXiv

[5] [5]

https: //github.com/tron-format/tron-javascript

TRON-JavaScript: A JavaScript library for working with the TRON format. https: //github.com/tron-format/tron-javascript. MIT License. Accessed: 2026-05-06. Mateo Lafalce. 2025.TOON vs. JSON: A Mathematical Evaluation of Byte Efficiency in Structured Data. Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Si...

2026

[6] [6]

arXiv preprint arXiv:2508.14704

MCP-Universe: Benchmarking large language mod- els with real-world model context protocol servers. arXiv preprint arXiv:2508.14704. Elio Masciari, Vincenzo Moscato, Enea Vincenzo Napolitano, Gian Marco Orlando, Marco Per- illo, and Diego Russo

work page arXiv

[7] [7]

(arXiv:2601.12014)

Are LLMs ready for TOON? benchmarking structural correctness- sustainability trade-offs in novel structured output formats. (arXiv:2601.12014). ArXiv:2601.12014 [cs]. Ivan Matveev

work page arXiv

[8] [8]

(arXiv:2603.03306)

Token-oriented object nota- tion vs JSON: A benchmark of plain and con- strained decoding generation. (arXiv:2603.03306). ArXiv:2603.03306 [cs]. Damon McMillan

work page arXiv

[9] [9]

(arXiv:2602.05447)

Structured context engi- neering for file-native agentic systems: Evaluat- ing schema accuracy, format effectiveness, and multi-file navigation at scale. (arXiv:2602.05447). ArXiv:2602.05447 [cs]. Model Context Protocol Authors

work page arXiv

[10] [10]

https://github.com/modelcontextprotocol

Model context protocol: An open protocol for seamless integration between LLM applications and external data. https://github.com/modelcontextprotocol. Accessed: 2026-01-13. Manas Mudbari and Chandan Bhagat. 2026.TSLN: Time-Series Lean Notation: A Novel Data Serializa- tion Format for Token-Efficient Analysis with Large Language Models. Gowthamkumar Nandakishore

2026

[11] [11]

JTON: A Token-Efficient JSON Superset with Zen Grid Tabular Encoding for Large Language Models

JTON: A token- efficient JSON superset with zen grid tabular encod- ing for large language models. (arXiv:2604.05865). ArXiv:2604.05865 [cs]. Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Gorilla: Large Language Model Connected with Massive APIs

Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789. Johann Schopplich

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

compact, human-readable, schema- aware JSON for LLM prompts

TOON: Token-oriented ob- ject notation. compact, human-readable, schema- aware JSON for LLM prompts. spec, bench- marks, TypeScript sdk. https://github.com/ toon-format/toon. MIT License. Accessed: 2026- 05-06. Janghoon Yang

2026

[15] [15]

evaluates multi- turn agentic reasoning. While MCP-Universe ships with both a native function-calling mode (using the OpenAI tool-use API) and a text-based ReAct agent (Yao et al., 2023), we use the ReAct agent because the native API manages tool schemas in- ternally and so cannot accept custom serializations. The ReAct agent embeds tool schemas as text i...

2023