CharBench: Evaluating the Role of Tokenization in Character-Level Tasks

Omri Uzan; Yuval Pinter

arxiv: 2508.02591 · v3 · submitted 2025-08-04 · 💻 cs.CL

CharBench: Evaluating the Role of Tokenization in Character-Level Tasks

Omri Uzan , Yuval Pinter This is my paper

Pith reviewed 2026-05-19 00:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords CharBenchtokenizationcharacter-level taskspositional understandinglanguage modelsbenchmarksubword units

0 comments

The pith

Longer tokens obscure character positions for language models on intra-word tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CharBench, a benchmark two orders of magnitude larger than existing ones, to test language models on character counting and locating tasks. Models achieve low average accuracy of 43.6 percent overall, with some tasks at 32.3 percent. Analysis shows tokenization properties weakly affect counting accuracy, where word length and character count matter more. For positional tasks inside words, accuracy drops as the length of the token holding the queried character grows. This indicates subword tokenization hides position details that models need for these simple reasoning steps.

Core claim

CharBench evaluates leading models on character-level tasks and finds that tokenization properties are weakly correlated with correctness on counting tasks, while the length of the queried word and the actual character count play a more significant part. In contrast, for tasks requiring intra-word positional understanding, performance is negatively correlated with the length of the token containing the queried character, suggesting that longer tokens obscure character position information for LLMs.

What carries the argument

The length of the token containing the queried character, used as the variable to measure correlation with accuracy on positional character tasks.

If this is right

Tokenization effects are weaker for counting than for locating characters inside words.
Word length and true character count influence counting performance more than segmentation details.
Models show an average accuracy of 43.6 percent on CharBench tasks overall.
Future work can use the benchmark to test improvements in handling character positions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tokenization design choices may need to prioritize shorter segments for tasks that depend on exact intra-word locations.
The same token-length effect could appear in other fine-grained language operations that require tracking individual symbols.
Benchmark results could guide targeted fine-tuning or architectural changes to reduce the performance drop on longer tokens.

Load-bearing premise

The character-level tasks chosen for CharBench represent the reasoning challenges LLMs encounter in practical settings, and the observed correlations reflect tokenization effects rather than other factors like model scale or training data.

What would settle it

Testing the same positional tasks on a model that uses character-level tokenization and finding either no negative correlation with token length or substantially higher accuracy would challenge the claim that longer tokens obscure position information.

read the original abstract

Tasks that require character-level reasoning, such as counting or locating characters within words, remain challenging for contemporary language models. A common conjecture is that language models' reliance on subword units, rather than characters, contributes to their struggles with character-level tasks, yet recent studies offer conflicting conclusions about the role of tokenization, leaving its impact unclear. To address this gap, we introduce CharBench, a comprehensive benchmark of character-level tasks that is two orders of magnitude larger than existing alternatives. We evaluate a diverse range of leading open-weight and proprietary models on CharBench and find that it presents a significant challenge to modern LLMs, with an average accuracy of 43.6% and 32.3% on some tasks. We present an in-depth analysis of how intrinsic properties of words and their segmentations into tokens correspond to model performance. For counting tasks, we find that tokenization properties are weakly correlated with correctness, while the length of the queried word and the actual character count play a more significant part. In contrast, for tasks requiring intra-word positional understanding, performance is negatively correlated with the length of the token containing the queried character, suggesting that longer tokens obscure character position information for LLMs. We encourage future work to build on the benchmark and evaluation methodology introduced here as tools for improving model performance on such tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CharBench scales up testing for character-level LLM weaknesses and splits counting from positional effects, but the key negative correlation with token length lacks clear controls for confounders.

read the letter

The main takeaway is that this paper gives us a much larger benchmark than before for character-level tasks and shows that counting behaves differently from figuring out where a character sits inside a word. The scale is the clearest advance—two orders of magnitude bigger—and they actually run it across a bunch of models with numbers like 43 percent average accuracy on some tasks. That data is worth having. They also separate the analysis: counting ties more to word length and true character count, while positional accuracy drops with longer tokens containing the target character. That split is new enough to notice and useful for people thinking about tokenization limits.

Referee Report

1 major / 2 minor

Summary. The paper introduces CharBench, a benchmark two orders of magnitude larger than prior alternatives, for evaluating LLMs on character-level tasks such as counting and locating characters within words. It reports that leading models achieve low average accuracies (43.6% overall, 32.3% on some tasks) and presents an analysis showing that counting-task performance correlates more strongly with word length and true character count than with tokenization properties, whereas positional tasks exhibit a negative correlation between accuracy and the length of the token containing the queried character, interpreted as evidence that longer tokens obscure intra-word positional information.

Significance. If the reported correlations prove robust after appropriate controls, CharBench and the accompanying analysis would offer a valuable large-scale resource and empirical grounding for understanding tokenization's contribution to character-level failures in LLMs, with potential to inform future modeling choices.

major comments (1)

[In-depth analysis of intrinsic properties and tokenizations] In the in-depth analysis of word properties and tokenizations (as summarized in the abstract), the negative correlation claimed for intra-word positional tasks between accuracy and length of the token containing the queried character lacks any indication of controls for confounders such as word length, absolute character index, or token-internal offset. This stands in contrast to the counting-task analysis, where word length and character count are explicitly noted as more significant; without analogous isolation (e.g., via regression with multiple covariates), the correlation may reflect unrelated positional or length effects rather than token length per se, directly affecting the central interpretive claim.

minor comments (2)

[Abstract and evaluation methodology] The abstract and evaluation sections state concrete accuracy figures and correlation directions but omit details on statistical significance tests, dataset construction controls, and error analysis; adding these would strengthen the evidential support without altering the core claims.
[Results presentation] Tables or figures presenting per-task or per-model results should include explicit column or axis labels for the exact character-level subtasks and tokenization metrics used, to aid reader interpretation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and insightful comments, which help clarify the strength of our interpretive claims. We respond to the major comment below.

read point-by-point responses

Referee: In the in-depth analysis of word properties and tokenizations (as summarized in the abstract), the negative correlation claimed for intra-word positional tasks between accuracy and length of the token containing the queried character lacks any indication of controls for confounders such as word length, absolute character index, or token-internal offset. This stands in contrast to the counting-task analysis, where word length and character count are explicitly noted as more significant; without analogous isolation (e.g., via regression with multiple covariates), the correlation may reflect unrelated positional or length effects rather than token length per se, directly affecting the central interpretive claim.

Authors: We agree that the absence of explicit controls for potential confounders limits the strength of the causal interpretation we attach to the observed negative correlation in positional tasks. Our manuscript reports a bivariate correlation between accuracy and the length of the token containing the queried character, while noting that word length and true character count are more predictive for counting tasks; however, we did not perform a multivariate regression isolating token length from word length, absolute character index, or token-internal offset. In the revised manuscript we will add a multiple linear regression (or equivalent) that includes these covariates simultaneously, allowing us to test whether the token-length coefficient remains significant after adjustment. We will also report the associated partial correlations and variance inflation factors to document the degree of confounding. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and correlation analysis

full rationale

The paper introduces CharBench, evaluates external LLMs on character-level tasks, and reports observed correlations between performance and token/word properties. No derivations, equations, fitted parameters presented as predictions, or self-citation chains are present in the abstract or described methodology. The central findings rest on direct model evaluations against an independently constructed benchmark, with no reduction of results to inputs by construction or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark and evaluation paper. No mathematical derivations, fitted parameters, background axioms, or new postulated entities are introduced.

pith-pipeline@v0.9.0 · 5764 in / 1049 out tokens · 60181 ms · 2026-05-19T00:42:51.644125+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

performance is negatively correlated with the length of the token containing the queried character, suggesting that longer tokens obscure character position information for LLMs
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

For counting tasks, we find that tokenization properties are weakly correlated with correctness, while the length of the queried word and the actual character count play a more significant part

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.