Teaching Metric Distance to Discrete Autoregressive Language Models

Dongjun Min; Jaewoo Park; Jiwan Chung; Saejin Kim; Yongrae Jo; Youngjae Yu

arxiv: 2503.02379 · v5 · submitted 2025-03-04 · 💻 cs.LG · cs.CV

Teaching Metric Distance to Discrete Autoregressive Language Models

Jiwan Chung , Saejin Kim , Yongrae Jo , Jaewoo Park , Dongjun Min , Youngjae Yu This is my paper

Pith reviewed 2026-05-23 00:57 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords DIST2Lossautoregressive modelsdistance-aware losstoken distancespolicy optimizationdiscrete supervisionreinforcement learning alternative

0 comments

The pith

DIST2Loss replaces one-hot targets with reward-weighted distributions derived from predefined token distances for autoregressive models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive models like large language models are trained to predict the next token using one-hot targets that treat all incorrect tokens as equally wrong. This paper introduces DIST2Loss to incorporate meaningful distances between tokens by turning those distances into weighted target distributions. The loss is presented as the closed-form solution to an entropy-regularized policy optimization problem where per-token rewards are known ahead of time. Experiments apply the method to visual grounding, robotic action learning, reward modeling, and vector-quantized image generation, reporting gains in data efficiency and final task metrics. A reader would care if the approach lets models respect token geometry without the sampling and instability that come with standard reinforcement learning.

Core claim

DIST2Loss replaces one-hot targets with reward-weighted distributions derived from predefined token distances. It can be interpreted as the closed-form solution to entropy-regularized policy optimization with known per-token rewards, retaining the core mechanism of reinforcement learning while avoiding sampling, rollouts, and instability. Experiments show that DIST2Loss improves data efficiency and downstream performance across diverse domains including tighter bounding boxes in visual grounding, faster robotic manipulation, better reward modeling for LLM alignment, and stronger vector-quantized image generation.

What carries the argument

DIST2Loss, the distance-aware objective that builds reward-weighted target distributions from predefined token distances.

If this is right

Training becomes more data-efficient on tasks where token closeness matters.
Visual grounding produces tighter bounding boxes.
Robotic manipulation learns actions faster.
Reward modeling for language model alignment improves.
Vector-quantized image generation becomes stronger.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same construction could be applied to any discrete sequence task that already has a natural distance measure between symbols.
If per-token rewards are available, many reinforcement learning setups in language modeling might be replaced by this direct supervised objective.
The method lowers the barrier to using metric supervision in new domains without requiring policy gradient machinery.

Load-bearing premise

Meaningful task-appropriate distances between tokens can be defined in advance and the resulting weighted distributions supply the right supervision signal.

What would settle it

A side-by-side run on a numerical prediction task where DIST2Loss produces higher error or slower convergence than standard cross-entropy loss.

read the original abstract

Large language models (LLMs) operate as autoregressive predictors over discrete token vocabularies, a formulation that has enabled their adaptation far beyond natural language to vision, robotics, and multimodal reasoning. However, training against one-hot targets disregards metric relationships between tokens and limits effectiveness on tasks where distance is meaningful, such as numerical values, spatial coordinates, or quantized embeddings. We introduce DIST2Loss, a distance-aware objective for discrete autoregressive models that replaces one-hot targets with reward-weighted distributions derived from predefined token distances. DIST2Loss can be interpreted as the closed-form solution to entropy-regularized policy optimization with known per-token rewards, retaining the core mechanism of reinforcement learning while avoiding sampling, rollouts, and instability. Our experiments show that DIST2Loss improves data efficiency and downstream performance across diverse domains. It yields tighter bounding boxes in visual grounding, accelerates robotic manipulation by improving action learning, enhances reward modeling for LLM alignment, and strengthens vector-quantized image generation. These results demonstrate that distance-aware supervision offers a simple and general alternative to one-hot supervision for discrete autoregressive models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DIST2Loss is a direct application of max-ent RL that swaps one-hot targets for distance-derived distributions and reports gains across four domains.

read the letter

The key takeaway is that this paper gives a simple recipe for distance-aware supervision in autoregressive models by turning token distances into per-token rewards and using the resulting softmax as the training target. That target is exactly the closed-form solution for entropy-regularized policy optimization when rewards are known ahead of time, so the RL framing follows immediately from the standard derivation without any new math tricks. The authors test it on visual grounding, robotic action prediction, reward modeling, and vector-quantized image generation, and they report better data efficiency and downstream numbers in each case. That multi-domain check is the strongest part of the work; it shows the idea is not tied to one narrow setting. The construction itself is lightweight and plugs into existing training loops without sampling or rollouts. The main soft spot is the reliance on having task-appropriate distances defined in advance. If those distances are noisy or miss the structure that actually matters for the downstream objective, the weighted targets could just add noise rather than signal. The paper would be stronger with more ablations on how sensitive results are to the choice of distance function and on whether the gains survive when the distances are replaced by random but normalized weights. The RL equivalence is clean but not surprising once the reward definition is accepted, so the real contribution sits in the empirical side rather than the interpretation. This is the kind of paper that matters to people building models for structured discrete outputs like coordinates, quantized latents, or action spaces. A reader who already works on loss-function tweaks or multimodal training would find the experiments worth looking at. It is solid enough and grounded enough to deserve real referee time rather than a desk reject.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces DIST2Loss, a distance-aware objective for training discrete autoregressive models. It replaces one-hot targets with reward-weighted distributions derived from predefined token distances. The central claim is that DIST2Loss is the closed-form solution to entropy-regularized policy optimization when per-token rewards are known a priori, thereby retaining RL mechanisms without sampling or rollouts. Experiments report gains in data efficiency and downstream performance on visual grounding, robotic manipulation, reward modeling, and vector-quantized image generation.

Significance. If the equivalence is rigorously derived and the gains are robust to distance choice, the result offers a stable alternative to RL for incorporating metric structure into autoregressive training. The closed-form interpretation is a strength when rewards are known, avoiding instability. Credit is due for framing the method as retaining core RL ideas while being sampling-free. Significance is tempered by dependence on task-appropriate predefined distances.

major comments (2)

[Abstract / §3] Abstract and presumed §3 (derivation): the claim that DIST2Loss is the closed-form solution to entropy-regularized policy optimization is load-bearing, yet the steps from the standard max-ent RL objective (optimal policy = softmax(reward / temperature)) to the specific reward-weighted target form are not shown. This leaves open whether the equivalence is independently derived or tautological once rewards are defined from distances.
[Experiments] Experiments section: improvements are reported across domains, but without explicit controls for the choice of distance metric, baseline comparisons, or statistical tests, it is unclear whether gains are attributable to distance awareness rather than other implementation choices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / §3] Abstract and presumed §3 (derivation): the claim that DIST2Loss is the closed-form solution to entropy-regularized policy optimization is load-bearing, yet the steps from the standard max-ent RL objective (optimal policy = softmax(reward / temperature)) to the specific reward-weighted target form are not shown. This leaves open whether the equivalence is independently derived or tautological once rewards are defined from distances.

Authors: We agree the derivation steps should be shown explicitly. The equivalence follows directly from the max-ent RL objective by defining per-token rewards as a function of negative distance to the ground-truth token (r_i = -d(t, i)), yielding the optimal policy as the normalized exp(r_i / τ) distribution, which is exactly the reward-weighted target in DIST2Loss. This holds for arbitrary predefined rewards and is not tautological. We will expand §3 with the full step-by-step derivation from the standard objective to the closed-form DIST2Loss. revision: yes
Referee: [Experiments] Experiments section: improvements are reported across domains, but without explicit controls for the choice of distance metric, baseline comparisons, or statistical tests, it is unclear whether gains are attributable to distance awareness rather than other implementation choices.

Authors: We acknowledge the need for stronger controls to isolate the effect. In revision we will add: (i) ablations across multiple distance metrics (e.g., Euclidean, cosine, and task-specific variants), (ii) additional baselines including standard cross-entropy, label smoothing, and alternative RL-inspired objectives, and (iii) statistical significance testing with error bars over multiple random seeds. These will clarify that performance gains stem from distance awareness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines DIST2Loss explicitly as replacing one-hot targets with reward-weighted distributions derived from predefined token distances, then notes that this construction matches the closed-form solution of entropy-regularized policy optimization under known per-token rewards. This equivalence follows directly from the standard max-ent RL result (optimal policy = softmax(reward / temperature)) and is presented as an interpretation rather than a self-derived theorem. No equations or steps in the provided material reduce the central claim to a fitted parameter, self-citation, or definitional tautology; the reward definition is an input assumption, not an output of the loss itself. The derivation remains self-contained against external RL benchmarks and does not rely on load-bearing self-citations or ansatzes smuggled from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the ledger therefore records only the assumptions explicitly required by the abstract description.

axioms (1)

domain assumption Predefined, task-appropriate distances between discrete tokens exist and can be used to construct reward-weighted target distributions.
The method is defined in terms of these distances; the abstract states that DIST2Loss is derived from them.

pith-pipeline@v0.9.0 · 5732 in / 1221 out tokens · 42528 ms · 2026-05-23T00:57:50.246414+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

DIST2Loss can be interpreted as the closed-form solution to entropy-regularized policy optimization … π∗(a)∝exp(R(a)/τ). … trains the model to minimize its KL divergence.
Foundation.GeneralizedDAlembert dAlembert_cosh_solution_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

formulating the target distribution pd using a discretized exponential family distribution: pd(v|x,t)=exp(−d(v,x,t)/τ)/∑exp(−d(v′,x,t)/τ)
Foundation.BranchSelection branch_selection refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

The construction of DIST2Loss can be directly linked to entropy-regularized policy optimization … retains the core mechanism of reinforcement learning while avoiding sampling, rollouts, and instability.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning
cs.CL 2025-05 unverdicted novelty 6.0

v1 adds a point-and-copy mechanism for dynamic visual token referencing in multimodal reasoning, trained on a new 300K dataset with grounding annotations, and outperforms baselines on multimodal math tasks.