Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLMs

· 2026 · cs.LG · arXiv 2604.22893

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Traditional data valuation methods based on ``row-count $\times$ quality coefficient'' paradigms fail to capture the nuanced, nonlinear contributions that data makes to Large Language Model (LLM) capabilities. This paper presents a dynamic data valuation framework that transitions from static accounting to utility-based pricing. Our approach operates on three layers: (1) token-level information density metrics using Shannon entropy and Data Quality Scores; (2) empirical training gain measurement through influence functions, proxy model strategies, and Data Shapley values; and (3) cryptographic verifiability through hash-based commitments, Merkle trees, and a tamper-evident training ledger. We provide comprehensive experimental validation on three real domains (instruction following, mathematical reasoning, and code summarization), demonstrating that proxy-based empirical gain achieves near-perfect ranking alignment with realized utility, substantially outperforming row-count and token-count baselines. This framework enables a fair Data-as-a-Service economy where high-reasoning data is priced according to its actual contribution to model intelligence, while providing the transparency and auditability necessary for trustworthy data markets.

representative citing papers

Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design

cs.AI · 2026-05-17 · unverdicted · novelty 4.0

The paper defines Computational Token Economics and introduces the Token Economics Trilemma as a framework for trade-offs in granularity, real-time performance, and optimality, while outlining a research agenda for three challenge areas.

citing papers explorer

Showing 1 of 1 citing paper.

Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design cs.AI · 2026-05-17 · unverdicted · none · ref 9 · internal anchor
The paper defines Computational Token Economics and introduces the Token Economics Trilemma as a framework for trade-offs in granularity, real-time performance, and optimality, while outlining a research agenda for three challenge areas.

Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLMs

fields

years

verdicts

representative citing papers

citing papers explorer