17th-century Italian imposes a 2.4x surprisal tax on LLMs versus modern Italian with comparable tokenization costs to Russian, yet embeddings stay robust above 0.85 similarity and a temporal prompt reduces surprisal by 60%.
L., Leskovec, J., and Jurafsky, D
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 4years
2026 4verdicts
UNVERDICTED 4representative citing papers
HistoRAG embeds historiographical principles into RAG via temporal windowing, decoupled retrieval, and contestable LLM relevance judgments, evaluated on 102k Der Spiegel articles from 1950-1979.
Graph-based neighborhood analysis of Persian poetry embeddings shows semantic change occurs through rewiring of local connections, with distinct patterns for time-sensitive, poet-sensitive, and stable words.
New Zealand Reddit users link language to place and form contiguous speech communities with complex geographic alignment; Word2Vec embeddings reveal semantic variations and shifts in NZ English on a 4.26 billion word corpus.
citing papers explorer
-
How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation
17th-century Italian imposes a 2.4x surprisal tax on LLMs versus modern Italian with comparable tokenization costs to Russian, yet embeddings stay robust above 0.85 similarity and a temporal prompt reduces surprisal by 60%.
-
HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice
HistoRAG embeds historiographical principles into RAG via temporal windowing, decoupled retrieval, and contestable LLM relevance judgments, evaluated on 102k Der Spiegel articles from 1950-1979.
-
Between Century and Poet: Graph-Based Lexical Semantic Change in Persian Poetry
Graph-based neighborhood analysis of Persian poetry embeddings shows semantic change occurs through rewiring of local connections, with distinct patterns for time-sensitive, poet-sensitive, and stable words.
-
Language, Place, and Social Media: Geographic Dialect Alignment in New Zealand
New Zealand Reddit users link language to place and form contiguous speech communities with complex geographic alignment; Word2Vec embeddings reveal semantic variations and shifts in NZ English on a 4.26 billion word corpus.