17th-century Italian imposes a 2.4x surprisal tax on LLMs versus modern Italian with comparable tokenization costs to Russian, yet embeddings stay robust above 0.85 similarity and a temporal prompt reduces surprisal by 60%.
L., Leskovec, J., and Jurafsky, D
5 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 5years
2026 5verdicts
UNVERDICTED 5representative citing papers
HistoRAG embeds historiographical principles into RAG via temporal windowing, decoupled retrieval, and contestable LLM relevance judgments, evaluated on 102k Der Spiegel articles from 1950-1979.
Graph-based neighborhood analysis of Persian poetry embeddings shows semantic change occurs through rewiring of local connections, with distinct patterns for time-sensitive, poet-sensitive, and stable words.
New Zealand Reddit users link language to place and form contiguous speech communities with complex geographic alignment; Word2Vec embeddings reveal semantic variations and shifts in NZ English on a 4.26 billion word corpus.
Language models constitute rather than passively record cultural realities through agential cuts in their design as measurement apparatus.
citing papers explorer
-
How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation
17th-century Italian imposes a 2.4x surprisal tax on LLMs versus modern Italian with comparable tokenization costs to Russian, yet embeddings stay robust above 0.85 similarity and a temporal prompt reduces surprisal by 60%.
-
HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice
HistoRAG embeds historiographical principles into RAG via temporal windowing, decoupled retrieval, and contestable LLM relevance judgments, evaluated on 102k Der Spiegel articles from 1950-1979.
-
Between Century and Poet: Graph-Based Lexical Semantic Change in Persian Poetry
Graph-based neighborhood analysis of Persian poetry embeddings shows semantic change occurs through rewiring of local connections, with distinct patterns for time-sensitive, poet-sensitive, and stable words.
-
Language, Place, and Social Media: Geographic Dialect Alignment in New Zealand
New Zealand Reddit users link language to place and form contiguous speech communities with complex geographic alignment; Word2Vec embeddings reveal semantic variations and shifts in NZ English on a 4.26 billion word corpus.
-
Language Models as Measurement Apparatus for Culture
Language models constitute rather than passively record cultural realities through agential cuts in their design as measurement apparatus.