pith. sign in

Arrows of time for large language models

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

We study the probabilistic modeling performed by Autoregressive Large Language Models (LLMs) through the angle of time directionality, addressing a question first raised in (Shannon, 1951). For large enough models, we empirically find a time asymmetry in their ability to learn natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...). Theoretically, this is surprising: from an information-theoretic point of view, there should be no such difference. We provide a theoretical framework to explain how such an asymmetry can appear from sparsity and computational complexity considerations, and outline a number of perspectives opened by our results.

fields

cs.CV 1 cs.LG 1

years

2026 1 2025 1

verdicts

UNVERDICTED 2

clear filters

representative citing papers

Distilling Specialized Orders for Visual Generation

cs.CV · 2025-04-23 · unverdicted · novelty 7.0

OAR distills specialized generation orders from any-order AR models via self-distillation, improving FID from 2.39 to 2.17 on ImageNet 256x256 while preserving multi-task flexibility.

citing papers explorer

Showing 2 of 2 citing papers after filters.

  • Distilling Specialized Orders for Visual Generation cs.CV · 2025-04-23 · unverdicted · none · ref 8 · internal anchor

    OAR distills specialized generation orders from any-order AR models via self-distillation, improving FID from 2.39 to 2.17 on ImageNet 256x256 while preserving multi-task flexibility.

  • Temporal Preference Concepts and their Functions in a Large Language Model cs.LG · 2026-05-11 · unverdicted · none · ref 84 · internal anchor

    Causal localization via attribution and patching identifies a temporal preference subgraph in mid-to-upper layers of Qwen3-4B-Instruct-2507, with time-horizon geometry in the residual stream and initial evidence for steering-vector control.