Arrows of Time for Large Language Models

Cl\'ement Hongler; J\'er\'emie Wenger; Vassilis Papadopoulos

arxiv: 2401.17505 · v4 · pith:MLCP46YTnew · submitted 2024-01-30 · 💻 cs.LG · cs.AI· cs.CL

Arrows of Time for Large Language Models

Vassilis Papadopoulos , J\'er\'emie Wenger , Cl\'ement Hongler This is my paper

classification 💻 cs.LG cs.AIcs.CL

keywords timelanguagedifferencelargemodelsasymmetrypredicttrying

0 comments

read the original abstract

We study the probabilistic modeling performed by Autoregressive Large Language Models (LLMs) through the angle of time directionality, addressing a question first raised in (Shannon, 1951). For large enough models, we empirically find a time asymmetry in their ability to learn natural language: a difference in the average log-perplexity when trying to predict the next token versus when trying to predict the previous one. This difference is at the same time subtle and very consistent across various modalities (language, model size, training time, ...). Theoretically, this is surprising: from an information-theoretic point of view, there should be no such difference. We provide a theoretical framework to explain how such an asymmetry can appear from sparsity and computational complexity considerations, and outline a number of perspectives opened by our results.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Distilling Specialized Orders for Visual Generation
cs.CV 2025-04 unverdicted novelty 7.0

OAR distills specialized generation orders from any-order AR models via self-distillation, improving FID from 2.39 to 2.17 on ImageNet 256x256 while preserving multi-task flexibility.
Temporal Preference Concepts and their Functions in a Large Language Model
cs.LG 2026-05 unverdicted novelty 5.0

Causal localization via attribution and patching identifies a temporal preference subgraph in mid-to-upper layers of Qwen3-4B-Instruct-2507, with time-horizon geometry in the residual stream and initial evidence for s...