Measuring Weak-to-Strong Legibility of Reasoning Models

Dani Roytburg; Daphne Ippolito; Shreya Sridhar

arxiv: 2603.20508 · v2 · pith:OW6ZSSV4new · submitted 2026-03-20 · 💻 cs.MA · cs.AI· cs.CL

Measuring Weak-to-Strong Legibility of Reasoning Models

Dani Roytburg , Shreya Sridhar , Daphne Ippolito This is my paper

classification 💻 cs.MA cs.AIcs.CL

keywords legibilitymodelsmonitorsreasoningtracesweak-to-strongweakeraccessible

0 comments

read the original abstract

Reasoning language models (RLMs) and the intermediate chains of thought they emit play an increasingly central role in multi-agent setups such as inter-model monitoring or distillation into smaller models. When agents at different capability tiers must cooperate, strong models need to produce traces digestible by weaker ones. We refer to this goal as "weak-to-strong legibility". Trustworthiness of large models depends in part on this legibility property. For safety oversight in particular, adoption of weak monitors may become a standard for reliability scaffolds on a healthy budget. Legibility requires that the shape of these decision-making traces takes some form accessible to weaker monitors. Existing efficiency-based metrics for legibility fail to capture "thoroughness", instead focusing on conciseness.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CLORE: Content-Level Optimization for Reasoning Efficiency
cs.AI 2026-05 unverdicted novelty 6.0

CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.