Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries

Angeliki Lazaridou; Bahare Fatemi; Ed Chi; Ethan Dyer; Harsh Mehta; Jean-Baptiste Lespiau; Jeffrey Hui; Kate Olszewska; Kelvin Xu; Kiran Vodrahalli

arxiv: 2409.12640 · v2 · pith:PMHBABAVnew · submitted 2024-09-19 · 💻 cs.CL · cs.LG

Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries

Kiran Vodrahalli , Santiago Ontanon , Nilesh Tripuraneni , Kelvin Xu , Sanil Jain , Rakesh Shivanna , Jeffrey Hui , Nishanth Dikkala

show 16 more authors

Mehran Kazemi Bahare Fatemi Rohan Anil Ethan Dyer Siamak Shakeri Roopali Vij Harsh Mehta Vinay Ramasesh Quoc Le Ed Chi Yifeng Lu Orhan Firat Angeliki Lazaridou Jean-Baptiste Lespiau Nithya Attaluri Kate Olszewska

This is my paper

classification 💻 cs.CL cs.LG

keywords evaluationsmodelstructurecontextlatentlong-contextinformationevaluation

0 comments

read the original abstract

We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model's ability to do more than retrieve a single piece of information from its context. The central idea of the Latent Structure Queries framework (LSQ) is to construct tasks which require a model to ``chisel away'' the irrelevant information in the context, revealing a latent structure in the context. To verify a model's understanding of this latent structure, we query the model for details of the structure. Using LSQ, we produce three diagnostic long-context evaluations across code and natural-language domains intended to provide a stronger signal of long-context language model capabilities. We perform evaluations on several state-of-the-art models and demonstrate both that a) the proposed evaluations are high-signal and b) that there is significant room for improvement in synthesizing long-context information.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

You Don't Need to Run Every Eval
cs.LG 2026-06 conditional novelty 6.0

The benchmark score matrix of 84 models on 133 tasks is approximately rank-2; BenchPress recovers held-out scores to within 4.6 points and identifies 5-benchmark subsets that predict the full scorecard to within 3.93-...
Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLMs
cs.CL 2026-06 unverdicted novelty 6.0

Lexical density acts as an independent limiter on effective LLM context windows, with performance collapsing from near-perfect to below 60% as information density rises in controlled ~12k-token benchmarks.
GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment
cs.CL 2026-05 unverdicted novelty 6.0

GoLongRL releases a 23K-sample open long-context RL dataset spanning 9 tasks and introduces TMN-Reweight to improve multitask optimization, achieving performance comparable to much larger models under GRPO.
Priming: Hybrid State Space Models From Pre-trained Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
cs.DC 2026-05 conditional novelty 6.0

SPECTRE delivers up to 2.28x speedup on large-model LLM inference by turning idle tail-model services into remote speculative drafters using hybrid parallel decoding and priority scheduling.
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
cs.DC 2026-05 unverdicted novelty 6.0

SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
cs.CL 2026-04 unverdicted novelty 6.0

GSQ uses Gumbel-Softmax to optimize scalar quantization grids for LLMs, closing most of the accuracy gap to vector methods like QTIP at 2-3 bits per parameter while using symmetric scalar grids compatible with existin...
GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling
cs.CL 2026-04 unverdicted novelty 6.0

GSQ applies a Gumbel-Softmax relaxation to learn discrete grid assignments in scalar quantization, closing most of the accuracy gap to vector methods like QTIP on Llama-3.1 models at 2-3 bits while using only symmetri...
The Verbose Context Problem in Medical Records
cs.CL 2026-06 unverdicted novelty 5.0

Presents PopMedQA benchmark and shows domain-independent LLM methods fail on token-inefficient longitudinal medical records, leaving room for domain-specific approaches.
Randomized YaRN Improves Length Generalization for Long-Context Reasoning
cs.CL 2026-06 unverdicted novelty 5.0

Randomized YaRN improves LLM reasoning performance on 16K-128K contexts when trained only on <8K data by randomizing YaRN positional encodings during short-context training.
Sakana Fugu Technical Report
cs.LG 2026-06 unverdicted novelty 5.0

Sakana Fugu trains LLM orchestrators using fine-tuning, evolutionary algorithms, and RL to build query-adaptive multi-agent scaffolds, claiming SOTA results on benchmarks including SWE-Bench Pro and GPQA-Diamond.
MiMo-V2-Flash Technical Report
cs.CL 2026-01 unverdicted novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale
cs.CL 2026-06 unverdicted novelty 4.0

Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.
JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data
cs.AI 2026-05 unverdicted novelty 4.0

JT-Safe-V2 is a safety-by-design LLM that reports SOTA scores on both capability and safety benchmarks while Safe-MoMA cuts inference cost over 30 percent.
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
cs.CL 2025-07 unverdicted novelty 4.0

Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
Gemma 3 Technical Report
cs.CL 2025-03 accept novelty 4.0

Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and ...