hub Canonical reference

Data engineering for scaling language models to 128k context.arXiv preprint arXiv:2402.10171

Fu, Y · 2024 · arXiv 2402.10171

Canonical reference. 100% of citing Pith papers cite this work as background.

16 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

RULER: What's the Real Context Size of Your Long-Context Language Models?

cs.CL · 2024-04-09 · accept · novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

cs.CL · 2026-05-15 · conditional · novelty 7.0

Proves that RoPE attention loses locality bias and token distinction in long contexts, approaching random behavior independent of content.

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

cs.CL · 2024-04-10 · conditional · novelty 7.0

Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.

Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing

cs.CL · 2026-05-11 · conditional · novelty 6.0

EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.

Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models

cs.IR · 2026-04-27 · conditional · novelty 6.0

RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.

Stacked from One: Multi-Scale Self-Injection for Context Window Extension

cs.CL · 2026-03-05 · unverdicted · novelty 6.0

SharedLLM stacks two copies of a short-context LLM so the lower one compresses context into query-aware multi-grained tokens that are injected only at the lowest layers of the upper one, enabling generalization from 8K training to 128K+ inputs.

HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

cs.CL · 2026-01-20 · unverdicted · novelty 6.0

HeteroCache dynamically allocates KV cache space to attention heads based on their temporal stability and uses hierarchical asynchronous retrieval to achieve state-of-the-art long-context performance with up to 3x faster decoding at 224K context length.

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

cs.LG · 2025-04-28 · unverdicted · novelty 6.0

TurboQuant achieves near-optimal vector quantization distortion for both MSE and inner products via random rotation and per-coordinate scalar quantization, with a formal proof that it matches lower bounds within a factor of approximately 2.7.

Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation

cs.CL · 2026-04-15 · unverdicted · novelty 5.0

RoPE-Perturbed Self-Distillation improves positional robustness during long-context fine-tuning of LLMs by training models to produce consistent outputs across RoPE-perturbed views of the input.

Self-Describing Structured Data with Dual-Layer Guidance: A Lightweight Alternative to RAG for Precision Retrieval in Large-Scale LLM Knowledge Navigation

cs.CL · 2026-03-28 · unverdicted · novelty 5.0

SDSR places human metadata at file primacy and combines it with prompt routing rules to reach 100% primary category accuracy on a 119-category benchmark, far above the 65% no-guidance baseline.

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

cs.CL · 2024-12-18 · unverdicted · novelty 5.0

ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.

Unlock the Potential of Large Language Models for Predictive Tabular Tasks in Data Science with Table-Specific Pretraining

cs.LG · 2024-03-29 · unverdicted · novelty 5.0

Table-specific pretraining of Llama-2 yields significant gains on zero-shot, few-shot, and in-context tabular prediction tasks over prior benchmarks.

Tokalator: A Context Engineering Toolkit for Artificial Intelligence Coding Assistants

cs.SE · 2026-04-09 · unverdicted · novelty 4.0

Tokalator is a toolkit with VS Code extension, calculators, and community resources to monitor and optimize token usage in AI coding environments.

Yi: Open Foundation Models by 01.AI

cs.CL · 2024-03-07 · unverdicted · novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

citing papers explorer

Showing 5 of 5 citing papers after filters.

MLVU: Benchmarking Multi-task Long Video Understanding cs.CV · 2024-06-06 · conditional · none · ref 14
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context cs.CV · 2026-05-13 · unverdicted · none · ref 35
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing cs.CL · 2026-05-11 · conditional · none · ref 12
EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models cs.IR · 2026-04-27 · conditional · none · ref 7
RouteHead trains a lightweight router to dynamically select optimal LLM attention heads per query for improved attention-based document re-ranking.
Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 23
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

Data engineering for scaling language models to 128k context.arXiv preprint arXiv:2402.10171

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer