RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Boris Ginsburg · 2024

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.

APWA: A Distributed Architecture for Parallelizable Agentic Workflows

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

APWA is a distributed multi-agent architecture that decomposes parallelizable agentic workflows into non-interfering subproblems for scalable execution on heterogeneous resources.

Self-Consolidating Language Models: Continual Knowledge Incorporation from Context

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.

MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

cs.CL · 2025-10-21 · conditional · novelty 6.0

MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.

citing papers explorer

Showing 4 of 4 citing papers.

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention cs.CL · 2026-05-18 · unverdicted · none · ref 20
DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.
APWA: A Distributed Architecture for Parallelizable Agentic Workflows cs.AI · 2026-05-14 · unverdicted · none · ref 22
APWA is a distributed multi-agent architecture that decomposes parallelizable agentic workflows into non-interfering subproblems for scalable execution on heterogeneous resources.
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context cs.CL · 2026-05-08 · unverdicted · none · ref 2 · 2 links
SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training cs.CL · 2025-10-21 · conditional · none · ref 22
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.

RULER: What’s the real context size of your long-context language models? InFirst Conference on Language Modeling

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer