GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.
Title resolution pending
20 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
PolicyLong shifts long-context data synthesis to an on-policy loop that re-screens contexts using the evolving model's entropy landscape, producing a self-curriculum that outperforms static offline baselines with larger gains at longer lengths.
A realistic scene synthesis strategy and document-aware training recipe enable a 1B-parameter MLLM to achieve superior accuracy and robustness in end-to-end parsing of real-world captured documents.
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
TAPA adds a learnable phase function to attention to preserve long-range token interactions, enabling direct continual pretraining, length extrapolation, lower perplexity, and stronger retrieval than RoPE-style methods.
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.
Qwen2.5-1M models reach 1M token context with improved long-context performance, no short-context loss, and 3-7x prefill speedup via open inference optimizations.
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
Pith review generated a malformed one-line summary.
Table-specific pretraining of Llama-2 yields significant gains on zero-shot, few-shot, and in-context tabular prediction tasks over prior benchmarks.
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
Qwen2.5 LLMs scale pre-training data to 18 trillion tokens and apply multistage reinforcement learning, achieving competitive performance on benchmarks with models up to 5 times larger.
GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
citing papers explorer
-
Group Representational Position Encoding
GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.
-
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
-
PolicyLong: Towards On-Policy Context Extension
PolicyLong shifts long-context data synthesis to an on-policy loop that re-screens contexts using the evolving model's entropy landscape, producing a self-curriculum that outperforms static offline baselines with larger gains at longer lengths.
-
Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
A realistic scene synthesis strategy and document-aware training recipe enable a 1B-parameter MLLM to achieve superior accuracy and robustness in end-to-end parsing of real-world captured documents.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
Positional Encoding via Token-Aware Phase Attention
TAPA adds a learnable phase function to attention to preserve long-range token interactions, enabling direct continual pretraining, length extrapolation, lower perplexity, and stronger retrieval than RoPE-style methods.
-
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
-
MoBA: Mixture of Block Attention for Long-Context LLMs
MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.
-
Qwen2.5-1M Technical Report
Qwen2.5-1M models reach 1M token context with improved long-context performance, no short-context loss, and 3-7x prefill speedup via open inference optimizations.
-
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
-
DataComp-LM: In search of the next generation of training sets for language models
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
-
An Empirical Study of Mamba-based Language Models
An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
-
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
-
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
-
Qwen3 Technical Report
Pith review generated a malformed one-line summary.
-
Unlock the Potential of Large Language Models for Predictive Tabular Tasks in Data Science with Table-Specific Pretraining
Table-specific pretraining of Llama-2 yields significant gains on zero-shot, few-shot, and in-context tabular prediction tasks over prior benchmarks.
-
InternLM2 Technical Report
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
-
Yi: Open Foundation Models by 01.AI
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
-
Qwen2.5 Technical Report
Qwen2.5 LLMs scale pre-training data to 18 trillion tokens and apply multistage reinforcement learning, achieving competitive performance on benchmarks with models up to 5 times larger.
-
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.