pith. sign in

super hub Mixed citations

RoFormer: Enhanced Transformer with Rotary Position Embedding

Mixed citation behavior. Most common role is background (46%).

138 Pith papers citing it
Background 46% of classified citations
abstract

Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Finally, we evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, we provide a theoretical analysis to explain some experimental results. RoFormer is already integrated into Huggingface: \url{https://huggingface.co/docs/transformers/model_doc/roformer}.

hub tools

citation-role summary

background 18 method 8 baseline 1 dataset 1

citation-polarity summary

claims ledger

  • abstract Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative

authors

co-cited works

representative citing papers

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

Recognizing Co-Speech Gestures in-the-Wild

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

Introduces the first large-scale GRW dataset for semantic co-speech gesture classification, word recognition, and temporal localization in unconstrained videos, along with benchmarks for the three tasks.

Attention Is Not All You Need for Diffraction

cond-mat.mtrl-sci · 2026-04-26 · unverdicted · novelty 7.0

Physics-informed transformer with sin^2(theta) encoding, physics-aware positional encoding, multi-task decoder, and three-stage curriculum classifies powder diffraction into 99 extinction groups, with structured errors on symmetry subgroup hierarchy.

Video Analysis and Generation via a Semantic Progress Function

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

A Semantic Progress Function is defined as a 1D curve of cumulative semantic shifts from frame embeddings, supporting a linearization procedure that retimes video sequences for constant-rate semantic evolution.

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

q-bio.QM · 2026-04-09 · unverdicted · novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

citing papers explorer

Showing 50 of 138 citing papers.

  • Sparse Layers are Critical to Scaling Looped Language Models cs.LG · 2026-05-09 · unverdicted · none · ref 11 · internal anchor

    Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.

  • Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs cs.LG · 2026-05-07 · unverdicted · none · ref 2 · 2 links · internal anchor

    Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.

  • Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization stat.ML · 2026-05-07 · unverdicted · none · ref 49 · internal anchor

    Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.

  • Feature Starvation as Geometric Instability in Sparse Autoencoders cs.LG · 2026-05-06 · unverdicted · none · ref 36 · internal anchor

    Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.

  • Spiking Sequence Machines and Transformers cs.NE · 2026-05-01 · unverdicted · none · ref 6 · internal anchor

    Spiking SDM and transformers implement identical functional operations for sequences via cosine similarity retrieval, unified by a phase-latency isomorphism between spike timing and sinusoidal positional encoding.

  • Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation cs.CV · 2026-04-28 · unverdicted · none · ref 39 · internal anchor

    Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

  • Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models cs.CV · 2026-04-22 · unverdicted · none · ref 50 · internal anchor

    Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy at low token budgets.

  • HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering cs.AI · 2026-04-22 · unverdicted · none · ref 143 · internal anchor

    HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

  • LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model cs.CV · 2026-04-22 · unverdicted · none · ref 32 · internal anchor

    LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.

  • Towards Real-Time ECG and EMG Modeling on $\mu$NPUs cs.LG · 2026-04-20 · unverdicted · none · ref 49 · internal anchor

    PhysioLite delivers Transformer-comparable ECG/EMG performance using learnable wavelet filters and hardware-aware design at ~370KB quantized size on μNPUs.

  • Shape: A Self-Supervised 3D Geometry Foundation Model for Industrial CAD Analysis cs.CV · 2026-04-19 · unverdicted · none · ref 7 · internal anchor

    A 10.9M-parameter self-supervised model pretrained on 61k CAD meshes achieves R²=0.729 reconstruction and 98.1% top-1 retrieval on held-out data via masked normalized geometry reconstruction and multi-resolution contrastive learning.

  • Parcae: Scaling Laws For Stable Looped Language Models cs.LG · 2026-04-14 · unverdicted · none · ref 75 · internal anchor

    Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.

  • LoMa: Local Feature Matching Revisited cs.CV · 2026-04-06 · unverdicted · none · ref 49 · internal anchor

    Scaling data, model size, and compute for local feature matching produces large performance gains on challenging benchmarks and a new manually annotated HardMatch dataset.

  • Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models cs.CV · 2026-02-24 · unverdicted · none · ref 39 · internal anchor

    MMHNet enables video-to-audio models trained on short clips to generalize and generate audio for videos over 5 minutes long.

  • Foundation Models for Discovery and Exploration in Chemical Space physics.chem-ph · 2025-10-20 · unverdicted · none · ref 152 · internal anchor

    MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.

  • Short window attention enables long-term memorization cs.LG · 2025-09-29 · unverdicted · none · ref 32 · internal anchor

    Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.

  • Positional Encoding via Token-Aware Phase Attention cs.CL · 2025-09-16 · unverdicted · none · ref 14 · internal anchor

    TAPA adds a learnable phase function to attention to preserve long-range token interactions, enabling direct continual pretraining, length extrapolation, lower perplexity, and stronger retrieval than RoPE-style methods.

  • F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions cs.RO · 2025-09-08 · unverdicted · none · ref 25 · internal anchor

    F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.

  • SpikingBrain: Spiking Brain-inspired Large Models cs.LG · 2025-09-05 · unverdicted · none · ref 31 · internal anchor

    SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.

  • GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning cs.CV · 2025-07-01 · unverdicted · none · ref 44 · internal anchor

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  • Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute cs.CV · 2025-04-23 · unverdicted · none · ref 43 · internal anchor

    A zero-shot subject-driven video generation framework that decomposes the task into identity injection from 200K subject-image pairs and motion preservation from 4K arbitrary videos, trained in 288 A100 GPU hours on CogVideoX-5B to match prior performance at 1% compute.

  • LongVILA: Scaling Long-Context Visual Language Models for Long Videos cs.CV · 2024-08-19 · unverdicted · none · ref 23 · internal anchor

    LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.

  • SAM 2: Segment Anything in Images and Videos cs.CV · 2024-08-01 · conditional · none · ref 24 · internal anchor

    SAM 2 delivers more accurate video segmentation with 3x fewer user interactions and 6x faster image segmentation than the original SAM by training a streaming-memory transformer on the largest video segmentation dataset collected to date.

  • FlashNorm: Fast Normalization for Transformers cs.LG · 2024-07-12 · accept · none · ref 15 · internal anchor

    FlashNorm is an exact algebraic reformulation of RMSNorm plus linear projection that folds weights and defers normalization to allow parallel execution, plus scale-invariance simplifications that remove redundant norms in certain architectures.

  • DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 174 · internal anchor

    DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

  • Chameleon: Mixed-Modal Early-Fusion Foundation Models cs.CL · 2024-05-16 · unverdicted · none · ref 31 · internal anchor

    Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro on captioning, VQA, text, and image tasks.

  • StarCoder 2 and The Stack v2: The Next Generation cs.SE · 2024-02-29 · accept · none · ref 283 · internal anchor

    StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

  • 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations cs.RO · 2024-02-16 · conditional · none · ref 21 · internal anchor

    3D Diffuser Actor unifies diffusion policies with 3D scene features to set new state-of-the-art results on RLBench and CALVIN robot benchmarks.

  • Gated Linear Attention Transformers with Hardware-Efficient Training cs.LG · 2023-12-11 · unverdicted · none · ref 89 · internal anchor

    Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

  • The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 35 · internal anchor

    Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

  • Llemma: An Open Language Model For Mathematics cs.CL · 2023-10-16 · unverdicted · none · ref 181 · internal anchor

    Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.

  • Efficient Streaming Language Models with Attention Sinks cs.CL · 2023-09-29 · accept · none · ref 46 · internal anchor

    StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.

  • YaRN: Efficient Context Window Extension of Large Language Models cs.CL · 2023-08-31 · unverdicted · none · ref 12 · internal anchor

    YaRN extends the context window of RoPE-based LLMs like LLaMA more efficiently than prior methods, using 10x fewer tokens and 2.5x fewer steps while surpassing state-of-the-art performance and enabling extrapolation beyond fine-tuning lengths.

  • MVDream: Multi-view Diffusion for 3D Generation cs.CV · 2023-08-31 · conditional · none · ref 177 · internal anchor

    MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.

  • Retentive Network: A Successor to Transformer for Large Language Models cs.CL · 2023-07-17 · unverdicted · none · ref 19 · internal anchor

    RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.

  • Textbooks Are All You Need cs.CL · 2023-06-20 · unverdicted · none · ref 26 · internal anchor

    A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

  • BloombergGPT: A Large Language Model for Finance cs.LG · 2023-03-30 · conditional · none · ref 109 · internal anchor

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

  • GPT-NeoX-20B: An Open-Source Autoregressive Language Model cs.CL · 2022-04-14 · accept · none · ref 91 · internal anchor

    GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.

  • PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 149 · internal anchor

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  • BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension cs.CV · 2026-06-07 · unverdicted · none · ref 23 · internal anchor

    BioVid is a data-driven autoregressive model using 2D-encode/3D-decode tokenization and causal Transformer with EOS termination that reproduces real action duration distributions (W1 distance 1.24 frames) on NTU RGB+D drinking clips, outperforming fixed-length baselines.

  • UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion eess.AS · 2026-05-29 · unverdicted · none · ref 40 · internal anchor

    UNISON introduces a unified latent diffusion framework with layer-wise LLM fusion and channel-mask task encoding for multiple speech and sound generation and editing tasks.

  • Anytime Training with Schedule-Free Spectral Optimization cs.LG · 2026-05-21 · unverdicted · none · ref 77 · internal anchor

    SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.

  • Bernini: Latent Semantic Planning for Video Diffusion cs.CV · 2026-05-21 · unverdicted · none · ref 64 · internal anchor

    Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.

  • Atoms of Thought: Universal EEG Representation Learning with Microstates cs.LG · 2026-05-19 · unverdicted · none · ref 68 · internal anchor

    Microstate tokenizer from clustered EEG signals provides universal representations that outperform traditional time- and frequency-domain features across sleep staging, emotion recognition, and motor imagery tasks.

  • A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits cs.LG · 2026-05-19 · unverdicted · none · ref 26 · internal anchor

    Applies optimal transport to bound OOD generalization error in Transformers via Lipschitz continuity and TC^0 circuit depth lower bounds for Dyck-k backtracking, supported by evaluations on 54 configurations.

  • MeMo: Memory as a Model cs.CL · 2026-05-14 · unverdicted · none · ref 69 · internal anchor

    MeMo encodes new knowledge into a separate memory model that integrates with frozen LLMs, showing strong performance on QA benchmarks while avoiding catastrophic forgetting and working without access to model weights.

  • FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery cs.CV · 2026-05-14 · unverdicted · none · ref 55 · internal anchor

    FactorizedHMR recovers 3D human meshes from video by deterministically anchoring the torso-root then probabilistically completing distal articulations via flow-matching with geometry-aware supervision and a synthetic data pipeline.

  • Mela: Test-Time Memory Consolidation based on Transformation Hypothesis cs.CL · 2026-05-11 · unverdicted · none · ref 18 · internal anchor

    Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.

  • StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 25 · internal anchor

    Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

  • Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model cs.LG · 2026-04-27 · unverdicted · none · ref 14 · internal anchor

    Nautile-370M is a hybrid small language model using SeqCond Attention layers alternating with transformers, with a claimed proof that the spectral operator matches full self-attention expressiveness in the continuous limit.