pith. sign in

super hub Mixed citations

RoFormer: Enhanced Transformer with Rotary Position Embedding

Mixed citation behavior. Most common role is background (46%).

138 Pith papers citing it
Background 46% of classified citations
abstract

Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Finally, we evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, we provide a theoretical analysis to explain some experimental results. RoFormer is already integrated into Huggingface: \url{https://huggingface.co/docs/transformers/model_doc/roformer}.

hub tools

citation-role summary

background 18 method 8 baseline 1 dataset 1

citation-polarity summary

claims ledger

  • abstract Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative

authors

co-cited works

clear filters

representative citing papers

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

Recognizing Co-Speech Gestures in-the-Wild

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

Introduces the first large-scale GRW dataset for semantic co-speech gesture classification, word recognition, and temporal localization in unconstrained videos, along with benchmarks for the three tasks.

Attention Is Not All You Need for Diffraction

cond-mat.mtrl-sci · 2026-04-26 · unverdicted · novelty 7.0

Physics-informed transformer with sin^2(theta) encoding, physics-aware positional encoding, multi-task decoder, and three-stage curriculum classifies powder diffraction into 99 extinction groups, with structured errors on symmetry subgroup hierarchy.

Video Analysis and Generation via a Semantic Progress Function

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

A Semantic Progress Function is defined as a 1D curve of cumulative semantic shifts from frame embeddings, supporting a linearization procedure that retimes video sequences for constant-rate semantic evolution.

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

q-bio.QM · 2026-04-09 · unverdicted · novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

citing papers explorer

Showing 14 of 14 citing papers after filters.

  • RULER: What's the Real Context Size of Your Long-Context Language Models? cs.CL · 2024-04-09 · accept · none · ref 30 · internal anchor

    RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

  • Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling cs.CL · 2023-04-03 · accept · none · ref 137 · internal anchor

    Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

  • FlashNorm: Fast Normalization for Transformers cs.LG · 2024-07-12 · accept · none · ref 15 · internal anchor

    FlashNorm is an exact algebraic reformulation of RMSNorm plus linear projection that folds weights and defers normalization to allow parallel execution, plus scale-invariance simplifications that remove redundant norms in certain architectures.

  • StarCoder 2 and The Stack v2: The Next Generation cs.SE · 2024-02-29 · accept · none · ref 283 · internal anchor

    StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

  • Efficient Streaming Language Models with Attention Sinks cs.CL · 2023-09-29 · accept · none · ref 46 · internal anchor

    StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.

  • GPT-NeoX-20B: An Open-Source Autoregressive Language Model cs.CL · 2022-04-14 · accept · none · ref 91 · internal anchor

    GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.

  • PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 149 · internal anchor

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  • StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 25 · internal anchor

    Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

  • Woosh: A Sound Effects Foundation Model cs.SD · 2026-04-02 · accept · none · ref 32 · internal anchor

    Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.

  • Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey cs.LG · 2024-03-21 · accept · none · ref 10 · internal anchor

    A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

  • Gemma: Open Models Based on Gemini Research and Technology cs.CL · 2024-03-13 · accept · none · ref 97 · internal anchor

    Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.

  • TinyLlama: An Open-Source Small Language Model cs.CL · 2024-01-04 · accept · none · ref 33 · internal anchor

    TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.

  • Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 127 · internal anchor

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

  • A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 287 · internal anchor

    This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.