pith. sign in

super hub Mixed citations

RoFormer: Enhanced Transformer with Rotary Position Embedding

Mixed citation behavior. Most common role is background (46%).

182 Pith papers citing it
Background 46% of classified citations
abstract

Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Finally, we evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, we provide a theoretical analysis to explain some experimental results. RoFormer is already integrated into Huggingface: \url{https://huggingface.co/docs/transformers/model_doc/roformer}.

hub tools

citation-role summary

background 18 method 8 baseline 1 dataset 1

citation-polarity summary

claims ledger

  • abstract Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative

authors

co-cited works

clear filters

representative citing papers

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

End-to-End Text Line Detection and Ordering

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.

Recognizing Co-Speech Gestures in-the-Wild

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

Introduces the first large-scale GRW dataset for semantic co-speech gesture classification, word recognition, and temporal localization in unconstrained videos, along with benchmarks for the three tasks.

citing papers explorer

Showing 12 of 12 citing papers after filters.

  • RULER: What's the Real Context Size of Your Long-Context Language Models? cs.CL · 2024-04-09 · accept · none · ref 30 · internal anchor

    RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

  • DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cs.CL · 2024-10-14 · conditional · none · ref 42 · internal anchor

    DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.

  • Massive Activations in Large Language Models cs.CL · 2024-02-27 · unverdicted · none · ref 149 · internal anchor

    Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

  • LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens cs.CL · 2024-02-21 · unverdicted · none · ref 12 · internal anchor

    LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.

  • Chameleon: Mixed-Modal Early-Fusion Foundation Models cs.CL · 2024-05-16 · unverdicted · none · ref 31 · internal anchor

    Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro on captioning, VQA, text, and image tasks.

  • Gemma: Open Models Based on Gemini Research and Technology cs.CL · 2024-03-13 · accept · none · ref 97 · internal anchor

    Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.

  • Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 75 · internal anchor

    Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.

  • TinyLlama: An Open-Source Small Language Model cs.CL · 2024-01-04 · accept · none · ref 33 · internal anchor

    TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.

  • Gemma 2: Improving Open Language Models at a Practical Size cs.CL · 2024-07-31 · conditional · none · ref 108 · internal anchor

    Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.

  • ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools cs.CL · 2024-06-18 · unverdicted · none · ref 38 · internal anchor

    GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.

  • Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 127 · internal anchor

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

  • Lessons from the Trenches on Reproducible Evaluation of Language Models cs.CL · 2024-05-23 · unreviewed · ref 42 · internal anchor