{"work":{"id":"4e5eee26-cd04-4c7a-988f-3e6d1a1f0eb9","openalex_id":null,"doi":null,"arxiv_id":"2104.09864","raw_key":null,"title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","authors":null,"authors_text":"Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu","year":2021,"venue":"cs.CL","abstract":"Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Finally, we evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, we provide a theoretical analysis to explain some experimental results. RoFormer is already integrated into Huggingface: \\url{https://huggingface.co/docs/transformers/model_doc/roformer}.","external_url":"https://arxiv.org/abs/2104.09864","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T07:43:13.682272+00:00","pith_arxiv_id":"2104.09864","created_at":"2026-05-08T20:09:09.940464+00:00","updated_at":"2026-06-29T07:43:13.682272+00:00","title_quality_ok":true,"display_title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","render_title":"RoFormer: Enhanced Transformer with Rotary Position Embedding"},"hub":{"state":{"work_id":"4e5eee26-cd04-4c7a-988f-3e6d1a1f0eb9","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":138,"external_cited_by_count":null,"distinct_field_count":19,"first_pith_cited_at":"2022-04-05T16:11:45+00:00","last_pith_cited_at":"2026-06-23T03:33:25+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T10:48:36.792424+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":18},{"context_role":"method","n":8},{"context_role":"baseline","n":1},{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"background","n":13},{"context_polarity":"use_method","n":8},{"context_polarity":"unclear","n":4},{"context_polarity":"baseline","n":1},{"context_polarity":"support","n":1},{"context_polarity":"use_dataset","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","claims":[{"claim_text":"Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"dimensional tensor x P Rbˆlˆd as the input tokens. The input tokens are first multiplied with three weight matrices WQ, WK, and WV , producing the output referred to as query(Q), key( K) and value( V ). Given the MSA module's inability to recognize positional data and the inherent auto- regressive nature of LLMs, the query and key will undergo a process using Rotary Positional Embedding [10] (RoPE, denoted as Rp.q in Eq 1) to encode the position information. Subsequently, the key and value will ","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"paradigms such as Retrieval-Augmented Generation (RAG) [8], In-Context Learning [9], and memory-augmented agentic systems [10] require models to process vast amounts of retrieved documents, extensive demonstrations, or long-term interaction histories. These demands are pushing context win- dow requirements to hundreds of thousands or even millions of tokens [11]-[13], imposing significant pressure on both computation and memory. Computation.This massive increase in sequence length exposes the qu","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"directions. Architecture. The GLM family of LLMs is built on Transformer [ 43]. In GLM-130B [ 53], we explored various options to stabilize its pre-training by taking into account the hardware constraints we faced at the time. Specifically, GLM-130B leveraged DeepNorm [44] as the layer normalization strategy and used Rotary Positional Encoding (RoPE) [38] as well as the Gated Linear Unit [36] with GeLU [15] activation function in FFNs. Throughout our exploration, we have investigated different s","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"features and 3D coordinates into a voxelized cognitive map. CDIF then alternates between intra-map reasoning and map reading to inject spatial knowledge back into semantic tokens. Both stages are guided by coordinate embeddings and 3D RoPE to preserve metric 3D relationships. to the voxel feature v(l) j , then apply 3D Continuous Rotary Positional Embedding (3D RoPE) [50] to make the queries and keys coordinate-aware. This enables attention to model relative spatial relationships in metric 3D sp","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"former [22], which is placed between residual blocks. How- ever, existing work has found that the training of Trans- formers with post-LN tends to be instable due to the large gradients near the output layer [286]. Thus, post-LN is rarely employed in existing LLMs except combined with other strategies (e.g.,combining post-LN with pre-LN in GLM- 130B [93]). •Pre-LN.Different from post-LN, pre-LN [287] is applied before each sub-layer, and an additional LN is placed before the final prediction. Co","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"can capture information about the relative position differences between input elements. A clipping distance, represented as k 2 ≤ k ≤ n − 4, specifies the maximum limit on relative lo- cations. This allows the model to make reasonable predictions for sequence lengths that are not part of the training data. 3) Rotary Position Embeddings : Rotary Positional Em- bedding (RoPE) [127] tackles problems with existing ap- proaches. Learned absolute positional encodings can lack gen- eralizability and me","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks RoFormer: Enhanced Transformer with Rotary Position Embedding because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (12 contexts).","role_counts":[{"n":12,"context_role":"background"},{"n":7,"context_role":"method"},{"n":1,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-18T12:20:51.683687+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"9fe5a313-73b2-4ef4-9c62-89c0e36bd922","orcid":null,"display_name":"Jianlin Su"},{"id":"8af5aa5f-5370-4102-a6a6-7228d5c760e3","orcid":null,"display_name":"Yu Lu"},{"id":"95957877-4198-43db-8cb6-33964ea10e9b","orcid":null,"display_name":"Shengfeng Pan"},{"id":"e9997095-6ce3-46b5-a80b-54f6145d625e","orcid":null,"display_name":"Ahmed Murtadha"},{"id":"de154aec-7467-43c1-a60f-b45ec99fe8bf","orcid":null,"display_name":"Bo Wen"},{"id":"9080370f-7fc6-4249-ac58-b129ca901188","orcid":null,"display_name":"Yunfeng Liu"}]},"error":null,"updated_at":"2026-05-18T12:20:51.674352+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T10:08:42.510920+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":20},{"title":"Attention Is All You Need","work_id":"baafb5a2-5272-43bc-932f-09fa9ffe5316","shared_citers":13},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":12},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":12},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":12},{"title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","work_id":"4ee75248-1199-492c-a52f-6661e0f4adff","shared_citers":11},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":11},{"title":"Fast Transformer Decoding: One Write-Head is All You Need","work_id":"160ea164-b1d4-4adb-8ccb-a4655d8a0bb4","shared_citers":10},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":10},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":10},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":9},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":9},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":9},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":8},{"title":"The Pile: An 800GB Dataset of Diverse Text for Language Modeling","work_id":"9b10667a-da61-4358-aceb-10578234d45d","shared_citers":8},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":7},{"title":"doi: 10.18653/v1/D18-2012","work_id":"81a6320b-c2e1-4d74-a03e-9e1ff6bbed8d","shared_citers":7},{"title":"Extending Context Window of Large Language Models via Positional Interpolation","work_id":"c8b6df85-e7da-4bd8-90a4-d309cc2a0f60","shared_citers":7},{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":7},{"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","shared_citers":7},{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":7},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":6},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":6},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":6}],"time_series":[{"n":1,"year":2022},{"n":8,"year":2023},{"n":12,"year":2024},{"n":2,"year":2025},{"n":39,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T10:18:41.189098+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T10:08:44.811717+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","claims":[{"claim_text":"Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"dimensional tensor x P Rbˆlˆd as the input tokens. The input tokens are first multiplied with three weight matrices WQ, WK, and WV , producing the output referred to as query(Q), key( K) and value( V ). Given the MSA module's inability to recognize positional data and the inherent auto- regressive nature of LLMs, the query and key will undergo a process using Rotary Positional Embedding [10] (RoPE, denoted as Rp.q in Eq 1) to encode the position information. Subsequently, the key and value will ","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"paradigms such as Retrieval-Augmented Generation (RAG) [8], In-Context Learning [9], and memory-augmented agentic systems [10] require models to process vast amounts of retrieved documents, extensive demonstrations, or long-term interaction histories. These demands are pushing context win- dow requirements to hundreds of thousands or even millions of tokens [11]-[13], imposing significant pressure on both computation and memory. Computation.This massive increase in sequence length exposes the qu","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"directions. Architecture. The GLM family of LLMs is built on Transformer [ 43]. In GLM-130B [ 53], we explored various options to stabilize its pre-training by taking into account the hardware constraints we faced at the time. Specifically, GLM-130B leveraged DeepNorm [44] as the layer normalization strategy and used Rotary Positional Encoding (RoPE) [38] as well as the Gated Linear Unit [36] with GeLU [15] activation function in FFNs. Throughout our exploration, we have investigated different s","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"features and 3D coordinates into a voxelized cognitive map. CDIF then alternates between intra-map reasoning and map reading to inject spatial knowledge back into semantic tokens. Both stages are guided by coordinate embeddings and 3D RoPE to preserve metric 3D relationships. to the voxel feature v(l) j , then apply 3D Continuous Rotary Positional Embedding (3D RoPE) [50] to make the queries and keys coordinate-aware. This enables attention to model relative spatial relationships in metric 3D sp","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"former [22], which is placed between residual blocks. How- ever, existing work has found that the training of Trans- formers with post-LN tends to be instable due to the large gradients near the output layer [286]. Thus, post-LN is rarely employed in existing LLMs except combined with other strategies (e.g.,combining post-LN with pre-LN in GLM- 130B [93]). •Pre-LN.Different from post-LN, pre-LN [287] is applied before each sub-layer, and an additional LN is placed before the final prediction. Co","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"can capture information about the relative position differences between input elements. A clipping distance, represented as k 2 ≤ k ≤ n − 4, specifies the maximum limit on relative lo- cations. This allows the model to make reasonable predictions for sequence lengths that are not part of the training data. 3) Rotary Position Embeddings : Rotary Positional Em- bedding (RoPE) [127] tackles problems with existing ap- proaches. Learned absolute positional encodings can lack gen- eralizability and me","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks RoFormer: Enhanced Transformer with Rotary Position Embedding because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (12 contexts).","role_counts":[{"n":12,"context_role":"background"},{"n":7,"context_role":"method"},{"n":1,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-18T12:20:51.680167+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","claims":[{"claim_text":"Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks RoFormer: Enhanced Transformer with Rotary Position Embedding because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T10:18:47.226401+00:00"}},"summary":{"title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","claims":[{"claim_text":"Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks RoFormer: Enhanced Transformer with Rotary Position Embedding because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":20},{"title":"Attention Is All You Need","work_id":"baafb5a2-5272-43bc-932f-09fa9ffe5316","shared_citers":13},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":12},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":12},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":12},{"title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","work_id":"4ee75248-1199-492c-a52f-6661e0f4adff","shared_citers":11},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":11},{"title":"Fast Transformer Decoding: One Write-Head is All You Need","work_id":"160ea164-b1d4-4adb-8ccb-a4655d8a0bb4","shared_citers":10},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":10},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":10},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":9},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":9},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":9},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":8},{"title":"The Pile: An 800GB Dataset of Diverse Text for Language Modeling","work_id":"9b10667a-da61-4358-aceb-10578234d45d","shared_citers":8},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":7},{"title":"doi: 10.18653/v1/D18-2012","work_id":"81a6320b-c2e1-4d74-a03e-9e1ff6bbed8d","shared_citers":7},{"title":"Extending Context Window of Large Language Models via Positional Interpolation","work_id":"c8b6df85-e7da-4bd8-90a4-d309cc2a0f60","shared_citers":7},{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":7},{"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","shared_citers":7},{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":7},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":6},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":6},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":6}],"time_series":[{"n":1,"year":2022},{"n":8,"year":2023},{"n":12,"year":2024},{"n":2,"year":2025},{"n":39,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"e9997095-6ce3-46b5-a80b-54f6145d625e","orcid":null,"display_name":"Ahmed Murtadha","source":"manual","import_confidence":0.72},{"id":"de154aec-7467-43c1-a60f-b45ec99fe8bf","orcid":null,"display_name":"Bo Wen","source":"manual","import_confidence":0.72},{"id":"9fe5a313-73b2-4ef4-9c62-89c0e36bd922","orcid":null,"display_name":"Jianlin Su","source":"manual","import_confidence":0.72},{"id":"95957877-4198-43db-8cb6-33964ea10e9b","orcid":null,"display_name":"Shengfeng Pan","source":"manual","import_confidence":0.72},{"id":"8af5aa5f-5370-4102-a6a6-7228d5c760e3","orcid":null,"display_name":"Yu Lu","source":"manual","import_confidence":0.72},{"id":"9080370f-7fc6-4249-ac58-b129ca901188","orcid":null,"display_name":"Yunfeng Liu","source":"manual","import_confidence":0.72}]}}