super hub Mixed citations

RoFormer: Enhanced Transformer with Rotary Position Embedding

Ahmed Murtadha, Bo Wen, Jianlin Su, Shengfeng Pan, Yu Lu, Yunfeng Liu · 2021 · cs.CL · arXiv 2104.09864

Mixed citation behavior. Most common role is background (46%).

139 Pith papers citing it

Background 46% of classified citations

open full Pith review browse 139 citing papers more from Ahmed Murtadha arXiv PDF

abstract

Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Finally, we evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, we provide a theoretical analysis to explain some experimental results. RoFormer is already integrated into Huggingface: \url{https://huggingface.co/docs/transformers/model_doc/roformer}.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 18 method 8 baseline 1 dataset 1

citation-polarity summary

background 13 use method 8 unclear 4 baseline 1 support 1 use dataset 1

claims ledger

abstract Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative

authors

Ahmed Murtadha Bo Wen Jianlin Su Shengfeng Pan Yu Lu Yunfeng Liu

co-cited works

representative citing papers

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

stat.ML · 2026-05-12 · unverdicted · novelty 8.0

The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.

RULER: What's the Real Context Size of Your Long-Context Language Models?

cs.CL · 2024-04-09 · accept · novelty 8.0

RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

stat.ML · 2023-10-25 · unverdicted · novelty 8.0

Score entropy loss enables discrete diffusion models (SEDD) that cut perplexity 25-75% versus prior diffusion methods and outperform GPT-2 on language modeling while supporting infilling and compute-quality tradeoffs.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Recognizing Co-Speech Gestures in-the-Wild

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

Introduces the first large-scale GRW dataset for semantic co-speech gesture classification, word recognition, and temporal localization in unconstrained videos, along with benchmarks for the three tasks.

Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Autoregressive transformers exhibit measurable cognitive fatigue during extended generation, quantified by the Fatigue Index that predicts degradation (AUROC 0.95) and repetition (rho 0.94).

Tensor Cache: Eviction-conditioned Associative Memory for Transformers

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.

POST: Prior-Observation Adversarial Learning of Spatio-Temporal Associations for Multivariate Time Series Anomaly Detection

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

POST uses prior-observation adversarial learning on adjacency matrices to reduce spatial over-generalization in graph-based multivariate time series anomaly detection and achieves new SOTA results on detection and channel-wise localization.

CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

CRePE supplies depth-aware positional distributions along curved rays for stable unified-camera control in frozen video DiT models.

Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Embedding Temporal Logic (ETL) performs runtime monitoring directly in learned embedding spaces using distance-based predicates composed with temporal operators, supported by conformal calibration for reliable predicate evaluation.

From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.

Cosine-Gated Adam-Decay: Drop-In Staleness-Aware Outer Optimization for Decoupled DiLoCo

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

CGAD is a staleness-aware Adam variant for DiLoCo that gates gradients with cosine and exponential decay, proves a convergence bound independent of maximum delay, and demonstrates stable pretraining of 25M to 7B parameter Llama-style models across controlled delays.

Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks

cs.LG · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

Jordan-RoPE realizes a distance-modulated phase basis via non-semisimple Jordan blocks, generating features such as d e^{iωd} for relative positional encoding.

Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks

cs.NI · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

A graph transformer with RL stabilizations is the first to exceed benchmarks for dynamic RMSA, supporting up to 13% more traffic load on networks up to 143 nodes.

Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning

astro-ph.GA · 2026-04-28 · unverdicted · novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.

Attention Is Not All You Need for Diffraction

cond-mat.mtrl-sci · 2026-04-26 · unverdicted · novelty 7.0

Physics-informed transformer with sin^2(theta) encoding, physics-aware positional encoding, multi-task decoder, and three-stage curriculum classifies powder diffraction into 99 extinction groups, with structured errors on symmetry subgroup hierarchy.

Video Analysis and Generation via a Semantic Progress Function

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

A Semantic Progress Function is defined as a 1D curve of cumulative semantic shifts from frame embeddings, supporting a linearization procedure that retimes video sequences for constant-rate semantic evolution.

WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

WildSplatter jointly learns 3D Gaussians and appearance embeddings from unconstrained photo collections to enable fast feed-forward reconstruction and flexible lighting control in 3D Gaussian Splatting.

Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider

hep-ph · 2026-04-22 · unverdicted · novelty 7.0

The work demonstrates masked-token prediction with transformers for model-independent anomaly detection in LHC data, achieving strong results on top-rich BSM signatures like four-top production using VQ-VAE tokenization.

When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence

cs.LG · 2026-04-16 · conditional · novelty 7.0

FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

q-bio.QM · 2026-04-09 · unverdicted · novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

Detection Without Correction: A Robust Asymmetry in Activation-Based Hallucination Probing

cs.CL · 2026-03-20 · conditional · novelty 7.0

Activation probes detect hallucinations pre-generation in large LLMs but cannot correct them via steering, with output confidence outperforming on accuracy.

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

cs.RO · 2026-03-10 · unverdicted · novelty 7.0

AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.

citing papers explorer

Showing 39 of 139 citing papers.

When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer cs.LG · 2026-04-25 · unverdicted · none · ref 26 · internal anchor
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.
Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity cs.CL · 2026-04-22 · unverdicted · none · ref 41 · internal anchor
Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.
Sessa: Selective State Space Attention cs.LG · 2026-04-20 · unverdicted · none · ref 33 · internal anchor
Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention cs.DC · 2026-04-18 · unverdicted · none · ref 13 · internal anchor
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
Woosh: A Sound Effects Foundation Model cs.SD · 2026-04-02 · accept · none · ref 32 · internal anchor
Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models cs.CV · 2026-01-29 · unverdicted · none · ref 54 · internal anchor
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias cs.LG · 2026-01-07 · unverdicted · none · ref 64 · internal anchor
Smart Embedding reduces parameters by 48.3 percent in polyphonic music models with information-theoretic loss bounds under 0.153 bits and tighter generalization via Rademacher complexity.
NVIDIA Nemotron 3: Efficient and Open Intelligence cs.CL · 2025-12-24 · unverdicted · none · ref 126 · internal anchor
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs cs.CV · 2025-09-09 · unverdicted · none · ref 16 · internal anchor
Video Parallel Scaling improves VideoLLM performance by aggregating outputs from parallel inferences on complementary disjoint frame subsets, effectively contracting the Chinchilla scaling law via uncorrelated visual evidence.
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference cs.PF · 2025-08-22 · unverdicted · none · ref 54 · internal anchor
ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference while maintaining accuracy.
Matrix-game 2.0: An open-source real-time and streaming interactive world model cs.CV · 2025-08-18 · unverdicted · none · ref 39 · internal anchor
Matrix-Game 2.0 introduces a scalable data pipeline, action-injection module, and few-step distillation to enable real-time streaming video generation at 25 FPS from game-engine interactions, with open-sourced weights and code.
Lightweight Domain Adaptation of a Large Language Model for Legal Assistance in the Indian Context cs.CL · 2025-05-28 · unverdicted · none · ref 17 · internal anchor
An 8B Llama model with RAG and prompt engineering scores 60.08% on the All-India Bar Examination, slightly above GPT-3.5 Turbo while claiming 22 times greater parameter efficiency via a new PEI metric.
Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models cs.SE · 2023-10-17 · unverdicted · none · ref 57 · internal anchor
bLLMs achieve state-of-the-art results on limited and imbalanced SE sentiment datasets even in zero-shot settings, but fine-tuned sLLMs outperform when ample balanced training data is available.
Continuous diffusion for categorical data cs.CL · 2022-11-28 · unverdicted · none · ref 86 · internal anchor
The paper proposes CDCD, a continuous-time and continuous-space diffusion framework for categorical data, and reports results on language modeling tasks.
BARFI-Q: Quantum-Enhanced Block Attention Residual Fusion Framework for Multivariate Time-Series Forecasting in Atom Interferometry quant-ph · 2026-05-06 · unverdicted · none · ref 36 · internal anchor
BARFI-Q integrates patch-based embedding, dual-branch temporal modeling, hierarchical fusion, adaptive block-attention residuals, and quantum feature mapping to forecast atom interferometry time-series, outperforming baselines while representing targets in circular sine-cosine space.
Fall Risk and Gait Analysis in Community-Dwelling Older Adults using World-Spaced 3D Human Mesh Recovery cs.CV · 2026-04-13 · unverdicted · none · ref 56 · internal anchor
Video-based 3D mesh recovery extracts gait parameters that correlate with sensor measurements and are associated with higher fall risk in older adults.
Improving Local Feature Matching by Entropy-inspired Scale Adaptability and Flow-endowed Local Consistency cs.CV · 2026-04-08 · unverdicted · none · ref 53 · internal anchor
A semi-dense image matching pipeline adds scale adaptability via score-matrix hints at the coarse stage and local flow consistency via gradient loss at the fine stage.
Ministral 3 cs.CL · 2026-01-13 · unverdicted · none · ref 25 · internal anchor
Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.
Multi-Model Synthetic Training for Mission-Critical Small Language Models cs.CL · 2025-09-16 · unverdicted · none · ref 14 · internal anchor
Fine-tunes Qwen2.5-7B on 21,543 synthetic maritime Q&A pairs generated from 3.2B AIS records by GPT-4o and o3-mini, reaching 75% accuracy at 261x lower inference cost than larger models.
Phi-4-reasoning Technical Report cs.AI · 2025-04-30 · unverdicted · none · ref 53 · internal anchor
A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related benchmarks.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model cs.CV · 2025-02-14 · unverdicted · none · ref 21 · internal anchor
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey cs.LG · 2024-03-21 · accept · none · ref 10 · internal anchor
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
Gemma: Open Models Based on Gemini Research and Technology cs.CL · 2024-03-13 · accept · none · ref 97 · internal anchor
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 75 · internal anchor
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
TinyLlama: An Open-Source Small Language Model cs.CL · 2024-01-04 · accept · none · ref 33 · internal anchor
TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.
Baichuan 2: Open Large-scale Language Models cs.CL · 2023-09-19 · unverdicted · none · ref 63 · internal anchor
Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.
CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models cs.LG · 2026-05-27 · unverdicted · none · ref 24 · internal anchor
Presents CosmicFish-HRM, a compact LM using hierarchical recurrent reasoning to adapt computation depth per input.
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems cs.LG · 2026-01-20 · unverdicted · none · ref 147 · internal anchor
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
Gemma 2: Improving Open Language Models at a Practical Size cs.CL · 2024-07-31 · conditional · none · ref 108 · internal anchor
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools cs.CL · 2024-06-18 · unverdicted · none · ref 38 · internal anchor
GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 127 · internal anchor
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 287 · internal anchor
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
EXAONE 4.5 Technical Report cs.CL · 2026-04-09 · unverdicted · none · ref 43 · internal anchor
EXAONE 4.5 is a new open-weight multimodal model that matches general benchmarks and outperforms similar-scale models on document understanding and Korean contextual reasoning.
Training LLMs on HPC Systems: Best Practices from the OpenGPT-X Project cs.DC · 2025-04-14 · unverdicted · none · ref 29 · internal anchor
Engineering report detailing HPC infrastructure, software choices, and performance measurements for training a 7B LLM using 3D parallelism on JUWELS Booster.
A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 66 · internal anchor
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation cs.CL · 2026-04-09 · unreviewed · ref 8 · internal anchor
Selective Rotary Position Embedding cs.CL · 2025-11-21 · unreviewed · ref 58 · internal anchor
Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation cs.LG · 2025-09-04 · unreviewed · ref 86 · internal anchor
Lessons from the Trenches on Reproducible Evaluation of Language Models cs.CL · 2024-05-23 · unreviewed · ref 42 · internal anchor

RoFormer: Enhanced Transformer with Rotary Position Embedding

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer