FSN achieves lower validation loss (1.5953) than a RoPE-SwiGLU transformer (1.611) on character-level tasks at 1M parameters by implementing next-token prediction as synchronization frustrated by data transitions.
super hub Mixed citations
RoFormer: Enhanced Transformer with Rotary Position Embedding
Mixed citation behavior. Most common role is background (46%).
abstract
Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Finally, we evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, we provide a theoretical analysis to explain some experimental results. RoFormer is already integrated into Huggingface: \url{https://huggingface.co/docs/transformers/model_doc/roformer}.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative
authors
co-cited works
representative citing papers
The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Score entropy loss enables discrete diffusion models (SEDD) that cut perplexity 25-75% versus prior diffusion methods and outperform GPT-2 on language modeling while supporting infilling and compute-quality tradeoffs.
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
HCMS partitions multi-head attention into chunks and pipelines them across dual CUDA streams to overlap communication and computation, delivering 10-17.5% speedup over Ulysses for 31K-56K token sequences.
Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.
RayPE extends video DiT attention with Plucker coordinates and a gated reciprocal-product term to improve 3D consistency and camera controllability.
QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.
MimeLens uses position-agnostic BERT encoders pretrained on random-offset binary windows to output one of 125 libmagic MIME labels, beating Magika on full files and enabling accurate classification on mid-file fragments.
Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.
MoPE replaces fixed sinusoidal or rotary positional encodings with per-dimension learned Morlet wavelets that recover prior methods as limits and add a Gaussian locality kernel, yielding a 0.119 gain on TinyShakespeare when paired with energy-gated attention.
Introduces the first large-scale GRW dataset for semantic co-speech gesture classification, word recognition, and temporal localization in unconstrained videos, along with benchmarks for the three tasks.
Autoregressive transformers exhibit measurable cognitive fatigue during extended generation, quantified by the Fatigue Index that predicts degradation (AUROC 0.95) and repetition (rho 0.94).
LaneRoPE adds an inter-sequence attention mask and extended RoPE to enable collaborative parallel sequence generation in LLMs, yielding accuracy gains on math reasoning under length limits.
Transformers trained from different random seeds exhibit residual-stream polymorphism that is exactly a uniform random rotation, which a Procrustes alignment removes to transfer SAEs and steering vectors.
Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.
POST uses prior-observation adversarial learning on adjacency matrices to reduce spatial over-generalization in graph-based multivariate time series anomaly detection and achieves new SOTA results on detection and channel-wise localization.
MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.
CRePE supplies depth-aware positional distributions along curved rays for stable unified-camera control in frozen video DiT models.
Embedding Temporal Logic (ETL) performs runtime monitoring directly in learned embedding spaces using distance-based predicates composed with temporal operators, supported by conformal calibration for reliable predicate evaluation.
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
CGAD is a staleness-aware Adam variant for DiLoCo that gates gradients with cosine and exponential decay, proves a convergence bound independent of maximum delay, and demonstrates stable pretraining of 25M to 7B parameter Llama-style models across controlled delays.
citing papers explorer
-
Attention as Frustrated Synchronization
FSN achieves lower validation loss (1.5953) than a RoPE-SwiGLU transformer (1.611) on character-level tasks at 1M parameters by implementing next-token prediction as synchronization frustrated by data transitions.
-
A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention
The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.
-
HCMS: Head-Chunked Multi-Stream Pipeline for Communication-Computation Overlap in Long-Sequence Parallel Attention
HCMS partitions multi-head attention into chunks and pipelines them across dual CUDA streams to overlap communication and computation, delivering 10-17.5% speedup over Ulysses for 31K-56K token sequences.
-
Set Diffusion: Interpolating Token Orderings Between Autoregression and Diffusion for Fast and Flexible Decoding
Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.
-
RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation
RayPE extends video DiT attention with Plucker coordinates and a gated reciprocal-product term to improve 3D consistency and camera controllability.
-
QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving
QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.
-
MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments
MimeLens uses position-agnostic BERT encoders pretrained on random-offset binary windows to output one of 125 libmagic MIME labels, beating Magika on full files and enabling accurate classification on mid-file fragments.
-
End-to-End Text Line Detection and Ordering
Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.
-
Beyond Sinusoids: A Morlet Wavelet Framework for Transformer Positional Encoding
MoPE replaces fixed sinusoidal or rotary positional encodings with per-dimension learned Morlet wavelets that recover prior methods as limits and add a Gaussian locality kernel, yielding a 0.119 gain on TinyShakespeare when paired with energy-gated attention.
-
Recognizing Co-Speech Gestures in-the-Wild
Introduces the first large-scale GRW dataset for semantic co-speech gesture classification, word recognition, and temporal localization in unconstrained videos, along with benchmarks for the three tasks.
-
Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement
Autoregressive transformers exhibit measurable cognitive fatigue during extended generation, quantified by the Fatigue Index that predicts degradation (AUROC 0.95) and repetition (rho 0.94).
-
LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation
LaneRoPE adds an inter-sequence attention mask and extended RoPE to enable collaborative parallel sequence generation in LLMs, yielding accuracy gains on math reasoning under length limits.
-
Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m
Transformers trained from different random seeds exhibit residual-stream polymorphism that is exactly a uniform random rotation, which a Procrustes alignment removes to transfer SAEs and steering vectors.
-
Tensor Cache: Eviction-conditioned Associative Memory for Transformers
Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.
-
POST: Prior-Observation Adversarial Learning of Spatio-Temporal Associations for Multivariate Time Series Anomaly Detection
POST uses prior-observation adversarial learning on adjacency matrices to reduce spatial over-generalization in graph-based multivariate time series anomaly detection and achieves new SOTA results on detection and channel-wise localization.
-
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.
-
CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation
CRePE supplies depth-aware positional distributions along curved rays for stable unified-camera control in frozen video DiT models.
-
Runtime Monitoring of Perception-Based Autonomous Systems via Embedding Temporal Logic
Embedding Temporal Logic (ETL) performs runtime monitoring directly in learned embedding spaces using distance-based predicates composed with temporal operators, supported by conformal calibration for reliable predicate evaluation.
-
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
-
Cosine-Gated Adam-Decay: Drop-In Staleness-Aware Outer Optimization for Decoupled DiLoCo
CGAD is a staleness-aware Adam variant for DiLoCo that gates gradients with cosine and exponential decay, proves a convergence bound independent of maximum delay, and demonstrates stable pretraining of 25M to 7B parameter Llama-style models across controlled delays.
-
Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks
Jordan-RoPE realizes a distance-modulated phase basis via non-semisimple Jordan blocks, generating features such as d e^{iωd} for relative positional encoding.
-
Graph Transformers and Stabilized Reinforcement Learning for Large-Scale Dynamic Routing Modulation and Spectrum Allocation in Elastic Optical Networks
A graph transformer with RL stabilizations is the first to exceed benchmarks for dynamic RMSA, supporting up to 13% more traffic load on networks up to 143 nodes.
-
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning
A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
-
Attention Is Not All You Need for Diffraction
Physics-informed transformer with sin^2(theta) encoding, physics-aware positional encoding, multi-task decoder, and three-stage curriculum classifies powder diffraction into 99 extinction groups, with structured errors on symmetry subgroup hierarchy.
-
Video Analysis and Generation via a Semantic Progress Function
A Semantic Progress Function is defined as a 1D curve of cumulative semantic shifts from frame embeddings, supporting a linearization procedure that retimes video sequences for constant-rate semantic evolution.
-
WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images
WildSplatter jointly learns 3D Gaussians and appearance embeddings from unconstrained photo collections to enable fast feed-forward reconstruction and flexible lighting control in 3D Gaussian Splatting.
-
Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider
The work demonstrates masked-token prediction with transformers for model-independent anomaly detection in LHC data, achieving strong results on top-rich BSM signatures like four-top production using VQ-VAE tokenization.
-
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
-
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
-
Detection Without Correction: A Robust Asymmetry in Activation-Based Hallucination Probing
Activation probes detect hallucinations pre-generation in large LLMs but cannot correct them via steering, with output confidence outperforming on accuracy.
-
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
-
Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension
Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.
-
PixGS: Pixel-Space Diffusion for Direct 3D Gaussian Splat Generation
A single-stage pixel-space diffusion model for direct 3D Gaussian Splat generation that bypasses latent compression and adds geometric supervisions to outperform prior multi-stage methods.
-
Domain-Informed Multi-View Self-Distillation for Astronomical Light-Curve Representation Learning with JEPA
A JEPA-based model with domain-informed multi-view self-distillation learns light-curve representations that outperform hand-crafted features on 15 of 16 StarEmbed metrics and adapts competitively to other irregular time-series datasets.
-
PORTER: Language-Grounded Event Representations for Portable Structured EHR Foundation Models
PORTER is a language-grounded EHR foundation model that uses text descriptions for events and a numeric pathway, matching fixed-vocabulary performance on 74 tasks while recovering 97.1% AUROC on unseen vocabularies and outperforming on MIMIC.
-
Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish
Morpheus is a morphology-aware neural tokenizer and embedder for Turkish that achieves lossless reversible segmentation, higher morphological alignment, lower bits-per-character, and competitive root-centric embeddings compared with subword baselines.
-
MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction
Introduces a new task of goal-conditioned 3D point motion forecasting along with a 1.16M-video dataset, a 111-category benchmark, and a model that outperforms baselines while transferring to robotics and video generation.
-
Modality Forcing for Scalable Spatial Generation
Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction versus prior joint generative baselines.
-
LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold
LoRA-Muon applies Muon's spectral steepest descent to low-rank factors with split weight decay, acting as a transferable proxy for full-rank Muon and Shampoo optimizers.
-
IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder
IDEAL improves discrete representation autoencoders by jointly aligning quantized tokens with shallow and deep VFM features, reporting 0.61 rFID on ImageNet and 1.89 gFID for autoregressive image generation.
-
Tyan-WP: A Wind Power Foundation Model for Ultra-Short-Term Probabilistic Forecasting
Tyan-WP is a pretrained wind power foundation model that outperforms site-specific TSMs and generic LTSMs in zero-shot ultra-short-term probabilistic forecasting on U.S. and U.K. sites via static embeddings and PAMF module.
-
Inside the LLM Word Factory
Activation patching localizes English detokenization in Llama2-7B to a two-stage attention-then-MLP process at layer 1 that generalizes to 12 models from 8 families, with depth varying by positional encoding, plus an early-layer probe achieving 0.94-0.97 AUROC.
-
EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control
EgoPriMo learns a unified egocentric motion prior with a Triple-stream DiT model that supports reconstruction, generation, and forecasting of SMPL motions from egocentric views and text, outperforming prior methods and transferable to humanoid controllers.
-
Still: Amortized KV Cache Compaction in a Single Forward Pass
Still is an amortized per-layer Perceiver that synthesizes compact KV caches in one forward pass, outperforming selection and per-context baselines on RULER, HELMET, and LongBench at 8-200x compression.
-
Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting
A parameter-free approach drops redundant video tokens via temporal L1 differences in frozen latent space and reconstructs them with LIT, yielding 31x speedup over ElasticTok-CV on TokenBench and DAVIS.
-
What's Under the Skin? Estimating Swine Body Condition
PigFormer estimates swine subcutaneous backfat thickness, loin muscle depth, and total tissue thickness from RGB-D depth frames, achieving 2.43 mm backfat MAE and 3.87 mm overall MAE on a 319-instance multi-site dataset while outperforming ResNet-18 and ViT-small baselines.
-
PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention
PJ-RoPE organizes relative-position mechanisms as a learnable Fourier-Jet-Affine space derived from lag-shift dynamics, extending RoPE and ALiBi with explicit jets and sector selection.
-
AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task
AlignAtt4LLM adapts AlignAtt to decoder-only LLMs via prompt layout, head selection, and attention replay, outperforming IWSLT 2026 baselines for En-De and En-It at ~2s and <4s latency.
-
Dynamic Short Convolutions Improve Transformers
Dynamic short convolutions applied to key/query/value projections and linear layers in Transformers yield consistent performance gains and 1.33-1.60x compute advantages over standard models on language modeling from 150M to 2B parameters.
-
Do Transformers Need Three Projections? Systematic Study of QKV Variants
Q-K=V projection sharing in transformers matches standard QKV performance with 50% KV cache reduction and combines with GQA/MQA for up to 96.9% reduction across vision and language tasks.