FSN achieves lower validation loss (1.5953) than a RoPE-SwiGLU transformer (1.611) on character-level tasks at 1M parameters by implementing next-token prediction as synchronization frustrated by data transitions.
super hub Mixed citations
RoFormer: Enhanced Transformer with Rotary Position Embedding
Mixed citation behavior. Most common role is background (46%).
abstract
Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Finally, we evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, we provide a theoretical analysis to explain some experimental results. RoFormer is already integrated into Huggingface: \url{https://huggingface.co/docs/transformers/model_doc/roformer}.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative
authors
co-cited works
representative citing papers
The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Score entropy loss enables discrete diffusion models (SEDD) that cut perplexity 25-75% versus prior diffusion methods and outperform GPT-2 on language modeling while supporting infilling and compute-quality tradeoffs.
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
HCMS partitions multi-head attention into chunks and pipelines them across dual CUDA streams to overlap communication and computation, delivering 10-17.5% speedup over Ulysses for 31K-56K token sequences.
Set diffusion factorizes likelihood over arbitrary token sets and uses a set-causal diffusion architecture to support KV caching and any-order decoding, yielding improved speed-quality tradeoffs versus prior diffusion LMs.
RayPE extends video DiT attention with Plucker coordinates and a gated reciprocal-product term to improve 3D consistency and camera controllability.
QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.
MimeLens uses position-agnostic BERT encoders pretrained on random-offset binary windows to output one of 125 libmagic MIME labels, beating Magika on full files and enabling accurate classification on mid-file fragments.
Orli is an autoregressive image-to-sequence model that jointly detects text lines and determines their reading order on historical documents via chord-frame baselines, trained on 196k pages across ten scripts.
MoPE replaces fixed sinusoidal or rotary positional encodings with per-dimension learned Morlet wavelets that recover prior methods as limits and add a Gaussian locality kernel, yielding a 0.119 gain on TinyShakespeare when paired with energy-gated attention.
Introduces the first large-scale GRW dataset for semantic co-speech gesture classification, word recognition, and temporal localization in unconstrained videos, along with benchmarks for the three tasks.
Autoregressive transformers exhibit measurable cognitive fatigue during extended generation, quantified by the Fatigue Index that predicts degradation (AUROC 0.95) and repetition (rho 0.94).
LaneRoPE adds an inter-sequence attention mask and extended RoPE to enable collaborative parallel sequence generation in LLMs, yielding accuracy gains on math reasoning under length limits.
Transformers trained from different random seeds exhibit residual-stream polymorphism that is exactly a uniform random rotation, which a Procrustes alignment removes to transfer SAEs and steering vectors.
Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.
POST uses prior-observation adversarial learning on adjacency matrices to reduce spatial over-generalization in graph-based multivariate time series anomaly detection and achieves new SOTA results on detection and channel-wise localization.
MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.
CRePE supplies depth-aware positional distributions along curved rays for stable unified-camera control in frozen video DiT models.
Embedding Temporal Logic (ETL) performs runtime monitoring directly in learned embedding spaces using distance-based predicates composed with temporal operators, supported by conformal calibration for reliable predicate evaluation.
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
CGAD is a staleness-aware Adam variant for DiLoCo that gates gradients with cosine and exponential decay, proves a convergence bound independent of maximum delay, and demonstrates stable pretraining of 25M to 7B parameter Llama-style models across controlled delays.
citing papers explorer
-
SAM 3: Segment Anything with Concepts
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute
A zero-shot subject-driven video generation framework that decomposes the task into identity injection from 200K subject-image pairs and motion preservation from 4K arbitrary videos, trained in 288 A100 GPU hours on CogVideoX-5B to match prior performance at 1% compute.
-
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
Video Parallel Scaling improves VideoLLM performance by aggregating outputs from parallel inferences on complementary disjoint frame subsets, effectively contracting the Chinchilla scaling law via uncorrelated visual evidence.
-
Matrix-game 2.0: An open-source real-time and streaming interactive world model
Matrix-Game 2.0 introduces a scalable data pipeline, action-injection module, and few-step distillation to enable real-time streaming video generation at 25 FPS from game-engine interactions, with open-sourced weights and code.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.