hub Mixed citations

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, Yunfeng Liu · 2024

Mixed citation behavior. Most common role is background (55%).

58 Pith papers citing it

Background 55% of classified citations

browse 58 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 11 method 8 baseline 1

citation-polarity summary

background 11 use method 8 baseline 1

representative citing papers

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

cs.CV · 2026-05-15 · unverdicted · novelty 7.0

Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.

Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study

eess.SP · 2026-05-12 · unverdicted · novelty 7.0

Contrastive predictive coding pretraining combined with structured state space models yields the strongest ECG foundation models, with continued gains from scaling data to 11 million samples.

ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

Consistent Geometric Deep Learning via Hilbert Bundles and Cellular Sheaves

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

HilbNets define convolutions via Hilbert bundle connection Laplacians, prove that sampled Hilbert cellular sheaf Laplacians converge to the continuous operator, and show that discretized networks are consistent and transferable across samplings.

Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

In a controlled synthetic setting, transformers implement in-distribution task inference via convex combinations of task vectors and out-of-distribution inference via nearly orthogonal extrapolative representations.

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 3 refs

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

Screening Is Enough

cs.LG · 2026-04-01 · unverdicted · novelty 7.0

Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.

NEST: Nested Event Stream Transformer for Sequences of Multisets

cs.LG · 2026-01-31 · unverdicted · novelty 7.0

NEST is a nested transformer for sequences of multisets that uses masked set modeling to learn improved set-level representations from hierarchical event streams like EHRs.

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

cs.CV · 2025-04-29 · unverdicted · novelty 7.0

ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

SEGA adaptively scales RoPE attention components using spectral-energy guidance from the latent to improve structural coherence and fine details in high-resolution DiT synthesis.

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

cs.CV · 2026-05-21 · conditional · novelty 6.0

A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.

TrajTok: Adaptive Spatial Tokenization for Trajectory Representation Learning

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

TrajTok learns multi-resolution hexagonal spatial tokens from GPS data and pretrains a factorized transformer with ST-RoPE and masked modeling to yield frozen encoders that outperform task-specific methods on similarity, classification, and travel-time tasks in the Porto dataset.

Generative Recursive Reasoning

cs.AI · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

A transformer-based neural renderer that transfers arbitrary PBR lighting to single images via shared intrinsic conditioning extracted from both multi-illumination photos and path-traced coarse 3D renders.

Towards Continuous Sign Language Conversation from Isolated Signs

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Constructs continuous sign conversation data from isolated signs using retrieval and diffusion models to train a direct sign-to-sign conversational AI.

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.

RigidFormer: Learning Rigid Dynamics using Transformers

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.

WavesFM: Hierarchical Representation Learning for Longitudinal Wearable Sensor Waveforms

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

WavesFM uses hierarchical SSL to pretrain a segment encoder on short waveforms followed by a temporal encoder on multi-day sequences, outperforming prior methods on 58 tasks after training on over 12 million hours of data from hundreds of thousands of people.

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Mean-Variance Split residuals separate centered variation from mean updates to prevent collapse and enable stable training of 1000-layer Diffusion Transformers.

citing papers explorer

Showing 50 of 58 citing papers.

Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 108
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video cs.CV · 2026-05-18 · unverdicted · none · ref 41
StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.
Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation cs.CV · 2026-05-15 · unverdicted · none · ref 38
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study eess.SP · 2026-05-12 · unverdicted · none · ref 31
Contrastive predictive coding pretraining combined with structured state space models yields the strongest ECG foundation models, with continued gains from scaling data to 11 million samples.
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models cs.CV · 2026-05-11 · unverdicted · none · ref 30
ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity stat.ML · 2026-05-08 · unverdicted · none · ref 78
Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.
Consistent Geometric Deep Learning via Hilbert Bundles and Cellular Sheaves cs.LG · 2026-05-07 · unverdicted · none · ref 96 · 2 links
HilbNets define convolutions via Hilbert bundle connection Laplacians, prove that sampled Hilbert cellular sheaf Laplacians converge to the continuous operator, and show that discretized networks are consistent and transferable across samplings.
Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers cs.LG · 2026-05-05 · unverdicted · none · ref 39
In a controlled synthetic setting, transformers implement in-distribution task inference via convex combinations of task vectors and out-of-distribution inference via nearly orthogonal extrapolative representations.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics cs.CV · 2026-05-05 · unverdicted · none · ref 30 · 3 links
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation cs.LG · 2026-04-11 · unverdicted · none · ref 179
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
Screening Is Enough cs.LG · 2026-04-01 · unverdicted · none · ref 18
Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
NEST: Nested Event Stream Transformer for Sequences of Multisets cs.LG · 2026-01-31 · unverdicted · none · ref 33
NEST is a nested transformer for sequences of multisets that uses masked set modeling to learn improved set-level representations from hierarchical event streams like EHRs.
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer cs.CV · 2025-04-29 · unverdicted · none · ref 55
ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.
SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers cs.CV · 2026-05-21 · unverdicted · none · ref 21
SEGA adaptively scales RoPE attention components using spectral-energy guidance from the latent to improve structural coherence and fine details in high-resolution DiT synthesis.
RiT: Vanilla Diffusion Transformers Suffice in Representation Space cs.CV · 2026-05-21 · conditional · none · ref 32
A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
TrajTok: Adaptive Spatial Tokenization for Trajectory Representation Learning cs.LG · 2026-05-19 · unverdicted · none · ref 21
TrajTok learns multi-resolution hexagonal spatial tokens from GPS data and pretrains a factorized transformer with ST-RoPE and masked modeling to yield frozen encoders that outperform task-specific methods on similarity, classification, and travel-time tasks in the Porto dataset.
Generative Recursive Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 46 · 2 links
GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.
PIXLRelight: Controllable Relighting via Intrinsic Conditioning cs.CV · 2026-05-18 · unverdicted · none · ref 39
A transformer-based neural renderer that transfers arbitrary PBR lighting to single images via shared intrinsic conditioning extracted from both multi-illumination photos and path-traced coarse 3D renders.
Towards Continuous Sign Language Conversation from Isolated Signs cs.CV · 2026-05-14 · unverdicted · none · ref 75
Constructs continuous sign conversation data from isolated signs using retrieval and diffusion models to train a direct sign-to-sign conversational AI.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context cs.CV · 2026-05-13 · unverdicted · none · ref 74
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
RigidFormer: Learning Rigid Dynamics using Transformers cs.CV · 2026-05-09 · unverdicted · none · ref 34
RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.
WavesFM: Hierarchical Representation Learning for Longitudinal Wearable Sensor Waveforms cs.LG · 2026-05-09 · unverdicted · none · ref 43
WavesFM uses hierarchical SSL to pretrain a segment encoder on short waveforms followed by a temporal encoder on multi-day sequences, outperforming prior methods on 58 tasks after training on over 12 million hours of data from hundreds of thousands of people.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 77
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers cs.LG · 2026-05-07 · unverdicted · none · ref 16
Mean-Variance Split residuals separate centered variation from mean updates to prevent collapse and enable stable training of 1000-layer Diffusion Transformers.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising cs.RO · 2026-04-29 · unverdicted · none · ref 67 · 2 links
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation cs.CV · 2026-04-13 · unverdicted · none · ref 52
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
LPM 1.0: Video-based Character Performance Model cs.CV · 2026-04-09 · unverdicted · none · ref 60
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers cs.LG · 2026-03-18 · unverdicted · none · ref 35
Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens cs.CL · 2026-03-06 · unverdicted · none · ref 36
MSA is an end-to-end trainable memory model using sparse attention and document-wise RoPE that scales to 100M tokens with linear complexity and less than 9% degradation.
Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling cs.AI · 2026-03-05 · unverdicted · none · ref 50
Timer-S1 is a released 8.3B-parameter MoE time series model that achieves state-of-the-art MASE and CRPS scores on GIFT-Eval using serial scaling and Serial-Token Prediction.
HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling cs.AI · 2026-02-15 · unverdicted · none · ref 24
HyMem introduces dual-granular memory storage with a lightweight summary module for fast responses and selective activation of a deep LLM module for complex queries, outperforming full-context baselines by 92.6% lower computational cost on LOCOMO and LongMemEval benchmarks.
DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis cs.LG · 2026-01-09 · unverdicted · none · ref 45
DeMa is a dual-path delay-aware Mamba architecture that decomposes MTS into intra-series temporal and inter-series variate paths to achieve SOTA performance with linear complexity on forecasting, imputation, anomaly detection, and classification.
RefTon: Reference person shot assist virtual Try-on cs.CV · 2025-11-02 · unverdicted · none · ref 32
RefTon is a flux-based virtual try-on method that uses unpaired reference images of the target garment on different people to guide texture and detail preservation in a streamlined person-to-person pipeline without body parsing or masks.
Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 85
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training cs.CL · 2025-10-21 · conditional · none · ref 29
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing cs.CV · 2025-09-26 · unverdicted · none · ref 39
MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent cs.CL · 2025-07-03 · unverdicted · none · ref 11
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models cs.CV · 2025-05-22 · unverdicted · none · ref 19
Circle-RoPE achieves cross-modal positional disentanglement in VLMs by mapping 2D image tokens to a cone-like annulus orthogonal to the text axis, with PTD=0 eliminating RoPE geometric bias while preserving intra-image structure via alternating geometry encoding.
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation cs.CV · 2025-05-08 · unverdicted · none · ref 70
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model cs.CV · 2025-03-10 · unverdicted · none · ref 29
Seedream 2.0 is a native Chinese-English bilingual diffusion model that integrates a self-developed LLM text encoder, Glyph-Aligned ByT5, and Scaled ROPE to reach claimed state-of-the-art results in prompt following, aesthetics, text rendering, and human preference alignment via RLHF.
Reasoning through Verifiable Forecast Actions: Consistency-Grounded RL for Financial LLMs cs.LG · 2026-05-21 · unverdicted · none · ref 52
StockR1 unifies LLM-based financial reasoning and time-series forecasting by emitting verifiable forecast actions that condition a decoder, optimized via consistency-grounded RL to improve accuracy on QA and prediction tasks.
Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models cs.CV · 2026-05-20 · unverdicted · none · ref 21
Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step
Prune, Update and Trim: Robust Structured Pruning for Large Language Models cs.LG · 2026-05-18 · unverdicted · none · ref 35
Putri is a structured pruning technique for LLMs that compensates for pruning errors via weight updates and sequential processing while pruning at the attention-head level to reach state-of-the-art results at extreme sparsity.
EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records cs.IR · 2026-05-12 · unverdicted · none · ref 63
EHR-RAGp is a retrieval-augmented EHR foundation model that employs prototype-guided retrieval to dynamically integrate relevant historical patient context, outperforming prior models on clinical prediction tasks.
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading cs.CL · 2026-05-11 · unverdicted · none · ref 31
MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
Towards Effective Theory of LLMs: A Representation Learning Approach cs.LG · 2026-05-10 · unverdicted · none · ref 55
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency cs.CL · 2026-04-03 · unverdicted · none · ref 11
JoyAI-LLM Flash delivers a 48B MoE LLM with 2.7B active parameters per token via FiberPO RL and dense multi-token prediction, released with checkpoints on Hugging Face.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models cs.CL · 2026-01-20 · unverdicted · none · ref 284
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
Emerging Properties in Unified Multimodal Pretraining cs.CV · 2025-05-20 · unverdicted · none · ref 67
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
A Simple Plug-in for Improving Eviction-Based KV Cache Compression cs.LG · 2026-05-22 · unverdicted · none · ref 30
VECTOR augments eviction-based KV cache compression with three-way token routing that combines importance scoring and offline regression-based reconstructability estimation to improve quality at high compression ratios.

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer