hub Mixed citations

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, Yunfeng Liu · 2024

Mixed citation behavior. Most common role is background (55%).

58 Pith papers citing it

Background 55% of classified citations

browse 58 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 11 method 8 baseline 1

citation-polarity summary

background 11 use method 8 baseline 1

representative citing papers

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

cs.CV · 2026-05-15 · unverdicted · novelty 7.0

Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.

Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study

eess.SP · 2026-05-12 · unverdicted · novelty 7.0

Contrastive predictive coding pretraining combined with structured state space models yields the strongest ECG foundation models, with continued gains from scaling data to 11 million samples.

ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

Consistent Geometric Deep Learning via Hilbert Bundles and Cellular Sheaves

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

HilbNets define convolutions via Hilbert bundle connection Laplacians, prove that sampled Hilbert cellular sheaf Laplacians converge to the continuous operator, and show that discretized networks are consistent and transferable across samplings.

Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

In a controlled synthetic setting, transformers implement in-distribution task inference via convex combinations of task vectors and out-of-distribution inference via nearly orthogonal extrapolative representations.

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 3 refs

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

Screening Is Enough

cs.LG · 2026-04-01 · unverdicted · novelty 7.0

Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.

NEST: Nested Event Stream Transformer for Sequences of Multisets

cs.LG · 2026-01-31 · unverdicted · novelty 7.0

NEST is a nested transformer for sequences of multisets that uses masked set modeling to learn improved set-level representations from hierarchical event streams like EHRs.

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

cs.CV · 2025-04-29 · unverdicted · novelty 7.0

ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

SEGA adaptively scales RoPE attention components using spectral-energy guidance from the latent to improve structural coherence and fine details in high-resolution DiT synthesis.

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

cs.CV · 2026-05-21 · conditional · novelty 6.0

A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.

TrajTok: Adaptive Spatial Tokenization for Trajectory Representation Learning

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

TrajTok learns multi-resolution hexagonal spatial tokens from GPS data and pretrains a factorized transformer with ST-RoPE and masked modeling to yield frozen encoders that outperform task-specific methods on similarity, classification, and travel-time tasks in the Porto dataset.

Generative Recursive Reasoning

cs.AI · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.

PIXLRelight: Controllable Relighting via Intrinsic Conditioning

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

A transformer-based neural renderer that transfers arbitrary PBR lighting to single images via shared intrinsic conditioning extracted from both multi-illumination photos and path-traced coarse 3D renders.

Towards Continuous Sign Language Conversation from Isolated Signs

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Constructs continuous sign conversation data from isolated signs using retrieval and diffusion models to train a direct sign-to-sign conversational AI.

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.

RigidFormer: Learning Rigid Dynamics using Transformers

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.

WavesFM: Hierarchical Representation Learning for Longitudinal Wearable Sensor Waveforms

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

WavesFM uses hierarchical SSL to pretrain a segment encoder on short waveforms followed by a temporal encoder on multi-day sequences, outperforming prior methods on 58 tasks after training on over 12 million hours of data from hundreds of thousands of people.

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Mean-Variance Split residuals separate centered variation from mean updates to prevent collapse and enable stable training of 1000-layer Diffusion Transformers.

citing papers explorer

Showing 29 of 29 citing papers after filters.

StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video cs.CV · 2026-05-18 · unverdicted · none · ref 41
StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.
Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation cs.CV · 2026-05-15 · unverdicted · none · ref 38
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models cs.CV · 2026-05-11 · unverdicted · none · ref 30
ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics cs.CV · 2026-05-05 · unverdicted · none · ref 30 · 3 links
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional animators on prompt understanding and artistic motion.
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer cs.CV · 2025-04-29 · unverdicted · none · ref 55
ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.
SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers cs.CV · 2026-05-21 · unverdicted · none · ref 21
SEGA adaptively scales RoPE attention components using spectral-energy guidance from the latent to improve structural coherence and fine details in high-resolution DiT synthesis.
RiT: Vanilla Diffusion Transformers Suffice in Representation Space cs.CV · 2026-05-21 · conditional · none · ref 32
A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.
PIXLRelight: Controllable Relighting via Intrinsic Conditioning cs.CV · 2026-05-18 · unverdicted · none · ref 39
A transformer-based neural renderer that transfers arbitrary PBR lighting to single images via shared intrinsic conditioning extracted from both multi-illumination photos and path-traced coarse 3D renders.
Towards Continuous Sign Language Conversation from Isolated Signs cs.CV · 2026-05-14 · unverdicted · none · ref 75
Constructs continuous sign conversation data from isolated signs using retrieval and diffusion models to train a direct sign-to-sign conversational AI.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context cs.CV · 2026-05-13 · unverdicted · none · ref 74
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
RigidFormer: Learning Rigid Dynamics using Transformers cs.CV · 2026-05-09 · unverdicted · none · ref 34
RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 77
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation cs.CV · 2026-04-13 · unverdicted · none · ref 52
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
LPM 1.0: Video-based Character Performance Model cs.CV · 2026-04-09 · unverdicted · none · ref 60
LPM 1.0 generates infinite-length, identity-stable, real-time audio-visual conversational performances for single characters using a distilled causal diffusion transformer and a new benchmark.
RefTon: Reference person shot assist virtual Try-on cs.CV · 2025-11-02 · unverdicted · none · ref 32
RefTon is a flux-based virtual try-on method that uses unpaired reference images of the target garment on different people to guide texture and detail preservation in a streamlined person-to-person pipeline without body parsing or masks.
Emu3.5: Native Multimodal Models are World Learners cs.CV · 2025-10-30 · unverdicted · none · ref 85
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation for efficient interleaved generation and world exploration.
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing cs.CV · 2025-09-26 · unverdicted · none · ref 39
MinerU2.5 uses a two-stage decoupled vision-language architecture to achieve state-of-the-art document parsing accuracy with lower computational overhead than existing general and domain-specific models.
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models cs.CV · 2025-05-22 · unverdicted · none · ref 19
Circle-RoPE achieves cross-modal positional disentanglement in VLMs by mapping 2D image tokens to a cone-like annulus orthogonal to the text axis, with PTD=0 eliminating RoPE geometric bias while preserving intra-image structure via alternating geometry encoding.
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation cs.CV · 2025-05-08 · unverdicted · none · ref 70
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model cs.CV · 2025-03-10 · unverdicted · none · ref 29
Seedream 2.0 is a native Chinese-English bilingual diffusion model that integrates a self-developed LLM text encoder, Glyph-Aligned ByT5, and Scaled ROPE to reach claimed state-of-the-art results in prompt following, aesthetics, text rendering, and human preference alignment via RLHF.
Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models cs.CV · 2026-05-20 · unverdicted · none · ref 21
Lens is a 3.8B-parameter text-to-image model that reaches competitive or superior performance to >6B-parameter systems using 19.3% of the training compute of Z-Image through a densely captioned 800M dataset, multi-resolution batching, semantic VAE, strong language encoder, RL fine-tuning, and 4-step
Emerging Properties in Unified Multimodal Pretraining cs.CV · 2025-05-20 · unverdicted · none · ref 67
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds cs.CV · 2026-04-15 · unverdicted · none · ref 59
HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claiming open-source SOTA performance.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 127
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Seedream 3.0 Technical Report cs.CV · 2025-04-15 · unverdicted · none · ref 22
Seedream 3.0 improves bilingual image generation through doubled defect-aware data, mixed-resolution training, cross-modality RoPE, representation alignment, aesthetic SFT, VLM reward modeling, and importance-aware timestep sampling for 4-8x faster inference at up to 2K resolution.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding cs.CV · 2025-01-22 · unverdicted · none · ref 51
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion cs.CV · 2026-05-15 · unreviewed · ref 24
HunyuanImage 3.0 Technical Report cs.CV · 2025-09-28 · unreviewed · ref 39
VRAG: Learning World Models for Interactive Video Generation cs.CV · 2025-05-28 · unreviewed · ref 58

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer