mega hub Mixed citations

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter · 2017 · cs.LG · arXiv 1711.05101

Mixed citation behavior. Most common role is method (58%).

1169 Pith papers citing it

Method 58% of classified citations

open full Pith review browse 1169 citing papers more from Ilya Loshchilov and Frank Hutter arXiv PDF

abstract

L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in TensorFlow and PyTorch; the complete source code for our experiments is available at https://github.com/loshchil/AdamW-and-SGDW

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 103 background 64 dataset 5 baseline 3 other 3

citation-polarity summary

use method 103 background 57 unclear 9 use dataset 5 baseline 3 support 1

claims ledger

abstract L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L$_2$ regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the

authors

Ilya Loshchilov and Frank Hutter

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

cs.CV · 2026-06-09 · conditional · novelty 8.0

Lip Forcing distills a 14B bidirectional video diffusion teacher into autoregressive students that achieve real-time lip synchronization at 31 FPS using two denoising steps without CFG.

Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

cs.CV · 2026-05-27 · conditional · novelty 8.0

Every9D-21M supplies 21.8M real-world 9D pose annotations for 700 everyday categories by propagating manual canonical poses through cross-instance alignment in object-centric videos and verifying them multiview.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

cs.CV · 2026-05-13 · unverdicted · novelty 8.0

AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

cs.CV · 2026-05-12 · unverdicted · novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

Dissecting Jet-Tagger Through Mechanistic Interpretability

hep-ph · 2026-05-11 · accept · novelty 8.0

A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.

LLM Translation of Compiler Intermediate Representation

cs.PL · 2026-05-07 · unverdicted · novelty 8.0

IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum

cs.LG · 2026-05-03 · unverdicted · novelty 8.0 · 2 refs

Momentum-based async SGD achieves optimal convergence rates for data-dependent delays without biasing updates toward simpler samples.

CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models

cs.CV · 2026-05-03 · unverdicted · novelty 8.0

CADFS supplies a large real-world CAD dataset and FeatureScript representation that, after VLM fine-tuning, produces more accurate and feature-rich designs than prior generative CAD systems.

Stability and Generalization in Looped Transformers

cs.LG · 2026-04-16 · unverdicted · novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

cs.LG · 2026-04-14 · unverdicted · novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.

Rotation Equivariant Mamba for Vision Tasks

cs.CV · 2026-03-10 · unverdicted · novelty 8.0

EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-equivariant baselines across classification, segmentation, and super-resolution.

A document is worth a structured record: Principled inductive bias design for document recognition

cs.CV · 2025-07-11 · unverdicted · novelty 8.0

Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, shape drawings, and mechanical engineering drawings.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

VMamba: Visual State Space Model

cs.CV · 2024-01-18 · conditional · novelty 8.0

VMamba introduces a state-space vision backbone using 2D selective scanning across four routes to achieve linear complexity and strong performance on image tasks.

Progress measures for grokking via mechanistic interpretability

cs.LG · 2023-01-12 · accept · novelty 8.0

Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

cs.LG · 2022-09-07 · unverdicted · novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

Decision Transformer: Reinforcement Learning via Sequence Modeling

cs.LG · 2021-06-02 · accept · novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

RoFormer: Enhanced Transformer with Rotary Position Embedding

cs.CL · 2021-04-20 · accept · novelty 8.0

RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Seek to Segment: Active Perception for Panoramic Referring Segmentation

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

Introduces APRS task and PanoSeeker agent using VLM plus EgoSphere memory for active 360° search and segmentation, outperforming baselines on a new benchmark.

citing papers explorer

Showing 50 of 64 citing papers after filters.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking cs.CV · 2026-05-12 · unverdicted · none · ref 50 · internal anchor
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters cs.CV · 2026-05-13 · conditional · none · ref 62 · internal anchor
AuraMask produces 40 aesthetic anti-facial recognition filters that match or exceed prior adversarial effectiveness and achieve significantly higher user acceptance in a 630-person study.
PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations cs.CV · 2026-05-12 · unverdicted · none · ref 21 · internal anchor
PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.
Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning cs.CV · 2026-05-09 · unverdicted · none · ref 62 · internal anchor
UniTopo unifies lane detection and topology reasoning into a single perception model, outperforming prior methods on OpenLane-V2 benchmarks with TOP_ll scores of 30.1% and 31.8%.
From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation cs.CV · 2026-05-09 · unverdicted · none · ref 55 · internal anchor
A kinematic-to-visual lifting paradigm combined with hierarchically routed control generates action-conditioned surgical videos with better faithfulness, fidelity, and efficiency.
SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild cs.CV · 2026-05-08 · unverdicted · none · ref 26 · internal anchor
SAM 3D Animal is the first promptable framework for multi-animal 3D reconstruction from single images, built on SMAL+ and trained on the new Herd3D dataset, achieving SOTA results on Animal3D, APTv2, and Animal Kingdom benchmarks.
SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition cs.CV · 2026-05-03 · unverdicted · none · ref 21 · internal anchor
SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames and modalities.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories cs.CV · 2026-04-16 · unverdicted · none · ref 31 · internal anchor
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens cs.CV · 2026-04-16 · unverdicted · none · ref 28 · internal anchor
TokenGS uses learnable Gaussian tokens in an encoder-decoder architecture to regress 3D means directly, achieving SOTA feed-forward reconstruction on static and dynamic scenes with better robustness.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches cs.CV · 2026-04-15 · unverdicted · none · ref 33 · internal anchor
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding cs.CV · 2026-04-09 · unverdicted · none · ref 38 · internal anchor
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
MoRight: Motion Control Done Right cs.CV · 2026-04-08 · unverdicted · none · ref 52 · internal anchor
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply causality in video generation.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 35 · internal anchor
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
Your Pre-trained Diffusion Model Secretly Knows Restoration cs.CV · 2026-04-06 · unverdicted · none · ref 33 · internal anchor
Pre-trained diffusion models inherently support image restoration that can be unlocked by optimizing prompt embeddings at the text encoder output using a diffusion bridge formulation, achieving competitive results on models like WAN and FLUX without fine-tuning.
InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset cs.CV · 2026-04-04 · unverdicted · none · ref 39 · internal anchor
InCaRPose is a Transformer-based model trained on synthetic data that predicts absolute metric-scale relative poses between distorted in-cabin camera views and generalizes to real images while releasing a new test dataset.
Deformation-based In-Context Learning for Point Cloud Understanding cs.CV · 2026-04-03 · unverdicted · none · ref 27 · internal anchor
DeformPIC deforms query point clouds under prompt guidance for in-context learning, outperforming prior methods with lower Chamfer Distance on reconstruction, denoising, and registration tasks.
Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining cs.CV · 2026-04-02 · unverdicted · none · ref 40 · internal anchor
Pretraining on 1M wild videos followed by post-training on curated data yields high-fidelity feedforward 3D avatars that generalize across identities, clothing, and lighting with emergent relightability and loose-garment support.
UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes cs.CV · 2025-11-28 · conditional · none · ref 35 · internal anchor
UniGeoSeg releases the first million-scale dataset for instruction-driven remote sensing segmentation and a unified model that achieves state-of-the-art results with strong zero-shot generalization.
Structured 3D Latents for Scalable and Versatile 3D Generation cs.CV · 2024-12-02 · unverdicted · none · ref 53 · internal anchor
SLAT provides a unified 3D latent representation enabling versatile high-quality generation across multiple output formats from text or image inputs.
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment cs.CV · 2024-03-08 · unverdicted · none · ref 36 · internal anchor
ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
Scalable Diffusion Models with Transformers cs.CV · 2022-12-19 · unverdicted · none · ref 33 · internal anchor
DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction cs.CV · 2026-05-14 · unverdicted · none · ref 64 · internal anchor
CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments cs.CV · 2026-05-12 · unverdicted · none · ref 30 · internal anchor
Introduces GAP datasets of synthetic heavily eroded irregular puzzle pieces modeled on archaeological fragments and PuzzleFlow, a ViT and flow-matching framework that outperforms prior jigsaw solvers on these datasets.
DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation cs.CV · 2026-05-11 · unverdicted · none · ref 26 · internal anchor
DuetFair couples inter-subgroup adaptation with intra-subgroup robustness via FairDRO (dMoE plus subgroup-conditioned DRO) to boost worst-case and equity-scaled performance on medical segmentation benchmarks.
Noise-Started One-Step Real-World Super-Resolution via LR-Conditioned SplitMeanFlow and GAN Refinement cs.CV · 2026-05-10 · unverdicted · none · ref 23 · internal anchor
SMFSR achieves state-of-the-art perceptual quality among one-step diffusion-based real-world super-resolution methods by preserving noise-started generation via LR-conditioned SplitMeanFlow and GAN refinement.
Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding cs.CV · 2026-05-09 · unverdicted · none · ref 32 · internal anchor
A new framework combines self-attention on the Oblique manifold with bidirectional geodesic cross-attention on the Lorentz hyperboloid to improve both localization accuracy and descriptive coherence in 3D dense captioning.
L2A: Learning to Accumulate Pose History for Accurate 3D Human Pose Estimation cs.CV · 2026-05-09 · unverdicted · none · ref 11 · 2 links · internal anchor
A new method accumulates historical pose features across layers in a Transformer network to reach state-of-the-art 3D human pose estimation accuracy.
Gate-and-Merge: Zero-shot Compositional Personalization of Vision Language Models cs.CV · 2026-05-09 · unverdicted · none · ref 8 · internal anchor
Gate-and-Merge enables zero-shot compositional personalization of VLMs by independently learning concept-specific LoRA adapters and merging them in weight space with cue-based gating to suppress interference.
DVD: Discrete Voxel Diffusion for 3D Generation and Editing cs.CV · 2026-05-08 · unverdicted · none · ref 43 · 2 links · internal anchor
DVD applies discrete diffusion directly to voxel occupancy for 3D generation, uncertainty estimation via entropy, and single-round editing via block perturbation fine-tuning.
ICDAR 2026 Competition on Writer Identification and Pen Classification from Hand-Drawn Circles cs.CV · 2026-05-08 · accept · none · ref 12 · 2 links · internal anchor
CircleID introduces a controlled dataset of 46,155 circles from 66 writers and 8 pens, with competition results showing top accuracies of 64.8% for open-set writer identification and 92.7% for pen classification.
Lightweight Unpaired Smartphone ISP Transfer with Semantic Pseudo-Pairing cs.CV · 2026-05-08 · conditional · none · ref 23 · internal anchor
Semantic pseudo-pairing via DINOv2 embeddings and fused Gromov-Wasserstein optimal transport enables training a 7K-parameter CNN for unpaired smartphone ISP, achieving 22.569 PSNR on the NTIRE 2026 challenge test set.
DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 40 · internal anchor
DIMoE-Adapters uses self-calibrated expert evolution and prototype-guided selection to dynamically grow and allocate experts, outperforming prior continual learning methods on vision-language models.
Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency cs.CV · 2026-05-07 · unverdicted · none · ref 18 · 4 links · internal anchor
Introduces Eulerian motion guidance with bidirectional geometric consistency to improve training speed and temporal quality in diffusion-based image animation.
Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy cs.CV · 2026-05-06 · unverdicted · none · ref 29 · 2 links · internal anchor
HiPR improves 3D occupancy prediction by reparameterizing image-to-voxel projections using LiDAR-derived height priors to adapt sampling ranges to scene sparsity and height variations.
Lyra 2.0: Explorable Generative 3D Worlds cs.CV · 2026-04-14 · unverdicted · none · ref 68 · internal anchor
Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation cs.CV · 2026-04-13 · unverdicted · none · ref 44 · internal anchor
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models cs.CV · 2026-04-12 · unverdicted · none · ref 24 · internal anchor
Rein3D generates photorealistic, globally consistent 3D indoor scenes by using a restore-and-refine process where radial panoramic videos are restored via diffusion models and then used to update a 3D Gaussian field.
Latent-Compressed Variational Autoencoder for Video Diffusion Models cs.CV · 2026-04-12 · unverdicted · none · ref 30 · internal anchor
A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing cs.CV · 2026-04-06 · unverdicted · none · ref 33 · internal anchor
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
R\'enyi Attention Entropy for Patch Pruning cs.CV · 2026-04-04 · unverdicted · none · ref 16 · internal anchor
Rényi entropy of attention maps serves as a tunable criterion for pruning redundant patches in vision transformers, reducing compute with preserved accuracy on image recognition.
An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis cs.CV · 2026-04-02 · unverdicted · none · ref 24 · internal anchor
A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.
Street-Legal Physical-World Adversarial Rim for License Plates cs.CV · 2026-04-02 · conditional · none · ref 23 · internal anchor
SPAR is a street-legal physical rim that cuts modern ALPR accuracy by 60% and reaches 18% targeted impersonation while costing under $100 and requiring no plate modification.
PET-DINO: Unifying Visual Cues into Grounding DINO with Prompt-Enriched Training cs.CV · 2026-04-01 · unverdicted · none · ref 22 · internal anchor
PET-DINO unifies visual and text prompts in Grounding DINO via an alignment-friendly generation module and prompt-enriched training strategies to improve zero-shot open-set object detection.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 161 · internal anchor
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets cs.CV · 2023-11-25 · conditional · none · ref 59 · internal anchor
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models cs.CV · 2023-08-13 · unverdicted · none · ref 46 · internal anchor
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis cs.CV · 2023-06-15 · conditional · none · ref 14 · internal anchor
HPD v2 is the largest human preference dataset for text-to-image images with 798k choices, and HPS v2 is the resulting CLIP-based scorer that better predicts human judgments and responds to model improvements.
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection cs.CV · 2022-03-07 · conditional · none · ref 24 · internal anchor
DINO reaches 51.3 AP on COCO val2017 with a ResNet-50 backbone after 24 epochs, a +2.7 AP gain over the prior best DETR variant.
Frequency Adapter with SAM for Generalized Medical Image Segmentation cs.CV · 2026-05-11 · unverdicted · none · ref 17 · internal anchor
FSAM integrates a frequency adapter into SAM with LoRA to extract domain-invariant high-frequency features and outperforms prior domain generalization methods on fundus and prostate datasets.
Video Generation with Predictive Latents cs.CV · 2026-05-04 · unverdicted · none · ref 28 · internal anchor
PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.