hub

Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer · 2025 · arXiv 2501.18427

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 baseline 2

citation-polarity summary

background 2 baseline 2

representative citing papers

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

The Silent Brush: Evaluating Artistic Style Leakage in AI Art Generation

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

Art Arena evaluates how artistic styles from training data leak into AI-generated images without explicit prompts, revealing asymmetric blending due to differences in representational strength and interaction dynamics across models like Stable Diffusion.

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

cs.CV · 2025-04-29 · unverdicted · novelty 7.0

ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.

What Makes Synthetic Data Effective in Image Segmentation

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

Dense scene composition and instance fidelity in synthetic diffusion images drive better segmentation performance; SENSE framework exploits this to improve models on Cityscapes, COCO, and ADE20K.

When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.

EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

EPIC introduces predicate-guided inference-time search that lifts compositional T2I prompt accuracy from 34% to 71% on GenEval2 with 31-81% lower execution costs.

Nucleus-Image: Sparse MoE for Image Generation

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.

TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

TC-AE improves reconstruction and generative performance in deep compression by decomposing token-to-latent compression into two stages and using joint self-supervised training.

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

cs.CV · 2025-10-23 · unverdicted · novelty 6.0

RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

cs.CV · 2025-10-09 · conditional · novelty 6.0

The work introduces rCM, a score-regularized continuous-time consistency model that matches DMD2 quality on large models up to 14B parameters while improving diversity and enabling 1-4 step sampling.

YOLOv12: Attention-Centric Real-Time Object Detectors

cs.CV · 2025-02-18 · unverdicted · novelty 6.0

YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.

Qwen-Image Technical Report

cs.CV · 2025-08-04 · unverdicted · novelty 5.0

Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive encoding for editing consistency.

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

cs.CV · 2026-04-20

citing papers explorer

Showing 14 of 14 citing papers.

Flow-GRPO: Training Flow Matching Models via Online RL cs.CV · 2025-05-08 · unverdicted · none · ref 70
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
The Silent Brush: Evaluating Artistic Style Leakage in AI Art Generation cs.LG · 2026-05-17 · unverdicted · none · ref 7
Art Arena evaluates how artistic styles from training data leak into AI-generated images without explicit prompts, revealing asymmetric blending due to differences in representational strength and interaction dynamics across models like Stable Diffusion.
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer cs.CV · 2025-04-29 · unverdicted · none · ref 60
ICEdit achieves state-of-the-art instructional image editing in Diffusion Transformers via in-context generation, requiring only 0.1% of prior training data and 1% trainable parameters.
What Makes Synthetic Data Effective in Image Segmentation cs.CV · 2026-05-19 · unverdicted · none · ref 15
Dense scene composition and instance fidelity in synthetic diffusion images drive better segmentation performance; SENSE framework exploits this to improve models on Cityscapes, COCO, and ADE20K.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy cs.CV · 2026-05-12 · unverdicted · none · ref 70
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation cs.CV · 2026-05-12 · unverdicted · none · ref 14
EPIC introduces predicate-guided inference-time search that lifts compositional T2I prompt accuracy from 34% to 71% on GenEval2 with 31-81% lower execution costs.
Nucleus-Image: Sparse MoE for Image Generation cs.CV · 2026-04-14 · unverdicted · none · ref 61
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders cs.CV · 2026-04-08 · unverdicted · none · ref 11
TC-AE improves reconstruction and generative performance in deep compression by decomposing token-to-latent compression into two stages and using joint self-supervised training.
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling cs.CV · 2025-10-23 · unverdicted · none · ref 20
RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency cs.CV · 2025-10-09 · conditional · none · ref 26
The work introduces rCM, a score-regularized continuous-time consistency model that matches DMD2 quality on large models up to 14B parameters while improving diversity and enabling 1-4 step sampling.
YOLOv12: Attention-Centric Real-Time Object Detectors cs.CV · 2025-02-18 · unverdicted · none · ref 63
YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens cs.CV · 2026-05-18 · unverdicted · none · ref 94
WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.
Qwen-Image Technical Report cs.CV · 2025-08-04 · unverdicted · none · ref 32
Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive encoding for editing consistency.
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models cs.CV · 2026-04-20 · unreviewed · ref 20

Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer