CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance

· 2025 · cs.CV · arXiv 2508.16644

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Diffusion models excel at photorealistic synthesis but struggle with precise object counts, especially in high-density settings. We introduce COUNTLOOP, a training-free framework that achieves precise instance control through iterative, structured feedback. Our method alternates between synthesis and evaluation: a VLM-based planner generates structured scene layouts, while a VLM-based critic provides explicit feedback on object counts, spatial arrangements, and visual quality to refine the layout iteratively. Instance-driven attention masking and cumulative attention composition further prevent semantic leakage, ensuring clear object separation even in densely occluded scenes. Evaluations on COCO-Count, T2I-CompBench, and two newly introduced high instance benchmarks show that COUNTLOOP reduces counting error by up to 57% and achieves the highest or comparable spatial quality scores across all benchmarks, while maintaining photorealism.

representative citing papers

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.

ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation

cs.CV · 2026-06-22 · unverdicted · novelty 4.0

ABACUS adapts a 3B unified foundation model using density-aware zooming, boundary-aware GRPO, and cycle-consistent self-critique to achieve SOTA on seven counting and generation benchmarks without task-specific training.

citing papers explorer

Showing 2 of 2 citing papers.

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 31 · internal anchor
PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.
ABACUS: Adapting Unified Foundation Model for Bridging Image Count Understanding and Generation cs.CV · 2026-06-22 · unverdicted · none · ref 54 · internal anchor
ABACUS adapts a 3B unified foundation model using density-aware zooming, boundary-aware GRPO, and cycle-consistent self-critique to achieve SOTA on seven counting and generation benchmarks without task-specific training.

CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance

fields

years

verdicts

representative citing papers

citing papers explorer