Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

Haoxuan Che; Jean-Michel Morel; Kangning Cui; Meng Chu; Raymond H. Chan; Rui Liu; Suiyun Zhang; Xiaodong Cun; Yaofang Liu; Zhaoqing Li

arxiv: 2605.12271 · v2 · pith:SO5DGOG4new · submitted 2026-05-12 · 💻 cs.CV

Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

Yaofang Liu , Kangning Cui , Meng Chu , Zhaoqing Li , Suiyun Zhang , Jean-Michel Morel , Xiaodong Cun , Haoxuan Che

show 2 more authors

Rui Liu Raymond H. Chan

This is my paper

Pith reviewed 2026-05-13 05:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual-to-visual generationV2V-Zerotraining-free conditioningvision-language modelsimage generationconditional generation

0 comments

The pith

Visual specification pages replace text prompts in frozen generators without any retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes visual-to-visual generation as a way for users to specify outputs through documents like sketches, references, and annotated scenes rather than converting intent into text. It introduces V2V-Zero, a method that extracts final-layer hidden states from these visual pages in an existing VLM and substitutes them for text conditioning. On GenEval the approach reaches 0.85 with a frozen Qwen-Image backbone, nearly matching the model's optimized text-to-image results. A new Simple-V2V Bench shows 32.7/100 across seven tasks, with attribute binding succeeding more reliably than structural control or content generation, and a video extension reaching 20.2/100.

Core claim

V2V-Zero is a training-free framework that conditions existing VLM-based generators by replacing text-only inputs with final-layer hidden states extracted from visual specification pages, exploiting the fact that the frozen VLM already projects both modalities into the generator's conditioning space.

What carries the argument

V2V-Zero framework that substitutes text conditioning with final-layer hidden states from visual pages in VLM-conditioned generators.

If this is right

Existing commercial and open-weight generators can accept visual conditioning through the same interface without architectural modification.
Attribute binding succeeds reliably while structural alignment and novel content synthesis remain weak points even in closed models.
The same conditioning swap extends directly to video generators and yields measurable though lower performance.
Conditioning-token attention concentrates 95 percent on the visual-page states, indicating the default reasoning path is visually routed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Users could shift from prompt engineering to creating and editing visual reference documents as the primary creative interface.
The observed hierarchy of task difficulty points to specific places where future models would need stronger visual-semantic integration.
If the mapping property holds across more VLMs, visual-to-visual may become the default conditioning mode rather than a special case.

Load-bearing premise

The frozen VLM already maps both text and visual pages into the generator's conditioning space so that the visual hidden states can stand in for text without any fine-tuning.

What would settle it

An experiment that swaps the visual-page hidden states for random vectors while keeping every other model component fixed and measures whether GenEval and Simple-V2V Bench scores collapse toward zero.

Figures

Figures reproduced from arXiv: 2605.12271 by Haoxuan Che, Jean-Michel Morel, Kangning Cui, Meng Chu, Raymond H. Chan, Rui Liu, Suiyun Zhang, Xiaodong Cun, Yaofang Liu, Zhaoqing Li.

**Figure 1.** Figure 1: Qualitative Simple-V2V Bench comparison. Rows are visual-conditioning tasks and columns compare the same visual page across V2V-Zero and SOTA baselines, previewing strong attribute/reference binding and harder counting, pose, sketch, and style-transfer cases. 1 Introduction Human visual intent is rarely born as a sentence. Designers use sketches, palettes, reference boards, typography sheets, pose diagrams… view at source ↗

**Figure 2.** Figure 2: V2V-Zero replaces user text prompts with visual prompt pages. A frozen VLM can accept plain visual text, inline color blocks, inline image blocks, or stylized rendered text tokens as encoder inputs. The main V2V-Zero path keeps pretrained weights and learned modules unchanged: the VLM reads the visual page, exposes visual hidden states, and the frozen DiT generator crossattends to those states through its… view at source ↗

**Figure 3.** Figure 3: HunyuanVideo-1.5 representative examples on Simple-V2V Bench. Each row shows the visual input page and four uniformly sampled frames from one generated video. The examples illustrate inline-color and object-counting cases; the aggregate score of 20.2/100 in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Real DiT attention routing in the V2V-Zero reasoning path. We hook Qwen-Image DiT joint attention during a real inline-color V2V-Bench generation and measure attention from latent image queries to VLM conditioning hidden states. The FULL-FINAL reasoning path contains both visual-prefix states from the visual page and generated reasoning-token states, but the DiT assigns 95.0% of conditioning-token attentio… view at source ↗

**Figure 5.** Figure 5: Token-level cross-modal alignment on rendered text pages. Rendered text-page image-token states retrieve their matching phrase-token states with R@1=68%, R@3=84%, and MRR=0.773, showing local visual-text alignment in the injected VLM hidden states. B Detailed Related Work Text-first visual generation and specialized editing. T2I and T2V models have advanced rapidly by scaling diffusion, transformer, and mu… view at source ↗

**Figure 6.** Figure 6: Simple-V2V Bench visual-page atlas. Representative input pages from the seven task families show the visual evidence that models must read from the page, including rendered text, inline swatches, visual references, counting displays, style references, pose skeletons, and sketches. specified, decoding is greedy and each GenEval prompt is sampled four times. Simple-V2V Bench generation uses 1024×1024 outputs… view at source ↗

**Figure 7.** Figure 7: Simple-V2V Bench category scores. Category-level scores expose which visual specification types are handled reliably and which remain difficult across models. The strongest systems remain much weaker on pose and sketch control than on inline color, visual reference, and object counting. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Quality–alignment bottleneck analysis. Most systems maintain substantially higher visual quality than alignment to the input page; final scores are therefore primarily limited by visualinstruction following rather than by raw image fidelity. 0 20 40 60 80 100 Samples (%) GPT Image 2 (n=616) Seedream 5.0 Lite (n=585/616) Nano Banana 2 (n=616) V2V-Zero (n=616) HunyuanVideo-1.5 (n=616) Qwen-Image-Edit-2511 (… view at source ↗

**Figure 9.** Figure 9: Distribution of final sample scores. The sample-level score distribution distinguishes consistently moderate behavior from mixtures of high-scoring successes and low-scoring alignment failures, complementing the mean category scores in [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: More qualitative examples from Simple-V2V Bench. trained under T2I supervision. It therefore has no explicit training signal that makes pose skeletons, line drawings, or their token geometry act as dense structural constraints on the DiT latent grid. The common failure pattern is not merely low fidelity: the output may become a wireframe-like collage, reproduce parts of the page, add garbled text, change … view at source ↗

**Figure 11.** Figure 11: Representative structural-control failures. Each row shows the input visual specification page, a V2V-Zero output, and a GPT Image 2 output for the same case. Pose pages require preserving joint topology and human count; sketch pages require preserving object layout, relative scale, and contour structure. V2V-Zero frequently turns these inputs into wireframe-like or collage-like images, while stronger com… view at source ↗

read the original abstract

Humans often specify and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text, a bottleneck that compresses signals like spatial structure, exact appearance, and glyph shape. We propose \textbf{\emph{visual-to-visual} (V2V)} generation, in which the user conditions a generative model with a visual specification page rather than a text prompt. The page is not an edit target, but a visual document that specifies the desired output. We introduce \textbf{V2V-Zero}, a training-free framework that exposes this interface in existing vision-language model (VLM) conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages, exploiting the fact that the frozen VLM already maps both text and images into the generator's conditioning space. On GenEval, V2V-Zero reaches 0.85 with a frozen Qwen-Image backbone, closely matching its optimized text-to-image performance without fine-tuning. To evaluate the broader V2V space, we introduce \textbf{Simple-V2V Bench}, spanning seven visual-conditioning tasks and seven models, including GPT Image 2, Nano Banana 2, Seedream 5.0 Lite, open-weight baselines, and a video extension. V2V-Zero scores 32.7/100, outperforming evaluated open-weight image baselines and revealing a clear capability hierarchy: attribute binding is strong, content generation is unreliable, and structural control remains hard even for commercial systems. A HunyuanVideo-1.5 extension scores 20.2/100, showing the interface transfers beyond images. Mechanistic analysis shows the default reasoning path is primarily visually routed, with 95.0\% of conditioning-token attention mass on visual-page hidden states.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

V2V-Zero shows a training-free swap of text tokens for VLM final-layer states from visual pages can match text-to-image scores on GenEval, but the conditioning spaces may not be equivalent.

read the letter

The main thing to know is that this paper gives a clean way to do visual conditioning on existing generators by pulling final hidden states from a VLM that has read a visual spec page instead of text. It works without any fine-tuning and gets 0.85 on GenEval with a frozen Qwen-Image backbone, close to the text baseline. They also release Simple-V2V Bench and run it across several models plus a video extension, which is useful for seeing where current systems still struggle with structure and content generation. The attention breakdown (95% on visual states) adds some mechanistic support for why the swap holds up on the tested cases. What stands out is how little they change the architecture while still getting the interface to work. The soft spot is the core assumption that those visual hidden states sit in the same conditioning manifold as text tokens. The paper shows the dimensions line up and the model attends to them, but nothing rules out a systematic shift from image-patch processing versus token embeddings, especially on out-of-distribution visual pages with sketches or annotations. The 32.7/100 bench score is better than the open baselines they compare against, yet it still leaves a lot of room, which suggests the method surfaces the problem more than it solves it. This is worth a serious referee for anyone working on multimodal generators or non-text interfaces. The framing is new enough and the results are grounded enough to justify review time, even if more targeted ablations on the state equivalence would tighten the central claim.

Referee Report

2 major / 1 minor

Summary. The paper proposes visual-to-visual (V2V) generation as an alternative to text prompting, where a visual specification page (sketches, glyphs, annotations) conditions a generative model. It introduces the training-free V2V-Zero framework that extracts final-layer hidden states from a frozen VLM processing the visual page and substitutes them for text conditioning tokens, exploiting the claim that the VLM already maps both modalities into the generator's conditioning space. On GenEval, V2V-Zero achieves 0.85 with a frozen Qwen-Image backbone, matching its text-to-image performance; a new Simple-V2V Bench yields 32.7/100 across seven tasks and models (with a HunyuanVideo extension at 20.2/100), and mechanistic analysis reports 95% attention mass on visual states.

Significance. If the core substitution holds, the work offers a practical route to richer conditioning interfaces that preserve spatial and structural signals lost in text serialization, with the training-free property and video transfer as clear strengths. The competitive GenEval number and attention analysis provide initial support, but the absence of error bars, distribution-shift tests, and detailed ablations limits the strength of the evidence for broad adoption.

major comments (2)

[Abstract] Abstract: The central claim that final-layer VLM hidden states from visual pages occupy the same conditioning manifold as text tokens (allowing direct substitution without fine-tuning or architectural changes) is load-bearing yet unverified; no direct comparison of state distributions, positional encoding effects, or out-of-distribution visual-page tests is reported to rule out systematic shifts that GenEval may tolerate but other tasks would not.
[Abstract] GenEval results: The reported 0.85 score is presented as closely matching optimized text-to-image performance, but lacks error bars, variance across runs, or explicit controls for visual-page composition (e.g., sketch vs. annotated scene), making it impossible to assess whether the match is robust or coincidental.

minor comments (1)

[Abstract] Abstract: The description of Simple-V2V Bench mentions seven tasks and seven models but provides no definition of the scoring scale (out of 100) or task breakdown, which would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to strengthen the empirical support for the core substitution claim and the reported results.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that final-layer VLM hidden states from visual pages occupy the same conditioning manifold as text tokens (allowing direct substitution without fine-tuning or architectural changes) is load-bearing yet unverified; no direct comparison of state distributions, positional encoding effects, or out-of-distribution visual-page tests is reported to rule out systematic shifts that GenEval may tolerate but other tasks would not.

Authors: We agree that direct verification of manifold alignment (e.g., via distribution comparisons or positional encoding analysis) is absent from the current manuscript and would strengthen the load-bearing claim. The 95% attention mass and GenEval parity provide indirect support, but we will add a dedicated analysis section in the revision that includes cosine similarity between text and visual hidden states on held-out sets, positional encoding ablation, and out-of-distribution visual-page tests to rule out systematic shifts. revision: yes
Referee: [Abstract] GenEval results: The reported 0.85 score is presented as closely matching optimized text-to-image performance, but lacks error bars, variance across runs, or explicit controls for visual-page composition (e.g., sketch vs. annotated scene), making it impossible to assess whether the match is robust or coincidental.

Authors: We concur that error bars, run variance, and explicit controls for visual-page composition are needed to demonstrate robustness. In the revised manuscript we will report standard deviations over multiple random seeds, provide per-composition breakdowns (sketch vs. glyph vs. annotated scene), and detail the exact visual-page generation protocol used for the GenEval evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: V2V-Zero is training-free and exploits pre-existing frozen VLM properties without reducing results to self-defined fits or citations

full rationale

The paper introduces V2V-Zero as a zero-shot substitution of final-layer VLM hidden states from visual pages for text conditioning tokens, explicitly relying on the pre-trained mapping properties of existing frozen models rather than any derivation, parameter fitting, or self-referential construction. Reported metrics such as 0.85 on GenEval and 32.7/100 on Simple-V2V Bench are obtained through direct empirical evaluation on external benchmarks with no equations or steps that redefine outcomes in terms of the paper's own inputs. No self-citations serve as load-bearing uniqueness theorems, no ansatzes are smuggled, and no predictions reduce by construction to fitted quantities. The approach is self-contained against external model properties and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one key domain assumption about modality alignment in frozen VLMs; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The frozen VLM already maps both text and images into the generator's conditioning space so that final-layer hidden states from visual pages can replace text conditioning without any fine-tuning or architectural changes.
This assumption is invoked to justify the training-free replacement of text conditioning with visual-page hidden states.

pith-pipeline@v0.9.0 · 5671 in / 1354 out tokens · 36667 ms · 2026-05-13T05:56:50.607649+00:00 · methodology

Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)