pith. sign in

arxiv: 2604.06757 · v2 · submitted 2026-04-08 · 💻 cs.CV

FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

Pith reviewed 2026-05-15 06:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal generationflow matchingvisual promptsunified generationtext-to-imageimage editingvision-centric
0
0 comments X

The pith

All multimodal generation can be unified as image-in, image-out flow matching using visual prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multimodal generation does not need language to dictate vision. Instead, all inputs convert to visual prompts processed by one flow matching model in an image-in to image-out flow. This unifies text-to-image, editing, and instruction following while removing alignment issues. The authors support this with a 5 million pair dataset and show superior results over existing systems.

Core claim

By converting textual descriptions, spatial layouts, and editing instructions into visual prompts, FlowInOne enables a single flow matching model to perform text-to-image generation, layout-guided editing, and visual instruction following as a unified image-to-image flow without cross-modal alignment or task branches.

What carries the argument

The visual prompt representation that encodes all task information visually for the flow matching process.

If this is right

  • Text-to-image generation operates purely on visual inputs without language models.
  • Layout-guided and instruction-based editing share the same model as generation.
  • State-of-the-art results are obtained on unified tasks surpassing commercial systems.
  • Physics-aware dynamics and trajectory prediction are supported within the visual framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models trained this way could handle new tasks by simply designing appropriate visual prompts without retraining separate modules.
  • Extending the approach to video generation might require only adding temporal visual prompts.
  • Potential for fully visual reasoning agents where input and output stay in image space.

Load-bearing premise

Visual prompts can capture all the information from text, layouts, and instructions without any loss in fidelity or precision.

What would settle it

A controlled test where the same complex editing instruction is given via text to a standard model and via visual prompt to FlowInOne, and the visual prompt version shows clear degradation in following the instruction details.

Figures

Figures reproduced from arXiv: 2604.06757 by Alex Jinpeng Wang, Jiahao Tang, Junchao Yi, Lijuan Wang, Linjie Li, Qisheng Su, Rui Zhao, Weixian Lei, Xiaofeng Zhu, Zhengyuan Yang.

Figure 1
Figure 1. Figure 1: Comparison of generation paradigms. Left: Tra￾ditional T2I only uses the text encoder to condition the Latent Diffusion Model(LDM); Middle: Traditional TI2I requires the joint conditioning of both the text and image encoders; Right: We unify the conditions as visual input and form a simple image-in, image-out framework with a single model. In this work, we take a decisive step toward this goal and introduc… view at source ↗
Figure 2
Figure 2. Figure 2: VisPrompt-5M is a comprehensive dataset that comprises eight distinct data types, including class-to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the FlowInOne architecture, a general and simple framework using flow matching for continuous evolution in only one modality. FlowInOne employs a Dual-Path Spatially-Adaptive Modulation to adapt computation by modality. For input image rendering with only text, the structural branch is bypassed to strictly follow semantic evolution. Conversely, for image editing, a spatially-adaptive gated netw… view at source ↗
Figure 4
Figure 4. Figure 4: Visual instruction editing comparison across methods. Token gated cross attention. Next, we investigate modulation mechanisms to unify generation and editing, which exhibit distinct structural dependencies. We compare: (1) Wo CA: self-attention only; (2) W Dual-Path CA: dual￾path cross-attention without adaptive gating; and (3) Dual-Path SAM: our proposed Dual-Path Spatially-Adaptive Modulation [PITH_FULL… view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Error types for each subset in VP-Bench. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of the robustness analysis against visual instruction perturbations. We evaluate the model’s generation stability under five distinct conditions: the original unmodified input, variations in text style (e.g., size, color, font, and layout), changes in text length, text blurring, and random text corruption. To evaluate the stability and reliability of our model in real-world scenarios, w… view at source ↗
Figure 8
Figure 8. Figure 8: Generation results across different input image resolutions. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results of the visual instruction ablation study. We compare generation outcomes across four input configurations: Blank (original image only), Text only, Visual prompt only, and the complete Text & visual prompt. illustrated in [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Data distribution across the eight distinct subsets within VP-Bench.VP-Bench is a comprehensive benchmark that comprises eight distinct data types. 0 50 100 150 200 250 300 350 400 Frequency remove pose add blue white replace color objects swap final wind according apply turn change generate pointed arrow image object 55 56 57 58 67 69 69 77 89 100 101 102 107 124 137 202 262 310 345 385 Top-20 Global Key… view at source ↗
Figure 12
Figure 12. Figure 12: Additional qualitative results for text-to-image generation and text-in-image editing tasks. In this section, we provide additional qualitative examples to further demonstrate the versatile generation capabilities of our model across various visual instruction categories. Specifically, we group these supplementary results into three main aspects: (1) [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional qualitative results for text bounding box (bbox) editing, doodle-guided editing, and visual marker-based editing. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional qualitative results demonstrating physical force understanding and trajectory understand￾ing. D More Dataset Details D.1 Overview of VisPrompt-5M To support the training of FlowInOne under a purely vision-centric paradigm, we constructed VisPrompt-5M, a meticulously curated large-scale dataset comprising approximately 5 million pairs of visual instructions. Unlike traditional multimodal dataset… view at source ↗
Figure 15
Figure 15. Figure 15: General data construct process. Doodles Editing. Doodles provide an intuitive interface for users to explicitly specify shape priors and spatial layouts. To construct this subset, we collect 5K high-quality web images to serve as unedited base canvases and predefine ten diverse object categories. We then employ a two-stage synthesis pipeline powered by Qwen Image Edit Wu et al. [2025a]. In the first stage… view at source ↗
Figure 16
Figure 16. Figure 16: Detailed instance distribution of VisPrompt-5M. Left: The Force & Trajectory Generation subset, highlighting strictly curated physics-aware categories (e.g., wind, object poking) designed to impart dynamic kinematic priors. Right: The Text-in-Image Editing subset (derived from GPT-Image-Edit), demonstrating a natural long-tailed distribution of semantic operations. Ranging from high-frequency attribute mo… view at source ↗
Figure 17
Figure 17. Figure 17: Fine-grained structure and stylized distributions of VisPrompt-5M. Left: The Structured Editing subset (derived from PixWizard), highlighting dense spatial translations such as Image-to-Sketch and Face Restoration. This subset trains the model to strictly adhere to geometric and structural conditions. Right: The Text-in-Image Editing subset (derived from PicoBanana), detailing 35 highly specialized, long-… view at source ↗
Figure 18
Figure 18. Figure 18: Semantic diversity and region-aware distributions of VisPrompt-5M. Left: The UnicEdit10M Diverse Edits subset, showcasing a profound long-tailed distribution. It spans high-frequency semantic operations like Color Alteration and Subject Addition to rare edge cases such as Object Extraction. Right: The VisMarker Region-Aware Edits subset, featuring robust, high-volume spatial operations like Object Swap an… view at source ↗
Figure 19
Figure 19. Figure 19: Volume of ancillary generation and structural datasets. This logarithmic lollipop chart illustrates the extreme variance in scale and specialization across our supplementary sources. It highlights the massive foundation of over 2.26 million general Text-to-Image pairs, balanced by broad categorical coverage (ImageNet21K) and highly specialized, precise structural tasks such as Text Bbox Edit and Doodles. … view at source ↗
Figure 20
Figure 20. Figure 20: Evaluation process. • Instruction Fidelity: Measures the semantic precision of the generated result (e.g., matching objects, attributes, and actions) in responding to the core generation instruction. • Content Consistency: For generation tasks (Case A), this evaluates canvas cleanliness. For editing tasks (Case B), it strictly checks for the preservation of unedited background regions and the successful r… view at source ↗
Figure 21
Figure 21. Figure 21: Evaluation prompts used in VLM evaluators. [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Unified Meta-Instruction for Fair Evaluation. [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗
read the original abstract

Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space. Our code and models are released on https://csu-jpg.github.io/FlowInOne.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces FlowInOne, a framework that unifies multimodal generation (text-to-image, layout-guided editing, visual instruction following) by converting all inputs into visual prompts and training a single flow-matching model for image-in, image-out generation. It releases the VisPrompt-5M dataset (5M visual-prompt pairs) and VP-Bench benchmark, claiming SOTA performance that surpasses open-source and commercial systems while eliminating cross-modal alignment and task-specific branches.

Significance. If the central claims hold, the work offers a coherent vision-centric alternative to text-dominated pipelines, with the new large-scale dataset and benchmark providing reusable resources for the community. Public release of code and models is a clear strength that supports reproducibility.

major comments (1)
  1. [§3.2, Figure 3] §3.2 and Figure 3: the conversion pipeline (caption rendering, layout rasterization, instruction overlay) is presented as lossless, yet no quantitative measure of information preservation (mutual information, attribute reconstruction error, or held-out spatial/negation accuracy) is reported. This directly underpins the claim that the single visual space incurs no alignment loss or task degradation.
minor comments (1)
  1. [Abstract] Abstract: the SOTA claim is stated without any numerical metrics, baseline names, or ablation summary; a one-sentence quantitative highlight would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and commit to incorporating the suggested improvements in the revised version.

read point-by-point responses
  1. Referee: [§3.2, Figure 3] §3.2 and Figure 3: the conversion pipeline (caption rendering, layout rasterization, instruction overlay) is presented as lossless, yet no quantitative measure of information preservation (mutual information, attribute reconstruction error, or held-out spatial/negation accuracy) is reported. This directly underpins the claim that the single visual space incurs no alignment loss or task degradation.

    Authors: We appreciate the referee's observation that explicit quantitative validation of the conversion pipeline would strengthen our claims. Although the pipeline was engineered to preserve information via deterministic rendering steps (e.g., exact font placement for captions and precise bounding-box rasterization for layouts), we acknowledge that no direct metrics such as reconstruction error or held-out task accuracy were reported. In the revised manuscript we will add a new subsection in §3.2 with quantitative results: (i) attribute reconstruction accuracy on a held-out set of 10k visual prompts, (ii) spatial precision (IoU and coordinate error) for layout elements, and (iii) negation and instruction faithfulness scores measured via automated parsing of generated outputs against ground-truth overlays. These additions will directly quantify any residual information loss. revision: yes

Circularity Check

0 steps flagged

No circularity: unification achieved via new dataset and empirical training, not self-referential equations

full rationale

The paper's core claim is that multimodal inputs can be converted to visual prompts for a single flow-matching model, eliminating cross-modal issues. This is supported by the introduction of VisPrompt-5M dataset and VP-Bench benchmark, with released code. No equations, derivations, or self-citations are shown that reduce the central result to fitted inputs or prior author work by construction. The approach is self-contained as an empirical unification rather than a mathematical reduction to its own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; assessment limited to high-level claims.

pith-pipeline@v0.9.0 · 5576 in / 1056 out tokens · 56100 ms · 2026-05-15T06:58:54.691587+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    missing glyph

    Robust Font Selection and Glyph Validation.To ensure the generative robustness of the text rendering, we implement a dynamic font-picking mechanism. Given an input text sequence, the engine first validates character support by parsing the TrueType font’scmap tables. To prevent the rendering of corrupted or “missing glyph” boxes (often caused by incomplete...

  2. [2]

    We utilize a custom tokenization algorithm tailored for visual layouts

    Semantic-Aware Tokenization.Handling multi-lingual instructions requires precise line-breaking strategies. We utilize a custom tokenization algorithm tailored for visual layouts. Characters are isolated as individual tokens to allow flexible word wrapping, whereas Latin alphanumeric sequences and symbols are grouped as cohesive whole-word tokens. This str...

  3. [3]

    Given a target bounding box with dimensions W×H , our goal is to find the maximum font size smax that accommodates the tokenized sequence T without overflow

    Adaptive Bounding-Box Layout Algorithm.To automatically determine the optimal typographic layout within a constrained visual canvas, we model the layout generation as a constrained optimization problem. Given a target bounding box with dimensions W×H , our goal is to find the maximum font size smax that accommodates the tokenized sequence T without overfl...

  4. [4]

    Context-Aware Stylization and Alpha Compositing.To guarantee text legibility regardless of the underlying visual content, we integrate a context-aware color contrast mechanism. Before rendering, the engine calculates the perceptual luminanceLof the underlying image region bounded by the text block: L= 0.299µ R + 0.587µG + 0.114µB (11) where µR, µG, µB den...

  5. [5]

    By formulating the layout process as a binary search optimization, we reduce the complexity to O(log(|Smax − Smin|))

    Analysis of the Layout Algorithm.The proposed adaptive layout strategy (Algorithm 1) provides several critical advantages for large-scale data synthesis: • Computational Efficiency:Traditional text rendering engines often rely on a linear step-down approach (iteratively decreasing font size until the text fits), yielding a time complexity of O(Smax −S min...

  6. [6]

    Let Tsrc denote the original instruction text and Tocr denote the text extracted from the rendered canvas Iv

    OCR-based Legibility Verification.To ensure the synthesized text is completely legible and free from truncation or rendering artifacts (e.g., overlapping bounding boxes or corrupted glyphs), we deploy an Optical Character Recognition (OCR) engine as the first filter. Let Tsrc denote the original instruction text and Tocr denote the text extracted from the...

  7. [7]

    Does the main subject in the image perfectly align with the embedded text prompt: [PROMPT]?

    Task-Specific VLM Quality Inspection.Images that pass the OCR check are subsequently evaluated by an advanced Multimodal Large Language Model (MLLM, e.g., Qwen3-VL). To handle the diverse nature of our generative tasks, we design task-specific prompts. The VLM acts as a judge, outputting a boolean decision based on customized criteria: • Fundamental Gener...

  8. [8]

    We extract CLIP image embeddings Eclip(I) for all candidates within a specific sub-task

    Diversity-Oriented Deduplication.To maximize the informational entropy of the dataset and prevent mode collapse during training, we apply a diversity-oriented filtering mechanism. We extract CLIP image embeddings Eclip(I) for all candidates within a specific sub-task. A candidate Ii is retained only if its cosine similarity with all previously accepted im...

  9. [9]

    Broad Semantic and Stylistic Coverage (Long-Tailed Nature).Our text-in-image editing subsets—derived heavily from UnicEdit, GPT-Image-Edit, and PicoBanana—encompass a massive spectrum of user intents. UnicEdit and GPT-Image-Edit contribute the bulk of the volume, dominated by high-frequency operations such asColor Alteration (∼203K),Attribute Modification...

  10. [10]

    Our structured editing subsets (PixWizard and VisMarker) serve this exact purpose

    Spatial Reasoning and Region-Aware Constraints.While text instructions govern semantic changes, visual and geometric inputs dictate spatial precision. Our structured editing subsets (PixWizard and VisMarker) serve this exact purpose. The VisMarker subset provides highly balanced, region-aware supervision across 8 core categories (e.g., Object Swap,Removal...

  11. [11]

    While smaller in scale compared to semantic edits (comprising specifically curated classes likeballs_pokeat ∼11K andwindat ∼9K), this subset is of exceptionally high fidelity

    Physics-Aware and Kinematic Dynamics.A uniquely challenging component ofVisPrompt-5Mis the Force & Trajectory generation subset. While smaller in scale compared to semantic edits (comprising specifically curated classes likeballs_pokeat ∼11K andwindat ∼9K), this subset is of exceptionally high fidelity. It forces the image-to-image paradigm to step beyond...

  12. [13]

    Spatial Precision The semantic precision of the generated result in responding to the generation instruction. Object that matches instruction description Checkpoints Objects Attributes Actions …… Case A Canvas cleanliness Case B Background preservation Checkpoints Preservation of the background and non- edited areas Removal of instructions/markers …… The ...

  13. [15]

    analysis

    Spatial Precision The semantic precision of the generated result in responding to the generation instruction. Object that matches instruction description Checkpoints Objects Attributes Actions …… Case A Canvas cleanliness Case B Background preservation Checkpoints Preservation of the background and non- edited areas Removal of instructions/markers …… The ...

  14. [16]

    Instruction Fidelity

  15. [17]

    analysis

    Spatial Precision The semantic precision of the generated result in responding to the generation instruction. Object that matches instruction description Checkpoints Objects Attributes Actions …… Case A Canvas cleanliness Case B Background preservation Checkpoints Preservation of the background and non- edited areas Removal of instructions/markers …… The ...

  16. [18]

    Instruction Fidelity •General Objective: The semantic precision of the generated result in responding to the generation instruction. •Checkpoints: Do the core objects, attributes (color, material), and actions described in the instruction accurately appear in the generated image? •Critical Judgment: If the generated content is irrelevant to the text descr...

  17. [19]

    • Checkpoints: Is the generated subject clear? Is the background clean or logical? (i.e., it should not produce messy, hallucinated objects)

    Content Consistency (Non-Edited Areas) •Case A (Text-Only Source Image): • Objective: Canvas cleanliness. • Checkpoints: Is the generated subject clear? Is the background clean or logical? (i.e., it should not produce messy, hallucinated objects). As long as the generated image is not chaotic, this metric can receive a high score. •Case B (Annotated Real-...

  18. [20]

    Visual Realism •General Objective: The naturalness of the image and the effective suppression of artifacts. •Checkpoints: • Are there conspicuous artifacts, blurriness, jagged edges, or anatomical distortions (e.g., twisted limbs)? • For Case B, is the blending between the edited region and the original background natural?

  19. [21]

    spillover

    Spatial Control Precision •Case A (Text-Only Source Image): • Objective: Compositional rationality. • Checkpoints: If there are no explicit visual markers, assign a default score of 5 (provided the object is complete and within the frame). •Case B (Annotated Real-World Image): • Objective: Marker alignment. • Checkpoints: Is the generated content strictly...