{"paper":{"title":"Latent Action Control for Reasoning-Guided Unified Image Generation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Latent Action Control turns inferred reasoning into hidden continuous actions that guide image generation inside unified models.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Fuxiang Zhai, Jianyu Lai, Lei Zhu, Shuaibo Li, Sixiang Chen, Tengjun Huang, Yingjin Li","submitted_at":"2026-05-16T12:23:20Z","abstract_excerpt":"Unified multimodal models can encode visual understanding and image generation within a shared backbone, yet understanding does not automatically translate into control: models may infer objects, relations, or knowledge cues but fail to instantiate them in the generated image. We propose Latent Action Control (LAC), which makes reasoning actionable by representing it as hidden continuous actions inside a unified generator. Given a prompt, LAC rolls out a role-structured latent trajectory for planning, internal visual drafting, diagnosis, and refinement, and injects these actions into the hidde"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"LAC consistently improves compositional and knowledge-grounded generation across GenEval, WISE, and T2I-CompBench, with the largest gains on spatial relations, attribute binding, and world-knowledge-sensitive prompts.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The learned latent action trajectories are actually consumed by the generator and causally affect the output image, as suggested by ablations and latent interventions but without explicit causal verification in the provided description.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Latent Action Control turns inferred reasoning into hidden continuous actions that guide image generation inside unified models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f97ca8f9d1dae76cef807b113c78b1ca0656652394f66651e60ae3968c36f952"},"source":{"id":"2605.16961","kind":"arxiv","version":1},"verdict":{"id":"54fd6a32-edec-44e1-a37a-bbbcfa319269","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T20:36:35.233712Z","strongest_claim":"LAC consistently improves compositional and knowledge-grounded generation across GenEval, WISE, and T2I-CompBench, with the largest gains on spatial relations, attribute binding, and world-knowledge-sensitive prompts.","one_line_summary":"Latent Action Control learns unobserved action trajectories via variational alignment and GRPO to inject reasoning into flow-based image generation, yielding gains on compositional benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The learned latent action trajectories are actually consumed by the generator and causally affect the output image, as suggested by ablations and latent interventions but without explicit causal verification in the provided description.","pith_extraction_headline":"Latent Action Control turns inferred reasoning into hidden continuous actions that guide image generation inside unified models."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.16961/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"doi_title_agreement","ran_at":"2026-05-19T21:01:19.082742Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T20:40:51.273445Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"cited_work_retraction","ran_at":"2026-05-19T19:51:58.144521Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"citation_quote_validity","ran_at":"2026-05-19T19:50:15.274380Z","status":"skipped","version":"0.1.0","findings_count":0},{"name":"claim_evidence","ran_at":"2026-05-19T18:41:56.230549Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T18:33:26.315667Z","status":"skipped","version":"1.0.0","findings_count":0}],"snapshot_sha256":"817191e7c0ad7f7997ffc59f45031591ce151ab4057808e91ef54dbd49a124c6"},"references":{"count":49,"sample":[{"doi":"","year":2023,"title":"Improving image generation with better captions.Computer Science","work_id":"fb7509a6-ece6-4ea3-b583-1ec884016dc8","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Training Diffusion Models with Reinforcement Learning","work_id":"67684dda-3930-452a-b91a-36cbb8e2e219","ref_index":2,"cited_arxiv_id":"2305.13301","is_internal_anchor":true},{"doi":"","year":2024,"title":"Flux.https://github.com/black-forest-labs/flux","work_id":"ff476bd7-afa4-451d-b45d-54207aeb1545","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2026,"title":"Show, don’t tell: Morphing latent reasoning into image generation","work_id":"b2d90d37-ba5a-42f2-8d9f-abc39922218a","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation","work_id":"44358f20-2fd5-4abc-a87d-85fdff92e52a","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":49,"snapshot_sha256":"2e56224325db55367a376fd1d0c7ed30cba897cd1cc2000cd9daa17e8f15020d","internal_anchors":20},"formal_canon":{"evidence_count":2,"snapshot_sha256":"a85bc225c8f643b61e4a19879c9566344dc3f20235d3be6df237f56020ac198e"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}