PhyBench: A physical com- monsense benchmark for evaluating text-to-image models

Fanqing Meng, Wenqi Shao, Li Ray Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, Ping Luo · 2025 · arXiv 2406.11802

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

PhysInOne: Visual Physics Learning and Reasoning in One Suite

cs.CV · 2026-04-10 · unverdicted · novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.

VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?

cs.CV · 2025-12-15 · unverdicted · novelty 7.0

VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.

Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning

cs.CV · 2025-07-02 · unverdicted · novelty 7.0

Presents Reason50K dataset and ReasonBrain framework for hypothetical instruction-based image editing that requires physical, temporal, causal, and story reasoning.

T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts

cs.CV · 2024-12-05 · unverdicted · novelty 7.0

T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.

Qwen-Image-Agent: Bridging the Context Gap in Real-World Image Generation

cs.CV · 2026-06-25 · unverdicted · novelty 6.0

Qwen-Image-Agent is a unified agent framework that progressively builds sufficient generation context for T2I models via Context-Aware Planning and Context Grounding, achieving SOTA on IA-Bench, Mindbench, and WISE-Verified.

Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

Qwen-Image-Bench introduces a hierarchical creator-centric benchmark with 1000 prompts, 23 sub-capabilities, and a Q-Judger model that scores images on 56 verifiable facets to distinguish T2I models on fidelity and creativity.

Training-Trajectory-Aware Token Selection

cs.CL · 2026-01-15 · unverdicted · novelty 6.0

Training-Trajectory-Aware Token Selection (T3S) reconstructs the token-level training objective to overcome a performance bottleneck in continual distillation of reasoning capabilities from large to small language models.

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

cs.CV · 2026-04-30 · unverdicted · novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.

FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle

cs.CV · 2025-11-21 · unverdicted · novelty 5.0 · 2 refs

FireScope trains a VLM on US data to output wildfire risk rasters with reasoning traces and shows improved cross-continental performance on European events compared with prior approaches.

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

cs.CV · 2025-03-10

citing papers explorer

Showing 10 of 10 citing papers.

PhysInOne: Visual Physics Learning and Reasoning in One Suite cs.CV · 2026-04-10 · unverdicted · none · ref 61
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.
VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans? cs.CV · 2025-12-15 · unverdicted · none · ref 33
VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.
Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning cs.CV · 2025-07-02 · unverdicted · none · ref 12
Presents Reason50K dataset and ReasonBrain framework for hypothetical instruction-based image editing that requires physical, temporal, causal, and story reasoning.
T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts cs.CV · 2024-12-05 · unverdicted · none · ref 34
T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.
Qwen-Image-Agent: Bridging the Context Gap in Real-World Image Generation cs.CV · 2026-06-25 · unverdicted · none · ref 9
Qwen-Image-Agent is a unified agent framework that progressively builds sufficient generation context for T2I models via Context-Aware Planning and Context Grounding, achieving SOTA on IA-Bench, Mindbench, and WISE-Verified.
Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation cs.CV · 2026-05-27 · unverdicted · none · ref 9
Qwen-Image-Bench introduces a hierarchical creator-centric benchmark with 1000 prompts, 23 sub-capabilities, and a Q-Judger model that scores images on 56 verifiable facets to distinguish T2I models on fidelity and creativity.
Training-Trajectory-Aware Token Selection cs.CL · 2026-01-15 · unverdicted · none · ref 14
Training-Trajectory-Aware Token Selection (T3S) reconstructs the token-level training objective to overcome a performance bottleneck in continual distillation of reasoning capabilities from large to small language models.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling cs.CV · 2026-04-30 · unverdicted · none · ref 53
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemphasizing perceptual quality.
FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle cs.CV · 2025-11-21 · unverdicted · none · ref 38 · 2 links
FireScope trains a VLM on US data to output wildfire risk rasters with reasoning traces and shows improved cross-continental performance on European events compared with prior approaches.
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation cs.CV · 2025-03-10 · unreviewed · ref 31

PhyBench: A physical com- monsense benchmark for evaluating text-to-image models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer