ImageTime is a benchmark that probes image generation models' visual world modeling by requiring coherent four-state sequences in single images, scored via VLM judge.
ImagenWorld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
TASTE supplies designer multi-dimensional rankings of T2I graphic outputs with statistical validation showing moderate agreement and benchmarks where a TASTE-trained MLP outperforms off-the-shelf VLMs.
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
A reproducible VLM-judge protocol with position-bias correction is validated as superior to CLIP similarity and geometry-validity proxies for assessing single-image 3D mesh quality.
citing papers explorer
-
Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency
ImageTime is a benchmark that probes image generation models' visual world modeling by requiring coherent four-state sequences in single images, scored via VLM judge.
-
TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design
TASTE supplies designer multi-dimensional rankings of T2I graphic outputs with statistical validation showing moderate agreement and benchmarks where a TASTE-trained MLP outperforms off-the-shelf VLMs.
-
RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
-
A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)
A reproducible VLM-judge protocol with position-bias correction is validated as superior to CLIP similarity and geometry-validity proxies for assessing single-image 3D mesh quality.