hub Canonical reference

Evaluating text-to-visual generation with image-to-text models.preprint

Lin, Z · 2024 · arXiv 2404.01291

Canonical reference. 83% of citing Pith papers cite this work as background.

24 Pith papers citing it

Background 83% of classified citations

read on arXiv browse 24 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 method 1

citation-polarity summary

background 5 use method 1

representative citing papers

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

cs.RO · 2026-04-10 · unverdicted · novelty 8.0 · 2 refs

RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

cs.CV · 2026-06-25 · unverdicted · novelty 7.0 · 4 refs

MemoBench is a new diagnostic benchmark with automated and VQA metrics that evaluates memory consistency in video models under disappear-and-reappear in dynamic environments.

Aggregating LLM-Based Weak Verifiers for Spatial Layout Generation

cs.GR · 2026-06-03 · unverdicted · novelty 7.0

Aggregating many LLM-synthesized weak verifiers via weak learning from sparse labels yields stronger verifiers that improve F1 by up to 7X over direct LLM judges on 3D room and 2D poster tasks and boost generation quality by 66.2%.

Drifting Preference Optimization for One-Step Generative Models

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

DrPO enables online preference optimization for deterministic one-step generators via non-parametric dipole updates from ranked samples plus base-model drift, without reward backpropagation.

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

OctoT2I uses a no-supervision PSEL loop to discover model capability frontiers and route T2I tasks, reaching 0.96 GenEval score with 90.3% speedup over Flow-GRPO.

CoMoGen: COntrollable MOtion Dynamics and Interactions with Mask-Guided Video GENeration

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

CoMoGen generates controllable interactive video from mask sequences and images by encoding masks into MMDiT via MaskAdapter and LoRA on motion layers, claiming SOTA motion fidelity.

HumanScore: Benchmarking Human Motions in Generated Videos

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps between visual appeal and physical fidelity.

Tiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution

cs.CV · 2026-02-03 · unverdicted · novelty 7.0

Tiled Prompts generates tile-specific text prompts for each latent tile in diffusion super-resolution to reduce errors from global prompts and improve perceptual quality.

Determinism of Randomness: Prompt-Residual Seed Shaping for Diffusion Generation

cs.CV · 2025-11-11 · unverdicted · novelty 7.0

A geometric view of semantic anisotropy in diffusion latents motivates a prompt-residual seed-shaping method that improves prompt alignment and visual quality without training.

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

Z-Reward trains a 27B reasoning teacher VLM on score distributions via GDSO and distills it via RISD into a 9B student, reaching 89.6% and 88.6% human preference accuracy with 41.3% optimization gain over SFT baseline.

MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

MaSC is a masked similarity metric that decomposes concept-driven image generation evaluation into subject-specific preservation and background-based prompt following using SigLIP2 embeddings, outperforming global baselines on human correlation and identity benchmarks.

ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.

Building a Precise Video Language with Human-AI Oversight

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.

Generative Simulation for Policy Learning in Physical Human-Robot Interaction

cs.RO · 2026-04-09 · unverdicted · novelty 6.0

A text-to-simulation pipeline using LLMs and VLMs generates synthetic pHRI data to train vision-based imitation learning policies that achieve over 80% success in zero-shot sim-to-real transfer on real assistive tasks.

Multimodal Language Models Cannot Spot Spatial Inconsistencies

cs.CV · 2026-04-01 · unverdicted · novelty 6.0

Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

cs.CV · 2025-05-08 · unverdicted · novelty 6.0

Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interleaved outputs including zero-shot editing.

Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

cs.CV · 2025-03-10 · unverdicted · novelty 6.0

Seedream 2.0 is a native Chinese-English bilingual diffusion model that integrates a self-developed LLM text encoder, Glyph-Aligned ByT5, and Scaled ROPE to reach claimed state-of-the-art results in prompt following, aesthetics, text rendering, and human preference alignment via RLHF.

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

cs.CV · 2024-10-07 · unverdicted · novelty 6.0

PhyGenBench supplies 160 prompts across 27 physical laws and an automated LLM/VLM evaluation pipeline to measure physical commonsense compliance in current text-to-video models.

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

cs.CV · 2024-09-06 · unverdicted · novelty 6.0

VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.

VideoPhy: Evaluating Physical Commonsense for Video Generation

cs.CV · 2024-06-05 · conditional · novelty 6.0

VideoPhy benchmark shows state-of-the-art text-to-video models follow physical commonsense and text prompts in only 39.6% of cases for the best model.

Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference

cs.CV · 2026-04-13 · unverdicted · novelty 5.0

Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.

Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics

cs.CY · 2026-04-02 · unverdicted · novelty 5.0 · 2 refs

Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.

FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle

cs.CV · 2025-11-21 · unverdicted · novelty 5.0 · 2 refs

FireScope trains a VLM on US data to output wildfire risk rasters with reasoning traces and shows improved cross-continental performance on European events compared with prior approaches.

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

cs.CV · 2025-03-10

citing papers explorer

Showing 1 of 1 citing paper after filters.

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation cs.CV · 2025-03-10 · unreviewed · ref 29

Evaluating text-to-visual generation with image-to-text models.preprint

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer