SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
Mindomni: Unleashing reasoning generation in vision language models with rgpo
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3representative citing papers
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
citing papers explorer
-
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
-
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
-
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.