VSI-Bench: Benchmarking visual spatial intelligence in vision-language models

Lixin Yang, Kailin Chen, Songyou Peng, et al · 2024 · arXiv 2407.07890

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?

cs.CV · 2026-05-19 · accept · novelty 6.0

VLMs achieve 53-97% on volumetric rearrangement planning but only 6-45% on occlusion and under 7% on reflections in a new 3,034-sample benchmark, with white-box analysis localizing the failure to visual-token merger in Qwen3-VL-8B-Thinking.

Generalizing Verifiable Instruction Following

cs.CL · 2025-07-03 · unverdicted · novelty 6.0

Introduces IFBench benchmark with 58 new constraints and demonstrates RLVR training improves generalization of language models to unseen verifiable output constraints.

Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

cs.LG · 2025-07-30 · unverdicted · novelty 4.0

Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.

citing papers explorer

Showing 4 of 4 citing papers.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders cs.AI · 2026-05-13 · accept · none · ref 20
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects? cs.CV · 2026-05-19 · accept · none · ref 34
VLMs achieve 53-97% on volumetric rearrangement planning but only 6-45% on occlusion and under 7% on reflections in a new 3,034-sample benchmark, with white-box analysis localizing the failure to visual-token merger in Qwen3-VL-8B-Thinking.
Generalizing Verifiable Instruction Following cs.CL · 2025-07-03 · unverdicted · none · ref 4
Introduces IFBench benchmark with 58 new constraints and demonstrates RLVR training improves generalization of language models to unseen verifiable output constraints.
Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead cs.LG · 2025-07-30 · unverdicted · none · ref 21
Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.

VSI-Bench: Benchmarking visual spatial intelligence in vision-language models

fields

years

verdicts

representative citing papers

citing papers explorer