VSI-Bench: Benchmarking visual spatial intelligence in vision-language models

Ricardo Dominguez-Olmedo, Florian E Dorner, Moritz Hardt · 2024 · arXiv 2407.07890

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

Validity Threats for Foundation Model Research

cs.LG · 2026-06-03 · accept · novelty 6.0

Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.

Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?

cs.CV · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

VLMs achieve 53-97% on rearrangement planning but only 6-45% on occlusion and under 7% on reflections, with failures localized to visual token compression after the vision encoder.

Generalizing Verifiable Instruction Following

cs.CL · 2025-07-03 · unverdicted · novelty 6.0

Introduces IFBench benchmark with 58 new constraints and demonstrates RLVR training improves generalization of language models to unseen verifiable output constraints.

Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

cs.LG · 2025-07-30 · unverdicted · novelty 4.0

Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Generalizing Verifiable Instruction Following cs.CL · 2025-07-03 · unverdicted · none · ref 4
Introduces IFBench benchmark with 58 new constraints and demonstrates RLVR training improves generalization of language models to unseen verifiable output constraints.
Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead cs.LG · 2025-07-30 · unverdicted · none · ref 21
Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.

VSI-Bench: Benchmarking visual spatial intelligence in vision-language models

fields

years

verdicts

representative citing papers

citing papers explorer