AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
VSI-Bench: Benchmarking visual spatial intelligence in vision-language models
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.
VLMs achieve 53-97% on rearrangement planning but only 6-45% on occlusion and under 7% on reflections, with failures localized to visual token compression after the vision encoder.
Introduces IFBench benchmark with 58 new constraints and demonstrates RLVR training improves generalization of language models to unseen verifiable output constraints.
Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.
citing papers explorer
-
Generalizing Verifiable Instruction Following
Introduces IFBench benchmark with 58 new constraints and demonstrates RLVR training improves generalization of language models to unseen verifiable output constraints.
-
Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.