So-bench: A structural output evaluation of multimodal llms, 2026

Di Feng, Kaixin Ma, Feng Nan, Haofeng Chen, Bohan Zhai, David Griffiths, Mingfei Gao, Zhe Gan, Eshan Verma, Yinfei Yang, Zhifeng Chen, Afshin Dehghan · 2026 · arXiv 2511.21750

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

cs.CV · 2026-03-16 · accept · novelty 8.0

VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 5.0

DACA-GRPO adds denoising-aware credit assignment and bias-reduced likelihood estimation to GRPO, delivering consistent gains up to 36.3pp on math, code, constraint, and schema benchmarks for diffusion LLMs.

citing papers explorer

Showing 2 of 2 citing papers.

VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents cs.CV · 2026-03-16 · accept · none · ref 2
VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.
DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 37
DACA-GRPO adds denoising-aware credit assignment and bias-reduced likelihood estimation to GRPO, delivering consistent gains up to 36.3pp on math, code, constraint, and schema benchmarks for diffusion LLMs.

So-bench: A structural output evaluation of multimodal llms, 2026

fields

years

verdicts

representative citing papers

citing papers explorer