The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

· 2026 · cs.CL · arXiv 2604.25359

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Large Language Models are increasingly being deployed to extract structured data from unstructured and semi-structured sources: parsing invoices, medical records, and converting PDF documents to database entries. Yet existing benchmarks for structured output generation either focus on schema compliance alone, or evaluate value correctness within a single source domain. We introduce SOB (The Structured Output Benchmark), a multi-source benchmark spanning three source modalities: native text, images, and audio conversations. All models receive a text-normalized representation of their context regardless of source modality; this deliberate design isolates structured-output capability from raw vision or speech-processing quality, ensuring a fair, source-agnostic comparison. Our benchmark comprises 5,000 text evaluation records derived from multi-hop QA drawn from a 25,091-record full corpus, 209 image records from OCR-processed PDFs across seven document types including multi-column layouts, dense tables, scanned historical documents, small-print text, and mathematical typesetting, and 115 audio records from the AMI corpus. Each record pairs a natural-language question with a JSON schema that the model must follow and a ground-truth answer verified against the source context. We evaluate 21 frontier and open-weight models across three source domains and seven metrics. Our results reveal a consistent pattern: models achieve near-perfect schema compliance, yet the best Value Accuracy, measured by exact leaf-value match, reaches only 83.0% on text, 67.2% on images, and 23.7% on audio, where longer context makes extraction substantially harder. We release the dataset, evaluation pipeline, and all related code.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation

cs.IR · 2026-05-09 · unverdicted · novelty 7.0

Reddit2Deezer supplies 190k authentic Reddit dialogues grounded in Deezer music entities for scalable conversational music recommendation research.

Source-Grounded Data Generation for Text-to-JSON Learning

cs.CL · 2026-06-18 · unverdicted · novelty 5.0

STAGE generates source-grounded text-to-JSON training data via spreadsheet validation, raising Qwen3-4B exact match from 31.37% to 74.27% on the 851-example STAGE-Eval benchmark.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Reddit2Deezer: A Scalable Dataset for Real-World Grounded Conversational Music Recommendation cs.IR · 2026-05-09 · unverdicted · none · ref 27 · internal anchor
Reddit2Deezer supplies 190k authentic Reddit dialogues grounded in Deezer music entities for scalable conversational music recommendation research.

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer