InfoSynth: Information-Guided Benchmark Synthesis for LLMs

Dawn Song; Ishir Garg; Neel Kolhe; Xuandong Zhao

arxiv: 2601.00575 · v2 · pith:VHS654F5new · submitted 2026-01-02 · 💻 cs.CL

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

Ishir Garg , Neel Kolhe , Xuandong Zhao , Dawn Song This is my paper

classification 💻 cs.CL

keywords benchmarksinfosynthbenchmarkllmsproblemscapabilitiescodecoding

0 comments

read the original abstract

Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation, but efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, which is expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher difficulty compared to prior works. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, challenging coding benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems
cs.SE 2026-04 unverdicted novelty 7.0

FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.