pith. sign in

arxiv: 2601.00575 · v2 · pith:VHS654F5new · submitted 2026-01-02 · 💻 cs.CL

InfoSynth: Information-Guided Benchmark Synthesis for LLMs

classification 💻 cs.CL
keywords benchmarksinfosynthbenchmarkllmsproblemscapabilitiescodecoding
0
0 comments X
read the original abstract

Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation, but efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, which is expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher difficulty compared to prior works. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, challenging coding benchmarks for LLMs. Project Page: https://ishirgarg.github.io/infosynth_web/

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems

    cs.SE 2026-04 unverdicted novelty 7.0

    FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.