Title resolution pending

Yiming Huang, Zhenghao Lin, Xiao Liu, Yeyun Gong, Shuai Lu, Fangyu Lei, Yaobo Liang, Yelong Shen, Chen Lin, Nan Duan, Weizhu Chen · 2023 · arXiv 2312.02143

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

AutoBaxBuilder: Bootstrapping Code Security Benchmarking

cs.CR · 2025-12-24 · unverdicted · novelty 7.0

AutoBaxBuilder automates creation of code security benchmarks for LLMs via LLM-driven test and exploit generation with reliability checks, yielding AutoBaxBench and cutting human effort by a factor of 12.

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

cs.AI · 2025-09-22 · unverdicted · novelty 7.0

EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.

League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

cs.AI · 2025-07-30 · unverdicted · novelty 6.0

League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

Benchmark Data Contamination of Large Language Models: A Survey

cs.CL · 2024-06-06 · unverdicted · novelty 3.0

A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.

citing papers explorer

Showing 5 of 5 citing papers.

AutoBaxBuilder: Bootstrapping Code Security Benchmarking cs.CR · 2025-12-24 · unverdicted · none · ref 1
AutoBaxBuilder automates creation of code security benchmarks for LLMs via LLM-driven test and exploit generation with reliability checks, yielding AutoBaxBench and cutting human effort by a factor of 12.
EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving cs.AI · 2025-09-22 · unverdicted · none · ref 22
EngiBench shows LLMs accuracy drops with task complexity, degrades under perturbations, and stays below human performance on open-ended engineering problems.
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models cs.AI · 2025-07-30 · unverdicted · none · ref 22
League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 259
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Benchmark Data Contamination of Large Language Models: A Survey cs.CL · 2024-06-06 · unverdicted · none · ref 65
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer