hub Mixed citations

Guanhua Zhang and Moritz Hardt

Yang, S · 2023 · arXiv 2311.04850

Mixed citation behavior. Most common role is background (67%).

25 Pith papers citing it

Background 67% of classified citations

read on arXiv browse 25 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 other 1

citation-polarity summary

background 4 unclear 2

representative citing papers

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

cs.LG · 2026-05-24 · unverdicted · novelty 8.0

TSFMAudit detects pretraining contamination in time series foundation models via probe adaptation dynamics (faster loss drop, smaller backbone shift), tested on 6 models and 187 datasets against 10 LLM-derived baselines.

How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.

Provable Joint Decontamination for Benchmarking Multiple Large Language Models

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

Agent Island is a new multiagent game environment that functions as a dynamic benchmark resistant to saturation and contamination, with Bayesian ranking showing OpenAI GPT-5.5 as the strongest performer among 49 models across 999 games.

When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation

cs.SE · 2026-04-27 · unverdicted · novelty 7.0

Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.

DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human validation finds 76% validity.

LogitTrace: Detecting Benchmark Contamination via Layerwise Logit Trajectories

cs.CL · 2025-09-25 · unverdicted · novelty 7.0

LogitTrace detects benchmark contamination by showing that contaminated inputs produce earlier stabilization in layerwise logit trajectories while clean inputs show more gradual accumulation.

SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models

cs.CL · 2026-06-29 · unverdicted · novelty 6.0

SrDetection detects data leakage in Code LLMs via contrast between original benchmark samples and their semantic variants, reporting F1 gains of 21.52 (gray-box) and 14.46 (black-box) over baselines in a controlled testbed.

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

cs.SE · 2026-05-22 · unverdicted · novelty 6.0

An empirical study of 57 ML evaluation harnesses shows 41.4% of operational issues occur in the specification stage, driven mainly by unimplemented features, documentation gaps, and missing input validation.

TRACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLMs

cs.SE · 2026-05-22 · unverdicted · novelty 6.0

TRACER presents a semantic-aware framework and the first benchmark for fine-grained code contamination detection across three levels of overlap, reporting F1 scores of 0.91-0.92 and large gains over prior methods.

Decaf: Improving Neural Decompilation with Automatic Feedback and Search

cs.SE · 2026-05-12 · unverdicted · novelty 6.0

Decaf uses compiler feedback and search to improve neural decompilation, boosting semantic success rate from 26.0% to 83.9% on ExeBench Real -O2 split.

Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

GSM-SEM is a reusable framework for creating semantically variant augmentations of math benchmarks like GSM8K that alter facts but preserve answers and difficulty, with evaluations showing LLM performance drops of up to 28% on the new variants.

TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

TPS-CalcBench is a new benchmark and evaluation framework that tests LLMs on analytical calculations in hypersonic aerodynamics and gas dynamics, using dual-track scoring and interventions to detect physically invalid reasoning.

League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

cs.AI · 2025-07-30 · unverdicted · novelty 6.0

League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

cs.AI · 2026-05-23 · unverdicted · novelty 5.0 · 3 refs

Proposes a multi-dimensional behavioral framework with six dimensions (Correctness, Consistency, Robustness, Local Logical Coherence, Efficiency, Stability) plus deployment-aware aggregation to diagnose LLM reasoning beyond accuracy-based benchmarks.

The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

ZCP detects direct and evasive data contamination in LLMs by truncating CoT reasoning and contrasting zero-CoT accuracy on original versus perturbed isomorphic datasets, plus a Contamination Confidence metric.

LLM Benchmark Datasets Should Be Contamination-Resistant

cs.LG · 2026-05-19 · unverdicted · novelty 4.0

Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.

Measuring AI Reasoning: A Guide for Researchers

cs.AI · 2026-05-04 · unverdicted · novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation

cs.SE · 2026-04-06 · unverdicted · novelty 4.0

Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

cs.CL · 2024-06-18 · unverdicted · novelty 3.0

GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.

citing papers explorer

Showing 20 of 20 citing papers after filters.

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models cs.LG · 2026-05-24 · unverdicted · none · ref 28
TSFMAudit detects pretraining contamination in time series foundation models via probe adaptation dynamics (faster loss drop, smaller backbone shift), tested on 6 models and 187 datasets against 10 LLM-derived baselines.
How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness cs.LG · 2026-05-22 · unverdicted · none · ref 26
Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.
Provable Joint Decontamination for Benchmarking Multiple Large Language Models cs.LG · 2026-05-20 · unverdicted · none · ref 172
JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack cs.AI · 2026-05-12 · conditional · none · ref 60
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity cs.CL · 2026-05-07 · unverdicted · none · ref 18
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games cs.AI · 2026-05-05 · unverdicted · none · ref 19
Agent Island is a new multiagent game environment that functions as a dynamic benchmark resistant to saturation and contamination, with Bayesian ranking showing OpenAI GPT-5.5 as the strongest performer among 49 models across 999 games.
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation cs.SE · 2026-04-27 · unverdicted · none · ref 44
Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math? cs.AI · 2026-04-10 · unverdicted · none · ref 14
DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human validation finds 76% validity.
SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models cs.CL · 2026-06-29 · unverdicted · none · ref 12
SrDetection detects data leakage in Code LLMs via contrast between original benchmark samples and their semantic variants, reporting F1 gains of 21.52 (gray-box) and 14.46 (black-box) over baselines in a controlled testbed.
Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild cs.SE · 2026-05-22 · unverdicted · none · ref 55
An empirical study of 57 ML evaluation harnesses shows 41.4% of operational issues occur in the specification stage, driven mainly by unimplemented features, documentation gaps, and missing input validation.
TRACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLMs cs.SE · 2026-05-22 · unverdicted · none · ref 46
TRACER presents a semantic-aware framework and the first benchmark for fine-grained code contamination detection across three levels of overlap, reporting F1 scores of 0.91-0.92 and large gains over prior methods.
Decaf: Improving Neural Decompilation with Automatic Feedback and Search cs.SE · 2026-05-12 · unverdicted · none · ref 32
Decaf uses compiler feedback and search to improve neural decompilation, boosting semantic success rate from 26.0% to 83.9% on ExeBench Real -O2 split.
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation cs.AI · 2026-05-11 · unverdicted · none · ref 27
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations cs.CL · 2026-05-08 · unverdicted · none · ref 34 · 2 links
GSM-SEM is a reusable framework for creating semantically variant augmentations of math benchmarks like GSM8K that alter facts but preserve answers and difficulty, with evaluations showing LLM performance drops of up to 28% on the new variants.
TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering cs.AI · 2026-04-20 · unverdicted · none · ref 49
TPS-CalcBench is a new benchmark and evaluation framework that tests LLMs on analytical calculations in hypersonic aerodynamics and gas dynamics, using dual-track scoring and interventions to detect physically invalid reasoning.
Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework cs.AI · 2026-05-23 · unverdicted · none · ref 26 · 3 links
Proposes a multi-dimensional behavioral framework with six dimensions (Correctness, Consistency, Robustness, Local Logical Coherence, Efficiency, Stability) plus deployment-aware aggregation to diagnose LLM reasoning beyond accuracy-based benchmarks.
The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation cs.LG · 2026-05-21 · unverdicted · none · ref 8
ZCP detects direct and evasive data contamination in LLMs by truncating CoT reasoning and contrasting zero-CoT accuracy on original versus perturbed isomorphic datasets, plus a Contamination Confidence metric.
LLM Benchmark Datasets Should Be Contamination-Resistant cs.LG · 2026-05-19 · unverdicted · none · ref 90
Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.
Measuring AI Reasoning: A Guide for Researchers cs.AI · 2026-05-04 · unverdicted · none · ref 45
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation cs.SE · 2026-04-06 · unverdicted · none · ref 22
Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.

Guanhua Zhang and Moritz Hardt

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer