TSFMAudit detects pretraining contamination in time series foundation models via probe adaptation dynamics (faster loss drop, smaller backbone shift), tested on 6 models and 187 datasets against 10 LLM-derived baselines.
hub Mixed citations
Guanhua Zhang and Moritz Hardt
Mixed citation behavior. Most common role is background (67%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.
JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Agent Island is a new multiagent game environment that functions as a dynamic benchmark resistant to saturation and contamination, with Bayesian ranking showing OpenAI GPT-5.5 as the strongest performer among 49 models across 999 games.
Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human validation finds 76% validity.
LogitTrace detects benchmark contamination by showing that contaminated inputs produce earlier stabilization in layerwise logit trajectories while clean inputs show more gradual accumulation.
SrDetection detects data leakage in Code LLMs via contrast between original benchmark samples and their semantic variants, reporting F1 gains of 21.52 (gray-box) and 14.46 (black-box) over baselines in a controlled testbed.
An empirical study of 57 ML evaluation harnesses shows 41.4% of operational issues occur in the specification stage, driven mainly by unimplemented features, documentation gaps, and missing input validation.
TRACER presents a semantic-aware framework and the first benchmark for fine-grained code contamination detection across three levels of overlap, reporting F1 scores of 0.91-0.92 and large gains over prior methods.
Decaf uses compiler feedback and search to improve neural decompilation, boosting semantic success rate from 26.0% to 83.9% on ExeBench Real -O2 split.
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
TPS-CalcBench is a new benchmark and evaluation framework that tests LLMs on analytical calculations in hypersonic aerodynamics and gas dynamics, using dual-track scoring and interventions to detect physically invalid reasoning.
League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
A multi-dimensional framework with six dimensions (Correctness, Consistency, Robustness, Logical Coherence, Efficiency, Stability) is applied to seven LLMs on 975 items, revealing orthogonality between logical coherence and correctness plus ranking inversions invisible to accuracy metrics.
ZCP detects direct and evasive data contamination in LLMs by truncating CoT reasoning and contrasting zero-CoT accuracy on original versus perturbed isomorphic datasets, plus a Contamination Confidence metric.
Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.
GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.
citing papers explorer
-
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
-
Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games
Agent Island is a new multiagent game environment that functions as a dynamic benchmark resistant to saturation and contamination, with Bayesian ranking showing OpenAI GPT-5.5 as the strongest performer among 49 models across 999 games.
-
DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?
DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human validation finds 76% validity.
-
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
-
TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering
TPS-CalcBench is a new benchmark and evaluation framework that tests LLMs on analytical calculations in hypersonic aerodynamics and gas dynamics, using dual-track scoring and interventions to detect physically invalid reasoning.
-
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
-
Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework
A multi-dimensional framework with six dimensions (Correctness, Consistency, Robustness, Logical Coherence, Efficiency, Stability) is applied to seven LLMs on 975 items, revealing orthogonality between logical coherence and correctness plus ranking inversions invisible to accuracy metrics.
-
Measuring AI Reasoning: A Guide for Researchers
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.