{"total":25,"items":[{"citing_arxiv_id":"2606.29815","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-29T05:48:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SrDetection detects data leakage in Code LLMs via contrast between original benchmark samples and their semantic variants, reporting F1 gains of 21.52 (gray-box) and 14.46 (black-box) over baselines in a controlled testbed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26161","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models","primary_cat":"cs.LG","submitted_at":"2026-05-24T14:59:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"TSFMAudit detects pretraining contamination in time series foundation models via probe adaptation dynamics (faster loss drop, smaller backbone shift), tested on 6 models and 187 datasets against 10 LLM-derived baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24661","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework","primary_cat":"cs.AI","submitted_at":"2026-05-23T17:03:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A multi-dimensional framework with six dimensions (Correctness, Consistency, Robustness, Logical Coherence, Efficiency, Stability) is applied to seven LLMs on 975 items, revealing orthogonality between logical coherence and correctness plus ranking inversions invisible to accuracy metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24213","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild","primary_cat":"cs.SE","submitted_at":"2026-05-22T20:54:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An empirical study of 57 ML evaluation harnesses shows 41.4% of operational issues occur in the specification stage, driven mainly by unimplemented features, documentation gaps, and missing input validation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24079","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TRACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLMs","primary_cat":"cs.SE","submitted_at":"2026-05-22T17:30:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TRACER presents a semantic-aware framework and the first benchmark for fine-grained code contamination detection across three levels of overlap, reporting F1 scores of 0.91-0.92 and large gains over prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23628","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness","primary_cat":"cs.LG","submitted_at":"2026-05-22T13:40:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21856","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation","primary_cat":"cs.LG","submitted_at":"2026-05-21T01:06:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ZCP detects direct and evasive data contamination in LLMs by truncating CoT reasoning and contrasting zero-CoT accuracy on original versus perturbed isomorphic datasets, plus a Contamination Confidence metric.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21543","ref_index":172,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Provable Joint Decontamination for Benchmarking Multiple Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-20T09:16:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19999","ref_index":90,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM Benchmark Datasets Should Be Contamination-Resistant","primary_cat":"cs.LG","submitted_at":"2026-05-19T15:33:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12673","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack","primary_cat":"cs.AI","submitted_at":"2026-05-12T19:22:45+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"04850. [58] Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URL https://arxiv.org/abs/ 2406.12045. [59] Boxi Yu, Yuxuan Zhu, Pinjia He, and Daniel Kang. Utboost: Rigorous evaluation of coding agents on swe-bench, 2025. URLhttps://arxiv.org/abs/2506.09289. [60] Boxi Yu, Yang Cao, Yuzhong Zhang, Liting Lin, Junjielong Xu, Zhiqing Zhong, Qinghua Xu, Guancheng Wang, Jialun Cao, Shing-Chi Cheung, Pinjia He, and Lionel Briand. Swe-abs: Adversarial benchmark strengthening exposes inflated success rates on test-based benchmark, 2026. URLhttps://arxiv.org/abs/2603.00520. [61] Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng,"},{"citing_arxiv_id":"2605.11501","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Decaf: Improving Neural Decompilation with Automatic Feedback and Search","primary_cat":"cs.SE","submitted_at":"2026-05-12T04:21:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Decaf uses compiler feedback and search to improve neural decompilation, boosting semantic success rate from 26.0% to 83.9% on ExeBench Real -O2 split.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"compiler configurations and instruction set architectures is an important direction for future work if the reranker is intended to generalize to these domains. Reinforcement Learning and Iterative Refinement . As described in Section 2.2, we sample eight candidates per function during data collection, yielding positive and neg- ative sequences that could be used for offline RL [32], [33], Direct Preference Optimization [34], or online RL algorithms such as PPO [35], [36] or GRPO [37]. In earlier iterations we also explored an iterative refinement model to recursively edit generator outputs, but did not find significant improvements. 5. Related Work 5.1. Traditional Decompilation Decompilation has been studied for over five decades [15], but most modern systems trace their lineage"},{"citing_arxiv_id":"2605.10448","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation","primary_cat":"cs.AI","submitted_at":"2026-05-11T12:20:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"multimodal agents for open-ended tasks in real computer environments, 2024. URL https: //arxiv.org/abs/2404.07972. [26] Frank F Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, et al. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks.arXiv preprint arXiv:2412.14161, 2024. [27] Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, and Ion Stoica. Rethinking benchmark and contamination for language models with rephrased samples, 2023. URL https://arxiv.org/abs/2311.04850. [28] Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents."},{"citing_arxiv_id":"2605.07053","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations","primary_cat":"cs.CL","submitted_at":"2026-05-08T00:02:39+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"51 [0.57] -3.88 [4.10]↓ -2.76 [2.86] -7.97 [8.26]↓ -6.93 [7.03] -15.33 [15.81]↓ GPT4.1 -0.95 [0.51] -3.85 [2.98]↓ -3.33 [3.47] -7.98 [8.31]↓ -4.47 [4.26] -11.86 [11.87]↓ GPT4.1-mini -1.77 [0.84] -2.68 [1.73]↓ -2.42 [2.51] -6.83 [7.06]↓ -4.25 [4.05] -12.19 [12.35]↓ Gemini2.5-f -2.51 [2.12] -5.07 [4.23]↓ -3.11 [3.17] -6.94 [7.16]↓ -6.66 [6.57] -13.38 [13.34]↓ Gemini2.5-f-l -1.95 [1.04] -4.44 [3.64]↓ -0.52 [0.55] -7.27 [7.62]↓ -6.40 [6.27] -14.05 [14.53]↓ Gemini2.5-pro -0.82 [0.36] -3.49 [3.39]↓ -0.86 [0.88] -6.66 [6.86]↓ -4.15 [4.07] -11.53 [11.35]↓ O3 -0.19 [-0.29] -3.36 [2.68]↓ -2.18 [2.23] -8.71 [8.91]↓ -5.06 [4.91] -9.77 [9.45]↓ GPT5(mnml) -1.16 [0.71] -3.37 [2.42]↓ -1.69 [1.73] -8.68 [8.89]↓ -4.84 [4."},{"citing_arxiv_id":"2605.06327","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity","primary_cat":"cs.CL","submitted_at":"2026-05-07T14:23:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04312","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games","primary_cat":"cs.AI","submitted_at":"2026-05-05T21:24:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Agent Island is a new multiagent game environment that functions as a dynamic benchmark resistant to saturation and contamination, with Bayesian ranking showing OpenAI GPT-5.5 as the strongest performer among 49 models across 999 games.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02442","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Measuring AI Reasoning: A Guide for Researchers","primary_cat":"cs.AI","submitted_at":"2026-05-04T10:42:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24712","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation","primary_cat":"cs.SE","submitted_at":"2026-04-27T17:21:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17966","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering","primary_cat":"cs.AI","submitted_at":"2026-04-20T08:46:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TPS-CalcBench is a new benchmark and evaluation framework that tests LLMs on analytical calculations in hypersonic aerodynamics and gas dynamics, using dual-track scoring and interventions to detect physically invalid reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09251","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?","primary_cat":"cs.AI","submitted_at":"2026-04-10T12:07:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human validation finds 76% validity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05150","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation","primary_cat":"cs.SE","submitted_at":"2026-04-06T20:25:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.20909","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LogitTrace: Detecting Benchmark Contamination via Layerwise Logit Trajectories","primary_cat":"cs.CL","submitted_at":"2025-09-25T08:55:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LogitTrace detects benchmark contamination by showing that contaminated inputs produce earlier stabilization in layerwise logit trajectories while clean inputs show more gradual accumulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.22359","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-07-30T03:50:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.12793","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools","primary_cat":"cs.CL","submitted_at":"2024-06-18T16:58:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[48] Y . Xu, X. Liu, X. Liu, Z. Hou, Y . Li, X. Zhang, Z. Wang, A. Zeng, Z. Du, W. Zhao, J. Tang, and Y . Dong. Chatglm-math: Improving math problem-solving in large language models with a self-critique pipeline, 2024. [49] F. Yan, H. Mao, C. C.-J. Ji, T. Zhang, S. G. Patil, I. Stoica, and J. E. Gonzalez. Berkeley function calling leaderboard. 2024. [50] S. Yang, W.-L. Chiang, L. Zheng, J. E. Gonzalez, and I. Stoica. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023. [51] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022."},{"citing_arxiv_id":"2406.11794","ref_index":208,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DataComp-LM: In search of the next generation of training sets for language models","primary_cat":"cs.LG","submitted_at":"2024-06-17T17:42:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.04244","ref_index":170,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Benchmark Data Contamination of Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2024-06-06T16:41:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Prompting Large Language Model for Machine Translation: A Case Study. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 41092-41110. https://proceedings.mlr.press/v202/zhang23m.html [170] Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 4 (2018), e1253. https://doi.org/10.1002/widm.1253 [171] Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, and Lidong Bing. 2023. Sentiment Analysis in the Era of Large Language Models: A Reality Check."}],"limit":50,"offset":0}