arXiv preprint arXiv:2502.17521 , year =

· 2025 · arXiv 2502.17521

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1 other 1

citation-polarity summary

background 1 unclear 1

representative citing papers

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

cs.AI · 2026-05-12 · conditional · novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection

cs.CL · 2026-04-06 · unverdicted · novelty 7.0

LiveFact is a new time-aware benchmark that evaluates LLMs on reasoning with dynamic and incomplete information for fake news detection, identifying a significant reasoning gap in model behavior.

CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

cs.AI · 2026-04-27 · unverdicted · novelty 5.0

STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.

LLM Benchmark Datasets Should Be Contamination-Resistant

cs.LG · 2026-05-19 · unverdicted · novelty 4.0

Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

cs.AI · 2025-07-28 · accept · novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

citing papers explorer

Showing 6 of 6 citing papers.

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack cs.AI · 2026-05-12 · conditional · none · ref 10
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
LiveFact: A Dynamic, Time-Aware Benchmark for LLM-Driven Fake News Detection cs.CL · 2026-04-06 · unverdicted · none · ref 1
LiveFact is a new time-aware benchmark that evaluates LLMs on reasoning with dynamic and incomplete information for fake news detection, identifying a significant reasoning gap in model behavior.
CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning cs.AI · 2026-04-13 · unverdicted · none · ref 9
CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator cs.AI · 2026-04-27 · unverdicted · none · ref 3
STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.
LLM Benchmark Datasets Should Be Contamination-Resistant cs.LG · 2026-05-19 · unverdicted · none · ref 6
Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence cs.AI · 2025-07-28 · accept · none · ref 144
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

arXiv preprint arXiv:2502.17521 , year =

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer