AutoMat benchmark shows current LLM coding agents achieve at most 54.1% success when reproducing computational materials science claims from papers.
hub
Advances in Neural Information Processing Systems , volume=
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.
Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
SafeHarbor introduces a hierarchical memory-augmented guardrail with adversarial rule extraction and entropy-driven self-evolution to balance safety and utility in LLM agents.
AblateCell reproduces baselines in three single-cell perturbation repositories with 88.9% success and recovers ground-truth critical components with 93.3% accuracy via closed-loop ablation.
AEM adaptively modulates response-level entropy in agentic RL to improve credit assignment and exploration-exploitation balance, yielding gains on ALFWorld, WebShop, and SWE-bench.
XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.
citing papers explorer
-
Can Coding Agents Reproduce Findings in Computational Materials Science?
AutoMat benchmark shows current LLM coding agents achieve at most 54.1% success when reproducing computational materials science claims from papers.
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
-
How Far Are We From True Auto-Research?
ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.
-
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
-
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
SafeHarbor introduces a hierarchical memory-augmented guardrail with adversarial rule extraction and entropy-driven self-evolution to balance safety and utility in LLM agents.
-
AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories
AblateCell reproduces baselines in three single-cell perturbation repositories with 88.9% success and recovers ground-truth critical components with 93.3% accuracy via closed-loop ablation.
-
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
AEM adaptively modulates response-level entropy in agentic RL to improve credit assignment and exploration-exploitation balance, yielding gains on ALFWorld, WebShop, and SWE-bench.
-
XekRung Technical Report
XekRung achieves state-of-the-art performance on cybersecurity benchmarks among same-scale models via tailored data synthesis and multi-stage training while retaining strong general capabilities.
- Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents