hub Canonical reference

arXiv preprint arXiv:2410.20285 , year=

Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, William Wang · 2024 · arXiv 2410.20285

Canonical reference. 80% of citing Pith papers cite this work as background.

12 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 1

citation-polarity summary

background 4 baseline 1

representative citing papers

Dynamic analysis enhances issue resolution

cs.SE · 2026-03-23 · conditional · novelty 7.0

DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.

STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.

Step Rejection Fine-Tuning: A Practical Distillation Recipe

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Step Rejection Fine-Tuning masks loss on erroneous steps identified by a critic LLM in unresolved trajectories, raising SWE-bench Verified resolution rate by 3.7% to 32.2% versus 2.4% for trajectory-level rejection.

GALA: Multimodal Graph Alignment for Bug Localization in Automated Program Repair

cs.SE · 2026-04-09 · unverdicted · novelty 6.0

GALA uses hierarchical graph alignment between UI screenshots and code structures to achieve state-of-the-art bug localization in multimodal automated program repair on SWE-bench.

On the Role of Fault Localization Context for LLM-Based Program Repair

cs.SE · 2026-04-07 · unverdicted · novelty 6.0

More fault localization context does not consistently improve LLM-based program repair; file-level context gives 15-17x gains, optimal around 6-10 files, while line-level context often degrades performance from noise.

Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints

cs.SE · 2026-04-06 · unverdicted · novelty 6.0

Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

cs.CL · 2025-11-25 · unverdicted · novelty 6.0

Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.

Code as Agent Harness

cs.CL · 2026-05-18 · accept · novelty 5.0

A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.

Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering

cs.SE · 2026-04-18 · unverdicted · novelty 5.0

LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.

Agentic Reasoning for Large Language Models

cs.AI · 2026-01-18 · unverdicted · novelty 4.0

The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

cs.SE · 2026-05-13

HELO-APR: Enhancing Low-Resource Program Repair through Cross-Lingual Knowledge Transfer

cs.SE · 2026-04-18

citing papers explorer

Showing 12 of 12 citing papers.

Dynamic analysis enhances issue resolution cs.SE · 2026-03-23 · conditional · none · ref 2
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning cs.LG · 2026-05-13 · unverdicted · none · ref 35
STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.
Step Rejection Fine-Tuning: A Practical Distillation Recipe cs.LG · 2026-05-11 · unverdicted · none · ref 5
Step Rejection Fine-Tuning masks loss on erroneous steps identified by a critic LLM in unresolved trajectories, raising SWE-bench Verified resolution rate by 3.7% to 32.2% versus 2.4% for trajectory-level rejection.
GALA: Multimodal Graph Alignment for Bug Localization in Automated Program Repair cs.SE · 2026-04-09 · unverdicted · none · ref 2
GALA uses hierarchical graph alignment between UI screenshots and code structures to achieve state-of-the-art bug localization in multimodal automated program repair on SWE-bench.
On the Role of Fault Localization Context for LLM-Based Program Repair cs.SE · 2026-04-07 · unverdicted · none · ref 2
More fault localization context does not consistently improve LLM-based program repair; file-level context gives 15-17x gains, optimal around 6-10 files, while line-level context often degrades performance from noise.
Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints cs.SE · 2026-04-06 · unverdicted · none · ref 5
Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory cs.CL · 2025-11-25 · unverdicted · none · ref 220
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
Code as Agent Harness cs.CL · 2026-05-18 · accept · none · ref 166
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering cs.SE · 2026-04-18 · unverdicted · none · ref 2
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
Agentic Reasoning for Large Language Models cs.AI · 2026-01-18 · unverdicted · none · ref 132
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation cs.SE · 2026-05-13 · unreviewed · ref 1
HELO-APR: Enhancing Low-Resource Program Repair through Cross-Lingual Knowledge Transfer cs.SE · 2026-04-18 · unreviewed · ref 1

arXiv preprint arXiv:2410.20285 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer