IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.
hub
Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Adding product context retrieval to AI coding agents raises decision compliance from 46% to 95% on a new benchmark of 8 tasks with 41 weighted decision points.
A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
SkillJect automates creation of poisoned skills that outperform manual prompt injection by embedding payloads in helper scripts and using multi-agent feedback to refine instructions.
InlineCoder reframes repository-level code generation as function-level coding by using a draft anchor to inline the target function into its call graph for upstream usage and downstream dependency context.
FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.
Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo generation on small projects versus module-by-module on complex ones.
MM-WebAgent is a hierarchical multimodal agent that coordinates AIGC tools through planning and iterative self-reflection to generate coherent, visually consistent webpages and outperforms baselines on a new benchmark.
TokenCake introduces agent-aware temporal and spatial schedulers for KV cache management in LLM multi-agent serving, claiming over 47% lower end-to-end latency and up to 16.9% better GPU memory utilization than vLLM on representative benchmarks.
A large-scale study finds that many LLM code translation failures are false negatives due to improper evaluation configurations rather than incorrect translations.
AblateCell reproduces baselines in three single-cell perturbation repositories with 88.9% success and recovers ground-truth critical components with 93.3% accuracy via closed-loop ablation.
TRACE improves project-wise subsequent code editing by interleaving neural-based induction for semantic edits and tool-based deduction for syntactic edits.
A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.
A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future research directions with 18 subcategories.
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
citing papers explorer
-
IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents
IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.
-
Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49%
Adding product context retrieval to AI coding agents raises decision compliance from 46% to 95% on a new benchmark of 8 tasks with 41 weighted decision points.
-
Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents
A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
-
SkillJect: Effectively Automating Skill-Based Prompt Injection for Skill-Enabled Agents
SkillJect automates creation of poisoned skills that outperform manual prompt injection by embedding payloads in helper scripts and using multi-agent feedback to refine instructions.
-
In Line with Context: Repository-Level Code Generation via Context Inlining
InlineCoder reframes repository-level code generation as function-level coding by using a draft anchor to inline the target function into its call graph for upstream usage and downstream dependency context.
-
FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing
FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.
-
Contextualized Code Pretraining for Code Generation
Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.
-
Revisiting DAgger in the Era of LLM-Agents
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
-
RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices
RealBench is a new repo-level code generation benchmark that adds UML diagrams to natural language specs, showing LLMs struggle more at full repositories, create modules with errors, and perform best with whole-repo generation on small projects versus module-by-module on complex ones.
-
MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
MM-WebAgent is a hierarchical multimodal agent that coordinates AIGC tools through planning and iterative self-reflection to generate coherent, visually consistent webpages and outperforms baselines on a new benchmark.
-
TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications
TokenCake introduces agent-aware temporal and spatial schedulers for KV cache management in LLM multi-agent serving, claiming over 47% lower end-to-end latency and up to 16.9% better GPU memory utilization than vLLM on representative benchmarks.
-
Beyond Translation Accuracy: Addressing False Failures in LLM-Based Code Translation
A large-scale study finds that many LLM code translation failures are false negatives due to improper evaluation configurations rather than incorrect translations.
-
AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories
AblateCell reproduces baselines in three single-cell perturbation repositories with 88.9% success and recovers ground-truth critical components with 93.3% accuracy via closed-loop ablation.
-
Learning Project-wise Subsequent Code Edits via Interleaving Neural-based Induction and Tool-based Deduction
TRACE improves project-wise subsequent code editing by interleaving neural-based induction for semantic edits and tool-based deduction for syntactic edits.
-
Retrieval-Augmented Generation for AI-Generated Content: A Survey
A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.
-
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review
A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future research directions with 18 subcategories.
-
A Survey on the Memory Mechanism of Large Language Model based Agents
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.