AgenticFlict is a public dataset of 29K+ textual merge conflicts from AI agent PRs, collected via merge simulation on 107K processed PRs and showing a 27.67% conflict rate with variation across agents.
In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC)
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
AI coding agents produce pull requests with substantially more commits and slightly higher description-to-diff similarity than human developers, based on analysis of 29,095 merged PRs.
Frontier LLMs like GPT-5.2 show large accuracy drops on perturbed program-output prediction tasks while open-source reasoning models remain more stable, exposing limits in code semantics understanding.
A survey of 419 practitioners shows strong reliance on reusable GitHub Actions for core CI/CD tasks but limited adoption of reusable workflows, with copy-pasting remaining common due to versioning and trust issues.
PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.
Debugging tools should present execution history in time order to support better hypothesis generation about program behavior.
citing papers explorer
-
AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub
AgenticFlict is a public dataset of 29K+ textual merge conflicts from AI agent PRs, collected via merge simulation on 107K processed PRs and showing a 27.67% conflict rate with variation across agents.
-
How AI Coding Agents Modify Code: A Large-Scale Study of GitHub Pull Requests
AI coding agents produce pull requests with substantially more commits and slightly higher description-to-diff similarity than human developers, based on analysis of 29,095 merged PRs.
-
How Robustly do LLMs Understand Execution Semantics?
Frontier LLMs like GPT-5.2 show large accuracy drops on perturbed program-output prediction tasks while open-source reasoning models remain more stable, exposing limits in code semantics understanding.
-
Automation and Reuse Practices in GitHub Actions Workflows: A Practitioner's Perspective
A survey of 419 practitioners shows strong reliance on reusable GitHub Actions for core CI/CD tasks but limited adoption of reusable workflows, with copy-pasting remaining common due to versioning and trust issues.
-
Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths
PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.
-
Tracers for debugging and program exploration
Debugging tools should present execution history in time order to support better hypothesis generation about program behavior.