Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.
In: 2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC), pp
9 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
AgenticFlict is a public dataset of 29K+ textual merge conflicts from AI agent PRs, collected via merge simulation on 107K processed PRs and showing a 27.67% conflict rate with variation across agents.
AI coding agents produce pull requests with substantially more commits and slightly higher description-to-diff similarity than human developers, based on analysis of 29,095 merged PRs.
Analysis of SATD in Dockerfiles shows 27% of admissions and 40% of repayments are coupled to non-Dockerfile artifacts, with coupled events repaid faster overall and external dependencies as a key trigger.
Frontier LLMs like GPT-5.2 show large accuracy drops on perturbed program-output prediction tasks while open-source reasoning models remain more stable, exposing limits in code semantics understanding.
A survey of 419 practitioners shows strong reliance on reusable GitHub Actions for core CI/CD tasks but limited adoption of reusable workflows, with copy-pasting remaining common due to versioning and trust issues.
PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.
Debugging tools should present execution history in time order to support better hypothesis generation about program behavior.
Case study applies verifier-guided LLM evolutionary agents to contraction-order optimization in tensor networks and concludes that human validation remains essential.
citing papers explorer
-
Repository-Level Solidity Code Generation with Large Language Models: From Prompting to Fine-Tuning
Introduces SolidityBench benchmark and SolidityScore metric for repository-level Solidity code generation, finding supervised fine-tuning outperforms prompting, CoT, ICL, and RAG methods on evaluated LLMs.
-
AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub
AgenticFlict is a public dataset of 29K+ textual merge conflicts from AI agent PRs, collected via merge simulation on 107K processed PRs and showing a 27.67% conflict rate with variation across agents.
-
How AI Coding Agents Modify Code: A Large-Scale Study of GitHub Pull Requests
AI coding agents produce pull requests with substantially more commits and slightly higher description-to-diff similarity than human developers, based on analysis of 29,095 merged PRs.
-
Beyond the Tip of the Iceberg: Understanding SATD in Dockerfiles through the Lens of Co-evolution
Analysis of SATD in Dockerfiles shows 27% of admissions and 40% of repayments are coupled to non-Dockerfile artifacts, with coupled events repaid faster overall and external dependencies as a key trigger.
-
How Robustly do LLMs Understand Execution Semantics?
Frontier LLMs like GPT-5.2 show large accuracy drops on perturbed program-output prediction tasks while open-source reasoning models remain more stable, exposing limits in code semantics understanding.
-
Automation and Reuse Practices in GitHub Actions Workflows: A Practitioner's Perspective
A survey of 419 practitioners shows strong reliance on reusable GitHub Actions for core CI/CD tasks but limited adoption of reusable workflows, with copy-pasting remaining common due to versioning and trust issues.