hub

arXiv:2312.15223 [cs.SE]

Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, Zhenyu Chen · 2023 · arXiv 2312.15223

21 Pith papers cite this work. Polarity classification is still indexing.

21 Pith papers citing it

read on arXiv browse 21 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Do Papers Tell the Whole Story? A Benchmark and Framework for Uncovering Hidden Implementation Gaps in Bioinformatics

cs.LG · 2026-03-23 · unverdicted · novelty 8.0

BioCon is the first benchmark dataset and cross-modal framework for detecting inconsistencies between methodological descriptions in bioinformatics papers and their code implementations.

Query2Diagram: Answering Developer Queries with UML Diagrams

cs.SE · 2026-04-26 · unverdicted · novelty 7.0

Fine-tuning Qwen2.5-Coder-14B on code-query-diagram triples produces UML diagrams with higher F1 scores and lower structural defect rates than base or other LLMs.

Do AI Coding Agents Log Like Humans? An Empirical Study

cs.SE · 2026-04-10 · unverdicted · novelty 7.0

AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans performing 72.5% of post-generation log repairs.

Story Point Estimation Using Large Language Models

cs.SE · 2026-03-06 · unverdicted · novelty 7.0

LLMs predict story points better in zero-shot prompting than supervised deep learning models trained on 80% of project data, with few-shot examples and comparative judgments further improving performance.

RubberDuckBench: A Benchmark for AI Coding Assistants

cs.SE · 2026-01-23 · unverdicted · novelty 7.0

RubberDuckBench shows top AI models score around 68% on real GitHub coding questions, rarely answer completely correctly, and hallucinate in 58% of responses on average.

Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI

cs.HC · 2026-01-17 · unverdicted · novelty 7.0

Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

cs.SE · 2025-12-20 · unverdicted · novelty 7.0

SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.

Contextualized Code Pretraining for Code Generation

cs.SE · 2026-05-18 · unverdicted · novelty 6.0

Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

cs.SE · 2026-05-06 · accept · novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

REAgent: Requirement-Driven LLM Agents for Software Issue Resolution

cs.SE · 2026-04-08 · unverdicted · novelty 6.0

REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.

Compiling Code LLMs into Lightweight Executables

cs.SE · 2026-03-31 · conditional · novelty 6.0

Ditto quantizes Code LLMs with K-Means codebooks and compiles inference via LLVM-BLAS replacement to deliver up to 10.5x faster, 6.4x smaller, and 10.5x lower-energy execution on commodity hardware while losing only 0.27% pass@1 accuracy.

How Helpful is LLM Assistance in Network Operations? A Case Study at a Large Demonstration Network

cs.NI · 2026-05-19 · unverdicted · novelty 5.0

A case study with 105 network engineers found that an LLM chatbot with RAG, CLI control, and ticket access received positive evaluations in 68.1% of interactions while assisting with building and operating a large demonstration network.

Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering

cs.SE · 2026-04-18 · unverdicted · novelty 5.0

LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.

CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases

cs.SE · 2025-10-28 · unverdicted · novelty 5.0

CodeWiki presents a unified framework for repository-level documentation across seven languages using hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, outperforming DeepWiki by 4.73% on CodeWikiBench.

Security of LLM-generated Code: A Comparative Analysis

cs.SE · 2026-05-21 · unverdicted · novelty 4.0

Empirical evaluation shows that code generated by all seven tested LLMs contains vulnerabilities, the majority of critical or high severity.

Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions

cs.SE · 2026-04-27 · unverdicted · novelty 4.0

LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.

Towards Enabling An Artificial Self-Construction Software Life-cycle via Autopoietic Architectures

cs.SE · 2026-04-15 · unverdicted · novelty 4.0

Proposes autopoietic architectures for self-constructing software as a fundamental shift in the SDLC, leveraging foundation models for autonomous evolution and maintenance.

REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment

cs.CL · 2025-11-06 · unverdicted · novelty 4.0

REFLEX is a reference-free LLM-based evaluation metric for log summarization that assesses quality on relevance, informativeness, and coherence without gold references or human annotations.

Foundational Design Principles and Patterns for Building Robust and Adaptive GenAI-Native Systems

cs.SE · 2025-08-21 · unverdicted · novelty 4.0

Proposes five foundational pillars and architectural patterns for building robust GenAI-native systems by combining AI with software engineering principles.

From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap

cs.SE · 2024-10-28 · unverdicted · novelty 4.0

A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.

Understanding the Human-LLM Dynamic: A Literature Survey of LLM Use in Programming Tasks

cs.SE · 2024-10-01 · unverdicted · novelty 3.0

A survey of user studies on LLM use in programming that identifies interaction behaviors, mixed benefits and weaknesses, and factors influencing human and task performance.

citing papers explorer

Showing 21 of 21 citing papers.

Do Papers Tell the Whole Story? A Benchmark and Framework for Uncovering Hidden Implementation Gaps in Bioinformatics cs.LG · 2026-03-23 · unverdicted · none · ref 13
BioCon is the first benchmark dataset and cross-modal framework for detecting inconsistencies between methodological descriptions in bioinformatics papers and their code implementations.
Query2Diagram: Answering Developer Queries with UML Diagrams cs.SE · 2026-04-26 · unverdicted · none · ref 46
Fine-tuning Qwen2.5-Coder-14B on code-query-diagram triples produces UML diagrams with higher F1 scores and lower structural defect rates than base or other LLMs.
Do AI Coding Agents Log Like Humans? An Empirical Study cs.SE · 2026-04-10 · unverdicted · none · ref 41
AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans performing 72.5% of post-generation log repairs.
Story Point Estimation Using Large Language Models cs.SE · 2026-03-06 · unverdicted · none · ref 19
LLMs predict story points better in zero-shot prompting than supervised deep learning models trained on 80% of project data, with few-shot examples and comparative judgments further improving performance.
RubberDuckBench: A Benchmark for AI Coding Assistants cs.SE · 2026-01-23 · unverdicted · none · ref 40
RubberDuckBench shows top AI models score around 68% on real GitHub coding questions, rarely answer completely correctly, and hallucinate in 58% of responses on average.
Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI cs.HC · 2026-01-17 · unverdicted · none · ref 94
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios cs.SE · 2025-12-20 · unverdicted · none · ref 66
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
Contextualized Code Pretraining for Code Generation cs.SE · 2026-05-18 · unverdicted · none · ref 52
Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code cs.SE · 2026-05-06 · accept · none · ref 149
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
REAgent: Requirement-Driven LLM Agents for Software Issue Resolution cs.SE · 2026-04-08 · unverdicted · none · ref 87
REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.
Compiling Code LLMs into Lightweight Executables cs.SE · 2026-03-31 · conditional · none · ref 78
Ditto quantizes Code LLMs with K-Means codebooks and compiles inference via LLVM-BLAS replacement to deliver up to 10.5x faster, 6.4x smaller, and 10.5x lower-energy execution on commodity hardware while losing only 0.27% pass@1 accuracy.
How Helpful is LLM Assistance in Network Operations? A Case Study at a Large Demonstration Network cs.NI · 2026-05-19 · unverdicted · none · ref 6
A case study with 105 network engineers found that an LLM chatbot with RAG, CLI control, and ticket access received positive evaluations in 68.1% of interactions while assisting with building and operating a large demonstration network.
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering cs.SE · 2026-04-18 · unverdicted · none · ref 33
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases cs.SE · 2025-10-28 · unverdicted · none · ref 55
CodeWiki presents a unified framework for repository-level documentation across seven languages using hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, outperforming DeepWiki by 4.73% on CodeWikiBench.
Security of LLM-generated Code: A Comparative Analysis cs.SE · 2026-05-21 · unverdicted · none · ref 87
Empirical evaluation shows that code generated by all seven tested LLMs contains vulnerabilities, the majority of critical or high severity.
Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions cs.SE · 2026-04-27 · unverdicted · none · ref 44
LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.
Towards Enabling An Artificial Self-Construction Software Life-cycle via Autopoietic Architectures cs.SE · 2026-04-15 · unverdicted · none · ref 73
Proposes autopoietic architectures for self-constructing software as a fundamental shift in the SDLC, leveraging foundation models for autonomous evolution and maintenance.
REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment cs.CL · 2025-11-06 · unverdicted · none · ref 22
REFLEX is a reference-free LLM-based evaluation metric for log summarization that assesses quality on relevance, informativeness, and coherence without gold references or human annotations.
Foundational Design Principles and Patterns for Building Robust and Adaptive GenAI-Native Systems cs.SE · 2025-08-21 · unverdicted · none · ref 66
Proposes five foundational pillars and architectural patterns for building robust GenAI-native systems by combining AI with software engineering principles.
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap cs.SE · 2024-10-28 · unverdicted · none · ref 120
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
Understanding the Human-LLM Dynamic: A Literature Survey of LLM Use in Programming Tasks cs.SE · 2024-10-01 · unverdicted · none · ref 118
A survey of user studies on LLM use in programming that identifies interaction behaviors, mixed benefits and weaknesses, and factors influencing human and task performance.

arXiv:2312.15223 [cs.SE]

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer