BioCon is the first benchmark dataset and cross-modal framework for detecting inconsistencies between methodological descriptions in bioinformatics papers and their code implementations.
hub
arXiv:2312.15223 [cs.SE]
21 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
Fine-tuning Qwen2.5-Coder-14B on code-query-diagram triples produces UML diagrams with higher F1 scores and lower structural defect rates than base or other LLMs.
AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans performing 72.5% of post-generation log repairs.
LLMs predict story points better in zero-shot prompting than supervised deep learning models trained on 80% of project data, with few-shot examples and comparative judgments further improving performance.
RubberDuckBench shows top AI models score around 68% on real GitHub coding questions, rarely answer completely correctly, and hallucinate in 58% of responses on average.
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.
Ditto quantizes Code LLMs with K-Means codebooks and compiles inference via LLVM-BLAS replacement to deliver up to 10.5x faster, 6.4x smaller, and 10.5x lower-energy execution on commodity hardware while losing only 0.27% pass@1 accuracy.
A case study with 105 network engineers found that an LLM chatbot with RAG, CLI control, and ticket access received positive evaluations in 68.1% of interactions while assisting with building and operating a large demonstration network.
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
CodeWiki presents a unified framework for repository-level documentation across seven languages using hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, outperforming DeepWiki by 4.73% on CodeWikiBench.
Empirical evaluation shows that code generated by all seven tested LLMs contains vulnerabilities, the majority of critical or high severity.
LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.
Proposes autopoietic architectures for self-constructing software as a fundamental shift in the SDLC, leveraging foundation models for autonomous evolution and maintenance.
REFLEX is a reference-free LLM-based evaluation metric for log summarization that assesses quality on relevance, informativeness, and coherence without gold references or human annotations.
Proposes five foundational pillars and architectural patterns for building robust GenAI-native systems by combining AI with software engineering principles.
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
A survey of user studies on LLM use in programming that identifies interaction behaviors, mixed benefits and weaknesses, and factors influencing human and task performance.
citing papers explorer
-
Do Papers Tell the Whole Story? A Benchmark and Framework for Uncovering Hidden Implementation Gaps in Bioinformatics
BioCon is the first benchmark dataset and cross-modal framework for detecting inconsistencies between methodological descriptions in bioinformatics papers and their code implementations.
-
Query2Diagram: Answering Developer Queries with UML Diagrams
Fine-tuning Qwen2.5-Coder-14B on code-query-diagram triples produces UML diagrams with higher F1 scores and lower structural defect rates than base or other LLMs.
-
Do AI Coding Agents Log Like Humans? An Empirical Study
AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans performing 72.5% of post-generation log repairs.
-
Story Point Estimation Using Large Language Models
LLMs predict story points better in zero-shot prompting than supervised deep learning models trained on 80% of project data, with few-shot examples and comparative judgments further improving performance.
-
RubberDuckBench: A Benchmark for AI Coding Assistants
RubberDuckBench shows top AI models score around 68% on real GitHub coding questions, rarely answer completely correctly, and hallucinate in 58% of responses on average.
-
Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
-
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
-
Contextualized Code Pretraining for Code Generation
Introduces contextualized code pretraining with caller-callee pairs from static analysis to train CallerGen models that outperform baselines on the new CallerEval benchmark.
-
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
-
REAgent: Requirement-Driven LLM Agents for Software Issue Resolution
REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.
-
Compiling Code LLMs into Lightweight Executables
Ditto quantizes Code LLMs with K-Means codebooks and compiles inference via LLVM-BLAS replacement to deliver up to 10.5x faster, 6.4x smaller, and 10.5x lower-energy execution on commodity hardware while losing only 0.27% pass@1 accuracy.
-
How Helpful is LLM Assistance in Network Operations? A Case Study at a Large Demonstration Network
A case study with 105 network engineers found that an LLM chatbot with RAG, CLI control, and ticket access received positive evaluations in 68.1% of interactions while assisting with building and operating a large demonstration network.
-
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
-
CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
CodeWiki presents a unified framework for repository-level documentation across seven languages using hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, outperforming DeepWiki by 4.73% on CodeWikiBench.
-
Security of LLM-generated Code: A Comparative Analysis
Empirical evaluation shows that code generated by all seven tested LLMs contains vulnerabilities, the majority of critical or high severity.
-
Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions
LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.
-
Towards Enabling An Artificial Self-Construction Software Life-cycle via Autopoietic Architectures
Proposes autopoietic architectures for self-constructing software as a fundamental shift in the SDLC, leveraging foundation models for autonomous evolution and maintenance.
-
REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment
REFLEX is a reference-free LLM-based evaluation metric for log summarization that assesses quality on relevance, informativeness, and coherence without gold references or human annotations.
-
Foundational Design Principles and Patterns for Building Robust and Adaptive GenAI-Native Systems
Proposes five foundational pillars and architectural patterns for building robust GenAI-native systems by combining AI with software engineering principles.
-
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
-
Understanding the Human-LLM Dynamic: A Literature Survey of LLM Use in Programming Tasks
A survey of user studies on LLM use in programming that identifies interaction behaviors, mixed benefits and weaknesses, and factors influencing human and task performance.