pith. sign in

arxiv: 2604.23509 · v1 · submitted 2026-04-26 · 💻 cs.SE

Uncovering Business Logic Bugs via Semantics-Driven Unit Test Generation

Pith reviewed 2026-05-08 06:01 UTC · model grok-4.3

classification 💻 cs.SE
keywords business logic bugsunit test generationsemantics-driven testingrequirement documentsLLM-based testingenterprise softwareGo projectsbug detection
0
0 comments X

The pith

SeGa builds semantic knowledge bases from requirement documents to guide unit test generation toward business logic bugs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SeGa as a method that turns product requirement documents into a structured semantic knowledge base to direct the creation of unit tests. It groups requirements into functionality entries, retrieves those relevant to a given method, and derives explicit business scenarios with preconditions, actions, outcomes, and constraints. These scenarios then steer large language models to produce tests that target violations of intended business semantics rather than code structure alone. A sympathetic reader would care because business logic bugs are common in enterprise software yet evade most code-centric testing approaches.

Core claim

SeGa constructs a semantic knowledge base from product requirement documents, represented as a set of functionality entries that group related requirements under a common business intent. Given a focal method, SeGa retrieves the relevant functionality entries and derives fine-grained business scenarios with explicit preconditions, triggering actions, expected outcomes, and semantic constraints to guide LLM-based test generation.

What carries the argument

The semantic knowledge base of functionality entries built from requirement documents, which retrieves business intents and translates them into guiding scenarios for test generation.

If this is right

  • SeGa detects 22-25 more business logic bugs than four state-of-the-art LLM-based techniques on four industrial Go projects containing 60 known bugs.
  • It raises precision of bug detection by 26.9 percent to 34.3 percent over the same baselines.
  • Deployment on six production repositories surfaces 16 previously unknown business logic bugs that developers confirm and fix.
  • The industrial evaluation yields concrete lessons for applying semantics-driven testing in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same requirement-to-scenario pipeline could be adapted to languages other than Go without changing the core retrieval and derivation steps.
  • Integrating the derived scenarios into continuous integration pipelines might surface business logic issues earlier in the development cycle.
  • Requirement documents that follow the functionality-entry format may themselves become easier to maintain once teams observe their direct use in test generation.

Load-bearing premise

Product requirement documents supply complete and accurate business semantics that can be grouped into functionality entries and turned into effective test scenarios without significant loss or misinterpretation.

What would settle it

SeGa would be falsified if it produced no measurable improvement in bug detection on projects where requirement documents are known to be incomplete, outdated, or ambiguous.

Figures

Figures reproduced from arXiv: 2604.23509 by Chen Yang, Junjie Chen.

Figure 1
Figure 1. Figure 1: Motivating Example 2.2 Motivating Example view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SeGa concerned only with the Item Operation Management functionality. When all functionalities are provided as context, LLM-based generators (e.g., CHATTESTER, SymPrompt, HITS, and RATester) are easily distracted by irrelevant information and generate tests that either miss the bug or trigger spurious failures. This motivates a semantic retrieval mechanism that identifiesthe specific functional… view at source ↗
Figure 3
Figure 3. Figure 3: illustrates the overlap of bugs detected by different techniques. From the figure, SeGa identifies 29 bugs in total, while CHATTESTER, SymPrompt, HITS, and RATester detect 7, 7, 6, and 4 bugs, respectively. Notably, SeGa covers nearly all bugs found by the other techniques (missing only one uniquely detected by SymPrompt) and also discovers the largest number of bugs that no baseline can find. This highlig… view at source ↗
Figure 4
Figure 4. Figure 4: Simplified and desensitized versions of two previously unknown business logic bugs view at source ↗
read the original abstract

Business logic bugs violate intended business semantics and are particularly prevalent in enterprise software. Yet most existing unit test generation techniques are code-centric, making such bugs difficult to expose. We present SeGa, a semantics-driven unit test generation technique for uncovering business logic bugs. SeGa constructs a semantic knowledge base from product requirement documents, represented as a set of functionality entries that group related requirements under a common business intent. Given a focal method, SeGa retrieves the relevant functionality entries and derives fine-grained business scenarios with explicit preconditions, triggering actions, expected outcomes, and semantic constraints to guide LLM-based test generation. We evaluate SeGa on four industrial Go projects containing 60 real-world business logic bugs. SeGa detects 22-25 more bugs than four state-of-the-art LLM-based techniques and improves precision by 26.9%-34.3%. Deployment across 6 production repositories further uncovers 16 previously unknown business logic bugs that were confirmed and fixed by developers. From our industrial study, we summarize a series of lessons and suggestions for practical use and future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents SeGa, a semantics-driven unit test generation technique that builds a knowledge base of functionality entries from product requirement documents, retrieves relevant entries for a focal method, and derives fine-grained business scenarios (preconditions, actions, outcomes, constraints) to guide LLM-based test generation. It claims to detect 22-25 more business logic bugs than four state-of-the-art LLM-based techniques on 60 real bugs from four industrial Go projects, with 26.9%-34.3% precision gains, plus 16 new bugs found and fixed in deployment across six production repositories.

Significance. If the evaluation controls and bug identification details hold, the work offers a practical advance for exposing business logic bugs that code-centric methods miss, by bridging requirements semantics to test generation. The industrial deployment results and summarized lessons provide direct value to practitioners in enterprise software testing.

major comments (3)
  1. [Abstract / Evaluation] Abstract and Evaluation section: The central claim that SeGa detects 22-25 more bugs and improves precision by 26.9%-34.3% over four LLM-based baselines is load-bearing, yet no control experiment supplies the baselines with equivalent semantic information extracted from the same requirement documents. Without this, it remains unclear whether gains stem from SeGa's specific retrieval, grouping into functionality entries, and scenario derivation (preconditions etc.) or from any form of semantics injection.
  2. [Evaluation] Evaluation setup: The paper reports results on 60 real-world business logic bugs but provides insufficient detail on the identification and verification process (how bugs were located, classified as business-logic rather than other defect types, and confirmed independently of SeGa). This information is necessary to assess selection bias and the validity of the performance deltas.
  3. [Methodology] Methodology (requirements processing): The technique assumes product requirement documents yield complete, unambiguous functionality entries that can be reliably translated into scenarios without significant loss. No analysis or mitigation strategy is presented for incomplete, conflicting, or evolving requirements, which is a load-bearing assumption for industrial applicability.
minor comments (2)
  1. [Abstract] Abstract: The distinction between the four evaluation projects and the six deployment repositories could be stated more explicitly to avoid reader confusion.
  2. [Throughout] Notation and terminology: Ensure all acronyms (e.g., LLM) receive an initial expansion on first use in the main body, and that 'functionality entry' is defined consistently before its repeated use.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment point by point below, indicating the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: The central claim that SeGa detects 22-25 more bugs and improves precision by 26.9%-34.3% over four LLM-based baselines is load-bearing, yet no control experiment supplies the baselines with equivalent semantic information extracted from the same requirement documents. Without this, it remains unclear whether gains stem from SeGa's specific retrieval, grouping into functionality entries, and scenario derivation (preconditions etc.) or from any form of semantics injection.

    Authors: We agree that the absence of a control supplying the baselines with equivalent semantic information leaves some ambiguity about whether the observed gains derive specifically from SeGa's retrieval, grouping, and scenario derivation steps or from semantics injection in general. The baselines are published as code-centric techniques, so our primary comparison evaluates them in their standard form to reflect real-world usage. To address this directly, we will add a new control experiment in the revised Evaluation section in which we augment the baseline prompts with the same functionality entries and scenarios extracted by SeGa, then re-run the comparisons. This will allow readers to better isolate the contribution of our structured processing pipeline. revision: yes

  2. Referee: [Evaluation] Evaluation setup: The paper reports results on 60 real-world business logic bugs but provides insufficient detail on the identification and verification process (how bugs were located, classified as business-logic rather than other defect types, and confirmed independently of SeGa). This information is necessary to assess selection bias and the validity of the performance deltas.

    Authors: We acknowledge that additional transparency is required on the bug collection and verification process to allow proper assessment of selection bias. In the revised manuscript, we will expand the Evaluation section with a dedicated subsection detailing: (1) the sources used to locate the 60 bugs (developer-reported issues, internal testing logs, and code review records), (2) the classification criteria distinguishing business-logic bugs (semantic violations of business rules) from other defect types, and (3) the independent verification steps, including cross-validation by at least two developers per bug and confirmation against the original requirement documents without reference to SeGa outputs. revision: yes

  3. Referee: [Methodology] Methodology (requirements processing): The technique assumes product requirement documents yield complete, unambiguous functionality entries that can be reliably translated into scenarios without significant loss. No analysis or mitigation strategy is presented for incomplete, conflicting, or evolving requirements, which is a load-bearing assumption for industrial applicability.

    Authors: This observation correctly identifies a load-bearing assumption in our methodology. While the industrial deployment demonstrated practical utility, we did not systematically analyze robustness to incomplete, conflicting, or evolving requirements. In the revision, we will add a new subsection under Discussion that (a) acknowledges this limitation, (b) describes observed challenges from the six-repository deployment, and (c) outlines mitigation strategies such as LLM-assisted conflict detection during knowledge-base construction and graceful fallback to code-only test generation when requirements are insufficient. We will also report any relevant statistics from the deployment study. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external bugs and independent baselines

full rationale

The paper describes a technique that builds a semantic knowledge base from product requirement documents, retrieves functionality entries for a focal method, and derives scenarios to guide LLM test generation. Its central claims are supported by direct evaluation on 60 externally identified real-world bugs across four industrial projects, plus deployment results uncovering 16 new bugs confirmed by developers. These results are compared against four separate state-of-the-art LLM-based techniques without any reduction of outcomes to fitted parameters, self-defined metrics, or load-bearing self-citations. No equations, ansatzes, or uniqueness theorems appear; the derivation chain consists of straightforward engineering steps whose performance is measured against external ground truth rather than constructed from the inputs themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that requirement documents are reliable and sufficient sources of business semantics; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Product requirement documents contain sufficient and accurate semantic information about business intent to derive effective test scenarios.
    Invoked when constructing the semantic knowledge base and deriving preconditions, actions, and constraints from functionality entries.

pith-pipeline@v0.9.0 · 5475 in / 1168 out tokens · 53221 ms · 2026-05-08T06:01:07.120053+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 11 canonical work pages

  1. [1]

    LangChain

    2026. LangChain. https://www.langchain.com/

  2. [2]

    Shreya Bhatia, Tarushi Gandhi, Dhruv Kumar, and Pankaj Jalote. 2024. System test case design from requirements specifications: Insights and challenges of using chatgpt.arXiv preprint arXiv:2412.03693(2024)

  3. [3]

    Xiang Cheng, Fan Sang, Yizhuo Zhai, Xiaokuan Zhang, and Taesoo Kim. 2025. RUG: Turbo LLM for Rust Unit Test Generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 634–634

  4. [4]

    Bei Chu, Yang Feng, Kui Liu, Hange Shi, Zifan Nan, Zhaoqiang Guo, and Baowen Xu. 2025. Synergizing Program Analysis and LLMs to Enhance Rust Unit Test Coverage. (2025)

  5. [5]

    Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, and Shuvendu K Lahiri. 2022. Toga: A neural method for test oracle generation. InProceedings of the 44th International Conference on Software Engineering. 2130–2141

  6. [6]

    FRET Tool Documentation. 2026. Formal Requirements Elicitation Tool (FRET). https://en.wikipedia.org/wiki/FRET_ (software)

  7. [7]

    Gordon Fraser and Andrea Arcuri. 2011. Evosuite: automatic test suite generation for object-oriented software. InProceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 416–419

  8. [8]

    Sujin Jang, Yeonhee Ryou, Heewon Lee, and Kihong Heo. 2025. UnitCon: Synthesizing Targeted Unit Tests for Java Runtime Exceptions.Proceedings of the ACM on Software Engineering2, FSE (2025), 2053–2074

  9. [9]

    René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. InProceedings of the 2014 international symposium on software testing and analysis. 437–440

  10. [10]

    Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. 2023. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 919–931

  11. [11]

    Dianshu Liao, Xin Yin, Shidong Pan, Chao Ni, Zhenchang Xing, and Xiaoyu Sun. 2025. Navigating the Labyrinth: Path-Sensitive Unit Test Generation with Large Language Models. (2025)

  12. [12]

    Yun Lin, You Sheng Ong, Jun Sun, Gordon Fraser, and Jin Song Dong. 2021. Graph-based seed object synthesis for search-based unit testing. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1068–1080

  13. [13]

    Jinwei Liu, Chao Li, Rui Chen, Shaofeng Li, Bin Gu, and Mengfei Yang. 2025. STRUT: Structured Seed Case Guided Unit Test Generation for C Programs using LLMs.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 2113–2135

  14. [14]

    Andrea Lops, Fedelucio Narducci, Azzurra Ragone, and Michelantonio Trizio. 2024. AgoneTest: Automated creation and assessment of Unit tests leveraging Large Language Models. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 2440–2441

  15. [15]

    Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated unit test generation for python. InProceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. 168–172

  16. [16]

    Nikitha Medeshetty, Ahmad Nauman Ghazi, Sadi Alawadi, and Fahed Alkhabbas. 2025. From Requirements to Test Cases: An NLP-Based Approach for High-Performance ECU Test Case Automation.arXiv preprint arXiv:2505.00547 (2025)

  17. [17]

    Zifan Nan, Zhaoqiang Guo, Kui Liu, and Xin Xia. 2025. Test intention guided llm-based unit test generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 779–779

  18. [18]

    Nico Naus, Freek Verbeek, Sagar Atla, and Binoy Ravindran. 2024. Poster: Formally Verified Binary Lifting to P-Code. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 4973–4975

  19. [19]

    Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. InCompanion to the 22nd ACM SIGPLAN conference on Object-oriented programming systems and applications companion. 815–816

  20. [20]

    Lilly Raamesh and GV Uma. 2010. Reliable mining of automatically generated test cases from software requirements specification (SRS).arXiv preprint arXiv:1002.1199(2010)

  21. [21]

    Sanjai Rayadurgam and Mats Per Erik Heimdahl. 2001. Test-sequence generation from formal requirement models. InProceedings Sixth IEEE International Symposium on High Assurance Systems Engineering. Special Topic: Impact of Networking. IEEE, 23–31

  22. [22]

    Gabriel Ryan, Siddhartha Jain, Mingyue Shang, Shiqi Wang, Xiaofei Ma, Murali Krishna Ramanathan, and Baishakhi Ray

  23. [23]

    Code-aware prompting: A study of coverage-guided test generation in regression setting using llm.Proceedings of the ACM on Software Engineering1, FSE (2024), 951–971

  24. [24]

    ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, et al. 2025. Seed-coder: Let the code model curate data for itself.arXiv preprint arXiv:2506.03524(2025). , Vol. 1, No. 1, Article . Publication date: April 2026. Uncovering Business Logic Bugs via Semantics-Driven Unit Test Generati...

  25. [25]

    Richa Sharma and KK Biswas. 2014. Automated generation of test cases from logical specification of software requirements. In2014 9th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE). IEEE, 1–8

  26. [26]

    Jiho Shin, Sepehr Hashtroudi, Hadi Hemmati, and Song Wang. 2024. Domain adaptation for code model-based unit test case generation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1211–1222

  27. [27]

    Phil Stocks and David Carrington. 1993. Test template framework: A specification-based testing case study. In Proceedings of the 1993 ACM SIGSOFT international symposium on Software testing and analysis. 11–18

  28. [28]

    Chunhui Wang, Fabrizio Pastore, Arda Goknil, and Lionel C Briand. 2020. Automatic generation of acceptance test cases from use case specifications: an nlp-based approach.IEEE Transactions on Software Engineering48, 2 (2020), 585–616

  29. [29]

    Zejun Wang, Kaibo Liu, Ge Li, and Zhi Jin. 2024. HITS: High-coverage LLM-based Unit Test Generation via Method Slicing.arXiv preprint arXiv:2408.11324(2024)

  30. [30]

    Ratnadira Widyasari, Sheng Qin Sim, Camellia Lok, Haodi Qi, Jack Phan, Qijin Tay, Constance Tan, Fiona Wee, Jodie Ethelda Tan, Yuheng Yieh, et al . 2020. Bugsinpy: a database of existing bugs in python programs to enable controlled testing and debugging studies. InProceedings of the 28th ACM joint meeting on european software engineering conference and sy...

  31. [31]

    Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4all: Univer- sal fuzzing with large language models. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  32. [32]

    Zhuokui Xie, Yinghao Chen, Chen Zhi, Shuiguang Deng, and Jianwei Yin. 2023. ChatUniTest: a ChatGPT-based automated unit test generation tool.arXiv preprint arXiv:2305.04764(2023)

  33. [33]

    Chen Yang, Junjie Chen, Bin Lin, Ziqi Wang, and Jianyi Zhou. 2024. Advancing Code Coverage: Incorporating Program Analysis with Large Language Models.ACM Transactions on Software Engineering and Methodology(2024)

  34. [34]

    Chen Yang, Ziqi Wang, Yanjie Jiang, Lin Yang, Yuteng Zheng, Jianyi Zhou, and Junjie Chen. 2025. Reflective Unit Test Generation for Precise Type Error Detection with Large Language Models. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering

  35. [35]

    Chen Yang, Ziqi Wang, Lin Yang, Dong Wang, Shutao Gao, Yanjie Jiang, and Junjie Chen. 2026. WiseUT: An Intelligent Framework for Unit Test Generation. In2026 IEEE/ACM 48th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)

  36. [36]

    Chen Yang, Lin Yang, Ziqi Wang, Dong Wang, Jianyi Zhou, and Junjie Chen. 2025. Clarifying Semantics of In-Context Examples for Unit Test Generation. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering

  37. [37]

    Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. 2024. On the evaluation of large language models in unit test generation. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1607–1619

  38. [38]

    Zhenzhen Yang, Rubing Huang, Chenhui Cui, Nan Niu, and Dave Towey. 2025. Requirements-based test generation: A comprehensive survey.ACM Transactions on Software Engineering and Methodology(2025)

  39. [39]

    Xin Yin, Chao Ni, Xinrui Li, Liushan Chen, Guojun Ma, and Xiaohu Yang. 2025. Enhancing LLM’s Ability to Generate More Repository-Aware Unit Tests Through Precise Contextual Information Injection.arXiv preprint arXiv:2501.07425 (2025)

  40. [40]

    Xin Yin, Chao Ni, Xiaodan Xu, and Xiaohu Yang. 2024. What you see is what you get: Attention-based self-guided automatic unit test generation.arXiv preprint arXiv:2412.00828(2024)

  41. [41]

    Zhiqiang Yuan, Yiling Lou, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, and Xin Peng. 2023. No more manual tests? evaluating and improving chatgpt for unit test generation.arXiv preprint arXiv:2305.04207(2023)

  42. [42]

    Junwei Zhang, Xing Hu, Shan Gao, Xin Xia, David Lo, and Shanping Li. 2025. Less is More: On the Importance of Data Quality for Unit Test Generation.arXiv preprint arXiv:2502.14212(2025)

  43. [43]

    Naifeng Zhang, Sanil Rao, Mike Franusich, and Franz Franchetti. 2025. Towards Semantics Lifting for Scientific Computing: A Case Study on FFT.arXiv preprint arXiv:2501.09201(2025). , Vol. 1, No. 1, Article . Publication date: April 2026