pith. sign in

arxiv: 2605.17561 · v1 · pith:M5LOKMGPnew · submitted 2026-05-17 · 💻 cs.SE · cs.AI· cs.MA

Automated Root-Cause Subclassification and No-Code Fix Generation for Invalid Bug Reports

Pith reviewed 2026-05-19 22:21 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.MA
keywords invalid bug reportsroot cause subclassificationno-code fixeslarge language modelsretrieval augmented generationagentic systemssoftware maintenancebug triage
0
0 comments X

The pith

Large language models with retrieval and agent techniques can subclassify root causes of invalid bug reports and generate no-code fixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a taxonomy that breaks invalid bug reports into root-cause subclasses such as non-reproducibility, feature requests, questions, and working as designed. It then compares three LLM configurations on a manually labeled collection of real reports to see how accurately each can assign the subclasses and how well each can draft no-code resolutions. Retrieval augmented generation performs best at subclassification while agentic web search performs best at producing usable fixes. If these automated steps prove reliable, customer support teams could shift from full manual review of every invalid report to a faster assisted workflow. The gold-standard benchmark supplies the ground-truth labels and example fixes used for all measurements.

Core claim

The authors establish a standardized taxonomy for root-cause subclassification of invalid bug reports and demonstrate through controlled experiments that different LLM setups can both detect those subclasses and generate matching no-code fixes, with results compared directly against the original human-labeled data from the reports.

What carries the argument

The standardized taxonomy of invalid bug report root-cause subclasses together with LLM configurations that add retrieval augmentation or agentic web search.

Load-bearing premise

The manually created set of labeled bug reports accurately reflects the distribution and characteristics of invalid reports that occur in real software projects.

What would settle it

Apply the same subclassification and fix-generation pipeline to a fresh collection of bug reports that have been independently labeled by multiple human experts and measure the level of agreement.

Figures

Figures reproduced from arXiv: 2605.17561 by Emre Dinc, Eray Tuzun, Mahmut Furkan Gon, Tevfik Emre Sungur.

Figure 1
Figure 1. Figure 1: Invalid Bug Report Examples with Different Root [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Evaluation Benchmark Curation Workflow [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of IssueSupport Methodology C. IssueSupport Methodology We experimented with four distinct methodologies: (1) Vanilla LLM Pipeline, (2) Vanilla LLM Pipeline Without Prior Invalid Subclass Information, (3) RAG Pipeline, and (4) Agentic Web Search Pipeline. The tested methodologies return one invalid subclass and one suggested no-code fix, except for (2), and differ based on the tools and sources th… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation prompt for the Judge LLM. It employs a contrastive three-part assessment approach, providing the Judge [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Issues faced when using software are reported in the form of bug reports. However, many bug reports are invalid, meaning they do not require code changes, and are resolved with a no-code fix. Manually determining the root cause of the invalid bug reports and providing actionable resolutions by the customer support causes a serious waste of resources. Our goal is to introduce a standardized taxonomy for root-cause oriented invalid bug report subclassification, and perform experiments to test the accuracy of various approaches on invalid subclassification and no-code fix generation. We study how different configurations perform on a gold-standard benchmark we have created. Using a manually curated benchmark for higher quality analysis, we experimented with vanilla LLMs, Retrieval Augmented Generation, and agentic web search to identify invalid subclasses and generate no-code fixes. We evaluated the results against manually labeled ground truth data that includes the invalid subclass and no-code fixes from the original bug reports. We measured subclass detection performance with weighted F1-Score, and assessed no-code fix suggestions using BERTScore and Judge LLM success rates. For subclassification, retrieval augmented generation achieves the highest overall performance with 0.66 weighted F1, slightly outperforming vanilla LLMs at 0.65 and agentic web search at 0.64. At the subclass level, performance peaks at 0.85 F1 for Non-reproducibility and 0.79 for Feature Request and Question, while Wrong Version remains the most challenging with scores between 0.00 and 0.29. For no-code fix generation, agentic web search achieves the highest overall Judge LLM success rate at 68.9%, compared to 64.4% for RAG applications and 64.9% for vanilla LLMs, with subclass-level peaks of 87.4% for Working as Designed and 72.2% for Question.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a standardized taxonomy for root-cause subclassification of invalid bug reports and evaluates vanilla LLMs, retrieval-augmented generation (RAG), and agentic web search on a manually curated gold-standard benchmark for both subclassification (via weighted F1) and no-code fix generation (via BERTScore and Judge LLM success rates). It reports RAG achieving the highest overall weighted F1 of 0.66 for subclassification (with peaks at 0.85 for Non-reproducibility) and agentic web search reaching the highest Judge LLM success rate of 68.9% for fix generation (with peaks at 87.4% for Working as Designed).

Significance. If the results hold, this work has moderate practical significance for software engineering by providing an empirical comparison of LLM configurations to automate triage and resolution of invalid bug reports, potentially reducing manual support effort. The concrete metrics against an independently labeled ground-truth set and the use of an external judge LLM (avoiding internal circularity) are strengths that support reproducibility and falsifiability of the performance claims.

major comments (2)
  1. [Results for no-code fix generation] Evaluation of no-code fix generation (results paragraph reporting 68.9% Judge LLM success rate): the claim that agentic web search outperforms RAG (64.4%) and vanilla LLMs (64.9%) rests on an unvalidated LLM judge proxy; no inter-rater agreement, correlation coefficient with human experts, or calibration study is reported for criteria such as actionability or true 'no-code' qualification, which is load-bearing for the central superiority claim given known divergences between LLM and human judgments on nuanced software-resolution tasks.
  2. [Benchmark creation and evaluation methodology] Benchmark and evaluation setup (abstract and results sections): the gold-standard benchmark size, inter-annotator agreement for the manual labels, prompt templates, and any statistical significance tests for the small performance margins (e.g., 0.66 vs. 0.65 weighted F1) are not reported; without these, the robustness of both headline performance claims cannot be fully assessed.
minor comments (2)
  1. [Methodology] The paper should include the full prompt templates and agentic workflow details in an appendix to support reproducibility of the RAG and web-search configurations.
  2. [Evaluation metrics] Clarify whether BERTScore was computed against the original no-code fixes or a reference set, and report the specific BERTScore values alongside the Judge LLM rates for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee report. We have carefully considered the major comments and outline our responses and planned revisions below.

read point-by-point responses
  1. Referee: [Results for no-code fix generation] Evaluation of no-code fix generation (results paragraph reporting 68.9% Judge LLM success rate): the claim that agentic web search outperforms RAG (64.4%) and vanilla LLMs (64.9%) rests on an unvalidated LLM judge proxy; no inter-rater agreement, correlation coefficient with human experts, or calibration study is reported for criteria such as actionability or true 'no-code' qualification, which is load-bearing for the central superiority claim given known divergences between LLM and human judgments on nuanced software-resolution tasks.

    Authors: We thank the referee for highlighting this important aspect of our evaluation. While we also provide BERTScore as a complementary automatic metric, we acknowledge the value of validating the LLM judge. In the revised version of the manuscript, we will include a small-scale human calibration study on a subset of the no-code fix generations to compute agreement with the Judge LLM, along with a discussion of the criteria used for 'actionability' and 'no-code' qualification. This will help substantiate the reported superiority of agentic web search. revision: yes

  2. Referee: [Benchmark creation and evaluation methodology] Benchmark and evaluation setup (abstract and results sections): the gold-standard benchmark size, inter-annotator agreement for the manual labels, prompt templates, and any statistical significance tests for the small performance margins (e.g., 0.66 vs. 0.65 weighted F1) are not reported; without these, the robustness of both headline performance claims cannot be fully assessed.

    Authors: We agree that providing these details is essential for assessing the reliability of our results. In the revision, we will explicitly state the size of our gold-standard benchmark, report the inter-annotator agreement achieved during the manual labeling process, include the prompt templates in the appendix or supplementary material, and conduct and report appropriate statistical significance tests (such as McNemar's test) for the differences in weighted F1 scores and success rates. These additions will address the concerns about robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on independent benchmark and external judge LLM

full rationale

The paper reports experimental performance numbers (weighted F1 scores for subclassification and Judge LLM success rates for fix generation) obtained by running vanilla LLMs, RAG, and agentic web search against a manually curated gold-standard benchmark whose labels and no-code fixes are taken from the original bug reports. No equations, fitted parameters, or derivations appear in the provided text. No self-citations are invoked to justify uniqueness theorems, ansatzes, or load-bearing premises. The reported metrics are therefore not reducible by construction to quantities the authors themselves defined or fitted inside the same paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study is purely empirical and relies on standard machine-learning evaluation practices rather than new mathematical axioms or invented physical entities.

pith-pipeline@v0.9.0 · 5889 in / 1196 out tokens · 34036 ms · 2026-05-19T22:21:35.727535+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 7 internal anchors

  1. [1]

    The cost of poor software quality in the us: A 2022 report,

    H. Krasner, “The cost of poor software quality in the us: A 2022 report,” Consortium for Information & Software Quality (CISQ), Tech. Rep., Dec. 2022, accessed 2025-11-07. [Online]. Available: https://www.it-cisq. org/wp-content/uploads/sites/6/2022/11/CPSQ-Report-Nov-22-2.pdf

  2. [2]

    (2025) Jira software: Issue and project tracking tool

    Atlassian. (2025) Jira software: Issue and project tracking tool. Accessed: November 7, 2025. [Online]. Available: https://www.atlassian.com/ software/jira

  3. [3]

    (2025) Github issues: Collaborative issue tracking platform

    GitHub. (2025) Github issues: Collaborative issue tracking platform. Accessed: November 7, 2025. [Online]. Available: https://github.com/ features/issues

  4. [4]

    Chaff from the wheat: Characterizing and determining valid bug reports,

    Y . Fan, X. Xia, D. Lo, and A. E. Hassan, “Chaff from the wheat: Characterizing and determining valid bug reports,”IEEE Transactions on Software Engineering, vol. 46, no. 5, pp. 495–525, 2020

  5. [5]

    A data-driven approach for understanding invalid bug reports: An industrial case study,

    M. Laiq, N. bin Ali, J. B ¨orstler, and E. Engstr ¨om, “A data-driven approach for understanding invalid bug reports: An industrial case study,”Information and Software Technology, vol. 164, p. 107305, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0950584923001593

  6. [6]

    Who should fix this bug?

    J. Anvik, L. Hiew, and G. C. Murphy, “Who should fix this bug?” in Proceedings of the 28th international conference on Software engineering, 2006, pp. 361–370

  7. [7]

    It’s not a bug, it’s a feature: How misclassification impacts bug prediction,

    K. Herzig, S. Just, and A. Zeller, “It’s not a bug, it’s a feature: How misclassification impacts bug prediction,” in2013 35th International Conference on Software Engineering (ICSE), 2013, pp. 392–401

  8. [8]

    Early identification of in- valid bug reports in industrial settings – a case study,

    M. Laiq, N. b. Ali, J. B ¨ostler, and E. Engstr¨om, “Early identification of in- valid bug reports in industrial settings – a case study,” inProduct-Focused Software Process Improvement, D. Taibi, M. Kuhrmann, T. Mikkonen, J. Kl ¨under, and P. Abrahamsson, Eds. Cham: Springer International Publishing, 2022, pp. 497–507

  9. [9]

    Why are bug reports invalid?

    J. Sun, “Why are bug reports invalid?” in2011 Fourth IEEE International Conference on Software Testing, Verification and Validation. IEEE, 2011, pp. 407–410

  10. [10]

    Creating an invalid defect classification model using text mining on server development,

    Y . Su, P. Luarn, Y .-S. Lee, and S.-J. Yen, “Creating an invalid defect classification model using text mining on server development,”Journal of Systems and Software, vol. 125, pp. 197–206, 2017

  11. [11]

    “won’t we fix this issue?

    S. Panichella, G. Canfora, and A. Di Sorbo, ““won’t we fix this issue?” qualitative characterization and automated identification of wontfix issues on github,”Information and Software Technology, vol. 139, p. 106665, 2021

  12. [12]

    Past, present, and future of bug tracking in the generative ai era,

    U. B. Torun, M. T. Demircan, M. F. G ¨on, and E. T ¨uz¨un, “Past, present, and future of bug tracking in the generative ai era,”ACM Transactions on Software Engineering and Methodology, 2026. [Online]. Available: https://doi.org/10.1145/3806655

  13. [13]

    Enhanced bug priority prediction via priority-sensitive long short-term memory–attention mechanism,

    G. Yang, J. Ji, and J. Kim, “Enhanced bug priority prediction via priority-sensitive long short-term memory–attention mechanism,”Applied Sciences, vol. 15, no. 2, p. 633, 2025

  14. [14]

    A V-FUZZER: Finding safety violations in autonomous driving systems,

    J. He, L. Xu, Y . Fan, Z. Xu, M. Yan, and Y . Lei, “Deep learning based valid bug reports determination and explanation,” in 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), 2020, pp. 184–194. [Online]. Available: https://doi.org/10.1109/ISSRE5003.2020.00026

  15. [15]

    Deeplabel: Automated issue classification for issue tracking systems,

    Z. Li, M. Pan, Y . Pei, T. Zhang, L. Wang, and X. Li, “Deeplabel: Automated issue classification for issue tracking systems,” inProceedings of the 13th Asia-Pacific Symposium on Internetware, 2022, pp. 231–241

  16. [16]

    A comparative analysis of ml techniques for bug report classification,

    M. Laiq, N. bin Ali, J. B ¨orstler, and E. Engstr ¨om, “A comparative analysis of ml techniques for bug report classification,”Journal of Systems and Software, vol. 227, p. 112457, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121225001256

  17. [17]

    Llm-brc: A large language model-based bug report classification framework,

    X. Du, Z. Liu, C. Li, X. Ma, Y . Li, and X. Wang, “Llm-brc: A large language model-based bug report classification framework,”Software Quality Journal, vol. 32, no. 3, pp. 985–1005, 2024

  18. [18]

    Judge the votes: A system to classify bug reports and give suggestions,

    E. Dinc ¸and E. T ¨uz¨un, “Judge the votes: A system to classify bug reports and give suggestions,” inProceedings of the 2nd ACM International Conference on AI-powered Software (AIWare ’25), 2025

  19. [19]

    BERTScore: Evaluating Text Generation with BERT

    T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019

  20. [20]

    Automated classification of software issue reports using machine learning techniques: an empirical study,

    N. Pandey, D. Sanyal, A. Hudait, and A. Sen, “Automated classification of software issue reports using machine learning techniques: an empirical study,”Innovations in Systems and Software Engineering, vol. 13, 12 2017

  21. [21]

    Unsupervised bug report categorization using clustering and labeling algorithm,

    N. Limsettho, H. Hata, A. Monden, and K. Matsumoto, “Unsupervised bug report categorization using clustering and labeling algorithm,”Inter- national Journal of Software Engineering and Knowledge Engineering, vol. 26, pp. 1027–1053, 09 2016

  22. [22]

    Automated labeling of issue reports using semi supervised approach,

    I. Chawla and S. Singh, “Automated labeling of issue reports using semi supervised approach,”Journal of Computational Methods in Sciences and Engineering, vol. 18, pp. 1–15, 01 2018

  23. [23]

    Classifying bug reports into bugs and non-bugs using lstm,

    H. Qin and X. Sun, “Classifying bug reports into bugs and non-bugs using lstm,” inProceedings of the 10th Asia-Pacific Symposium on Internetware, 2018, pp. 1–4

  24. [24]

    Bug report classification using lstm architecture for more accurate software defect locating,

    X. Ye, F. Fang, J. Wu, R. Bunescu, and C. Liu, “Bug report classification using lstm architecture for more accurate software defect locating,” in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 2018, pp. 1438–1445

  25. [25]

    Which bug reports are valid and why? using the bert transformer to classify bug reports and explain their validity,

    Q. Meng and J. Visser, “Which bug reports are valid and why? using the bert transformer to classify bug reports and explain their validity,” in Proceedings of the 4th European Symposium on Software Engineering (ESSE 2023), 2023, pp. 52–60

  26. [26]

    Towards word embeddings for improved duplicate bug report retrieval in software repositories,

    A. Budhiraja, K. Dutta, M. Shrivastava, and R. Reddy, “Towards word embeddings for improved duplicate bug report retrieval in software repositories,” inProceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR ’18), 2018, pp. 167–170

  27. [27]

    Towards accurate duplicate bug retrieval using deep learning techniques,

    J. Deshmukh, K. Annervaz, S. Podder, S. Sengupta, and N. Dubash, “Towards accurate duplicate bug retrieval using deep learning techniques,” in2017 IEEE International conference on software maintenance and evolution (ICSME). IEEE, 2017, pp. 115–124

  28. [28]

    Rag4tickets: Ai-powered ticket resolution via retrieval- augmented generation on jira and github data,

    M. Baqar, “Rag4tickets: Ai-powered ticket resolution via retrieval- augmented generation on jira and github data,”arXiv preprint arXiv:2510.08667, 2025

  29. [29]

    Credibility assessment of fabricated bug reports via large language models: A study on detecting fake software issues,

    K. Ren, “Credibility assessment of fabricated bug reports via large language models: A study on detecting fake software issues,” 2025

  30. [30]

    ImproBR: Bug Report Improver Using LLMs

    E. Akyol, M. Dedeler, and E. T ¨uz¨un, “Improbr: Bug report improver using llms,” in30th International Conference on Evaluation and Assessment in Software Engineering (EASE), 03 2026. [Online]. Available: https://arxiv.org/abs/2604.26142

  31. [31]

    WebThinker: Empowering Large Reasoning Models with Deep Research Capability

    X. Li, J. Jin, G. Dong, H. Qian, Y . Wu, J.-R. Wen, Y . Zhu, and Z. Dou, “Webthinker: Empowering large reasoning models with deep research capability,” 2025. [Online]. Available: https://arxiv.org/abs/2504.21776

  32. [32]

    From web search towards agentic deep research: Incentivizing search with reasoning agents,

    W. Zhang, Y . Li, Y . Bei, J. Luo, G. Wan, L. Yang, C. Xie, Y . Yang, W.-C. Huang, C. Miao, H. P. Zou, X. Luo, Y . Zhao, Y . Chen, C. Chan, P. Zhou, X. Zhang, C. Zhang, J. Shang, M. Zhang, Y . Song, I. King, and P. S. Yu, “From web search towards agentic deep research: Incentivizing search with reasoning agents,” 2025. [Online]. Available: https://arxiv.o...

  33. [33]

    Webexplorer: Explore and evolve for training long-horizon web agents,

    J. Liu, Y . Li, C. Zhang, J. Li, A. Chen, K. Ji, W. Cheng, Z. Wu, C. Du, Q. Xu, J. Song, Z. Zhu, W. Chen, P. Zhao, and J. He, “Webexplorer: Explore and evolve for training long-horizon web agents,”

  34. [34]
  35. [35]

    Browsemaster: Towards scalable web browsing via tool-augmented programmatic agent pair,

    X. Pang, S. Tang, R. Ye, Y . Du, Y . Du, and S. Chen, “Browsemaster: Towards scalable web browsing via tool-augmented programmatic agent pair,” 2025. [Online]. Available: https://arxiv.org/abs/2508.09129

  36. [36]

    Revolutionizing customer service: The impact of large language models on chatbot performance,

    M. Sudeep, “Revolutionizing customer service: The impact of large language models on chatbot performance,”INTERNATIONAL JOURNAL, vol. 10, no. 5, pp. 721–730, 2024

  37. [37]

    Ecom-bench: Can llm agent resolve real-world e-commerce customer support issues?

    H. Wang, X. Peng, H. Cheng, Y . Huang, M. Gong, C. Yang, Y . Liu, and J. Lin, “Ecom-bench: Can llm agent resolve real-world e-commerce customer support issues?” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2025, pp. 276–284

  38. [38]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?”arXiv preprint arXiv:2310.06770, 2023

  39. [39]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic and contamination free evaluation of large language models for code,”arXiv preprint arXiv:2403.07974, 2024

  40. [40]

    Gitbugs: Bug reports for duplicate detection, retrieval aug- mented generation, triage, and more,

    A. Patil, “Gitbugs: Bug reports for duplicate detection, retrieval aug- mented generation, triage, and more,”arXiv e-prints, pp. arXiv–2504, 2025

  41. [41]

    Icon is missing from brave notification ads after macos upgrade,

    “Icon is missing from brave notification ads after macos upgrade,” https: //github.com/brave/brave-browser/issues/26323, accessed: 2026

  42. [42]

    Full screen mode on mac make tabs and url section disappear,

    “Full screen mode on mac make tabs and url section disappear,” https: //github.com/brave/brave-browser/issues/35808, accessed: 2026

  43. [43]

    Development version request,

    “Development version request,” https://github.com/brave/brave-browser/ issues/21405, accessed: 2026

  44. [44]

    Crash in brave ads,

    “Crash in brave ads,” https://github.com/brave/brave-browser/issues/ 34144, accessed: 2026

  45. [45]

    What format does brave use to store date/time for ads,

    “What format does brave use to store date/time for ads,” https://github. com/brave/brave-browser/issues/27157, accessed: 2026. IEEE TRANSACTIONS ON SOFTW ARE ENGINEERING 19

  46. [46]

    Update without restarting,

    “Update without restarting,” https://github.com/brave/brave-browser/ issues/20778, accessed: 2026

  47. [47]

    Possible display bug on recovery phrase screen,

    “Possible display bug on recovery phrase screen,” https://github.com/ brave/brave-browser/issues/20796, accessed: 2026

  48. [48]

    Labels · brave/bravebrowser,

    b. contributors, “Labels · brave/bravebrowser,” https://github.com/brave/ brave-browser/labels, 2026, accessed: 2026-01-05

  49. [49]

    W. G. Cochran,Sampling Techniques. Hoboken: John Wiley & Sons, 2007

  50. [50]

    GLM-5: from Vibe Coding to Agentic Engineering

    A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xieet al., “Glm-5: from vibe coding to agentic engineering,” arXiv preprint arXiv:2602.15763, 2026

  51. [51]

    Minimax m2.7: Early echoes of self-evolution,

    MiniMax AI, “Minimax m2.7: Early echoes of self-evolution,” https: //huggingface.co/MiniMaxAI/MiniMax-M2.7, 2026, technical report and model release

  52. [52]

    Kimi K2.5: Visual Agentic Intelligence

    K. Team, T. Bai, Y . Bai, Y . Bao, S. Cai, Y . Cao, Y . Charles, H. Che, C. Chen, G. Chenet al., “Kimi k2. 5: Visual agentic intelligence,”arXiv preprint arXiv:2602.02276, 2026

  53. [53]

    (2026) Serper: The world’s fastest and cheapest Google Search API

    Serper. (2026) Serper: The world’s fastest and cheapest Google Search API. Accessed: 2026-04-21. [Online]. Available: https://serper.dev/

  54. [54]

    Openrouter api reference,

    OpenRouter, “Openrouter api reference,” https://openrouter.ai/docs/api/ reference/overview, 2026, accessed: 2026-04-23

  55. [55]

    A simple ensemble strategy for llm inference: Towards more stable text classification,

    J. Niimi, “A simple ensemble strategy for llm inference: Towards more stable text classification,” inInternational Conference on Applications of Natural Language to Information Systems. Springer, 2025, pp. 189–199

  56. [56]

    How language model hallucinations can snowball,

    M. Zhang, O. Press, W. Merrill, A. Liu, and N. A. Smith, “How language model hallucinations can snowball,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

  57. [57]

    Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools,

    J. Wu, J. Zhu, Y . Liu, M. Xu, and Y . Jin, “Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools,”arXiv preprint arXiv:2502.04644, 2025. [Online]. Available: https://arxiv.org/html/2502.04644v2

  58. [58]

    Brave issue 34522, wrong version no-code fix example,

    “Brave issue 34522, wrong version no-code fix example,” https://github. com/brave/brave-browser/issues/34522#issuecomment-1827025260, ac- cessed: 2026

  59. [59]

    Brave issue 31299, working as designed no-code fix example,

    “Brave issue 31299, working as designed no-code fix example,” https://github.com/brave/brave-browser/issues/31299# issuecomment-1608043966, accessed: 2026

  60. [60]

    Brave issue 23741, faulty configuration no-code fix example,

    “Brave issue 23741, faulty configuration no-code fix example,” https://github.com/brave/brave-browser/issues/23741# issuecomment-1169167446, accessed: 2026

  61. [61]

    Why your google search results differ from others,

    Google Search Help, “Why your google search results differ from others,” https://support.google.com/websearch/answer/12412910?hl=en, 2025, accessed: 2026-04-24