Automated Root-Cause Subclassification and No-Code Fix Generation for Invalid Bug Reports

Emre Dinc; Eray Tuzun; Mahmut Furkan Gon; Tevfik Emre Sungur

arxiv: 2605.17561 · v1 · pith:M5LOKMGPnew · submitted 2026-05-17 · 💻 cs.SE · cs.AI· cs.MA

Automated Root-Cause Subclassification and No-Code Fix Generation for Invalid Bug Reports

Mahmut Furkan Gon , Emre Dinc , Tevfik Emre Sungur , Eray Tuzun This is my paper

Pith reviewed 2026-05-19 22:21 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.MA

keywords invalid bug reportsroot cause subclassificationno-code fixeslarge language modelsretrieval augmented generationagentic systemssoftware maintenancebug triage

0 comments

The pith

Large language models with retrieval and agent techniques can subclassify root causes of invalid bug reports and generate no-code fixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a taxonomy that breaks invalid bug reports into root-cause subclasses such as non-reproducibility, feature requests, questions, and working as designed. It then compares three LLM configurations on a manually labeled collection of real reports to see how accurately each can assign the subclasses and how well each can draft no-code resolutions. Retrieval augmented generation performs best at subclassification while agentic web search performs best at producing usable fixes. If these automated steps prove reliable, customer support teams could shift from full manual review of every invalid report to a faster assisted workflow. The gold-standard benchmark supplies the ground-truth labels and example fixes used for all measurements.

Core claim

The authors establish a standardized taxonomy for root-cause subclassification of invalid bug reports and demonstrate through controlled experiments that different LLM setups can both detect those subclasses and generate matching no-code fixes, with results compared directly against the original human-labeled data from the reports.

What carries the argument

The standardized taxonomy of invalid bug report root-cause subclasses together with LLM configurations that add retrieval augmentation or agentic web search.

Load-bearing premise

The manually created set of labeled bug reports accurately reflects the distribution and characteristics of invalid reports that occur in real software projects.

What would settle it

Apply the same subclassification and fix-generation pipeline to a fresh collection of bug reports that have been independently labeled by multiple human experts and measure the level of agreement.

Figures

Figures reproduced from arXiv: 2605.17561 by Emre Dinc, Eray Tuzun, Mahmut Furkan Gon, Tevfik Emre Sungur.

**Figure 2.** Figure 2: Overview of Evaluation Benchmark Curation Workflow [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of IssueSupport Methodology C. IssueSupport Methodology We experimented with four distinct methodologies: (1) Vanilla LLM Pipeline, (2) Vanilla LLM Pipeline Without Prior Invalid Subclass Information, (3) RAG Pipeline, and (4) Agentic Web Search Pipeline. The tested methodologies return one invalid subclass and one suggested no-code fix, except for (2), and differ based on the tools and sources th… view at source ↗

**Figure 4.** Figure 4: Evaluation prompt for the Judge LLM. It employs a contrastive three-part assessment approach, providing the Judge [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Issues faced when using software are reported in the form of bug reports. However, many bug reports are invalid, meaning they do not require code changes, and are resolved with a no-code fix. Manually determining the root cause of the invalid bug reports and providing actionable resolutions by the customer support causes a serious waste of resources. Our goal is to introduce a standardized taxonomy for root-cause oriented invalid bug report subclassification, and perform experiments to test the accuracy of various approaches on invalid subclassification and no-code fix generation. We study how different configurations perform on a gold-standard benchmark we have created. Using a manually curated benchmark for higher quality analysis, we experimented with vanilla LLMs, Retrieval Augmented Generation, and agentic web search to identify invalid subclasses and generate no-code fixes. We evaluated the results against manually labeled ground truth data that includes the invalid subclass and no-code fixes from the original bug reports. We measured subclass detection performance with weighted F1-Score, and assessed no-code fix suggestions using BERTScore and Judge LLM success rates. For subclassification, retrieval augmented generation achieves the highest overall performance with 0.66 weighted F1, slightly outperforming vanilla LLMs at 0.65 and agentic web search at 0.64. At the subclass level, performance peaks at 0.85 F1 for Non-reproducibility and 0.79 for Feature Request and Question, while Wrong Version remains the most challenging with scores between 0.00 and 0.29. For no-code fix generation, agentic web search achieves the highest overall Judge LLM success rate at 68.9%, compared to 64.4% for RAG applications and 64.9% for vanilla LLMs, with subclass-level peaks of 87.4% for Working as Designed and 72.2% for Question.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New taxonomy and benchmark for invalid bug report subclassification, with RAG edging out on F1 and agentic search on judge scores, but the fix evaluation rests on an unvalidated LLM proxy.

read the letter

The key takeaway from this paper is a new taxonomy for root-cause subclassification of invalid bug reports along with a fresh manually labeled benchmark, plus head-to-head tests of vanilla LLMs, RAG, and agentic search for both classification and no-code fix suggestions. They report RAG achieving the best weighted F1 of 0.66 on subclassification, with agentic search at 68.9% judge success for fixes. The work breaks down performance by subclass, highlighting easier cases like Non-reproducibility and tougher ones like Wrong Version. Releasing the benchmark stands out as a concrete contribution that others can build on. The experiments apply known methods without major new inventions, but the focus on invalid reports and no-code resolutions fills a gap in bug report handling literature. The use of an external judge LLM and ground truth from original reports keeps things somewhat independent. A soft spot is the lack of human validation for the judge LLM scores. Without correlation data or inter-rater checks against experts, the 68.9% figure for agentic search rests on an untested proxy, and small differences over other methods could shift with better evaluation. The abstract omits benchmark size and agreement stats, though the full text likely covers more. This paper suits researchers in software engineering who work on support automation or LLM tools for triage. It offers practical metrics and a dataset for a real workflow issue. The evidence is solid enough on the benchmark creation and basic comparisons to merit peer review, even with room for improvement on evaluation rigor. I recommend putting it through peer review with feedback on validating the automated judge.

Referee Report

2 major / 2 minor

Summary. The paper introduces a standardized taxonomy for root-cause subclassification of invalid bug reports and evaluates vanilla LLMs, retrieval-augmented generation (RAG), and agentic web search on a manually curated gold-standard benchmark for both subclassification (via weighted F1) and no-code fix generation (via BERTScore and Judge LLM success rates). It reports RAG achieving the highest overall weighted F1 of 0.66 for subclassification (with peaks at 0.85 for Non-reproducibility) and agentic web search reaching the highest Judge LLM success rate of 68.9% for fix generation (with peaks at 87.4% for Working as Designed).

Significance. If the results hold, this work has moderate practical significance for software engineering by providing an empirical comparison of LLM configurations to automate triage and resolution of invalid bug reports, potentially reducing manual support effort. The concrete metrics against an independently labeled ground-truth set and the use of an external judge LLM (avoiding internal circularity) are strengths that support reproducibility and falsifiability of the performance claims.

major comments (2)

[Results for no-code fix generation] Evaluation of no-code fix generation (results paragraph reporting 68.9% Judge LLM success rate): the claim that agentic web search outperforms RAG (64.4%) and vanilla LLMs (64.9%) rests on an unvalidated LLM judge proxy; no inter-rater agreement, correlation coefficient with human experts, or calibration study is reported for criteria such as actionability or true 'no-code' qualification, which is load-bearing for the central superiority claim given known divergences between LLM and human judgments on nuanced software-resolution tasks.
[Benchmark creation and evaluation methodology] Benchmark and evaluation setup (abstract and results sections): the gold-standard benchmark size, inter-annotator agreement for the manual labels, prompt templates, and any statistical significance tests for the small performance margins (e.g., 0.66 vs. 0.65 weighted F1) are not reported; without these, the robustness of both headline performance claims cannot be fully assessed.

minor comments (2)

[Methodology] The paper should include the full prompt templates and agentic workflow details in an appendix to support reproducibility of the RAG and web-search configurations.
[Evaluation metrics] Clarify whether BERTScore was computed against the original no-code fixes or a reference set, and report the specific BERTScore values alongside the Judge LLM rates for completeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee report. We have carefully considered the major comments and outline our responses and planned revisions below.

read point-by-point responses

Referee: [Results for no-code fix generation] Evaluation of no-code fix generation (results paragraph reporting 68.9% Judge LLM success rate): the claim that agentic web search outperforms RAG (64.4%) and vanilla LLMs (64.9%) rests on an unvalidated LLM judge proxy; no inter-rater agreement, correlation coefficient with human experts, or calibration study is reported for criteria such as actionability or true 'no-code' qualification, which is load-bearing for the central superiority claim given known divergences between LLM and human judgments on nuanced software-resolution tasks.

Authors: We thank the referee for highlighting this important aspect of our evaluation. While we also provide BERTScore as a complementary automatic metric, we acknowledge the value of validating the LLM judge. In the revised version of the manuscript, we will include a small-scale human calibration study on a subset of the no-code fix generations to compute agreement with the Judge LLM, along with a discussion of the criteria used for 'actionability' and 'no-code' qualification. This will help substantiate the reported superiority of agentic web search. revision: yes
Referee: [Benchmark creation and evaluation methodology] Benchmark and evaluation setup (abstract and results sections): the gold-standard benchmark size, inter-annotator agreement for the manual labels, prompt templates, and any statistical significance tests for the small performance margins (e.g., 0.66 vs. 0.65 weighted F1) are not reported; without these, the robustness of both headline performance claims cannot be fully assessed.

Authors: We agree that providing these details is essential for assessing the reliability of our results. In the revision, we will explicitly state the size of our gold-standard benchmark, report the inter-annotator agreement achieved during the manual labeling process, include the prompt templates in the appendix or supplementary material, and conduct and report appropriate statistical significance tests (such as McNemar's test) for the differences in weighted F1 scores and success rates. These additions will address the concerns about robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on independent benchmark and external judge LLM

full rationale

The paper reports experimental performance numbers (weighted F1 scores for subclassification and Judge LLM success rates for fix generation) obtained by running vanilla LLMs, RAG, and agentic web search against a manually curated gold-standard benchmark whose labels and no-code fixes are taken from the original bug reports. No equations, fitted parameters, or derivations appear in the provided text. No self-citations are invoked to justify uniqueness theorems, ansatzes, or load-bearing premises. The reported metrics are therefore not reducible by construction to quantities the authors themselves defined or fitted inside the same paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study is purely empirical and relies on standard machine-learning evaluation practices rather than new mathematical axioms or invented physical entities.

pith-pipeline@v0.9.0 · 5889 in / 1196 out tokens · 34036 ms · 2026-05-19T22:21:35.727535+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We experimented with vanilla LLMs, Retrieval Augmented Generation, and agentic web search to identify invalid subclasses and generate no-code fixes... measured subclass detection performance with weighted F1-Score, and assessed no-code fix suggestions using BERTScore and Judge LLM success rates.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The final clusters of invalid subclasses are External System & Dependency Issues, Faulty Configuration, Feature Request, Non-reproducible, Question, Working as Designed, and Wrong Version.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 7 internal anchors

[1]

The cost of poor software quality in the us: A 2022 report,

H. Krasner, “The cost of poor software quality in the us: A 2022 report,” Consortium for Information & Software Quality (CISQ), Tech. Rep., Dec. 2022, accessed 2025-11-07. [Online]. Available: https://www.it-cisq. org/wp-content/uploads/sites/6/2022/11/CPSQ-Report-Nov-22-2.pdf

work page 2022
[2]

(2025) Jira software: Issue and project tracking tool

Atlassian. (2025) Jira software: Issue and project tracking tool. Accessed: November 7, 2025. [Online]. Available: https://www.atlassian.com/ software/jira

work page 2025
[3]

(2025) Github issues: Collaborative issue tracking platform

GitHub. (2025) Github issues: Collaborative issue tracking platform. Accessed: November 7, 2025. [Online]. Available: https://github.com/ features/issues

work page 2025
[4]

Chaff from the wheat: Characterizing and determining valid bug reports,

Y . Fan, X. Xia, D. Lo, and A. E. Hassan, “Chaff from the wheat: Characterizing and determining valid bug reports,”IEEE Transactions on Software Engineering, vol. 46, no. 5, pp. 495–525, 2020

work page 2020
[5]

A data-driven approach for understanding invalid bug reports: An industrial case study,

M. Laiq, N. bin Ali, J. B ¨orstler, and E. Engstr ¨om, “A data-driven approach for understanding invalid bug reports: An industrial case study,”Information and Software Technology, vol. 164, p. 107305, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0950584923001593

work page 2023
[6]

Who should fix this bug?

J. Anvik, L. Hiew, and G. C. Murphy, “Who should fix this bug?” in Proceedings of the 28th international conference on Software engineering, 2006, pp. 361–370

work page 2006
[7]

It’s not a bug, it’s a feature: How misclassification impacts bug prediction,

K. Herzig, S. Just, and A. Zeller, “It’s not a bug, it’s a feature: How misclassification impacts bug prediction,” in2013 35th International Conference on Software Engineering (ICSE), 2013, pp. 392–401

work page 2013
[8]

Early identification of in- valid bug reports in industrial settings – a case study,

M. Laiq, N. b. Ali, J. B ¨ostler, and E. Engstr¨om, “Early identification of in- valid bug reports in industrial settings – a case study,” inProduct-Focused Software Process Improvement, D. Taibi, M. Kuhrmann, T. Mikkonen, J. Kl ¨under, and P. Abrahamsson, Eds. Cham: Springer International Publishing, 2022, pp. 497–507

work page 2022
[9]

Why are bug reports invalid?

J. Sun, “Why are bug reports invalid?” in2011 Fourth IEEE International Conference on Software Testing, Verification and Validation. IEEE, 2011, pp. 407–410

work page 2011
[10]

Creating an invalid defect classification model using text mining on server development,

Y . Su, P. Luarn, Y .-S. Lee, and S.-J. Yen, “Creating an invalid defect classification model using text mining on server development,”Journal of Systems and Software, vol. 125, pp. 197–206, 2017

work page 2017
[11]

“won’t we fix this issue?

S. Panichella, G. Canfora, and A. Di Sorbo, ““won’t we fix this issue?” qualitative characterization and automated identification of wontfix issues on github,”Information and Software Technology, vol. 139, p. 106665, 2021

work page 2021
[12]

Past, present, and future of bug tracking in the generative ai era,

U. B. Torun, M. T. Demircan, M. F. G ¨on, and E. T ¨uz¨un, “Past, present, and future of bug tracking in the generative ai era,”ACM Transactions on Software Engineering and Methodology, 2026. [Online]. Available: https://doi.org/10.1145/3806655

work page doi:10.1145/3806655 2026
[13]

Enhanced bug priority prediction via priority-sensitive long short-term memory–attention mechanism,

G. Yang, J. Ji, and J. Kim, “Enhanced bug priority prediction via priority-sensitive long short-term memory–attention mechanism,”Applied Sciences, vol. 15, no. 2, p. 633, 2025

work page 2025
[14]

A V-FUZZER: Finding safety violations in autonomous driving systems,

J. He, L. Xu, Y . Fan, Z. Xu, M. Yan, and Y . Lei, “Deep learning based valid bug reports determination and explanation,” in 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), 2020, pp. 184–194. [Online]. Available: https://doi.org/10.1109/ISSRE5003.2020.00026

work page doi:10.1109/issre5003.2020.00026 2020
[15]

Deeplabel: Automated issue classification for issue tracking systems,

Z. Li, M. Pan, Y . Pei, T. Zhang, L. Wang, and X. Li, “Deeplabel: Automated issue classification for issue tracking systems,” inProceedings of the 13th Asia-Pacific Symposium on Internetware, 2022, pp. 231–241

work page 2022
[16]

A comparative analysis of ml techniques for bug report classification,

M. Laiq, N. bin Ali, J. B ¨orstler, and E. Engstr ¨om, “A comparative analysis of ml techniques for bug report classification,”Journal of Systems and Software, vol. 227, p. 112457, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121225001256

work page 2025
[17]

Llm-brc: A large language model-based bug report classification framework,

X. Du, Z. Liu, C. Li, X. Ma, Y . Li, and X. Wang, “Llm-brc: A large language model-based bug report classification framework,”Software Quality Journal, vol. 32, no. 3, pp. 985–1005, 2024

work page 2024
[18]

Judge the votes: A system to classify bug reports and give suggestions,

E. Dinc ¸and E. T ¨uz¨un, “Judge the votes: A system to classify bug reports and give suggestions,” inProceedings of the 2nd ACM International Conference on AI-powered Software (AIWare ’25), 2025

work page 2025
[19]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[20]

Automated classification of software issue reports using machine learning techniques: an empirical study,

N. Pandey, D. Sanyal, A. Hudait, and A. Sen, “Automated classification of software issue reports using machine learning techniques: an empirical study,”Innovations in Systems and Software Engineering, vol. 13, 12 2017

work page 2017
[21]

Unsupervised bug report categorization using clustering and labeling algorithm,

N. Limsettho, H. Hata, A. Monden, and K. Matsumoto, “Unsupervised bug report categorization using clustering and labeling algorithm,”Inter- national Journal of Software Engineering and Knowledge Engineering, vol. 26, pp. 1027–1053, 09 2016

work page 2016
[22]

Automated labeling of issue reports using semi supervised approach,

I. Chawla and S. Singh, “Automated labeling of issue reports using semi supervised approach,”Journal of Computational Methods in Sciences and Engineering, vol. 18, pp. 1–15, 01 2018

work page 2018
[23]

Classifying bug reports into bugs and non-bugs using lstm,

H. Qin and X. Sun, “Classifying bug reports into bugs and non-bugs using lstm,” inProceedings of the 10th Asia-Pacific Symposium on Internetware, 2018, pp. 1–4

work page 2018
[24]

Bug report classification using lstm architecture for more accurate software defect locating,

X. Ye, F. Fang, J. Wu, R. Bunescu, and C. Liu, “Bug report classification using lstm architecture for more accurate software defect locating,” in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 2018, pp. 1438–1445

work page 2018
[25]

Which bug reports are valid and why? using the bert transformer to classify bug reports and explain their validity,

Q. Meng and J. Visser, “Which bug reports are valid and why? using the bert transformer to classify bug reports and explain their validity,” in Proceedings of the 4th European Symposium on Software Engineering (ESSE 2023), 2023, pp. 52–60

work page 2023
[26]

Towards word embeddings for improved duplicate bug report retrieval in software repositories,

A. Budhiraja, K. Dutta, M. Shrivastava, and R. Reddy, “Towards word embeddings for improved duplicate bug report retrieval in software repositories,” inProceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR ’18), 2018, pp. 167–170

work page 2018
[27]

Towards accurate duplicate bug retrieval using deep learning techniques,

J. Deshmukh, K. Annervaz, S. Podder, S. Sengupta, and N. Dubash, “Towards accurate duplicate bug retrieval using deep learning techniques,” in2017 IEEE International conference on software maintenance and evolution (ICSME). IEEE, 2017, pp. 115–124

work page 2017
[28]

Rag4tickets: Ai-powered ticket resolution via retrieval- augmented generation on jira and github data,

M. Baqar, “Rag4tickets: Ai-powered ticket resolution via retrieval- augmented generation on jira and github data,”arXiv preprint arXiv:2510.08667, 2025

work page arXiv 2025
[29]

Credibility assessment of fabricated bug reports via large language models: A study on detecting fake software issues,

K. Ren, “Credibility assessment of fabricated bug reports via large language models: A study on detecting fake software issues,” 2025

work page 2025
[30]

ImproBR: Bug Report Improver Using LLMs

E. Akyol, M. Dedeler, and E. T ¨uz¨un, “Improbr: Bug report improver using llms,” in30th International Conference on Evaluation and Assessment in Software Engineering (EASE), 03 2026. [Online]. Available: https://arxiv.org/abs/2604.26142

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

X. Li, J. Jin, G. Dong, H. Qian, Y . Wu, J.-R. Wen, Y . Zhu, and Z. Dou, “Webthinker: Empowering large reasoning models with deep research capability,” 2025. [Online]. Available: https://arxiv.org/abs/2504.21776

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

From web search towards agentic deep research: Incentivizing search with reasoning agents,

W. Zhang, Y . Li, Y . Bei, J. Luo, G. Wan, L. Yang, C. Xie, Y . Yang, W.-C. Huang, C. Miao, H. P. Zou, X. Luo, Y . Zhao, Y . Chen, C. Chan, P. Zhou, X. Zhang, C. Zhang, J. Shang, M. Zhang, Y . Song, I. King, and P. S. Yu, “From web search towards agentic deep research: Incentivizing search with reasoning agents,” 2025. [Online]. Available: https://arxiv.o...

work page arXiv 2025
[33]

Webexplorer: Explore and evolve for training long-horizon web agents,

J. Liu, Y . Li, C. Zhang, J. Li, A. Chen, K. Ji, W. Cheng, Z. Wu, C. Du, Q. Xu, J. Song, Z. Zhu, W. Chen, P. Zhao, and J. He, “Webexplorer: Explore and evolve for training long-horizon web agents,”

work page
[34]

Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025

[Online]. Available: https://arxiv.org/abs/2509.06501

work page arXiv
[35]

Browsemaster: Towards scalable web browsing via tool-augmented programmatic agent pair,

X. Pang, S. Tang, R. Ye, Y . Du, Y . Du, and S. Chen, “Browsemaster: Towards scalable web browsing via tool-augmented programmatic agent pair,” 2025. [Online]. Available: https://arxiv.org/abs/2508.09129

work page arXiv 2025
[36]

Revolutionizing customer service: The impact of large language models on chatbot performance,

M. Sudeep, “Revolutionizing customer service: The impact of large language models on chatbot performance,”INTERNATIONAL JOURNAL, vol. 10, no. 5, pp. 721–730, 2024

work page 2024
[37]

Ecom-bench: Can llm agent resolve real-world e-commerce customer support issues?

H. Wang, X. Peng, H. Cheng, Y . Huang, M. Gong, C. Yang, Y . Liu, and J. Lin, “Ecom-bench: Can llm agent resolve real-world e-commerce customer support issues?” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2025, pp. 276–284

work page 2025
[38]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?”arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic and contamination free evaluation of large language models for code,”arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Gitbugs: Bug reports for duplicate detection, retrieval aug- mented generation, triage, and more,

A. Patil, “Gitbugs: Bug reports for duplicate detection, retrieval aug- mented generation, triage, and more,”arXiv e-prints, pp. arXiv–2504, 2025

work page 2025
[41]

Icon is missing from brave notification ads after macos upgrade,

“Icon is missing from brave notification ads after macos upgrade,” https: //github.com/brave/brave-browser/issues/26323, accessed: 2026

work page 2026
[42]

Full screen mode on mac make tabs and url section disappear,

“Full screen mode on mac make tabs and url section disappear,” https: //github.com/brave/brave-browser/issues/35808, accessed: 2026

work page 2026
[43]

Development version request,

“Development version request,” https://github.com/brave/brave-browser/ issues/21405, accessed: 2026

work page 2026
[44]

Crash in brave ads,

“Crash in brave ads,” https://github.com/brave/brave-browser/issues/ 34144, accessed: 2026

work page 2026
[45]

What format does brave use to store date/time for ads,

“What format does brave use to store date/time for ads,” https://github. com/brave/brave-browser/issues/27157, accessed: 2026. IEEE TRANSACTIONS ON SOFTW ARE ENGINEERING 19

work page 2026
[46]

Update without restarting,

“Update without restarting,” https://github.com/brave/brave-browser/ issues/20778, accessed: 2026

work page 2026
[47]

Possible display bug on recovery phrase screen,

“Possible display bug on recovery phrase screen,” https://github.com/ brave/brave-browser/issues/20796, accessed: 2026

work page 2026
[48]

Labels · brave/bravebrowser,

b. contributors, “Labels · brave/bravebrowser,” https://github.com/brave/ brave-browser/labels, 2026, accessed: 2026-01-05

work page 2026
[49]

W. G. Cochran,Sampling Techniques. Hoboken: John Wiley & Sons, 2007

work page 2007
[50]

GLM-5: from Vibe Coding to Agentic Engineering

A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xieet al., “Glm-5: from vibe coding to agentic engineering,” arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Minimax m2.7: Early echoes of self-evolution,

MiniMax AI, “Minimax m2.7: Early echoes of self-evolution,” https: //huggingface.co/MiniMaxAI/MiniMax-M2.7, 2026, technical report and model release

work page 2026
[52]

Kimi K2.5: Visual Agentic Intelligence

K. Team, T. Bai, Y . Bai, Y . Bao, S. Cai, Y . Cao, Y . Charles, H. Che, C. Chen, G. Chenet al., “Kimi k2. 5: Visual agentic intelligence,”arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[53]

(2026) Serper: The world’s fastest and cheapest Google Search API

Serper. (2026) Serper: The world’s fastest and cheapest Google Search API. Accessed: 2026-04-21. [Online]. Available: https://serper.dev/

work page 2026
[54]

Openrouter api reference,

OpenRouter, “Openrouter api reference,” https://openrouter.ai/docs/api/ reference/overview, 2026, accessed: 2026-04-23

work page 2026
[55]

A simple ensemble strategy for llm inference: Towards more stable text classification,

J. Niimi, “A simple ensemble strategy for llm inference: Towards more stable text classification,” inInternational Conference on Applications of Natural Language to Information Systems. Springer, 2025, pp. 189–199

work page 2025
[56]

How language model hallucinations can snowball,

M. Zhang, O. Press, W. Merrill, A. Liu, and N. A. Smith, “How language model hallucinations can snowball,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

work page 2024
[57]

Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools,

J. Wu, J. Zhu, Y . Liu, M. Xu, and Y . Jin, “Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools,”arXiv preprint arXiv:2502.04644, 2025. [Online]. Available: https://arxiv.org/html/2502.04644v2

work page arXiv 2025
[58]

Brave issue 34522, wrong version no-code fix example,

“Brave issue 34522, wrong version no-code fix example,” https://github. com/brave/brave-browser/issues/34522#issuecomment-1827025260, ac- cessed: 2026

work page 2026
[59]

Brave issue 31299, working as designed no-code fix example,

“Brave issue 31299, working as designed no-code fix example,” https://github.com/brave/brave-browser/issues/31299# issuecomment-1608043966, accessed: 2026

work page 2026
[60]

Brave issue 23741, faulty configuration no-code fix example,

“Brave issue 23741, faulty configuration no-code fix example,” https://github.com/brave/brave-browser/issues/23741# issuecomment-1169167446, accessed: 2026

work page 2026
[61]

Why your google search results differ from others,

Google Search Help, “Why your google search results differ from others,” https://support.google.com/websearch/answer/12412910?hl=en, 2025, accessed: 2026-04-24

work page arXiv 2025

[1] [1]

The cost of poor software quality in the us: A 2022 report,

H. Krasner, “The cost of poor software quality in the us: A 2022 report,” Consortium for Information & Software Quality (CISQ), Tech. Rep., Dec. 2022, accessed 2025-11-07. [Online]. Available: https://www.it-cisq. org/wp-content/uploads/sites/6/2022/11/CPSQ-Report-Nov-22-2.pdf

work page 2022

[2] [2]

(2025) Jira software: Issue and project tracking tool

Atlassian. (2025) Jira software: Issue and project tracking tool. Accessed: November 7, 2025. [Online]. Available: https://www.atlassian.com/ software/jira

work page 2025

[3] [3]

(2025) Github issues: Collaborative issue tracking platform

GitHub. (2025) Github issues: Collaborative issue tracking platform. Accessed: November 7, 2025. [Online]. Available: https://github.com/ features/issues

work page 2025

[4] [4]

Chaff from the wheat: Characterizing and determining valid bug reports,

Y . Fan, X. Xia, D. Lo, and A. E. Hassan, “Chaff from the wheat: Characterizing and determining valid bug reports,”IEEE Transactions on Software Engineering, vol. 46, no. 5, pp. 495–525, 2020

work page 2020

[5] [5]

A data-driven approach for understanding invalid bug reports: An industrial case study,

M. Laiq, N. bin Ali, J. B ¨orstler, and E. Engstr ¨om, “A data-driven approach for understanding invalid bug reports: An industrial case study,”Information and Software Technology, vol. 164, p. 107305, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0950584923001593

work page 2023

[6] [6]

Who should fix this bug?

J. Anvik, L. Hiew, and G. C. Murphy, “Who should fix this bug?” in Proceedings of the 28th international conference on Software engineering, 2006, pp. 361–370

work page 2006

[7] [7]

It’s not a bug, it’s a feature: How misclassification impacts bug prediction,

K. Herzig, S. Just, and A. Zeller, “It’s not a bug, it’s a feature: How misclassification impacts bug prediction,” in2013 35th International Conference on Software Engineering (ICSE), 2013, pp. 392–401

work page 2013

[8] [8]

Early identification of in- valid bug reports in industrial settings – a case study,

M. Laiq, N. b. Ali, J. B ¨ostler, and E. Engstr¨om, “Early identification of in- valid bug reports in industrial settings – a case study,” inProduct-Focused Software Process Improvement, D. Taibi, M. Kuhrmann, T. Mikkonen, J. Kl ¨under, and P. Abrahamsson, Eds. Cham: Springer International Publishing, 2022, pp. 497–507

work page 2022

[9] [9]

Why are bug reports invalid?

J. Sun, “Why are bug reports invalid?” in2011 Fourth IEEE International Conference on Software Testing, Verification and Validation. IEEE, 2011, pp. 407–410

work page 2011

[10] [10]

Creating an invalid defect classification model using text mining on server development,

Y . Su, P. Luarn, Y .-S. Lee, and S.-J. Yen, “Creating an invalid defect classification model using text mining on server development,”Journal of Systems and Software, vol. 125, pp. 197–206, 2017

work page 2017

[11] [11]

“won’t we fix this issue?

S. Panichella, G. Canfora, and A. Di Sorbo, ““won’t we fix this issue?” qualitative characterization and automated identification of wontfix issues on github,”Information and Software Technology, vol. 139, p. 106665, 2021

work page 2021

[12] [12]

Past, present, and future of bug tracking in the generative ai era,

U. B. Torun, M. T. Demircan, M. F. G ¨on, and E. T ¨uz¨un, “Past, present, and future of bug tracking in the generative ai era,”ACM Transactions on Software Engineering and Methodology, 2026. [Online]. Available: https://doi.org/10.1145/3806655

work page doi:10.1145/3806655 2026

[13] [13]

Enhanced bug priority prediction via priority-sensitive long short-term memory–attention mechanism,

G. Yang, J. Ji, and J. Kim, “Enhanced bug priority prediction via priority-sensitive long short-term memory–attention mechanism,”Applied Sciences, vol. 15, no. 2, p. 633, 2025

work page 2025

[14] [14]

A V-FUZZER: Finding safety violations in autonomous driving systems,

J. He, L. Xu, Y . Fan, Z. Xu, M. Yan, and Y . Lei, “Deep learning based valid bug reports determination and explanation,” in 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), 2020, pp. 184–194. [Online]. Available: https://doi.org/10.1109/ISSRE5003.2020.00026

work page doi:10.1109/issre5003.2020.00026 2020

[15] [15]

Deeplabel: Automated issue classification for issue tracking systems,

Z. Li, M. Pan, Y . Pei, T. Zhang, L. Wang, and X. Li, “Deeplabel: Automated issue classification for issue tracking systems,” inProceedings of the 13th Asia-Pacific Symposium on Internetware, 2022, pp. 231–241

work page 2022

[16] [16]

A comparative analysis of ml techniques for bug report classification,

M. Laiq, N. bin Ali, J. B ¨orstler, and E. Engstr ¨om, “A comparative analysis of ml techniques for bug report classification,”Journal of Systems and Software, vol. 227, p. 112457, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121225001256

work page 2025

[17] [17]

Llm-brc: A large language model-based bug report classification framework,

X. Du, Z. Liu, C. Li, X. Ma, Y . Li, and X. Wang, “Llm-brc: A large language model-based bug report classification framework,”Software Quality Journal, vol. 32, no. 3, pp. 985–1005, 2024

work page 2024

[18] [18]

Judge the votes: A system to classify bug reports and give suggestions,

E. Dinc ¸and E. T ¨uz¨un, “Judge the votes: A system to classify bug reports and give suggestions,” inProceedings of the 2nd ACM International Conference on AI-powered Software (AIWare ’25), 2025

work page 2025

[19] [19]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[20] [20]

Automated classification of software issue reports using machine learning techniques: an empirical study,

N. Pandey, D. Sanyal, A. Hudait, and A. Sen, “Automated classification of software issue reports using machine learning techniques: an empirical study,”Innovations in Systems and Software Engineering, vol. 13, 12 2017

work page 2017

[21] [21]

Unsupervised bug report categorization using clustering and labeling algorithm,

N. Limsettho, H. Hata, A. Monden, and K. Matsumoto, “Unsupervised bug report categorization using clustering and labeling algorithm,”Inter- national Journal of Software Engineering and Knowledge Engineering, vol. 26, pp. 1027–1053, 09 2016

work page 2016

[22] [22]

Automated labeling of issue reports using semi supervised approach,

I. Chawla and S. Singh, “Automated labeling of issue reports using semi supervised approach,”Journal of Computational Methods in Sciences and Engineering, vol. 18, pp. 1–15, 01 2018

work page 2018

[23] [23]

Classifying bug reports into bugs and non-bugs using lstm,

H. Qin and X. Sun, “Classifying bug reports into bugs and non-bugs using lstm,” inProceedings of the 10th Asia-Pacific Symposium on Internetware, 2018, pp. 1–4

work page 2018

[24] [24]

Bug report classification using lstm architecture for more accurate software defect locating,

X. Ye, F. Fang, J. Wu, R. Bunescu, and C. Liu, “Bug report classification using lstm architecture for more accurate software defect locating,” in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 2018, pp. 1438–1445

work page 2018

[25] [25]

Which bug reports are valid and why? using the bert transformer to classify bug reports and explain their validity,

Q. Meng and J. Visser, “Which bug reports are valid and why? using the bert transformer to classify bug reports and explain their validity,” in Proceedings of the 4th European Symposium on Software Engineering (ESSE 2023), 2023, pp. 52–60

work page 2023

[26] [26]

Towards word embeddings for improved duplicate bug report retrieval in software repositories,

A. Budhiraja, K. Dutta, M. Shrivastava, and R. Reddy, “Towards word embeddings for improved duplicate bug report retrieval in software repositories,” inProceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR ’18), 2018, pp. 167–170

work page 2018

[27] [27]

Towards accurate duplicate bug retrieval using deep learning techniques,

J. Deshmukh, K. Annervaz, S. Podder, S. Sengupta, and N. Dubash, “Towards accurate duplicate bug retrieval using deep learning techniques,” in2017 IEEE International conference on software maintenance and evolution (ICSME). IEEE, 2017, pp. 115–124

work page 2017

[28] [28]

Rag4tickets: Ai-powered ticket resolution via retrieval- augmented generation on jira and github data,

M. Baqar, “Rag4tickets: Ai-powered ticket resolution via retrieval- augmented generation on jira and github data,”arXiv preprint arXiv:2510.08667, 2025

work page arXiv 2025

[29] [29]

Credibility assessment of fabricated bug reports via large language models: A study on detecting fake software issues,

K. Ren, “Credibility assessment of fabricated bug reports via large language models: A study on detecting fake software issues,” 2025

work page 2025

[30] [30]

ImproBR: Bug Report Improver Using LLMs

E. Akyol, M. Dedeler, and E. T ¨uz¨un, “Improbr: Bug report improver using llms,” in30th International Conference on Evaluation and Assessment in Software Engineering (EASE), 03 2026. [Online]. Available: https://arxiv.org/abs/2604.26142

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

X. Li, J. Jin, G. Dong, H. Qian, Y . Wu, J.-R. Wen, Y . Zhu, and Z. Dou, “Webthinker: Empowering large reasoning models with deep research capability,” 2025. [Online]. Available: https://arxiv.org/abs/2504.21776

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

From web search towards agentic deep research: Incentivizing search with reasoning agents,

W. Zhang, Y . Li, Y . Bei, J. Luo, G. Wan, L. Yang, C. Xie, Y . Yang, W.-C. Huang, C. Miao, H. P. Zou, X. Luo, Y . Zhao, Y . Chen, C. Chan, P. Zhou, X. Zhang, C. Zhang, J. Shang, M. Zhang, Y . Song, I. King, and P. S. Yu, “From web search towards agentic deep research: Incentivizing search with reasoning agents,” 2025. [Online]. Available: https://arxiv.o...

work page arXiv 2025

[33] [33]

Webexplorer: Explore and evolve for training long-horizon web agents,

J. Liu, Y . Li, C. Zhang, J. Li, A. Chen, K. Ji, W. Cheng, Z. Wu, C. Du, Q. Xu, J. Song, Z. Zhu, W. Chen, P. Zhao, and J. He, “Webexplorer: Explore and evolve for training long-horizon web agents,”

work page

[34] [34]

Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025

[Online]. Available: https://arxiv.org/abs/2509.06501

work page arXiv

[35] [35]

Browsemaster: Towards scalable web browsing via tool-augmented programmatic agent pair,

X. Pang, S. Tang, R. Ye, Y . Du, Y . Du, and S. Chen, “Browsemaster: Towards scalable web browsing via tool-augmented programmatic agent pair,” 2025. [Online]. Available: https://arxiv.org/abs/2508.09129

work page arXiv 2025

[36] [36]

Revolutionizing customer service: The impact of large language models on chatbot performance,

M. Sudeep, “Revolutionizing customer service: The impact of large language models on chatbot performance,”INTERNATIONAL JOURNAL, vol. 10, no. 5, pp. 721–730, 2024

work page 2024

[37] [37]

Ecom-bench: Can llm agent resolve real-world e-commerce customer support issues?

H. Wang, X. Peng, H. Cheng, Y . Huang, M. Gong, C. Yang, Y . Liu, and J. Lin, “Ecom-bench: Can llm agent resolve real-world e-commerce customer support issues?” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2025, pp. 276–284

work page 2025

[38] [38]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?”arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica, “Livecodebench: Holistic and contamination free evaluation of large language models for code,”arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Gitbugs: Bug reports for duplicate detection, retrieval aug- mented generation, triage, and more,

A. Patil, “Gitbugs: Bug reports for duplicate detection, retrieval aug- mented generation, triage, and more,”arXiv e-prints, pp. arXiv–2504, 2025

work page 2025

[41] [41]

Icon is missing from brave notification ads after macos upgrade,

“Icon is missing from brave notification ads after macos upgrade,” https: //github.com/brave/brave-browser/issues/26323, accessed: 2026

work page 2026

[42] [42]

Full screen mode on mac make tabs and url section disappear,

“Full screen mode on mac make tabs and url section disappear,” https: //github.com/brave/brave-browser/issues/35808, accessed: 2026

work page 2026

[43] [43]

Development version request,

“Development version request,” https://github.com/brave/brave-browser/ issues/21405, accessed: 2026

work page 2026

[44] [44]

Crash in brave ads,

“Crash in brave ads,” https://github.com/brave/brave-browser/issues/ 34144, accessed: 2026

work page 2026

[45] [45]

What format does brave use to store date/time for ads,

“What format does brave use to store date/time for ads,” https://github. com/brave/brave-browser/issues/27157, accessed: 2026. IEEE TRANSACTIONS ON SOFTW ARE ENGINEERING 19

work page 2026

[46] [46]

Update without restarting,

“Update without restarting,” https://github.com/brave/brave-browser/ issues/20778, accessed: 2026

work page 2026

[47] [47]

Possible display bug on recovery phrase screen,

“Possible display bug on recovery phrase screen,” https://github.com/ brave/brave-browser/issues/20796, accessed: 2026

work page 2026

[48] [48]

Labels · brave/bravebrowser,

b. contributors, “Labels · brave/bravebrowser,” https://github.com/brave/ brave-browser/labels, 2026, accessed: 2026-01-05

work page 2026

[49] [49]

W. G. Cochran,Sampling Techniques. Hoboken: John Wiley & Sons, 2007

work page 2007

[50] [50]

GLM-5: from Vibe Coding to Agentic Engineering

A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xieet al., “Glm-5: from vibe coding to agentic engineering,” arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

Minimax m2.7: Early echoes of self-evolution,

MiniMax AI, “Minimax m2.7: Early echoes of self-evolution,” https: //huggingface.co/MiniMaxAI/MiniMax-M2.7, 2026, technical report and model release

work page 2026

[52] [52]

Kimi K2.5: Visual Agentic Intelligence

K. Team, T. Bai, Y . Bai, Y . Bao, S. Cai, Y . Cao, Y . Charles, H. Che, C. Chen, G. Chenet al., “Kimi k2. 5: Visual agentic intelligence,”arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[53] [53]

(2026) Serper: The world’s fastest and cheapest Google Search API

Serper. (2026) Serper: The world’s fastest and cheapest Google Search API. Accessed: 2026-04-21. [Online]. Available: https://serper.dev/

work page 2026

[54] [54]

Openrouter api reference,

OpenRouter, “Openrouter api reference,” https://openrouter.ai/docs/api/ reference/overview, 2026, accessed: 2026-04-23

work page 2026

[55] [55]

A simple ensemble strategy for llm inference: Towards more stable text classification,

J. Niimi, “A simple ensemble strategy for llm inference: Towards more stable text classification,” inInternational Conference on Applications of Natural Language to Information Systems. Springer, 2025, pp. 189–199

work page 2025

[56] [56]

How language model hallucinations can snowball,

M. Zhang, O. Press, W. Merrill, A. Liu, and N. A. Smith, “How language model hallucinations can snowball,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

work page 2024

[57] [57]

Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools,

J. Wu, J. Zhu, Y . Liu, M. Xu, and Y . Jin, “Agentic reasoning: A streamlined framework for enhancing llm reasoning with agentic tools,”arXiv preprint arXiv:2502.04644, 2025. [Online]. Available: https://arxiv.org/html/2502.04644v2

work page arXiv 2025

[58] [58]

Brave issue 34522, wrong version no-code fix example,

“Brave issue 34522, wrong version no-code fix example,” https://github. com/brave/brave-browser/issues/34522#issuecomment-1827025260, ac- cessed: 2026

work page 2026

[59] [59]

Brave issue 31299, working as designed no-code fix example,

“Brave issue 31299, working as designed no-code fix example,” https://github.com/brave/brave-browser/issues/31299# issuecomment-1608043966, accessed: 2026

work page 2026

[60] [60]

Brave issue 23741, faulty configuration no-code fix example,

“Brave issue 23741, faulty configuration no-code fix example,” https://github.com/brave/brave-browser/issues/23741# issuecomment-1169167446, accessed: 2026

work page 2026

[61] [61]

Why your google search results differ from others,

Google Search Help, “Why your google search results differ from others,” https://support.google.com/websearch/answer/12412910?hl=en, 2025, accessed: 2026-04-24

work page arXiv 2025