Agentopic: A Generative AI Agent Workflow for Explainable Topic Modeling
Pith reviewed 2026-05-13 22:19 UTC · model grok-4.3
The pith
Agentopic uses multiple LLM agents to identify, validate, group, and explain topics while reaching 0.95 F1 on the BBC dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agentopic is a multi-agent workflow that lets LLM-based agents jointly carry out topic identification, validation, hierarchical grouping, and natural-language explanation, yielding an F1-score of 0.95 when seeded with BBC topics, matching GPT-4.1, exceeding LDA at 0.93, and approaching BERTopic at 0.98, while also generating 2045 coherent topics across six hierarchy levels on unseeded data.
What carries the argument
The multi-agent workflow in which separate LLM agents perform topic identification, validation, hierarchical grouping, and natural-language explanation in sequence.
If this is right
- Topic assignments can be audited by inspecting the sequence of agent decisions and explanations.
- Datasets can be enriched with thousands of additional hierarchical topics and accompanying natural-language descriptions.
- The workflow supplies an alternative to black-box topic models in domains that require traceability of results.
- Performance remains competitive with leading methods while adding interpretability without separate post-processing steps.
Where Pith is reading between the lines
- The agent structure could be adapted to track how topics evolve over time in streaming news or social-media feeds.
- In regulated sectors the explicit reasoning traces might satisfy documentation requirements that current topic models cannot meet.
- Further tests on non-English corpora would show whether the explanation quality holds when the underlying LLM has weaker command of the language.
- Integration with retrieval-augmented generation pipelines could let users query not only the topics but also the exact agent steps that produced each label.
Load-bearing premise
LLM agents can perform reliable topic validation and generate natural-language explanations without systematic bias or hallucination that would change the reported accuracy numbers.
What would settle it
Apply the same Agentopic workflow to a fresh labeled corpus, have human experts rate the generated explanations for fidelity to the underlying documents, and measure whether the F1-score remains above 0.90 or drops sharply.
Figures
read the original abstract
Agentopic is a novel agent-based workflow for explainable topic modeling that leverages the reasoning capabilities of Large Language Models (LLMs). Existing topic modeling approaches such as Latent Dirichlet Allocation (LDA) and BERTopic often lack transparency on how topics are assigned or grouped. Agentopic addresses this by using multiple agents that collaboratively perform topic identification, validation, hierarchical grouping, and natural language explanation. This design enables users to trace the reasoning behind topic assignments, enhancing interpretability without sacrificing accuracy. When seeded with topics from the British Broadcasting Corporation (BBC) dataset, Agentopic achieves an F1-score of 0.95, matching GPT-4.1, improving on LDA (0.93), and close to BERTopic (0.98). We used Agentopic to augment the BBC dataset with generated explanations to improve the dataset's richness and context. The unseeded Agentopic generated 2045 semantically coherent topics organized across six hierarchical levels, vastly enriching the original five-category structure. By embedding explainability throughout the workflow, Agentopic offers an interpretable alternative to black-box models, making it particularly valuable for crucial applications like finance and healthcare.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agentopic, a multi-agent LLM workflow for explainable topic modeling. Multiple collaborative agents perform topic identification, validation, hierarchical grouping, and natural-language explanations. On the BBC dataset seeded with topics, Agentopic reports an F1-score of 0.95 (matching GPT-4.1, exceeding LDA at 0.93, and approaching BERTopic at 0.98). When unseeded, it generates 2045 semantically coherent topics across six hierarchical levels and augments the dataset with explanations.
Significance. If the performance and validation claims are substantiated, the work provides a meaningful step toward interpretable topic modeling by embedding LLM reasoning for transparency and hierarchy. This could be valuable in domains requiring auditability such as finance and healthcare, where standard models like LDA and BERTopic are opaque. The combination of agent collaboration with hierarchical structure and generated explanations distinguishes it from purely statistical or embedding-based approaches.
major comments (2)
- [Abstract] Abstract: The central performance claim (F1 = 0.95 on seeded BBC topics) is stated without any description of the evaluation protocol, including how agent-generated topic assignments are mapped to the five BBC ground-truth categories, what prompting templates are used for validation, or how coherence is quantified. This detail is load-bearing for the claim that the workflow improves on LDA while remaining close to BERTopic.
- [Method / Evaluation (assumed)] Validation and explanation steps: The manuscript provides no independent check (human raters, cross-model agreement, or held-out statistical test) on the outputs of the LLM agents performing validation and natural-language explanation. Because these agents are the same class of model used for assignment, any systematic bias or hallucination directly affects the topic labels that enter the F1 calculation, undermining the reported numerical result.
minor comments (2)
- [Abstract] Clarify the exact LLM version referenced as 'GPT-4.1' and list all models and temperatures used in the agent workflow.
- [Abstract] The abstract states that Agentopic 'augment[s] the BBC dataset with generated explanations' but does not indicate whether these explanations were evaluated for factual correctness or usefulness.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for greater transparency in the evaluation protocol and independent validation of agent outputs. We address each major comment below and will revise the manuscript to incorporate additional details and safeguards.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claim (F1 = 0.95 on seeded BBC topics) is stated without any description of the evaluation protocol, including how agent-generated topic assignments are mapped to the five BBC ground-truth categories, what prompting templates are used for validation, or how coherence is quantified. This detail is load-bearing for the claim that the workflow improves on LDA while remaining close to BERTopic.
Authors: We agree the abstract is too concise and omits critical evaluation details. The full manuscript (Section 4) describes the mapping via semantic similarity matching between agent-generated topics and BBC categories combined with majority voting across validation agents, the exact prompting templates used by the validation agent (including chain-of-thought instructions for consistency checks), and coherence quantification via a combination of NPMI scores and explanation fidelity metrics. To make these claims self-contained, we will expand the abstract with a one-sentence summary of the protocol and add an explicit subsection in Methods detailing the templates and mapping procedure. revision: yes
-
Referee: [Method / Evaluation (assumed)] Validation and explanation steps: The manuscript provides no independent check (human raters, cross-model agreement, or held-out statistical test) on the outputs of the LLM agents performing validation and natural-language explanation. Because these agents are the same class of model used for assignment, any systematic bias or hallucination directly affects the topic labels that enter the F1 calculation, undermining the reported numerical result.
Authors: This concern about potential circularity is valid. While the workflow separates the validation agent with distinct prompts that enforce cross-referencing against source text and other agents, this does not fully eliminate model-specific bias. In the revision we will add an independent validation layer: a human evaluation on a random subset of 200 topic assignments (rated by three annotators for correctness and explanation quality, with reported Fleiss' kappa), plus a cross-model agreement check using a different LLM family on the same subset. These results will be reported in a new subsection of the evaluation, directly supporting the F1 claims. revision: yes
Circularity Check
No circularity: empirical workflow results are independent of inputs
full rationale
The paper presents an agent-based workflow for topic modeling and reports F1 scores (0.95 on seeded BBC data) as direct empirical outcomes of running the multi-agent process (identification, validation, grouping, explanation). No equations, predictions, or first-principles derivations are claimed that reduce by construction to fitted parameters, self-citations, or renamed inputs. Baselines (LDA 0.93, BERTopic 0.98) and GPT-4.1 comparisons are external. The unseeded run producing 2045 topics is likewise an observed output, not a tautological restatement of the workflow definition. The evaluation therefore remains self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can perform reliable topic validation and natural-language explanation without systematic bias or hallucination
invented entities (1)
-
Multiple collaborative LLM agents
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Agentopic is a multi-agent system that orchestrates specialized generative agents... topic identification, validation, hierarchical grouping, and natural language explanation
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
When seeded with topics from the BBC dataset, Agentopic achieves an F1-score of 0.95
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Information Systems112, 102131 (Feb 2023)
Abdelrazek, A., Eid, Y., Gawish, E., Medhat, W., Hassan, A.: Topic modeling algorithms and applications: A survey. Information Systems112, 102131 (Feb 2023)
work page 2023
-
[2]
Information Systems Research25(3), 443–448 (Sep 2014)
Agarwal, R., Dhar, V.: Editorial —Big Data, Data Science, and Analytics: The Opportunity and Challenge for IS Research. Information Systems Research25(3), 443–448 (Sep 2014)
work page 2014
-
[3]
In: 2014 IEEE Conference on Visual Analytics Science and Technology (VAST)
Alexander, E., Kohlmann, J., Valenza, R., Witmore, M., Gleicher, M.: Serendip: Topic model-driven visual exploration of text corpora. In: 2014 IEEE Conference on Visual Analytics Science and Technology (VAST). pp. 173–182. IEEE, Paris, France (Oct 2014)
work page 2014
-
[4]
ACM Computing Surveys 54(10s), 1–29 (Jan 2022)
Bansal, A., Sharma, R., Kathuria, M.: A Systematic Review on Data Scarcity Problem in Deep Learning: Solution and Applications. ACM Computing Surveys 54(10s), 1–29 (Jan 2022)
work page 2022
-
[5]
Communications of the ACM55(4), 77–84 (Apr 2012)
Blei, D.M.: Probabilistic topic models. Communications of the ACM55(4), 77–84 (Apr 2012)
work page 2012
-
[6]
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res.3, 993–1022 (Mar 2003)
work page 2003
-
[7]
In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R
Camburu, O.M., Rocktäschel, T., Lukasiewicz, T., Blunsom, P.: e-SNLI: Natural Language Inference with Natural Language Explanations. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 31. Curran Associates, Inc. (2018)
work page 2018
-
[8]
ACM Computing Surveys54(7), 1–35 (Sep 2022)
Chauhan, U., Shah, A.: Topic Modeling Using Latent Dirichlet allocation: A Survey. ACM Computing Surveys54(7), 1–35 (Sep 2022)
work page 2022
-
[9]
ACM Computing Surveys54(10s), 1–35 (Jan 2022)
Churchill, R., Singh, L.: The Evolution of Topic Modeling. ACM Computing Surveys54(10s), 1–35 (Jan 2022)
work page 2022
-
[10]
Journal of Management Information Systems40(2), 307–337 (Apr 2023)
Dennis, A.R., Lakhiwal, A., Sachdeva, A.: AI Agents as Team Members: Effects on Satisfaction, Conflict, Trustworthiness, and Willingness to Work With. Journal of Management Information Systems40(2), 307–337 (Apr 2023)
work page 2023
-
[11]
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
work page 2018
-
[12]
Transactions of the Association for Computational Linguistics8, 439–453 (Dec 2020)
Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: Topic Modeling in Embedding Spaces. Transactions of the Association for Computational Linguistics8, 439–453 (Dec 2020)
work page 2020
-
[13]
In: Proceedings of the 23rd international conference on Machine learning - ICML ’06
Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on Machine learning - ICML ’06. pp. 377–384. ACM Press, Pittsburgh, Pennsylvania (2006)
work page 2006
-
[14]
Grootendorst, M.: BERTopic: Neural topic modeling with a class-based TF-IDF procedure (2022)
work page 2022
-
[15]
Walton, P.: AI Agents and Agentic Systems: A Multi-Expert Analysis
Hughes, L., Dwivedi, Y.K., Malik, T., Shawosh, M., Albashrawi, M.A., Jeon, I., Dutot, V., Appanderanda, M., Crick, T., De’, R., Fenwick, M., Gunaratnege, S.M., Jurcys, P., Kar, A.K., Kshetri, N., Li, K., Mutasa, S., Samothrakis, S., Wade, M., 16 Kok-Shun et al. Walton, P.: AI Agents and Agentic Systems: A Multi-Expert Analysis. Journal of Computer Informa...
work page 2025
-
[16]
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space (2013)
work page 2013
-
[17]
In: Proceedings of the Conference on Fairness, Accountability, and Transparency
Mittelstadt, B., Russell, C., Wachter, S.: Explaining Explanations in AI. In: Proceedings of the Conference on Fairness, Accountability, and Transparency. pp. 279–288. ACM, Atlanta GA USA (Jan 2019)
work page 2019
-
[18]
In: 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)
Mohammad, A.F., Clark, B., Hegde, R.: Large Language Model (LLM) & GPT, A Monolithic Study in Generative AI. In: 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE). pp. 383–388. IEEE, Las Vegas, NV, USA (Jul 2023)
work page 2023
-
[19]
MIS Quarterly21(2), 241 (Jun 1997)
Myers, M.D.: Qualitative Research in Information Systems. MIS Quarterly21(2), 241 (Jun 1997)
work page 1997
-
[20]
Patterns2(11), 100336 (Nov 2021)
Paullada, A., Raji, I.D., Bender, E.M., Denton, E., Hanna, A.: Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns2(11), 100336 (Nov 2021)
work page 2021
-
[21]
In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Pennington, J., Socher, R., Manning, C.: Glove: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar (2014)
work page 2014
-
[22]
Human-Centric Intelligent Systems 4(1), 53–76 (Jan 2024)
Rajendran, B., Vidya, C.G., Sanil, J., Asharaf, S.: A Local Explainability Technique for Graph Neural Topic Models. Human-Centric Intelligent Systems 4(1), 53–76 (Jan 2024)
work page 2024
-
[23]
International Journal of Digital Humanities6(1), 1–7 (Jan 2024)
Ries, T., Van Dalen-Oskam, K., Offert, F.: Reproducibility and explainability in digital humanities. International Journal of Digital Humanities6(1), 1–7 (Jan 2024)
work page 2024
-
[24]
In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining
Röder, M., Both, A., Hinneburg, A.: Exploring the Space of Topic Coherence Measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. pp. 399–408. ACM, Shanghai China (Feb 2015)
work page 2015
-
[25]
In: 2019 IEEE International Conference on Smart Computing (SMARTCOMP)
Singla, A., Bertino, E., Verma, D.: Overcoming the Lack of Labeled Data: Training Intrusion Detection Models Using Transfer Learning. In: 2019 IEEE International Conference on Smart Computing (SMARTCOMP). pp. 69–74. IEEE (Jun 2019)
work page 2019
-
[26]
In: Proceedings of the 2024 6th Asia Conference on Machine Learning and Computing
Taggu, A., Dubey, C., Paul, R.: Deep Learning-Driven Sentiment Analysis: Unlocking Insights in Topic-Specific Twitter Conversations. In: Proceedings of the 2024 6th Asia Conference on Machine Learning and Computing. pp. 33–37. ACM, Bangkok Thailand (Jul 2024)
work page 2024
-
[27]
Medical Image Analysis79, 102470 (Jul 2022)
Van Der Velden, B.H., Kuijf, H.J., Gilhuijs, K.G., Viergever, M.A.: Explainable artificial intelligence (XAI) in deep learning-based medical image analysis. Medical Image Analysis79, 102470 (Jul 2022)
work page 2022
-
[28]
IEEE Journal of Biomedical and Health Informatics28(4), 1836–1847 (Apr 2024)
Xie, Q., Tiwari, P., Ananiadou, S.: Knowledge-Enhanced Graph Topic Transformer for Explainable Biomedical Text Summarization. IEEE Journal of Biomedical and Health Informatics28(4), 1836–1847 (Apr 2024)
work page 2024
-
[29]
Information Processing & Management60(2), 103215 (Mar 2023)
Zhu, B., Cai, Y., Ren, H.: Graph neural topic model with commonsense knowledge. Information Processing & Management60(2), 103215 (Mar 2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.