pith. sign in

arxiv: 2605.00833 · v1 · submitted 2026-04-02 · 💻 cs.LG · cs.AI

Agentopic: A Generative AI Agent Workflow for Explainable Topic Modeling

Pith reviewed 2026-05-13 22:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords topic modelinglarge language modelsmulti-agent systemsexplainable AIhierarchical clusteringnatural language explanationsBBC datasetgenerative agents
0
0 comments X

The pith

Agentopic uses multiple LLM agents to identify, validate, group, and explain topics while reaching 0.95 F1 on the BBC dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Agentopic as a workflow in which several specialized agents collaborate to perform topic modeling with built-in traceability. Agents handle identification of topics from text, validation of those topics, hierarchical grouping, and generation of natural-language explanations for the assignments. This setup is tested on the BBC news collection, where the seeded version matches the performance of GPT-4.1 and improves on LDA while remaining close to BERTopic. The authors further demonstrate that the same workflow can expand the original five-category dataset into thousands of finer-grained, hierarchically organized topics accompanied by explanations. The central goal is to deliver topic models whose reasoning steps can be inspected directly rather than treated as opaque outputs.

Core claim

Agentopic is a multi-agent workflow that lets LLM-based agents jointly carry out topic identification, validation, hierarchical grouping, and natural-language explanation, yielding an F1-score of 0.95 when seeded with BBC topics, matching GPT-4.1, exceeding LDA at 0.93, and approaching BERTopic at 0.98, while also generating 2045 coherent topics across six hierarchy levels on unseeded data.

What carries the argument

The multi-agent workflow in which separate LLM agents perform topic identification, validation, hierarchical grouping, and natural-language explanation in sequence.

If this is right

  • Topic assignments can be audited by inspecting the sequence of agent decisions and explanations.
  • Datasets can be enriched with thousands of additional hierarchical topics and accompanying natural-language descriptions.
  • The workflow supplies an alternative to black-box topic models in domains that require traceability of results.
  • Performance remains competitive with leading methods while adding interpretability without separate post-processing steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The agent structure could be adapted to track how topics evolve over time in streaming news or social-media feeds.
  • In regulated sectors the explicit reasoning traces might satisfy documentation requirements that current topic models cannot meet.
  • Further tests on non-English corpora would show whether the explanation quality holds when the underlying LLM has weaker command of the language.
  • Integration with retrieval-augmented generation pipelines could let users query not only the topics but also the exact agent steps that produced each label.

Load-bearing premise

LLM agents can perform reliable topic validation and generate natural-language explanations without systematic bias or hallucination that would change the reported accuracy numbers.

What would settle it

Apply the same Agentopic workflow to a fresh labeled corpus, have human experts rate the generated explanations for fidelity to the underlying documents, and measure whether the F1-score remains above 0.90 or drops sharply.

Figures

Figures reproduced from arXiv: 2605.00833 by Brice Valentin Kok-Shun, David Sundaram, Gabrielle Peko, Johnny Chan.

Figure 1
Figure 1. Figure 1: Agentopic workflow: multi-agent orchestration for explainable topic modeling. The agentic workflow depicted in [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Section of the Agentopic hierarchy showing sports-related topics. In terms of coverage, Agentopic successfully recovers and expands upon the topic space of the original dataset. It uncovers a broader spectrum of subtopics that reflect diverse subject areas within each category. This expanded coverage is particularly evident in the fine-grained segmentation of each domain. For example, Entertainment is brok… view at source ↗
read the original abstract

Agentopic is a novel agent-based workflow for explainable topic modeling that leverages the reasoning capabilities of Large Language Models (LLMs). Existing topic modeling approaches such as Latent Dirichlet Allocation (LDA) and BERTopic often lack transparency on how topics are assigned or grouped. Agentopic addresses this by using multiple agents that collaboratively perform topic identification, validation, hierarchical grouping, and natural language explanation. This design enables users to trace the reasoning behind topic assignments, enhancing interpretability without sacrificing accuracy. When seeded with topics from the British Broadcasting Corporation (BBC) dataset, Agentopic achieves an F1-score of 0.95, matching GPT-4.1, improving on LDA (0.93), and close to BERTopic (0.98). We used Agentopic to augment the BBC dataset with generated explanations to improve the dataset's richness and context. The unseeded Agentopic generated 2045 semantically coherent topics organized across six hierarchical levels, vastly enriching the original five-category structure. By embedding explainability throughout the workflow, Agentopic offers an interpretable alternative to black-box models, making it particularly valuable for crucial applications like finance and healthcare.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Agentopic, a multi-agent LLM workflow for explainable topic modeling. Multiple collaborative agents perform topic identification, validation, hierarchical grouping, and natural-language explanations. On the BBC dataset seeded with topics, Agentopic reports an F1-score of 0.95 (matching GPT-4.1, exceeding LDA at 0.93, and approaching BERTopic at 0.98). When unseeded, it generates 2045 semantically coherent topics across six hierarchical levels and augments the dataset with explanations.

Significance. If the performance and validation claims are substantiated, the work provides a meaningful step toward interpretable topic modeling by embedding LLM reasoning for transparency and hierarchy. This could be valuable in domains requiring auditability such as finance and healthcare, where standard models like LDA and BERTopic are opaque. The combination of agent collaboration with hierarchical structure and generated explanations distinguishes it from purely statistical or embedding-based approaches.

major comments (2)
  1. [Abstract] Abstract: The central performance claim (F1 = 0.95 on seeded BBC topics) is stated without any description of the evaluation protocol, including how agent-generated topic assignments are mapped to the five BBC ground-truth categories, what prompting templates are used for validation, or how coherence is quantified. This detail is load-bearing for the claim that the workflow improves on LDA while remaining close to BERTopic.
  2. [Method / Evaluation (assumed)] Validation and explanation steps: The manuscript provides no independent check (human raters, cross-model agreement, or held-out statistical test) on the outputs of the LLM agents performing validation and natural-language explanation. Because these agents are the same class of model used for assignment, any systematic bias or hallucination directly affects the topic labels that enter the F1 calculation, undermining the reported numerical result.
minor comments (2)
  1. [Abstract] Clarify the exact LLM version referenced as 'GPT-4.1' and list all models and temperatures used in the agent workflow.
  2. [Abstract] The abstract states that Agentopic 'augment[s] the BBC dataset with generated explanations' but does not indicate whether these explanations were evaluated for factual correctness or usefulness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in the evaluation protocol and independent validation of agent outputs. We address each major comment below and will revise the manuscript to incorporate additional details and safeguards.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim (F1 = 0.95 on seeded BBC topics) is stated without any description of the evaluation protocol, including how agent-generated topic assignments are mapped to the five BBC ground-truth categories, what prompting templates are used for validation, or how coherence is quantified. This detail is load-bearing for the claim that the workflow improves on LDA while remaining close to BERTopic.

    Authors: We agree the abstract is too concise and omits critical evaluation details. The full manuscript (Section 4) describes the mapping via semantic similarity matching between agent-generated topics and BBC categories combined with majority voting across validation agents, the exact prompting templates used by the validation agent (including chain-of-thought instructions for consistency checks), and coherence quantification via a combination of NPMI scores and explanation fidelity metrics. To make these claims self-contained, we will expand the abstract with a one-sentence summary of the protocol and add an explicit subsection in Methods detailing the templates and mapping procedure. revision: yes

  2. Referee: [Method / Evaluation (assumed)] Validation and explanation steps: The manuscript provides no independent check (human raters, cross-model agreement, or held-out statistical test) on the outputs of the LLM agents performing validation and natural-language explanation. Because these agents are the same class of model used for assignment, any systematic bias or hallucination directly affects the topic labels that enter the F1 calculation, undermining the reported numerical result.

    Authors: This concern about potential circularity is valid. While the workflow separates the validation agent with distinct prompts that enforce cross-referencing against source text and other agents, this does not fully eliminate model-specific bias. In the revision we will add an independent validation layer: a human evaluation on a random subset of 200 topic assignments (rated by three annotators for correctness and explanation quality, with reported Fleiss' kappa), plus a cross-model agreement check using a different LLM family on the same subset. These results will be reported in a new subsection of the evaluation, directly supporting the F1 claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical workflow results are independent of inputs

full rationale

The paper presents an agent-based workflow for topic modeling and reports F1 scores (0.95 on seeded BBC data) as direct empirical outcomes of running the multi-agent process (identification, validation, grouping, explanation). No equations, predictions, or first-principles derivations are claimed that reduce by construction to fitted parameters, self-citations, or renamed inputs. Baselines (LDA 0.93, BERTopic 0.98) and GPT-4.1 comparisons are external. The unseeded run producing 2045 topics is likewise an observed output, not a tautological restatement of the workflow definition. The evaluation therefore remains self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that current LLMs possess sufficient reasoning capability to act as reliable topic validators and explainers; no free parameters are explicitly fitted in the abstract, but the workflow implicitly depends on prompt engineering choices and model selection.

axioms (1)
  • domain assumption LLMs can perform reliable topic validation and natural-language explanation without systematic bias or hallucination
    Invoked to justify the F1 scores and generated explanations; appears in the description of the agent workflow.
invented entities (1)
  • Multiple collaborative LLM agents no independent evidence
    purpose: To decompose topic modeling into traceable steps of identification, validation, grouping, and explanation
    New workflow component introduced to achieve explainability; no independent evidence provided beyond the reported F1 numbers.

pith-pipeline@v0.9.0 · 5507 in / 1434 out tokens · 28007 ms · 2026-05-13T22:19:52.230193+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    Information Systems112, 102131 (Feb 2023)

    Abdelrazek, A., Eid, Y., Gawish, E., Medhat, W., Hassan, A.: Topic modeling algorithms and applications: A survey. Information Systems112, 102131 (Feb 2023)

  2. [2]

    Information Systems Research25(3), 443–448 (Sep 2014)

    Agarwal, R., Dhar, V.: Editorial —Big Data, Data Science, and Analytics: The Opportunity and Challenge for IS Research. Information Systems Research25(3), 443–448 (Sep 2014)

  3. [3]

    In: 2014 IEEE Conference on Visual Analytics Science and Technology (VAST)

    Alexander, E., Kohlmann, J., Valenza, R., Witmore, M., Gleicher, M.: Serendip: Topic model-driven visual exploration of text corpora. In: 2014 IEEE Conference on Visual Analytics Science and Technology (VAST). pp. 173–182. IEEE, Paris, France (Oct 2014)

  4. [4]

    ACM Computing Surveys 54(10s), 1–29 (Jan 2022)

    Bansal, A., Sharma, R., Kathuria, M.: A Systematic Review on Data Scarcity Problem in Deep Learning: Solution and Applications. ACM Computing Surveys 54(10s), 1–29 (Jan 2022)

  5. [5]

    Communications of the ACM55(4), 77–84 (Apr 2012)

    Blei, D.M.: Probabilistic topic models. Communications of the ACM55(4), 77–84 (Apr 2012)

  6. [6]

    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res.3, 993–1022 (Mar 2003)

  7. [7]

    In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R

    Camburu, O.M., Rocktäschel, T., Lukasiewicz, T., Blunsom, P.: e-SNLI: Natural Language Inference with Natural Language Explanations. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 31. Curran Associates, Inc. (2018)

  8. [8]

    ACM Computing Surveys54(7), 1–35 (Sep 2022)

    Chauhan, U., Shah, A.: Topic Modeling Using Latent Dirichlet allocation: A Survey. ACM Computing Surveys54(7), 1–35 (Sep 2022)

  9. [9]

    ACM Computing Surveys54(10s), 1–35 (Jan 2022)

    Churchill, R., Singh, L.: The Evolution of Topic Modeling. ACM Computing Surveys54(10s), 1–35 (Jan 2022)

  10. [10]

    Journal of Management Information Systems40(2), 307–337 (Apr 2023)

    Dennis, A.R., Lakhiwal, A., Sachdeva, A.: AI Agents as Team Members: Effects on Satisfaction, Conflict, Trustworthiness, and Willingness to Work With. Journal of Management Information Systems40(2), 307–337 (Apr 2023)

  11. [11]

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)

  12. [12]

    Transactions of the Association for Computational Linguistics8, 439–453 (Dec 2020)

    Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: Topic Modeling in Embedding Spaces. Transactions of the Association for Computational Linguistics8, 439–453 (Dec 2020)

  13. [13]

    In: Proceedings of the 23rd international conference on Machine learning - ICML ’06

    Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on Machine learning - ICML ’06. pp. 377–384. ACM Press, Pittsburgh, Pennsylvania (2006)

  14. [14]

    Grootendorst, M.: BERTopic: Neural topic modeling with a class-based TF-IDF procedure (2022)

  15. [15]

    Walton, P.: AI Agents and Agentic Systems: A Multi-Expert Analysis

    Hughes, L., Dwivedi, Y.K., Malik, T., Shawosh, M., Albashrawi, M.A., Jeon, I., Dutot, V., Appanderanda, M., Crick, T., De’, R., Fenwick, M., Gunaratnege, S.M., Jurcys, P., Kar, A.K., Kshetri, N., Li, K., Mutasa, S., Samothrakis, S., Wade, M., 16 Kok-Shun et al. Walton, P.: AI Agents and Agentic Systems: A Multi-Expert Analysis. Journal of Computer Informa...

  16. [16]

    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space (2013)

  17. [17]

    In: Proceedings of the Conference on Fairness, Accountability, and Transparency

    Mittelstadt, B., Russell, C., Wachter, S.: Explaining Explanations in AI. In: Proceedings of the Conference on Fairness, Accountability, and Transparency. pp. 279–288. ACM, Atlanta GA USA (Jan 2019)

  18. [18]

    In: 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)

    Mohammad, A.F., Clark, B., Hegde, R.: Large Language Model (LLM) & GPT, A Monolithic Study in Generative AI. In: 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE). pp. 383–388. IEEE, Las Vegas, NV, USA (Jul 2023)

  19. [19]

    MIS Quarterly21(2), 241 (Jun 1997)

    Myers, M.D.: Qualitative Research in Information Systems. MIS Quarterly21(2), 241 (Jun 1997)

  20. [20]

    Patterns2(11), 100336 (Nov 2021)

    Paullada, A., Raji, I.D., Bender, E.M., Denton, E., Hanna, A.: Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns2(11), 100336 (Nov 2021)

  21. [21]

    In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Pennington, J., Socher, R., Manning, C.: Glove: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar (2014)

  22. [22]

    Human-Centric Intelligent Systems 4(1), 53–76 (Jan 2024)

    Rajendran, B., Vidya, C.G., Sanil, J., Asharaf, S.: A Local Explainability Technique for Graph Neural Topic Models. Human-Centric Intelligent Systems 4(1), 53–76 (Jan 2024)

  23. [23]

    International Journal of Digital Humanities6(1), 1–7 (Jan 2024)

    Ries, T., Van Dalen-Oskam, K., Offert, F.: Reproducibility and explainability in digital humanities. International Journal of Digital Humanities6(1), 1–7 (Jan 2024)

  24. [24]

    In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining

    Röder, M., Both, A., Hinneburg, A.: Exploring the Space of Topic Coherence Measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. pp. 399–408. ACM, Shanghai China (Feb 2015)

  25. [25]

    In: 2019 IEEE International Conference on Smart Computing (SMARTCOMP)

    Singla, A., Bertino, E., Verma, D.: Overcoming the Lack of Labeled Data: Training Intrusion Detection Models Using Transfer Learning. In: 2019 IEEE International Conference on Smart Computing (SMARTCOMP). pp. 69–74. IEEE (Jun 2019)

  26. [26]

    In: Proceedings of the 2024 6th Asia Conference on Machine Learning and Computing

    Taggu, A., Dubey, C., Paul, R.: Deep Learning-Driven Sentiment Analysis: Unlocking Insights in Topic-Specific Twitter Conversations. In: Proceedings of the 2024 6th Asia Conference on Machine Learning and Computing. pp. 33–37. ACM, Bangkok Thailand (Jul 2024)

  27. [27]

    Medical Image Analysis79, 102470 (Jul 2022)

    Van Der Velden, B.H., Kuijf, H.J., Gilhuijs, K.G., Viergever, M.A.: Explainable artificial intelligence (XAI) in deep learning-based medical image analysis. Medical Image Analysis79, 102470 (Jul 2022)

  28. [28]

    IEEE Journal of Biomedical and Health Informatics28(4), 1836–1847 (Apr 2024)

    Xie, Q., Tiwari, P., Ananiadou, S.: Knowledge-Enhanced Graph Topic Transformer for Explainable Biomedical Text Summarization. IEEE Journal of Biomedical and Health Informatics28(4), 1836–1847 (Apr 2024)

  29. [29]

    Information Processing & Management60(2), 103215 (Mar 2023)

    Zhu, B., Cai, Y., Ren, H.: Graph neural topic model with commonsense knowledge. Information Processing & Management60(2), 103215 (Mar 2023)