pith. machine review for the scientific record. sign in

arxiv: 2604.07562 · v2 · submitted 2026-04-08 · 💻 cs.CL · cs.AI· cs.CY· cs.LG

Recognition: unknown

Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.LG
keywords unsupervised clusteringLLM reasoningcluster refinementcoherence verificationredundancy adjudicationlabel groundingsocial media analysissemantic validation
0
0 comments X

The pith

Large language models refine unsupervised text clusters by serving as semantic judges to verify coherence, remove redundancies, and ground labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Unsupervised methods for grouping text often produce clusters that lack coherence or have labels that do not reflect the content well, making them hard to use for understanding large text collections. This paper introduces a framework that applies large language models in three reasoning stages to refine the results of any unsupervised clustering algorithm. The stages include checking whether a cluster's summary is supported by its member texts, deciding whether to merge or reject overlapping clusters, and generating and consolidating labels in an unsupervised way. This matters for applications like analyzing social media because it improves the quality of the clusters and their labels without needing any labeled training data. The approach shows better results than standard topic models and other recent methods on data from two different platforms.

Core claim

The paper claims that LLMs can act as semantic judges to validate and restructure outputs from arbitrary unsupervised clustering algorithms through coherence verification, redundancy adjudication, and label grounding. This leads to clusters with improved coherence and labels that align better with human judgments, as shown in evaluations on real-world social media data from two platforms with distinct interaction models. The design avoids using LLMs for embeddings and instead focuses on reasoning to mitigate common failures of unsupervised methods.

What carries the argument

The reasoning-based refinement framework with its three stages of coherence verification against member texts, redundancy adjudication for semantic overlap, and two-stage unsupervised label grounding.

If this is right

  • Refined clusters exhibit greater coherence and fewer redundancies than those produced by classical topic models or embedding-based approaches.
  • Cluster labels achieve higher alignment with human interpretations despite the lack of gold-standard annotations.
  • The framework maintains effectiveness across different social media platforms and under matched temporal and volume conditions.
  • LLM-based reasoning provides a general mechanism for validating unsupervised semantic structures in large text collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could be applied to other domains involving unsupervised grouping, such as document collections in specific fields, to enhance interpretability.
  • The separation between the clustering step and the refinement step allows users to choose any base clustering algorithm and still benefit from the LLM validation.
  • Future improvements in LLM reasoning abilities would directly translate to better cluster refinements without modifying the framework itself.
  • Such refinement techniques might help in creating more trustworthy summaries or visualizations of large-scale text data for non-expert users.

Load-bearing premise

That large language models can reliably judge cluster coherence, semantic redundancy, and suitable labels in an unsupervised manner that works across various data sources and initial clustering qualities.

What would settle it

Conduct a human evaluation study on a fresh set of social media posts where participants rate the coherence and label quality of both original and LLM-refined clusters; if the refined clusters do not receive significantly higher ratings, the claimed improvements would not hold.

Figures

Figures reproduced from arXiv: 2604.07562 by Tunazzina Islam.

Figure 1
Figure 1. Figure 1: Overview of our framework. Unsupervised cluster￾ing generates initial cluster proposals that are often noisy. We treat these clusters as hypotheses and use LLMs as semantic judges to (1) verify coherence, (2) adjudicate redundancy, and (3) generate interpretable labels via a two-stage grounding process, producing refined, coherent, and distinct clusters. 1 Introduction Large text collections are commonly a… view at source ↗
Figure 2
Figure 2. Figure 2: Example of the incoherent cluster from X. Identifying and Merging Redundant Clusters. We identify clusters that are semantically similar and merge them to reduce redundancy. SBERT is used to generate an embedding for each cluster summary. Cosine similarity between cluster em￾beddings is calculated to identify clusters with high semantic overlap. Clusters exceeding a similarity threshold are merged. We sele… view at source ↗
Figure 3
Figure 3. Figure 3: Example of merged cluster summaries. compute pairwise SBERT similarity between gen￾erated labels. Labels with similarity above a thresh￾old of 0.85 are grouped to identify semantically redundant themes. For each group of similar labels, the LLM generates a consolidated label represent￾ing the merged semantic category. This process allows multiple preliminary labels to be unified into a single final label (… view at source ↗
Figure 4
Figure 4. Figure 4: Comparing intra-cluster similarity across HDB￾SCAN, SBERT-refinement, and LLM-refinement. on sentence embeddings) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: UMAP projection of vegan discourse from X (green) and Bluesky (blue) with broader themes. misclassified as ethical consumption or advocacy. Error analysis details are provided in App. D. 5 Qualitative Analysis To determine whether the differences in linguis￾tic tone across the two platforms can be attributed to differing thematic focuses, we map embedding vectors of 1000 texts from both platforms (same 500… view at source ↗
Figure 6
Figure 6. Figure 6: Prompt templates (shown as zero-shot). B.5 Theme Distribution Analysis [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt example of mapping text→theme (Bluesky dataset). The black colored segment is the input prompt and the blue colored segment is the generated output by LLMs. Dataset Labels X Daily Motivation and Inspiration, Gluten-Free and Vegan Publication Updates, Vegan￾ism Advocacy and Lifestyle Promotion, Vegan and Vegetarian Recipes and Cookbook, Veganism and Plant-Based Ethical Lifestyle, Veganism Advocacy an… view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of assigned labels/themes using GPT-4o on X (green) and Bluesky (blue) datasets. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Theme distribution in balanced corpora [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
read the original abstract

Unsupervised methods are widely used to induce latent semantic structure from large text collections, yet their outputs often contain incoherent, redundant, or poorly grounded clusters that are difficult to validate without labeled data. We propose a reasoning-based refinement framework that leverages large language models (LLMs) not as embedding generators, but as semantic judges that validate and restructure the outputs of arbitrary unsupervised clustering algorithms. Our framework introduces three reasoning stages: (i) coherence verification, where LLMs assess whether cluster summaries are supported by their member texts; (ii) redundancy adjudication, where candidate clusters are merged or rejected based on semantic overlap; and (iii) label grounding, where clusters are assigned interpretable labels through a two-stage process that generates and consolidates semantically similar labels in a fully unsupervised manner. This design decouples representation learning from structural validation and mitigates the common failure modes of embedding-only approaches. We evaluate the framework in real-world social media corpora from two platforms with distinct interaction models, demonstrating consistent improvements in cluster coherence and human-aligned labeling quality over classical topic models and recent representation-based baselines. Human evaluation shows strong agreement with LLM-generated labels, despite the absence of gold-standard annotations. We further conduct robustness analysis under matched temporal and volume conditions to assess cross-platform stability. Beyond empirical gains, our results suggest that LLM-based reasoning can serve as a general mechanism for validating and refining unsupervised semantic structure, enabling more reliable and interpretable analysis of large text collections without supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a reasoning-based refinement framework for unsupervised text clusters that uses LLMs to perform coherence verification, redundancy adjudication, and label grounding. Evaluated on social media corpora from two platforms, it claims consistent improvements in cluster coherence and human-aligned labeling quality over classical topic models and representation-based baselines, with strong human agreement on LLM-generated labels despite no gold-standard annotations.

Significance. If the empirical claims hold, this work could provide a practical method for enhancing the reliability and interpretability of unsupervised semantic clustering in text data, particularly useful for large-scale analysis of social media content. The separation of representation learning from structural validation addresses a key limitation of embedding-based approaches. The cross-platform robustness analysis is a positive aspect for assessing generalizability.

major comments (2)
  1. [Abstract] The abstract asserts 'consistent improvements in cluster coherence and human-aligned labeling quality' and 'strong agreement with LLM-generated labels' without providing any quantitative metrics, statistical tests, specific baseline details, or error analysis. This lack of evidence prevents verification of the central empirical claim.
  2. [Evaluation] The reliance on human agreement with LLM labels does not independently confirm improved semantic structure or correctness of the refined clusters. Both humans and the LLM could be responding to surface-level cues or stylistic patterns rather than verifying that member texts support the summaries or that merges preserve distinct semantics. Additional intrinsic metrics or gold-standard comparisons are needed to address this validation gap.
minor comments (2)
  1. [Abstract] Clarify what specific unsupervised clustering algorithms were used as input to the framework, as 'arbitrary' suggests generality but experiments likely use particular ones.
  2. The robustness analysis under matched temporal and volume conditions is mentioned but its results are not detailed; consider expanding on any observed differences or stability measures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we will make to improve the clarity, rigor, and verifiability of our empirical claims.

read point-by-point responses
  1. Referee: [Abstract] The abstract asserts 'consistent improvements in cluster coherence and human-aligned labeling quality' and 'strong agreement with LLM-generated labels' without providing any quantitative metrics, statistical tests, specific baseline details, or error analysis. This lack of evidence prevents verification of the central empirical claim.

    Authors: We agree that the abstract would benefit from greater specificity to allow immediate assessment of the central claims. In the revised manuscript, we will expand the abstract to include key quantitative results from our experiments (e.g., reported coherence score improvements and human agreement percentages), mention the primary baselines (classical topic models and representation-based methods), and note the use of statistical tests where applicable. This change will make the abstract self-contained while preserving its brevity. revision: yes

  2. Referee: [Evaluation] The reliance on human agreement with LLM labels does not independently confirm improved semantic structure or correctness of the refined clusters. Both humans and the LLM could be responding to surface-level cues or stylistic patterns rather than verifying that member texts support the summaries or that merges preserve distinct semantics. Additional intrinsic metrics or gold-standard comparisons are needed to address this validation gap.

    Authors: We appreciate this concern about potential superficial biases in the evaluation. Our human annotation guidelines explicitly required judges to check semantic support between member texts and cluster summaries/labels, as well as semantic distinctness after merges; high inter-annotator agreement was observed under these instructions. Nevertheless, we acknowledge the value of supplementary validation. In the revision, we will incorporate additional intrinsic metrics (such as within-cluster similarity and coherence measures like NPMI) into the evaluation section and expand the limitations discussion to address the absence of gold-standard annotations for these unsupervised social media datasets. We maintain that the multi-stage reasoning framework combined with targeted human validation offers a practical solution in label-free settings, but we will strengthen the empirical presentation as suggested. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical framework with external evaluation

full rationale

The paper proposes an LLM-based refinement framework with three explicit reasoning stages (coherence verification, redundancy adjudication, label grounding) applied to outputs of arbitrary unsupervised clustering algorithms. All claims of improvement are grounded in empirical results on two external social-media corpora, including human agreement metrics and cross-platform robustness checks under matched conditions. No equations, fitted parameters, or self-defined quantities appear in the derivation. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The central claims therefore do not reduce to the inputs by construction; the method is presented as an independent empirical procedure whose validity rests on observable outputs rather than tautological re-labeling of its own components.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that current LLMs possess sufficient semantic judgment capability for the three tasks; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption LLMs can act as reliable semantic judges for text cluster coherence and redundancy without labeled data
    Framework design and evaluation claims depend on this capability being accurate and consistent.

pith-pipeline@v0.9.0 · 5559 in / 1204 out tokens · 57353 ms · 2026-05-10T17:35:44.953496+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Dimo Angelov. 2020. Top2vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470

  4. [4]

    David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. the Journal of machine Learning research

  5. [5]

    Su Lin Blodgett and 1 others. 2020. Language (technology) is power: A critical survey of “bias” in nlp. In ACL

  6. [6]

    Jordan Boyd-Graber, David Mimno, and David Newman. 2014. Care and feeding of topic models: Problems, diagnostics, and improvements. Handbook of mixed membership models and their applications, 225255

  7. [7]

    Alexander Brady and Tunazzina Islam. 2025. Latent topic synthesis: Leveraging llms for electoral ad analysis. arXiv preprint arXiv:2510.15125

  8. [8]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

  9. [9]

    Jonathan Chang, Sean Gerrish, Chong Wang, Jordan Boyd-Graber, and David Blei. 2009. Reading tea leaves: How humans interpret topic models. Advances in neural information processing systems, 22

  10. [10]

    William G Cochran. 1952. The 2 test of goodness of fit. The Annals of mathematical statistics

  11. [11]

    Jacob Cohen. 1960. A coefficient of agreement for nominal scales. EPM

  12. [12]

    Shih-Chieh Dai, Aiping Xiong, and Lun-Wei Ku. 2023. Llm-in-the-loop: Leveraging large language model for thematic analysis. arXiv preprint arXiv:2310.15100

  13. [13]

    David L Davies and Donald W Bouldin. 2009. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2):224--227

  14. [14]

    Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391--407

  15. [15]

    Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken Chia, Boyang Li, Shafiq Joty, and Lidong Bing. 2023. Is gpt-3 a good data annotator? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11173--11195

  16. [16]

    Kawin Ethayarajh. 2019. https://doi.org/10.18653/v1/D19-1006 How contextual are contextualized word representations? C omparing the geometry of BERT , ELM o, and GPT -2 embeddings . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCN...

  17. [17]

    Fabrizio Gilardi, Meysam Alizadeh, and Ma \"e l Kubli. 2023. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120

  18. [18]

    Maarten Grootendorst. 2022. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794

  19. [19]

    Liangjie Hong and Brian D Davison. 2010. Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics, pages 80--88

  20. [20]

    Patrik O Hoyer. 2004. Non-negative matrix factorization with sparseness constraints. Journal of machine learning research, 5(Nov):1457--1469

  21. [21]

    Fan Huang, Haewoon Kwak, and Jisun An. 2023. Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. In Companion proceedings of the ACM web conference 2023, pages 294--297

  22. [22]

    Tunazzina Islam. 2019. Yoga-veganism: Correlation mining of twitter health data. arXiv preprint arXiv:1906.07668

  23. [23]

    Tunazzina Islam. 2026. Who gets which message? auditing demographic bias in llm-generated targeted text. arXiv preprint arXiv:2601.17172

  24. [24]

    Tunazzina Islam and Dan Goldwasser. 2025 a . Can llms assist annotators in identifying morality frames?-case study on vaccination debate on social media. In Proceedings of the 17th ACM Web Science Conference 2025, pages 169--178

  25. [25]

    Tunazzina Islam and Dan Goldwasser. 2025 b . Discovering latent themes in social media messaging: A machine-in-the-loop approach integrating llms. In Proceedings of the International AAAI Conference on Web and Social Media, volume 19, pages 859--884

  26. [26]

    Tunazzina Islam and Dan Goldwasser. 2025 c . https://doi.org/10.18653/v1/2025.findings-emnlp.857 Post-hoc study of climate microtargeting on social media ads with LLM s: Thematic insights and fairness evaluation . In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15838--15859, Suzhou, China. Association for Computational Linguistics

  27. [27]

    Tunazzina Islam and Dan Goldwasser. 2025 d . Uncovering latent arguments in social media messaging by employing llms-in-the-loop strategy. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 7397--7429

  28. [28]

    Albert Q Jiang, , and 1 others. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825

  29. [29]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199--22213

  30. [30]

    Damir Koren c i \'c , Strahil Ristov, and Jan S najder. 2018. Document-based topic coherence measures for news media text. Expert systems with Applications, 114:357--373

  31. [31]

    William H Kruskal and W Allen Wallis. 1952. Use of ranks in one-criterion variance analysis. Journal of the American statistical Association, 47(260):583--621

  32. [32]

    Michelle S Lam, Janice Teoh, James A Landay, Jeffrey Heer, and Michael S Bernstein. 2024. Concept induction: Analyzing unstructured text with high-level concepts using lloom. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1--28

  33. [33]

    Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788--791

  34. [34]

    Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pages 50--60

  35. [35]

    Leland McInnes, John Healy, Steve Astels, and 1 others. 2017. hdbscan: Hierarchical density based clustering. J. Open Source Softw., 2(11):205

  36. [36]

    Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. JOSS

  37. [37]

    David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Proceedings of the 2011 conference on empirical methods in natural language processing, pages 262--272

  38. [38]

    Elaheh Momeni, Shanika Karunasekera, Palash Goyal, and Kristina Lerman. 2018. Modeling evolution of topics in large-scale temporal text corpora. In Proceedings of the International AAAI Conference on Web and Social Media, volume 12

  39. [39]

    Davoud Moulavi, Pablo A Jaskowiak, Ricardo JGB Campello, Arthur Zimek, and J \"o rg Sander. 2014. Density-based clustering validation. In Proceedings of the 2014 SIAM international conference on data mining, pages 839--847. SIAM

  40. [40]

    Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, and 1 others. 2022. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844--9855

  41. [41]

    OpenAI. 2024. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024

  42. [42]

    Chau Minh Pham, Alexander Hoyle, Simeng Sun, Philip Resnik, and Mohit Iyyer. 2023. Topicgpt: A prompt-based topic modeling framework. arXiv preprint arXiv:2311.01449

  43. [43]

    Hamed Rahimi, David Mimno, Jacob Hoover, Hubert Naacke, Camelia Constantin, and Bernd Amann. 2024. https://doi.org/10.18653/v1/2024.findings-eacl.123 Contextualized topic coherence metrics . In Findings of the Association for Computational Linguistics: EACL 2024, pages 1760--1773, St. Julian ' s, Malta. Association for Computational Linguistics

  44. [44]

    Nitin Ramrakhiyani, Sachin Pawar, Swapnil Hingmire, and Girish Palshikar. 2017. https://aclanthology.org/E17-2070/ Measuring topic coherence through optimal word buckets . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , pages 437--442, Valencia, Spain. Association fo...

  45. [45]

    Nils Reimers and Iryna Gurevych. 2019. Sentence- BERT : Sentence embeddings using S iamese BERT -networks. In EMNLP-IJCNLP

  46. [46]

    Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53--65

  47. [47]

    PK Srijith, Mark Hepple, Kalina Bontcheva, and Daniel Preotiuc-Pietro. 2017. Sub-story detection in twitter with hierarchical dirichlet processes. Information Processing & Management, 53(4):989--1003

  48. [48]

    Yee Teh, Michael Jordan, Matthew Beal, and David Blei. 2004. Sharing clusters among related groups: Hierarchical dirichlet processes. Advances in neural information processing systems, 17

  49. [49]

    Hugo Touvron and 1 others. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  50. [50]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533

  51. [51]

    Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2021. Want to reduce labeling cost? gpt-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4195--4205

  52. [52]

    Zhiqiang Wang, Yiran Pang, and Yanbin Lin. 2023. Large language models are zero-shot text classifiers. arXiv preprint arXiv:2312.01044

  53. [53]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

  54. [54]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations

  55. [55]

    Michele Zappavigna. 2012. Discourse of twitter and social media. Discourse of Twitter and Social Media

  56. [56]

    Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing twitter and traditional media using topic models. In European conference on information retrieval, pages 338--349. Springer