pith. machine review for the scientific record. sign in

arxiv: 2305.19118 · v4 · submitted 2023-05-30 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:56 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsmulti-agent debatedivergent thinkingdegeneration of thoughtself-reflectionreasoning taskscommonsense translationarithmetic reasoning
0
0 comments X

The pith

Large language models overcome stuck reasoning by having multiple agents argue tit-for-tat under a judge instead of reflecting alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-reflection causes LLMs to lock into an initial answer once they gain confidence, even when that answer is wrong, because later reflection fails to produce genuinely new ideas. The paper introduces a Multi-Agent Debate framework in which separate LLM agents present opposing arguments in a back-and-forth exchange while a judge LLM oversees the process and selects a final solution. Experiments on commonsense machine translation and counter-intuitive arithmetic reasoning show that this debate setup produces better results than reflection-based methods. The authors also find that debate performance depends on stopping at an adaptive point and keeping the level of disagreement moderate rather than extreme.

Core claim

The Multi-Agent Debate framework encourages divergent thinking in LLMs by placing multiple agents in a tit-for-tat argumentative state, with a judge managing the exchange to reach a final solution, thereby addressing the Degeneration-of-Thought problem that limits self-reflection on tasks requiring deep contemplation.

What carries the argument

The Multi-Agent Debate process in which LLM agents generate opposing arguments in a tit-for-tat dynamic and a separate judge LLM synthesizes them into a final answer.

If this is right

  • MAD improves performance over self-reflection on commonsense machine translation and counter-intuitive arithmetic reasoning.
  • Effective MAD requires an adaptive stopping point for the debate and only a modest level of tit-for-tat intensity.
  • Using different LLMs for agents versus judge can produce biased synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same debate structure could be tested on other tasks that reward considering multiple perspectives, such as planning or creative writing.
  • If the judge bias problem is confirmed, replacing the judge with a human or a rule-based aggregator becomes a direct next step.
  • Scaling the number of agents beyond the small groups tested here might increase the chance of surfacing overlooked alternatives.

Load-bearing premise

The judge LLM can evaluate and combine the agents' arguments fairly without itself becoming stuck in an initial view.

What would settle it

Apply the same MAD setup to the reported datasets and obtain accuracy no higher than self-reflection baselines, or observe the judge consistently favoring one agent's first position regardless of counter-arguments.

read the original abstract

Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of "tit for tat" state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Code is available at https://github.com/Skytliang/Multi-Agents-Debate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper identifies the Degeneration-of-Thought (DoT) problem in self-reflection methods for LLMs on complex reasoning tasks. It proposes a Multi-Agent Debate (MAD) framework in which multiple agents engage in tit-for-tat arguments managed by a judge LLM to produce a final solution. Experiments on commonsense machine translation and counter-intuitive arithmetic reasoning datasets are reported to demonstrate effectiveness, with additional analyses on adaptive debate length and tit-for-tat intensity.

Significance. If the results hold under tighter controls, the MAD framework provides a concrete procedural approach to mitigating DoT and encouraging divergent thinking in LLMs. The open-sourced code and empirical evaluation on two challenging tasks constitute a useful contribution to the study of LLM reasoning strategies.

major comments (3)
  1. [Abstract and Experiments] The central claim requires that the judge LLM synthesizes the debate without inheriting DoT bias. The manuscript notes unfairness when different LLMs are used for agents but does not report controls that hold the judge model fixed while varying agent diversity or initial stance strength (Abstract; Experiments section).
  2. [Experiments] Statistical significance, exact baseline implementations, prompt sensitivity, and judge-bias controls are not detailed, leaving the reported gains on the two datasets difficult to interpret or reproduce (Experiments section).
  3. [Analyses] The claim that an adaptive break and modest tit-for-tat level are required for good performance lacks quantitative thresholds or effect-size tables showing how performance degrades outside those regimes (Analyses section).
minor comments (1)
  1. All prompts and exact debate templates should be included in an appendix to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We believe the suggested revisions will significantly strengthen the paper by providing more rigorous controls and quantitative analyses. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Abstract and Experiments] The central claim requires that the judge LLM synthesizes the debate without inheriting DoT bias. The manuscript notes unfairness when different LLMs are used for agents but does not report controls that hold the judge model fixed while varying agent diversity or initial stance strength (Abstract; Experiments section).

    Authors: We thank the referee for highlighting this important aspect. While we observed that using different LLMs for agents can lead to unfair judgments, we agree that explicit controls holding the judge fixed are necessary to isolate the effect of agent diversity. In the revised manuscript, we will include additional experiments where the judge model is fixed (e.g., using GPT-4 as judge) and systematically vary the agent models and the strength of initial stances. This will provide clearer evidence that the judge synthesizes without inheriting DoT bias. revision: yes

  2. Referee: [Experiments] Statistical significance, exact baseline implementations, prompt sensitivity, and judge-bias controls are not detailed, leaving the reported gains on the two datasets difficult to interpret or reproduce (Experiments section).

    Authors: We acknowledge the need for more rigorous reporting. In the revision, we will provide: (1) statistical significance tests (e.g., p-values from paired t-tests or bootstrap) for the performance gains; (2) exact prompt templates and baseline implementations with links to code; (3) analysis of prompt sensitivity by varying key prompt elements; and (4) additional judge-bias controls as mentioned above. These details will be added to the Experiments section to enhance reproducibility. revision: yes

  3. Referee: [Analyses] The claim that an adaptive break and modest tit-for-tat level are required for good performance lacks quantitative thresholds or effect-size tables showing how performance degrades outside those regimes (Analyses section).

    Authors: We agree that quantitative support would strengthen this claim. We will add effect-size tables and plots in the Analyses section showing performance as a function of debate length (number of rounds) and tit-for-tat intensity levels. This will include thresholds where performance degrades, such as when debate continues too long or tit-for-tat is too aggressive, leading to degeneration. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines the MAD framework as a procedural multi-agent interaction with a judge, without any equations, fitted parameters, or mathematical derivations. Central claims rest on empirical results from two external datasets (commonsense machine translation and counter-intuitive arithmetic reasoning) compared to baselines, with no reduction of outputs to inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing manner for the core argument; the DoT observation and MAD proposal are presented as independent contributions evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that structured disagreement among LLM instances produces net gains in reasoning quality; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Multiple LLM agents in tit-for-tat debate can generate novel thoughts that a single agent cannot produce via self-reflection.
    Invoked to justify why the framework overcomes DoT; appears in the motivation and method description.
  • domain assumption An LLM judge can reliably select the best solution from the debate transcript.
    Required for the final output step; noted as potentially problematic when different LLMs are used.

pith-pipeline@v0.9.0 · 5585 in / 1265 out tokens · 43183 ms · 2026-05-13T23:56:00.948856+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies

    cs.MA 2026-05 unverdicted novelty 7.0

    Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect ...

  2. EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement

    cs.CV 2026-05 unverdicted novelty 7.0

    EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.

  3. When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    Structured critic-actor loops improve AI performance on theoretical physics reasoning tasks, with benefits strongest in asymmetric model pairings using constructive feedback.

  4. Learning to Interrupt in Language-based Multi-agent Communication

    cs.CL 2026-04 unverdicted novelty 7.0

    HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, sche...

  5. What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network

    cs.CL 2026-03 unverdicted novelty 7.0

    Discourse among AI agents on Moltbook is largely determined by architectural constraints like context windows and identity files, appearing as social learning but actually short-horizon contextual conditioning.

  6. Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies

    cs.MA 2026-05 unverdicted novelty 6.0

    Spectral features of the successor representation matrix for multi-agent LLM communication topologies predict robustness to perturbations, consensus formation, and error accumulation, with an extension to account for ...

  7. The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    Closed-system multi-step LLM reasoning is subject to an information-theoretic bound where mutual information with evidence decreases, preserving accuracy while eroding faithfulness, with EGSR recovering it on SciFact ...

  8. Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation

    cs.MA 2026-04 unverdicted novelty 6.0

    Architectural heterogeneity across 7-9B models reduces first-choice concentration in policy simulations (70.9% to 46.1% and 46.0% to 22.9%), while coherence validation shows a scenario-dependent tradeoff.

  9. Preregistered Belief Revision Contracts

    cs.AI 2026-04 unverdicted novelty 6.0

    PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.

  10. PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage

    cs.AI 2026-04 unverdicted novelty 6.0

    PolySwarm aggregates predictions from 50 LLM personas for Polymarket trading using Bayesian combination and divergence metrics, outperforming single models in calibration while adding latency arbitrage via CEX price models.

  11. Large Language Models Cannot Self-Correct Reasoning Yet

    cs.CL 2023-10 unverdicted novelty 6.0

    LLMs cannot reliably self-correct reasoning mistakes using only their internal capabilities and often degrade in performance without external feedback.

  12. Robust Multi-Agent LLMs under Byzantine Faults

    cs.MA 2026-05 unverdicted novelty 5.0

    SAC is a decentralized iterative filter-and-refine protocol that achieves (F+1)-robustness in LLM multi-agent systems, suppressing Byzantine influence and improving performance on reasoning benchmarks where prior meth...

  13. When Independent Sampling Outperforms Agentic Reasoning

    cs.LG 2026-05 unverdicted novelty 5.0

    On Codeforces problems, independent k-shot sampling achieves better accuracy-cost and accuracy-query tradeoffs than agentic reasoning, even with prompt caching.

  14. 12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation

    cs.AI 2026-05 unverdicted novelty 5.0

    Twelve LLM agents in a 12 Angry Men jury setup almost always end in hung juries due to anchoring, with Llama-4-Scout showing more vote changes than GPT-4o, suggesting RLHF alignment intensity limits deliberative flexibility.

  15. Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems

    cs.MA 2026-04 unverdicted novelty 5.0

    MMP defines a seven-field CMB schema, role-based SVAF evaluation, content-hash lineage, and remix storage to enable traceable cross-session collaboration among autonomous LLM agents.

  16. Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research

    cs.HC 2026-04 unverdicted novelty 5.0

    AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.

  17. Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 5.0

    APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.

  18. Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.

  19. Act or Escalate? Evaluating Escalation Behavior in Automation with Language Models

    cs.LG 2026-03 unverdicted novelty 5.0

    Language models display model-specific escalation thresholds in uncertain decisions that are not explained by scale or architecture, and supervised fine-tuning on explicit uncertainty reasoning produces robust, genera...

  20. Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures

    cs.AI 2026-04 unverdicted novelty 4.0

    A survey comparing classical multi-agent systems with large foundation model-enabled multi-agent systems, showing how the latter enables semantic-level collaboration and greater adaptability.

  21. Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)

    cs.LG 2026-04 unverdicted novelty 4.0

    HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...

  22. The Rise and Potential of Large Language Model Based Agents: A Survey

    cs.AI 2023-09 accept novelty 4.0

    The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

  23. A Case-Driven Multi-Agent Framework for E-Commerce Search Relevance

    cs.IR 2026-05 unverdicted novelty 3.0

    A case-driven multi-agent system automates the full pipeline of bad-case detection, annotation, and resolution for e-commerce search relevance using Annotator, Optimizer, and User agents plus supporting components.

Reference graph

Works this paper leans on

286 extracted references · 286 canonical work pages · cited by 22 Pith papers · 7 internal anchors

  1. [1]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Answering Questions by Meta-Reasoning over Multiple Chains of Thought , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  2. [3]

    A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity , author=. Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  3. [7]

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages=

    Solving General Arithmetic Word Problems , author=. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages=

  4. [12]

    Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

    Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

  5. [13]

    Advances in neural information processing systems , volume=

    Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

  6. [15]

    Advances in Neural Information Processing Systems , volume=

    Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=

  7. [18]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  8. [19]

    Advances in Neural Information Processing Systems , volume=

    Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  9. [20]

    Advances in Neural Information Processing Systems , volume=

    Self-refine: Iterative refinement with self-feedback , author=. Advances in Neural Information Processing Systems , volume=

  10. [21]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  11. [23]

    Transactions of the Association for Computational Linguistics , volume=

    Exploring human-like translation strategy with large language models , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

  12. [25]

    International Conference on Machine Learning , pages=

    The unreasonable effectiveness of few-shot learning for machine translation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  13. [26]

    Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction , author=. Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  14. [27]

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

    Critic: Large language models can self-correct with tool-interactive critiquing , author=. arXiv preprint arXiv:2305.11738 , year=

  15. [28]

    Advances in Neural Information Processing Systems , volume=

    Eliciting thinking hierarchy without a prior , author=. Advances in Neural Information Processing Systems , volume=

  16. [30]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Question Answering as Programming for Solving Time-Sensitive Questions , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  17. [31]

    The 61st Annual Meeting Of The Association For Computational Linguistics , year=

    Solving Math Word Problems via Cooperative Reasoning induced Language Models , author=. The 61st Annual Meeting Of The Association For Computational Linguistics , year=

  18. [32]

    Philosophical Explorations , volume=

    Does reflection lead to wise choices? , author=. Philosophical Explorations , volume=. 2011 , publisher=

  19. [33]

    , author=

    Metacognition and Reflection by Interdisciplinary Experts: Insights from Cognitive Science and Philosophy. , author=. Issues in Interdisciplinary Studies , volume=. 2017 , publisher=

  20. [34]

    On the reliability of watermarks for large language mod- els.arXiv preprint arXiv:2306.04634, 2023

    On the reliability of watermarks for large language models , author=. arXiv preprint arXiv:2306.04634 , year=

  21. [35]

    arXiv:2308.10848

    Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents , author=. arXiv preprint arXiv:2308.10848 , year=

  22. [36]

    Advances in Neural Information Processing Systems , volume=

    Camel: Communicative agents for" mind" exploration of large language model society , author=. Advances in Neural Information Processing Systems , volume=

  23. [37]

    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

    Chateval: Towards better llm-based evaluators through multi-agent debate , author=. arXiv preprint arXiv:2308.07201 , year=

  24. [38]

    ChatDev: Communicative Agents for Software Development

    Communicative agents for software development , author=. arXiv preprint arXiv:2307.07924 , year=

  25. [39]

    Thinking, fast and slow , author=

  26. [40]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

    Towards Making the Most of ChatGPT for Machine Translation , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  27. [41]

    Transactions of the Association for Computational Linguistics , volume=

    Lost in the middle: How language models use long contexts , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

  28. [42]

    Proceedings of the Eighth Conference on Machine Translation , pages=

    Findings of the WMT 2023 Shared Task on Machine Translation with Terminologies , author=. Proceedings of the Eighth Conference on Machine Translation , pages=

  29. [43]

    Proceedings of the Sixth Conference on Machine Translation , pages=

    Findings of the WMT shared task on machine translation using terminologies , author=. Proceedings of the Sixth Conference on Machine Translation , pages=

  30. [44]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Meta-cotgan: A meta cooperative training paradigm for improving adversarial text generation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  31. [45]

    Transactions on Machine Learning Research , year=

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. Transactions on Machine Learning Research , year=

  32. [46]

    Findings of the Association for Computational Linguistics: ACL 2023 , pages=

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

  33. [47]

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    Learning to solve arithmetic word problems with verb categorization , author=. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  34. [48]

    Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Confer...

  35. [49]

    Lisa Bortolotti. 2011. Does reflection lead to wise choices? Philosophical Explorations, 14(3):297--313

  36. [50]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

  37. [51]

    Kahneman Daniel. 2017. Thinking, fast and slow. Farrar, Straus and Giroux

  38. [52]

    Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. 2023. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246

  39. [53]

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325

  40. [54]

    Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. 2023. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142

  41. [55]

    Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2022. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720

  42. [56]

    Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Melvin Johnson, and Orhan Firat. 2023. The unreasonable effectiveness of few-shot learning for machine translation. In International Conference on Machine Learning, pages 10867--10878. PMLR

  43. [57]

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2023. Critic: Large language models can self-correct with tool-interactive critiquing

  44. [58]

    Jie He, Tao Wang, Deyi Xiong, and Qun Liu. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.327 The box is in the pen: Evaluating commonsense reasoning in neural machine translation . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3662--3672, Online. Association for Computational Linguistics

  45. [59]

    Zhiwei He, Tian Liang, Wenxiang Jiao, Zhuosheng Zhang, Yujiu Yang, Rui Wang, Zhaopeng Tu, Shuming Shi, and Xing Wang. 2024. Exploring human-like translation strategy with large language models. Transactions of the Association for Computational Linguistics, 12:229--246

  46. [60]

    Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210

  47. [61]

    Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523--533

  48. [62]

    Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, Shuming Shi, and Zhaopeng Tu. 2023. Is chatgpt a good translator? yes with gpt-4 as the engine. arXiv preprint arXiv:2301.08745

  49. [63]

    Machiel Keestra. 2017. Metacognition and reflection by interdisciplinary experts: Insights from cognitive science and philosophy. Issues in Interdisciplinary Studies, 35:121--169

  50. [64]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199--22213

  51. [65]

    Yuqing Kong, Yunqi Li, Yubo Zhang, Zhihuan Huang, and Jinzhao Wu. 2022. Eliciting thinking hierarchy without a prior. Advances in Neural Information Processing Systems, 35:13329--13341

  52. [66]

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157--173

  53. [67]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2024. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36

  54. [68]

    Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1--22

  55. [69]

    Jonathan Pilault, Xavier Garcia, Arthur Bra z inskas, and Orhan Firat. 2023. Interactive-chain-prompting: Ambiguity resolution for crosslingual conditional generation with interaction. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computati...

  56. [70]

    Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743--1752

  57. [71]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2024. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36

  58. [72]

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri \`a Garriga-Alonso, et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research

  59. [73]

    Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. 2023. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003--13051

  60. [74]

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926

  61. [75]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171

  62. [76]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

  63. [77]

    Haoran Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang Jiao, and Michael Lyu. 2023. Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark. arXiv preprint arXiv:2303.13648

  64. [78]

    Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, and Bing Qin. 2023. Diving into the inter-consistency of large language models: An insightful analysis through debate. arXiv preprint arXiv:2305.11595

  65. [79]

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36

  66. [80]

    Haiyan Yin, Dingcheng Li, Xu Li, and Ping Li. 2020. Meta-cotgan: A meta cooperative training paradigm for improving adversarial text generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9466--9473

  67. [81]

    Ori Yoran, Tomer Wolfson, Ben Bogin, Uri Katz, Daniel Deutch, and Jonathan Berant. 2023. Answering questions by meta-reasoning over multiple chains of thought. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5942--5966

  68. [82]

    Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493

  69. [83]

    Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. 2023. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797

  70. [84]

    Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Jiaxing Zhang, Yujiu Yang, et al. 2023 a . Solving math word problems via cooperative reasoning induced language models. In The 61st Annual Meeting Of The Association For Computational Linguistics

  71. [85]

    Xinyu Zhu, Cheng Yang, Bei Chen, Siheng Li, Jian-Guang Lou, and Yujiu Yang. 2023 b . Question answering as programming for solving time-sensitive questions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12775--12790

  72. [86]

    Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, et al. 2023 c . Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144

  73. [87]

    Proceedings of the Ninth Workshop on Noisy and User-generated Text (W-NUT 2024). 2024

  74. [88]

    Correcting Challenging F innish Learner Texts With Claude, GPT -3.5 and GPT -4 Large Language Models

    Creutz, Mathias. Correcting Challenging F innish Learner Texts With Claude, GPT -3.5 and GPT -4 Large Language Models. 2024

  75. [89]

    Context-aware Adversarial Attack on Named Entity Recognition

    Chen, Shuguang and Neves, Leonardo and Solorio, Thamar. Context-aware Adversarial Attack on Named Entity Recognition. 2024

  76. [90]

    Effects of different types of noise in user-generated reviews on human and machine translations including C hat GPT

    Popovic, Maja and Lapshinova-Koltunski, Ekaterina and Koponen, Maarit. Effects of different types of noise in user-generated reviews on human and machine translations including C hat GPT. 2024

  77. [91]

    Stanceosaurus 2.0 - Classifying Stance Towards R ussian and S panish Misinformation

    Lavrouk, Anton and Ligon, Ian and Zheng, Jonathan and Naous, Tarek and Xu, Wei and Ritter, Alan. Stanceosaurus 2.0 - Classifying Stance Towards R ussian and S panish Misinformation. 2024

  78. [92]

    and Shibli, G

    Elahi, Kazi and Rahman, Tasnuva and Shahriar, Shakil and Sarker, Samir and Shawon, Md. and Shibli, G. M. A Comparative Analysis of Noise Reduction Methods in Sentiment Analysis on Noisy B angla Texts. 2024

  79. [93]

    Label Supervised Contrastive Learning for Imbalanced Text Classification in E uclidean and Hyperbolic Embedding Spaces

    Khalid, Baber and Dai, Shuyang and Taghavi, Tara and Lee, Sungjin. Label Supervised Contrastive Learning for Imbalanced Text Classification in E uclidean and Hyperbolic Embedding Spaces. 2024

  80. [94]

    M aint N orm: A corpus and benchmark model for lexical normalisation and masking of industrial maintenance short text

    Bikaun, Tyler and Hodkiewicz, Melinda and Liu, Wei. M aint N orm: A corpus and benchmark model for lexical normalisation and masking of industrial maintenance short text. 2024

Showing first 80 references.