pith. sign in

arxiv: 2509.24765 · v8 · submitted 2025-09-29 · 💻 cs.AI

Semantic-Aware Logical Reasoning via a Semiotic Framework

Pith reviewed 2026-05-18 12:54 UTC · model grok-4.3

classification 💻 cs.AI
keywords logical reasoningsemiotic squarelarge language modelsmulti-perspective analysisRepublicQAsemantic complexityautomated deduction
0
0 comments X

The pith

LogicAgent combines a semiotic square for multi-perspective semantics with deduction and verification to improve logical reasoning in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LogicAgent, a framework guided by the semiotic square that analyzes propositions from several semantic angles at once. It pairs this analysis with automated deduction steps and reflective checks to handle longer chains of reasoning. A new benchmark called RepublicQA tests these abilities with abstract, philosophically grounded statements that include contrary and contradictory forms at college-level reading difficulty. Results show consistent gains on this benchmark and on established ones like ProntoQA and FOLIO. The work matters because most current systems falter when both the meaning is ambiguous and the logical steps are deep.

Core claim

LogicAgent integrates the semiotic square to perform multi-perspective semantic analysis and combines it with automated deduction plus reflective verification, allowing large language models to manage logical complexity more effectively across deeper reasoning chains on tasks that mix semantic and logical difficulty.

What carries the argument

The semiotic square, which organizes semantic relations among a proposition, its contrary, its contradictory, and its subcontrary to enable structured multi-perspective examination of meaning.

If this is right

  • Language models gain the ability to track conflicting stances within the same reasoning task rather than collapsing them early.
  • Benchmarks that jointly vary semantic depth and logical length become necessary for realistic evaluation of reasoning systems.
  • The same semiotic structure can be reused to generate or verify reasoning traces that explicitly account for alternative meanings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could transfer to domains such as policy analysis or case law where one proposition must be examined against its logical opposites.
  • Future work could test whether the square structure helps models avoid common semantic pitfalls like scope ambiguity in natural-language premises.
  • If the integration scales, it suggests a general route for adding lightweight symbolic scaffolds to purely neural reasoning pipelines.

Load-bearing premise

The semiotic square supplies a reliable structure for breaking down semantic relations that can be usefully combined with deduction and verification inside language-model reasoning loops.

What would settle it

Run LogicAgent and a version without the semiotic-square component on a fresh set of abstract propositions with systematically varied contrary and contradictory forms; if the full system shows no measurable gain in accuracy or chain length, the central integration claim does not hold.

Figures

Figures reproduced from arXiv: 2509.24765 by Junqing Yu, Junxi Sheng, Wei Yang, Wenbing Li, Xinglang Zhang, Yi-Ping Phoebe Chen, Yunyao Zhang, Zikai Song.

Figure 1
Figure 1. Figure 1: Overview of LogicAgent and the pro￾posed RepublicQA benchmark. (Top-left) Re￾publicQA features abstract, philosophical propo￾sitions from Plato’s Republic with diverse con￾textual premises, enabling multiple semantic in￾terpretations. (Bottom-left) LogicAgent consists of three stages. (Top-right) A multi-step reason￾ing process explores contraries and contradictions when S1 is indeterminate. (Bottom-right)… view at source ↗
Figure 2
Figure 2. Figure 2: Greimas’ Semiotic Square: illustrating contraries (S1 vs. S2), contradictions (S1 vs. ¬S1, S2 vs. ¬S2), and implications (S1 ⇒ ¬S2, S2 ⇒ ¬S1). Greimas’ Semiotic Square. The Greimas’ Semiotic Square (Greimas et al., 1982) is a foundational construct in structuralist seman￾tics that organizes conceptual contraries and contradictions into a four-element structure, en￾abling fine-grained reasoning over meaning… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the LogicAgent framework. The agent processes a natural language propo￾sition through three stages. (1) Semantic Structuring Stage constructs a Greimas’ Semiotic Square, generating four interrelated propositions: the primary proposition S1, its contradiction ¬S1, the con￾trary S2, and the contradiction of the contrary ¬S2. These are verified for FOL-consistency using a CFG-based parser. (2) Log… view at source ↗
Figure 4
Figure 4. Figure 4: Complexity metrics compari￾son. Red is our benchmark. Current benchmarks primarily focus on logical complex￾ity while largely overlooking semantic complexity, result￾ing in limited coverage of abstraction, contextual ambigu￾ity, and nuanced meaning. To address this gap, we con￾struct RepublicQA, a benchmark designed to jointly cap￾ture logical depth and semantic breadth reasoning. Benchmark Construction. R… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation studies: (a) input modalities and (b) reasoning efficiency. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example CFG parse tree for the FOL rule [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Answer distribution across different benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Analysis of Philosophical Concepts: (a) frequency distribution of concepts, (b) overall [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overall and relation-specific accuracy across datasets. FOLIO ProntoQA ProofWriter ProverQA RepublicQA 0.0 0.2 0.4 0.6 0.8 1.0 Proportion 0.70 0.30 1.00 1.00 0.87 0.13 0.27 0.70 Contradictory Proportion Contrary Proportion [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
read the original abstract

Logical reasoning is a fundamental capability of large language models. However, existing studies often overlook the interaction between logical complexity and semantic complexity, leading to systems that struggle with abstract propositions, ambiguous contexts, and conflicting stances that are central to human reasoning. We propose LogicAgent, a semiotic-square-guided framework that jointly addresses these two axes of difficulty. The semiotic square provides a principled structure for multi-perspective semantic analysis, and LogicAgent integrates automated deduction with reflective verification to manage logical complexity across deeper reasoning chains. To support evaluation under these conditions, we introduce RepublicQA, a benchmark that couples semantic complexity with logical depth. RepublicQA reaches college-level semantic difficulty (FKGL 11.94), contains philosophically grounded abstract propositions with systematically constructed contrary and contradictory forms, and offers a semantically rich setting for assessing logical reasoning in large language models. Experiments show that LogicAgent achieves state-of-the-art performance on RepublicQA with a 6.25 percent average improvement over strong baselines, and generalizes effectively to mainstream logical reasoning benchmarks including ProntoQA, ProofWriter, FOLIO, and ProverQA, achieving an additional 7.05 percent average gain. These results demonstrate the effectiveness of semiotic-grounded multi-perspective reasoning in enhancing logical performance. Code is available at https://github.com/AI4SS/Logic-Agent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes LogicAgent, a semiotic-square-guided framework that combines multi-perspective semantic analysis with automated deduction and reflective verification to improve logical reasoning in LLMs under conditions of high semantic and logical complexity. It introduces the RepublicQA benchmark, which features college-level semantic difficulty (FKGL 11.94), philosophically grounded abstract propositions, and systematically constructed contrary/contradictory forms. Experiments report that LogicAgent achieves SOTA performance on RepublicQA (6.25% average improvement over strong baselines) and generalizes to ProntoQA, ProofWriter, FOLIO, and ProverQA (additional 7.05% average gain). Code is released publicly.

Significance. If the results hold after addressing isolation concerns, the work would represent a meaningful step toward integrating semiotic structures with automated reasoning pipelines in LLMs, offering a structured way to handle ambiguous and conflicting semantic contexts that current systems often overlook. The new RepublicQA benchmark and public code release are concrete strengths that could support follow-on research in semantic-aware logical reasoning.

major comments (1)
  1. [Framework and Experiments] Framework description and experimental evaluation: The central claim attributes the 6.25% RepublicQA gain and 7.05% cross-benchmark improvement specifically to the semiotic-square-guided multi-perspective analysis. However, no ablation is reported that removes or replaces the semiotic square (contrary/contradictory forms and multi-perspective semantic analysis) while retaining the automated deduction and reflective verification steps. This leaves open whether the gains arise from the semiotic component or from reflective verification alone, which is load-bearing for the paper's attribution of effectiveness to the semiotic framework.
minor comments (1)
  1. [Abstract] Abstract and experimental setup: More explicit details on baseline implementations, exact prompting templates, and statistical controls (e.g., number of runs, variance) would strengthen reproducibility claims, even with code release.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thorough review and constructive criticism. The concern about isolating the contribution of the semiotic square is well-taken and points to a genuine gap in the current experimental design. We address this point directly below and outline the planned revision.

read point-by-point responses
  1. Referee: [Framework and Experiments] Framework description and experimental evaluation: The central claim attributes the 6.25% RepublicQA gain and 7.05% cross-benchmark improvement specifically to the semiotic-square-guided multi-perspective analysis. However, no ablation is reported that removes or replaces the semiotic square (contrary/contradictory forms and multi-perspective semantic analysis) while retaining the automated deduction and reflective verification steps. This leaves open whether the gains arise from the semiotic component or from reflective verification alone, which is load-bearing for the paper's attribution of effectiveness to the semiotic framework.

    Authors: We agree that the manuscript would be strengthened by an ablation that removes or replaces the semiotic square (including the contrary/contradictory forms and multi-perspective semantic analysis) while keeping the automated deduction and reflective verification components intact. The existing baselines compare LogicAgent against methods that lack the full pipeline, but they do not isolate the semiotic component from reflective verification in the manner described. To address this directly, we will add a targeted ablation study in the revised version. This study will evaluate a variant that retains deduction and verification but substitutes a non-semiotic multi-perspective prompt or removes the structured contrary/contradictory analysis. The new results will be reported alongside the existing experiments to clarify the specific contribution of the semiotic framework to the observed gains on RepublicQA and the other benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework and benchmark validated on external benchmarks

full rationale

The paper introduces LogicAgent as a novel semiotic-square-guided framework and RepublicQA as a new benchmark with college-level semantic difficulty and philosophically grounded propositions. Performance gains (6.25% on RepublicQA, 7.05% on cross-benchmarks) are reported via direct empirical comparison to strong baselines on ProntoQA, ProofWriter, FOLIO, and ProverQA. No equations, fitted parameters, or self-referential definitions appear in the abstract or described derivation; the semiotic square is presented as an imported principled structure rather than derived from the results themselves. The central claims rest on experimental outcomes and external benchmark generalization, making the derivation self-contained against independent data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of a newly introduced framework and benchmark whose core structuring assumption (semiotic square utility) and evaluation setting have no independent external validation cited.

axioms (1)
  • domain assumption The semiotic square provides a principled structure for multi-perspective semantic analysis.
    Invoked directly as the foundation for LogicAgent in the abstract description of the framework.
invented entities (2)
  • LogicAgent no independent evidence
    purpose: Semiotic-square-guided framework integrating deduction and reflective verification
    Newly proposed system whose performance gains are demonstrated only within this work.
  • RepublicQA no independent evidence
    purpose: Benchmark coupling semantic complexity with logical depth via abstract propositions and contrary/contradictory forms
    Newly introduced evaluation dataset whose construction and difficulty claims are internal to the paper.

pith-pipeline@v0.9.0 · 5787 in / 1452 out tokens · 47242 ms · 2026-05-18T12:54:29.783256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    TEMA is the first framework for multi-modification composed image retrieval, using entity mapping to improve accuracy on both new complex datasets and existing benchmarks while balancing efficiency.

  2. IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics

    cs.SI 2026-04 unverdicted novelty 7.0

    IntervenSim is an intervention-aware social network simulation that couples source interventions with crowd interactions in a feedback loop, improving MAPE by 41.6% and DTW by 66.9% over prior static frameworks on rea...

  3. OmniTrend: Content-Context Modeling for Scalable Social Popularity Prediction

    cs.CV 2026-04 unverdicted novelty 6.0

    OmniTrend predicts popularity by combining separate content attractiveness and contextual exposure predictors using cross-modal and exogenous signals.

  4. HotComment: A Benchmark for Evaluating Popularity of Online Comments

    cs.AI 2026-04 unverdicted novelty 6.0

    HotComment is a new multimodal benchmark that quantifies online comment popularity via content quality assessment, interaction-based prediction, and agent-simulated user engagement, accompanied by the StyleCmt stylist...

  5. Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner

    cs.LG 2026-04 unverdicted novelty 6.0

    A unified incentive-score decomposition of preference optimization reveals the disentanglement band condition and reward calibration method that enables suppressing losers while preserving winners in LLM training.

  6. Coupling Macro Dynamics and Micro States for Long-Horizon Social Simulation

    cs.SI 2026-04 unverdicted novelty 6.0

    MF-MDP enables stable long-horizon social simulations by coupling micro-level individual opinion states with macro-level collective dynamics, achieving up to 40,000 interactions with 75% lower KL divergence than baselines.

  7. Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction

    cs.MM 2026-04 unverdicted novelty 5.0

    A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.

  8. CurEvo: Curriculum-Guided Self-Evolution for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 8 Pith papers · 13 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

  4. [4]

    Nltk: the natural language toolkit

    Steven Bird. Nltk: the natural language toolkit. In Proceedings of the COLING/ACL 2006 interactive presentation sessions, pp.\ 69--72, 2006

  5. [5]

    Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023

    Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, B \"o rje F Karlsson, Jie Fu, and Yemin Shi. Autoagents: A framework for automatic agent generation. arXiv preprint arXiv:2309.17288, 2023

  6. [6]

    Asymptotically unambitious artificial general intelligence

    Michael Cohen, Badri Vellambi, and Marcus Hutter. Asymptotically unambitious artificial general intelligence. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 2467--2476, 2020

  7. [7]

    Semcoder: Training code language models with comprehensive semantics reasoning

    Yangruibo Ding, Jinjun Peng, Marcus Min, Gail Kaiser, Junfeng Yang, and Baishakhi Ray. Semcoder: Training code language models with comprehensive semantics reasoning. Advances in Neural Information Processing Systems, 37: 0 60275--60308, 2024

  8. [8]

    Agent AI: Surveying the Horizons of Multimodal Interaction

    Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, et al. Agent ai: Surveying the horizons of multimodal interaction. arXiv preprint arXiv:2401.03568, 2024

  9. [9]

    Deep se (3)-equivariant geometric reasoning for precise placement tasks

    Ben Eisner, Yi Yang, Todor Davchev, Mel Vecerik, Jonathan Scholz, and David Held. Deep se (3)-equivariant geometric reasoning for precise placement tasks. arXiv preprint arXiv:2404.13478, 2024

  10. [10]

    Pal: Program-aided language models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp.\ 10764--10799. PMLR, 2023

  11. [11]

    Linguistic complexity: Locality of syntactic dependencies

    Edward Gibson. Linguistic complexity: Locality of syntactic dependencies. Cognition, 68 0 (1): 0 1--76, 1998

  12. [12]

    On meaning: Selected writings in semiotic theory

    Algirdas Julien Greimas. On meaning: Selected writings in semiotic theory. (No Title), 1987

  13. [13]

    Maupassant: The semiotics of text

    Algirdas Julien Greimas. Maupassant: The semiotics of text. 1988

  14. [14]

    Semiotics and language: An analytical dictionary

    Algirdas Julien Greimas, Joseph Court \'e s, Larry Crist, and Daniel Patte. Semiotics and language: An analytical dictionary. Indiana University Press Bloomington, 1982

  15. [15]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  16. [16]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  17. [17]

    Folio: Natural language reasoning with first-order logic

    Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, et al. Folio: Natural language reasoning with first-order logic. arXiv preprint arXiv:2209.00840, 2022

  18. [18]

    Plan-then-execute: An empirical study of user trust and team performance when using llm agents as a daily assistant

    Gaole He, Gianluca Demartini, and Ujwal Gadiraju. Plan-then-execute: An empirical study of user trust and team performance when using llm agents as a daily assistant. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pp.\ 1--22, 2025

  19. [19]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 3 0 (4): 0 6, 2023

  20. [20]

    Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained understanding

    Yangliu Hu, Zikai Song, Na Feng, Yawei Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Sf2t: Self-supervised fragment finetuning of video-llms for fine-grained understanding. arXiv preprint arXiv:2504.07745, 2025

  21. [21]

    Towards Reasoning in Large Language Models: A Survey

    Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022

  22. [22]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  23. [23]

    Coupled mamba: Enhanced multi-modal fusion with coupled state space model

    Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, and Wei Yang. Coupled mamba: Enhanced multi-modal fusion with coupled state space model. arXiv preprint arXiv:2405.18014, 2024

  24. [24]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 9493--9500. IEEE, 2023

  25. [25]

    Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis

    Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis. Intelligent Computing, 3: 0 0063, 2024

  26. [26]

    GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024

  27. [27]

    Logic-LM: Empowering Large Language Models With Symbolic Solvers for Faithful Logical Reasoning,

    Liangming Pan, Alon Albalak, Xinyi Wang, and William Yang Wang. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. arXiv preprint arXiv:2305.12295, 2023

  28. [28]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pp.\ 1--22, 2023

  29. [29]

    Advances in neural in- formation processing systems, 35:27730–27744

    Nisarg Patel, Mohith Kulkarni, Mihir Parmar, Aashna Budhiraja, Mutsumi Nakamura, Neeraj Varshney, and Chitta Baral. Multi-logieval: Towards evaluating multi-step logical reasoning ability of large language models. arXiv preprint arXiv:2406.17169, 2024

  30. [30]

    Gorilla: Large language model connected with massive apis

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. Advances in Neural Information Processing Systems, 37: 0 126544--126565, 2024

  31. [31]

    Critical and reflective thinking: A philosophical perspective

    Richard W Paul. Critical and reflective thinking: A philosophical perspective. In Dimensions of thinking and cognitive instruction, pp.\ 445--494. Routledge, 2013

  32. [32]

    Large language models meet symbolic provers for logical reasoning evaluation

    Chengwen Qi, Ren Ma, Bowen Li, He Du, Binyuan Hui, Jinwang Wu, Yuanjun Laili, and Conghui He. Large language models meet symbolic provers for logical reasoning evaluation. arXiv preprint arXiv:2502.06563, 2025

  33. [33]

    Divide and translate: Compositional first-order logic translation and verification for complex logical reasoning

    Hyun Ryu, Gyeongman Kim, Hyemin S Lee, and Eunho Yang. Divide and translate: Compositional first-order logic translation and verification for complex logical reasoning. arXiv preprint arXiv:2410.08047, 2024

  34. [34]

    arXiv preprint arXiv:2210.01240 , year=

    Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240, 2022

  35. [35]

    An introduction to formal logic

    Peter Smith. An introduction to formal logic. Cambridge University Press, 2003

  36. [36]

    Transformer tracking with cyclic shifting window attention

    Zikai Song, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Transformer tracking with cyclic shifting window attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 8791--8800, 2022

  37. [37]

    Compact transformer tracker with correlative masked modeling

    Zikai Song, Run Luo, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Compact transformer tracker with correlative masked modeling. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pp.\ 2321--2329, 2023

  38. [38]

    Autogenic language embedding for coherent point tracking

    Zikai Song, Ying Tang, Run Luo, Lintao Ma, Junqing Yu, Yi-Ping Phoebe Chen, and Wei Yang. Autogenic language embedding for coherent point tracking. In Proceedings of the 32nd ACM International Conference on Multimedia, pp.\ 2021--2030, 2024

  39. [39]

    Temporal coherent object flow for multi-object tracking

    Zikai Song, Run Luo, Lintao Ma, Ying Tang, Yi-Ping Phoebe Chen, Junqing Yu, and Wei Yang. Temporal coherent object flow for multi-object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 6978--6986, 2025

  40. [40]

    Tafjord, B

    Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. Proofwriter: Generating implications, proofs, and abductive statements over natural language. arXiv preprint arXiv:2012.13048, 2020

  41. [41]

    Ambiguity, polysemy, and vagueness

    David Tuggy. Ambiguity, polysemy, and vagueness. 1993

  42. [42]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  43. [43]

    Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning,

    Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. arXiv preprint arXiv:2310.03731, 2023

  44. [44]

    CANDLE : Iterative conceptualization and instantiation distillation from large language models for commonsense reasoning

    Weiqi Wang, Tianqing Fang, Chunyang Li, Haochen Shi, Wenxuan Ding, Baixuan Xu, Zhaowei Wang, Jiaxin Bai, Xin Liu, Cheng Jiayang, Chunkit Chan, and Yangqiu Song. CANDLE : Iterative conceptualization and instantiation distillation from large language models for commonsense reasoning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of th...

  45. [45]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

  46. [46]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

  47. [47]

    Aristotle: Mastering logical reasoning with a logic-complete decompose-search-resolve framework

    Jundong Xu, Hao Fei, Meng Luo, Qian Liu, Liangming Pan, William Yang Wang, Preslav Nakov, Mong-Li Lee, and Wynne Hsu. Aristotle: Mastering logical reasoning with a logic-complete decompose-search-resolve framework. arXiv preprint arXiv:2412.16953, 2024 a

  48. [48]

    arXiv preprint arXiv:2405.18357 (2024)

    Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, and Wynne Hsu. Faithful logical reasoning via symbolic chain-of-thought. arXiv preprint arXiv:2405.18357, 2024 b

  49. [49]

    An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. Qwen2. 5-1m technical report. arXiv preprint arXiv:2501.15383, 2025

  50. [50]

    Harnessing the power of large language models for natural language to first-order logic translation

    Yuan Yang, Siheng Xiong, Ali Payani, Ehsan Shareghi, and Faramarz Fekri. Harnessing the power of large language models for natural language to first-order logic translation. arXiv preprint arXiv:2305.15541, 2023

  51. [51]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36: 0 11809--11822, 2023

  52. [52]

    Mvp: Winning solution to smp challenge 2025 video track

    Liliang Ye, Yunyao Zhang, Yafeng Wu, Yi-Ping Phoebe Chen, Junqing Yu, Wei Yang, and Zikai Song. Mvp: Winning solution to smp challenge 2025 video track. arXiv preprint arXiv:2507.00950, 2025

  53. [53]

    Why prompt design matters and works: A complexity analysis of prompt search space in llms

    Xiang Zhang, Juntai Cao, Jiaqi Wei, Chenyu You, and Dujian Ding. Why prompt design matters and works: A complexity analysis of prompt search space in llms. arXiv preprint arXiv:2503.10084, 2025 a

  54. [54]

    ga-s^3 : Comprehensive social network simulation with group agents

    Yunyao Zhang, Zikai Song, Hang Zhou, Wenfeng Ren, Yi-Ping Phoebe Chen, Junqing Yu, and Wei Yang. ga-s^3 : Comprehensive social network simulation with group agents. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 8950--8970, Vienna, Austria, Ju...

  55. [55]

    Semantics-aware bert for language understanding

    Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. Semantics-aware bert for language understanding. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 9628--9635, 2020

  56. [56]

    Explicit planning helps language models in logical reasoning

    Hongyu Zhao, Kangrui Wang, Mo Yu, and Hongyuan Mei. Explicit planning helps language models in logical reasoning. arXiv preprint arXiv:2303.15714, 2023 a

  57. [57]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 1 0 (2), 2023 b

  58. [58]

    Exploring the role of reasoning structures for constructing proofs in multi-step natural language reasoning with large language models

    Zi'ou Zheng, Christopher Malon, Martin Renqiang Min, and Xiaodan Zhu. Exploring the role of reasoning structures for constructing proofs in multi-step natural language reasoning with large language models. arXiv preprint arXiv:2410.08436, 2024

  59. [59]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Denny Zhou, Nathanael Sch \"a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022

  60. [60]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  61. [61]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  62. [62]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...