Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4

Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, Yue Zhang · 2023 · arXiv 2304.03439

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Learning to Ask: When LLM Agents Meet Unclear Instruction

cs.CL · 2024-08-31 · unverdicted · novelty 6.0

Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.

Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization

cs.AI · 2026-05-02 · unverdicted · novelty 5.0

SCM-GRPO grounds multi-hop fact verification in structural causal models and applies GRPO reinforcement learning to optimize reasoning chain length, outperforming baselines on HoVer and EX-FEVER.

Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

TaNOS decouples table semantics from numerical structure via anonymization, sketches, and program-first self-supervision, yielding 80.13% FinQA accuracy with 10% data and near-zero cross-domain gap versus over 10pp for standard fine-tuning.

A Goal-Oriented Chatbot for Engaging the Elderly Through Family Photo Conversations

cs.HC · 2026-02-10 · unverdicted · novelty 4.0

A goal-oriented chatbot uses family photos to generate W-questions and open prompts for elderly reminiscence, analyzes topics to suggest follow-up photos, and supplies caregivers with conversation insights via a web portal.

citing papers explorer

Showing 6 of 6 citing papers.

Learning to Ask: When LLM Agents Meet Unclear Instruction cs.CL · 2024-08-31 · unverdicted · none · ref 9
Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 276
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning cs.AI · 2026-05-07 · unverdicted · none · ref 5
Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.
Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization cs.AI · 2026-05-02 · unverdicted · none · ref 21
SCM-GRPO grounds multi-hop fact verification in structural causal models and applies GRPO reinforcement learning to optimize reasoning chain length, outperforming baselines on HoVer and EX-FEVER.
Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning cs.LG · 2026-04-23 · unverdicted · none · ref 4
TaNOS decouples table semantics from numerical structure via anonymization, sketches, and program-first self-supervision, yielding 80.13% FinQA accuracy with 10% data and near-zero cross-domain gap versus over 10pp for standard fine-tuning.
A Goal-Oriented Chatbot for Engaging the Elderly Through Family Photo Conversations cs.HC · 2026-02-10 · unverdicted · none · ref 13
A goal-oriented chatbot uses family photos to generate W-questions and open prompts for elderly reminiscence, analyzes topics to suggest follow-up photos, and supplies caregivers with conversation insights via a web portal.

Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4

fields

years

verdicts

representative citing papers

citing papers explorer