”Evaluating the logical reasoning ability of chatgpt and gpt-4.” arXiv preprint arXiv:2304.03439 (2023)

Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, Yue Zhang · 2023 · arXiv 2304.03439

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Learning to Ask: When LLM Agents Meet Unclear Instruction

cs.CL · 2024-08-31 · unverdicted · novelty 6.0

Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.

Generalizing Numerical Reasoning in Table Data through Operation Sketches and Self-Supervised Learning

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

TaNOS decouples table semantics from numerical structure via anonymization, sketches, and program-first self-supervision, yielding 80.13% FinQA accuracy with 10% data and near-zero cross-domain gap versus over 10pp for standard fine-tuning.

A Goal-Oriented Chatbot for Engaging the Elderly Through Family Photo Conversations

cs.HC · 2026-02-10 · unverdicted · novelty 4.0

A goal-oriented chatbot uses family photos to generate W-questions and open prompts for elderly reminiscence, analyzes topics to suggest follow-up photos, and supplies caregivers with conversation insights via a web portal.

Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization

cs.AI · 2026-05-02

citing papers explorer

Showing 1 of 1 citing paper after filters.

Learning to Ask: When LLM Agents Meet Unclear Instruction cs.CL · 2024-08-31 · unverdicted · none · ref 9
Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.

”Evaluating the logical reasoning ability of chatgpt and gpt-4.” arXiv preprint arXiv:2304.03439 (2023)

fields

years

verdicts

representative citing papers

citing papers explorer