pith. machine review for the scientific record. sign in

arxiv: 2604.11778 · v1 · submitted 2026-04-13 · 💻 cs.CL · cs.AI

Recognition: unknown

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords general reasoningLLM benchmarkdomain-dependent reasoningK-12 knowledge problemslogical constraintsreasoning evaluationAI generalization
0
0 comments X

The pith

Current large language models reach only 62.8 percent accuracy on a benchmark for general reasoning limited to K-12 knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates General365 to measure how LLMs handle reasoning tasks that do not depend on specialized expertise. It assembles 365 seed problems plus variants in eight categories that feature complex constraints, nested logic, and semantic challenges while keeping required background at K-12 level. Tests across 26 leading models show the strongest result is 62.8 percent correct, much lower than the near-perfect scores these models post on math and physics tests. The findings indicate that LLM reasoning remains tied to specific domains instead of operating as a general skill. The benchmark is meant to guide work toward reasoning that supports broader, real-world uses.

Core claim

General365 is a benchmark consisting of 365 seed problems and 1,095 variants across eight categories that restricts background knowledge to K-12 level in order to separate general reasoning from domain expertise. When 26 leading LLMs are evaluated on it, the best-performing model reaches only 62.8 percent accuracy, in contrast to near-perfect results on specialized math and physics benchmarks, indicating that current reasoning capabilities in LLMs are heavily domain-dependent.

What carries the argument

The General365 benchmark, a set of problems with K-12-limited knowledge, complex constraints, and nested logical branches that isolates general reasoning performance from specialized knowledge.

If this is right

  • LLM reasoning performance varies sharply by domain rather than reflecting a single general capacity.
  • Substantial improvement is still needed before LLMs can handle broad, constraint-heavy problems reliably.
  • Progress on narrow expert tasks does not automatically transfer to general reasoning scenarios.
  • Real-world applications that require flexible logic across topics will continue to expose current limitations.
  • New training approaches focused on constraint satisfaction and logic chains may be required beyond scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models that excel on math and physics may still require separate mechanisms to handle open-ended constraint problems outside those domains.
  • If the gap persists, developers might prioritize datasets that mix logical structure with everyday topics rather than expert content alone.
  • Human performance baselines on the same problems could clarify whether 63 percent represents a fundamental model limit or a training shortfall.
  • Expanding the benchmark with controlled variations in constraint depth could reveal which specific reasoning steps are weakest.

Load-bearing premise

That problems written with only K-12 background knowledge truly isolate pure reasoning without confounds from surface patterns, ambiguous wording, or other non-reasoning factors.

What would settle it

A follow-up test in which the same models are given the identical General365 problems but with added expert-level background facts, and their accuracy either stays near 63 percent or rises sharply toward 90 percent or higher.

Figures

Figures reproduced from arXiv: 2604.11778 by Dan Ma, Junlin Liu, Shengnan An, Shixiong Luo, Shuang Zhou, Wenling Yuan, Xiaoyu Li, Xuezhi Cao, Xunliang Cai, Yifan Zhou, Ying Xie, Yuan Zhang, Ziwen Wang.

Figure 1
Figure 1. Figure 1: Performance of various LLMs on GENERAL365. Gemini-3-Pro achieves state-of-the-art performance with 62.8%, while the majority of models fail to reach the 60% passing standard. ∗ Work done during the internship at Meituan. † Correspondence to: {anshengnan, caoxuezhi}@meituan.com. arXiv:2604.11778v1 [cs.CL] 13 Apr 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The construction pipeline of GENERAL365. The GENERAL365 benchmark is publicly available at General365.github.io. By providing this diverse and chal￾lenging benchmark, we aim to facilitate further application on robust, general-purpose real-world scenarios. 2 GENERAL365 In this section, we first introduce the challenge categories taxonomy of GENERAL365 (Section 2.1), followed by a detailed description of th… view at source ↗
Figure 3
Figure 3. Figure 3: Statistical Overview of GENERAL365. (a) Distribution of challenge categories, highlighting the diversity of problems within the benchmark. (b) Distribution of multi-label problems, showcasing the complexity in task assignments. 2.3 Dataset Statistics Based on the challenge categories taxonomy above, each seed problem in GENERAL365 is annotated with one or more challenge labels. Figure 3a illustrates the ov… view at source ↗
Figure 4
Figure 4. Figure 4: The relationship between accuracy and average output tokens for various LLMs, highlighting the reasoning [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Radar chart of various LLMs series across eight [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: T-SNE visualization of query embeddings for [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The distribution of reasoning similarity scores across benchmarks shows a clear right-skewed distribution for [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparative performance across benchmarks, highlighting [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The relationship between accuracy and average output length varies significantly across benchmarks: the fact [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: https://general365.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces General365, a benchmark with 365 seed problems and 1,095 variants across eight categories, designed to evaluate general reasoning in LLMs by restricting background knowledge to K-12 level and thereby decoupling it from specialized expertise. Evaluations of 26 leading LLMs show the highest accuracy at 62.8%, in contrast to near-perfect performance on math and physics benchmarks, from which the authors conclude that LLM reasoning abilities are heavily domain-dependent.

Significance. If the benchmark design successfully isolates pure reasoning without confounds from knowledge demands or question ambiguity, the reported performance gap would provide evidence that current LLMs' reasoning remains limited outside narrow domains, motivating targeted improvements for broader applications. The public release of the dataset, code, and leaderboard constitutes a concrete contribution that enables community follow-up and reproducibility.

major comments (2)
  1. Abstract: The central inference that the 62.8% ceiling demonstrates domain-dependent reasoning rests on the claim that problems use only K-12 background knowledge and measure reasoning rather than surface patterns or unintended expertise. No problem examples, expert validation statistics, inter-annotator agreement scores, or exclusion criteria for knowledge level are supplied to substantiate this isolation.
  2. Evaluation section (implied by abstract reporting): Accuracy figures for 26 models are presented without statistical tests, confidence intervals, or details on scoring rules and variant handling, leaving open whether the performance difference versus math/physics benchmarks could be affected by evaluation artifacts.
minor comments (2)
  1. Abstract: The eight categories are referenced but neither listed nor characterized, hindering assessment of claimed diversity.
  2. The manuscript should include at least one concrete problem example per category to allow readers to judge difficulty and knowledge demands directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas where additional documentation can strengthen the claims about knowledge isolation and evaluation rigor. We have revised the manuscript accordingly and respond to each major comment below.

read point-by-point responses
  1. Referee: Abstract: The central inference that the 62.8% ceiling demonstrates domain-dependent reasoning rests on the claim that problems use only K-12 background knowledge and measure reasoning rather than surface patterns or unintended expertise. No problem examples, expert validation statistics, inter-annotator agreement scores, or exclusion criteria for knowledge level are supplied to substantiate this isolation.

    Authors: We agree that the original submission provided insufficient documentation to support the K-12 knowledge restriction. In the revised manuscript we have added a new subsection (Section 3.2) that includes two representative examples from each of the eight categories. We describe the problem-generation pipeline in detail: initial problems were drafted by the authors, then independently reviewed by three K-12 educators who classified every required fact or concept against standard U.S. and international K-12 curricula. We report an inter-annotator agreement of 91% (Fleiss' kappa = 0.87) on the knowledge-level labels; all disagreements were resolved by consensus discussion. The exclusion criteria are now explicitly listed (no college-level mathematics, no domain-specific terminology, no facts introduced after 2010). We also explain the design steps taken to reduce surface-pattern exploitation, including the introduction of novel constraints and the requirement for multi-step inference that cannot be solved by direct recall of K-12 facts. revision: yes

  2. Referee: Evaluation section (implied by abstract reporting): Accuracy figures for 26 models are presented without statistical tests, confidence intervals, or details on scoring rules and variant handling, leaving open whether the performance difference versus math/physics benchmarks could be affected by evaluation artifacts.

    Authors: We accept that the original evaluation section lacked the statistical detail needed for robust comparison. The revised version now reports 95% bootstrap confidence intervals for every model accuracy (1,000 resamples over the 1,095 problems). We added paired t-tests confirming that the gap versus MATH and GPQA is statistically significant (p < 0.001 after Bonferroni correction). Scoring rules are clarified: answers are judged by exact string match on the final numerical or short-phrase answer; no partial credit is given. Each of the three variants per seed is scored independently and then averaged to produce the per-seed score; overall accuracy is the mean across all seeds. The full prompt templates, decoding parameters, and variant-generation procedure are now provided in Appendix B. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

The paper introduces General365 as a new benchmark with 365 seeds and variants, restricts knowledge to K-12 by design, evaluates 26 LLMs to obtain measured accuracies (top at 62.8%), and interprets the gap versus math/physics benchmarks as evidence of domain dependence. No equations, fitted parameters, predictions derived from inputs, or self-citations appear in the provided text. The decoupling claim is an explicit design statement, not a reduction to prior self-work. All reported numbers are observed outputs from model runs on the benchmark, not quantities forced by construction or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the design assumption that K-12 knowledge restriction isolates general reasoning; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption General reasoning ability can be measured by problems whose required background knowledge is limited to K-12 level.
    Explicitly stated as the core design choice to decouple reasoning from specialized expertise.

pith-pipeline@v0.9.0 · 5568 in / 1228 out tokens · 48801 ms · 2026-05-10T16:03:36.671890+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references

  1. [1]

    B was sitting next to C

  2. [2]

    C was sitting opposite the person adjacent to E

  3. [3]

    D was sitting opposite A, and D ordered cappuccino

  4. [4]

    Those who ordered iced espresso are sitting opposite E

  5. [5]

    Neither D nor F ordered mocha

  6. [6]

    F did not order cold brew coffee

  7. [7]

    The person who ordered cold brew coffee was sitting opposite D

  8. [8]

    Among them, there was one person who ordered espresso

    The person sitting next to B ordered Irish coffee. Among them, there was one person who ordered espresso. How many possible seating arrangements were there for these six people at that time? Answer:8 Example 3: Branching & Enumeration Problem Question:There are 8 boxes, labeled A through H. A total of 8 cartons of milk are distributed among them. We know ...

  9. [9]

    Move forward one step, then move one step to the upper right, and turn left 45 degrees

  10. [10]

    Move forward one step, then move one step to the upper left, and turn right 135 degrees

  11. [11]

    Move forward three steps, then move one step to the lower right, and turn left 45 degrees

  12. [12]

    Move forward three steps, then move two steps to the right, and turn right 90 degrees. → ↘ ↙ ← ↓ ↑ ← → ↗ ↓ ← ← ← ← ↓ ↙ ↖ ← ← ↖ ↙ ↘ → → ↖ Where is Xiaokang now in the diagram? Answer:The starting point Example 5: Recursive & Backtracking Problem Question:There are 9 tunnels in a row, and one of them hides an enemy wounded soldier. The wounded soldier will ...

  13. [13]

    Australia ˆ Brazil = 7776

  14. [14]

    China + United States = 7

  15. [15]

    We’ve lost a friend. The murderer is likely hiding in the crowd. Please be careful

    Argentina * Kazakhstan = ? Answer:72 Example 8: Optimal Strategy Problem Question:A fire truck receives an emergency call and heads to the fire scene. The starting point is 10 km away from the destination. The road speed limit is 60 km/h, and the fire truck’s theoretical maximum average speed is 80 km/h. It is expected that there will be 3 intersections w...