arxiv: 2604.06196 · v1 · submitted 2026-03-12 · 💻 cs.CL · cs.AI· cs.LO

Recognition: no theorem link

Consistency-Guided Decoding with Proof-Driven Disambiguation for Three-Way Logical Question Answering

Tianyi Huang , Ming Hou , Jiaheng Su , Yutong Zhang , Ziling Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LO

keywords three-way logical QAconsistency-guided decodingnegation consistencyproof-driven disambiguationFOLIO benchmarklarge language modelsUnknown predictionsfirst-order logic

0 comments

The pith

A consistency layer that checks answers to a hypothesis and its negation plus targeted binary probes lifts three-way logic QA accuracy by up to 16 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often give contradictory labels when asked about a statement and its direct negation, or they default to Unknown even when the premises logically support one side. The paper introduces CGD-PD, a test-time procedure that runs the same three-way classifier on both the hypothesis and its mechanical negation, forces the pair into a consistent label when possible, and then uses a handful of binary entailment queries to resolve lingering Unknown cases. If the method works as described, it produces higher accuracy and fewer undecided answers on first-order logic problems without any retraining of the underlying model. A sympathetic reader would care because the fixes target two concrete, recurring failure modes in logical reasoning benchmarks while adding only four or five extra model calls on average.

Core claim

The paper claims that projecting the outputs of a 3-way classifier on a hypothesis H and its mechanical negation onto a negation-consistent decision, followed by selective binary entailment probes to resolve remaining Unknown outcomes, produces consistent accuracy gains of up to 16 percent relative to the base model on the first-order-logic fields of the FOLIO benchmark while also lowering the rate of Unknown predictions.

What carries the argument

CGD-PD, the test-time layer that queries a 3-way classifier on both H and its negation, projects the pair to a consistent label when possible, and invokes proof-driven binary entailment probes for disambiguation.

If this is right

Frontier LLMs become more reliable on three-way logical tasks by avoiding contradictory answers to opposite statements.
The rate of Unknown predictions drops, so more questions receive decisive True or False labels when the premises support one side.
The gains appear across multiple large language models without any change to their training.
Average inference cost stays low at roughly four to five model calls per instance.
The same consistency-plus-probe pattern could apply to other three-way classification settings that involve negation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same negation-consistency check might reduce errors in domains such as common-sense reasoning where opposite statements are easy to generate.
Post-hoc correction of this kind can compensate for gaps in how LLMs internalize logical negation rules from training data.
Combining the layer with chain-of-thought prompting could further improve the quality of the binary entailment probes.
Evaluating the method on logical problems drawn from sources other than FOLIO would test whether the observed gains generalize.

Load-bearing premise

Mechanically negating the hypothesis preserves its logical structure without artifacts, and the targeted binary entailment probes can reliably resolve Unknown cases without introducing new inconsistencies or selection bias.

What would settle it

A head-to-head evaluation on the FOLIO first-order logic subset in which CGD-PD produces no accuracy gain over the base model or leaves the rate of negation-inconsistent answers unchanged would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.06196 by Jiaheng Su, Ming Hou, Tianyi Huang, Yutong Zhang, Ziling Zhang.

**Figure 2.** Figure 2: Validation Unknown rate on FOLIO (FOL fields). CGD-PD reduces Unknown predictions on both GPT-5.2 and Claude Sonnet 4.5. Confidence intervals. The paired bootstrap confidence intervals exclude zero for both models. For GPT-5.2, CGD-PD yields a +4.4 point accuracy gain (95% CI: +1.5 to +7.4) and a −3.4 point reduction in Unknown rate (95% CI: −6.4 to −0.5). For Claude Sonnet 4.5, accuracy improves by +6.9… view at source ↗

**Figure 3.** Figure 3: Row-normalized confusion matrices on the validation split. Each cell shows the row-normalized rate [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Three-way logical question answering (QA) assigns $True/False/Unknown$ to a hypothesis $H$ given a premise set $S$. While modern large language models (LLMs) can be accurate on isolated examples, we identify two recurring failure modes in 3-way logic QA: (i) negation inconsistency, where answers to $H$ and $\neg H$ violate the deterministic label mapping, and (ii) epistemic $Unknown$, where the model predicts $Unknown$ due to uncertainty or instability even when $S$ entails one side. We present CGD-PD, a lightweight test-time layer that (a) queries a single 3-way classifier on both $H$ and a mechanically negated form of $H$, (b) projects the pair onto a negation-consistent decision when possible, and (c) invokes a proof-driven disambiguation step that uses targeted binary entailment probes to selectively resolve $Unknown$ outcomes, requiring only an average of 4-5 model calls. On the FOLIO benchmark's first-order-logic fields, CGD-PD yields consistent gains across frontier LLMs, with relative improvements in accuracy of up to 16% over the base model, while also reducing $Unknown$ predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CGD-PD gives practical gains on FOLIO by projecting consistent labels from H and ~H then probing to resolve Unknowns, but the mechanical negation step risks logical artifacts and the evidence is still thin.

read the letter

The main takeaway is that CGD-PD adds a lightweight test-time layer to three-way logical QA: it runs the base classifier on both the hypothesis and a mechanically negated version, projects to a negation-consistent label when the pair allows it, and then uses a few targeted binary entailment probes to clean up the remaining Unknown cases. On the FOLIO first-order logic sections this produces relative accuracy lifts up to 16% across several frontier models while cutting Unknown predictions, all at an average cost of 4-5 extra calls. That combination of consistency projection plus selective disambiguation is the piece that is not already standard in the prompting or decoding literature the abstract cites, and the cross-model consistency of the gains is a real point in its favor. The approach is also genuinely lightweight and does not require any retraining, which makes it straightforward to try on other logic benchmarks. The soft spots sit mainly in the negation step and the supporting numbers. Mechanically negating H in first-order logic can easily change quantifier scope or predicate structure, so the consistency projection may be steering toward one answer for reasons that are not logically sound; the abstract gives no detail on how the negated form is generated or validated, which leaves the central mechanism open to exactly the artifact the stress-test note flags. The reported improvements are also given without exact baselines, statistical tests, or error breakdowns, so the empirical claim rests on fairly loose support at this stage. A reader working on reliable LLM reasoning for law or science applications will get the most out of it, because the method directly targets the inconsistency and abstention problems that matter in those settings. It is coherent enough and shows enough concrete lift to deserve a serious referee, though the negation handling and the analysis will almost certainly need tightening before publication. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CGD-PD, a lightweight test-time method for three-way logical QA (True/False/Unknown) on a hypothesis H given premises S. It queries a base 3-way classifier on both H and a mechanically negated ~H, projects the pair to a negation-consistent label when possible, and invokes targeted binary entailment probes to resolve remaining Unknown cases, averaging 4-5 model calls. On FOLIO's first-order-logic fields, it reports consistent gains across frontier LLMs with relative accuracy improvements up to 16% and fewer Unknown predictions.

Significance. If the reported gains are confirmed with rigorous baselines and controls, CGD-PD would offer a practical, training-free layer for improving logical consistency and reducing epistemic uncertainty in LLM reasoning pipelines, with low overhead and applicability to existing models.

major comments (2)

[Abstract] Abstract: The central claim of 'relative improvements in accuracy of up to 16%' is presented without absolute baseline accuracies for each model, standard deviations, statistical significance tests, or error breakdowns by category (e.g., negation inconsistency vs. epistemic Unknown), leaving the empirical support for the method's effectiveness only weakly substantiated.
[Method] Consistency projection step (described in the method): The pipeline assumes that a surface-level mechanical negation of H produces a logically dual ~H whose entailment status w.r.t. S is exactly inverted; however, in FOL negation interacts with quantifiers and predicates, and no verification, counterexample analysis, or robustness check is provided to rule out systematic artifacts that could bias the projection and inflate apparent gains.

minor comments (2)

[Abstract] The average number of model calls (4-5) is stated without a breakdown by dataset or model, which would help assess efficiency claims.
[Method] A formal pseudocode or decision table for the projection from the (H, ~H) label pair to the final output would improve clarity of the consistency-guided step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the empirical claims require stronger substantiation through absolute metrics and statistical analysis, and that the negation projection step would benefit from explicit robustness verification. We will revise the manuscript accordingly to address both points.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 'relative improvements in accuracy of up to 16%' is presented without absolute baseline accuracies for each model, standard deviations, statistical significance tests, or error breakdowns by category (e.g., negation inconsistency vs. epistemic Unknown), leaving the empirical support for the method's effectiveness only weakly substantiated.

Authors: We agree that absolute baseline accuracies, standard deviations, significance tests, and error breakdowns are necessary to fully substantiate the claims. In the revised manuscript we will add a results table reporting per-model absolute accuracies (with and without CGD-PD), standard deviations across runs where applicable, p-values from paired significance tests, and a breakdown of remaining errors into negation-inconsistency versus epistemic-Unknown categories. These additions will be placed in the main results section and referenced from the abstract. revision: yes
Referee: [Method] Consistency projection step (described in the method): The pipeline assumes that a surface-level mechanical negation of H produces a logically dual ~H whose entailment status w.r.t. S is exactly inverted; however, in FOL negation interacts with quantifiers and predicates, and no verification, counterexample analysis, or robustness check is provided to rule out systematic artifacts that could bias the projection and inflate apparent gains.

Authors: We acknowledge that mechanical negation of a first-order hypothesis does not always yield a perfect logical dual when quantifiers or predicates interact in complex ways. While the FOLIO benchmark statements are constructed to support straightforward negation, we will add a dedicated robustness subsection that (i) samples 200 examples, manually verifies the logical duality of the generated ~H, (ii) reports the fraction of cases where the mechanical negation fails to invert entailment status, and (iii) measures the impact on final accuracy when those cases are excluded. This analysis will quantify any systematic bias and justify the projection step. revision: yes

Circularity Check

0 steps flagged

No circularity: method is algorithmic over external model calls

full rationale

The derivation consists of a test-time procedure that issues queries to a base 3-way classifier and binary entailment probes, then applies deterministic projection rules for consistency and disambiguation. No parameters are fitted to the FOLIO data, no predictions reduce to inputs by construction, and no self-citations supply load-bearing uniqueness theorems or ansatzes. The pipeline depends on external LLM outputs rather than redefining its own quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no free parameters, axioms, or invented entities beyond standard use of an existing LLM classifier and entailment probes.

pith-pipeline@v0.9.0 · 5534 in / 1069 out tokens · 44479 ms · 2026-05-15T11:24:35.750788+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Steven Cho, Stefano Ruberto, and Valerio Terragni

Curran Associates, Inc. Steven Cho, Stefano Ruberto, and Valerio Terragni

work page
[2]

In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME), page 174–186

Metamorphic testing of large language mod- els for natural language processing. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME), page 174–186. IEEE. Dorottya Demszky, Kelvin Guu, and Percy Liang

work page
[3]

Transforming Question Answering Datasets Into Natural Language Inference Datasets

Transforming question answering datasets into natural language inference datasets.Preprint, arXiv:1809.02922. Yonatan Geifman and Ran El-Yaniv. 2017. Selec- tive classification for deep neural networks. InAd- vances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhent- ing Qi, Martin...

work page internal anchor Pith review Pith/arXiv arXiv 2017