VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning

Vikash Singh , Darion Cassel , Nathaniel Weir , Nick Feng , Sam Bayless

Authors on Pith no claims yet

classification 💻 cs.CL cs.AI

keywords logicalclaimsformalreasoningverificationanswersapproachconsensus

read the original abstract

Despite the syntactic fluency of Large Language Models (LLMs), ensuring their logical correctness in high-stakes domains remains a fundamental challenge. We present a neurosymbolic framework that combines LLMs with SMT solvers to produce verification-guided answers through iterative refinement. Our approach decomposes LLM outputs into atomic claims, autoformalizes them into first-order logic, and verifies their logical consistency using automated theorem proving. We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking to ensure logic-level alignment between candidates, eliminating the syntactic bias of surface-form metrics, (2) semantic routing that directs different claim types to appropriate verification strategies: symbolic solvers for logical claims and LLM ensembles for commonsense reasoning, and (3) precise logical error localization via Minimal Correction Subsets (MCS), which pinpoint the exact subset of claims to revise, transforming binary failure signals into actionable feedback. Our framework classifies claims by their logical status and aggregates multiple verification signals into a unified score with variance-based penalty. The system iteratively refines answers using structured feedback until acceptance criteria are met or convergence is achieved. This hybrid approach delivers formal guarantees where possible and consensus verification elsewhere, advancing trustworthy AI. With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
cs.AI 2026-05 conditional novelty 7.0

TraceFix repairs LLM-generated multi-agent protocols via TLA+ counterexamples to achieve full verification on all tested tasks and higher completion rates than prompt-only baselines.
MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents
cs.CL 2026-05 unverdicted novelty 7.0

MANTRA automatically synthesizes SMT-validated compliance benchmarks for LLM agents from natural language manuals and tool schemas, producing 285 tasks across 6 domains with minimal human effort.
Reliability-Gated Source Anchoring for Continual Test-Time Adaptation
cs.LG 2026-05 unverdicted novelty 6.0

RMemSafe gates source anchoring via entropy in CTTA, reducing error by 1.05pp on ResNet-50 when source accuracy collapses and showing shallower degradation slope than prior methods.