FregeLogic at SemEval 2026 Task 11: A Hybrid Neuro-Symbolic Architecture for Content-Robust Syllogistic Validity Prediction
Pith reviewed 2026-05-10 05:11 UTC · model grok-4.3
The pith
A hybrid of language model ensembles and a formal logic solver improves syllogism validity judgments by routing disagreements to exact verification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that deferring to formal verification on cases where an ensemble of five language model classifiers disagrees achieves 94.3 percent accuracy, a content effect of 2.85, and a combined score of 41.88 in nested 5-fold cross-validation. This improves the combined score by 2.76 points over the pure ensemble baseline of 39.12, driven by a 16 percent reduction in content effect from 3.39 to 2.85, with only a 0.9 percent accuracy gain.
What carries the argument
The hybrid neuro-symbolic architecture that treats disagreement among language model classifiers as a trigger to invoke a formal logic solver for tiebreaking on disputed validity predictions.
Load-bearing premise
Disagreement among the language models reliably marks content-biased errors that the formal solver will correct without introducing new mistakes.
What would settle it
If applying the formal solver to the disagreement cases produces lower accuracy on those items than the ensemble majority vote, the claimed benefit of the hybrid routing would not hold.
Figures
read the original abstract
We present FregeLogic, a hybrid neuro-symbolic system for SemEval-2026 Task 11 (Subtask 1), which addresses syllogistic validity prediction while reducing content effects on predictions. Our approach combines an ensemble of five LLM classifiers, spanning three open-weights models (Llama 4 Maverick, Llama 4 Scout, and Qwen3-32B) paired with varied prompting strategies, with a Z3 SMT solver that serves as a formal logic tiebreaker. The central hypothesis is that LLM disagreement within the ensemble signals likely content-biased errors, where real-world believability interferes with logical judgment. By deferring to Z3's structurally-grounded formal verification on these disputed cases, our system achieves 94.3% accuracy with a content effect of 2.85 and a combined score of 41.88 in nested 5-fold cross-validation on the dataset (N=960). This represents a 2.76-point improvement in combined score over the pure ensemble (39.12), with a 0.9% accuracy gain, driven by a 16% reduction in content effect (3.39 to 2.85). Adopting structured-output API calls for Z3 extraction reduced failure rates from ~22% to near zero, and an Aristotelian encoding with existence axioms was validated against task annotations. Our results suggest that targeted neuro-symbolic integration, applying formal methods precisely where ensemble consensus is lowest, can improve the combined accuracy-plus-content-effect metric used by this task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents FregeLogic, a hybrid neuro-symbolic system for SemEval-2026 Task 11 Subtask 1 on syllogistic validity prediction. It ensembles five LLMs (Llama 4 Maverick, Llama 4 Scout, Qwen3-32B with varied prompts) and uses a Z3 SMT solver as a formal tiebreaker exclusively on ensemble disagreement cases, under the hypothesis that such disagreements flag content-biased errors. On the N=960 dataset via nested 5-fold cross-validation, the system reports 94.3% accuracy, content effect of 2.85, and combined score of 41.88, a 2.76-point gain over the pure ensemble (39.12) with a 0.9% accuracy increase and 16% content-effect reduction. Structured-output API calls for Z3 and an Aristotelian encoding with existence axioms are also described.
Significance. If the central hypothesis holds, the work offers a targeted, low-overhead neuro-symbolic pattern that applies formal verification only where neural consensus is weakest, yielding measurable gains on a task metric that jointly penalizes inaccuracy and content sensitivity. The nested CV protocol and near-zero Z3 failure rate after structured outputs are practical strengths. The result would support broader exploration of selective symbolic intervention in reasoning tasks prone to believability biases, provided the gains can be causally attributed to Z3 rather than ancillary factors.
major comments (2)
- [Results section] Results section (and abstract): the 0.9% accuracy lift and 16% content-effect drop are reported only in aggregate (94.3% accuracy, 2.85 content effect, 41.88 combined score). No table, figure, or subsection isolates the LLM-disagreement subset to report Z3's accuracy against ground truth on those exact instances, nor compares it to the accuracy the ensemble majority vote would have achieved on the same cases. Without this breakdown the causal claim that Z3 is correcting content-biased errors cannot be verified and the improvement could stem from prompting variation or sampling.
- [Methodology section] Methodology section: the exact prompting templates and variation strategies for the five LLMs are described at a high level only. Because the central hypothesis ties improvement to disagreement patterns, the absence of reproducible prompt text prevents independent verification that the observed disagreement rate and subsequent Z3 corrections are not artifacts of particular prompt choices.
minor comments (3)
- [Abstract and Results] The precise formula or operational definition used to compute the 'content effect' metric (reported as 3.39 for the ensemble and 2.85 for the hybrid) is not stated in the abstract or results; a brief equation or reference would aid interpretation of the 16% reduction.
- [Results section] Error analysis is limited to aggregate numbers; a confusion-matrix or case-level breakdown of remaining errors after Z3 intervention would strengthen the presentation.
- [Figure 1] Figure 1 (system diagram) would benefit from explicit annotation of the disagreement-detection threshold and the exact Z3 encoding used for Aristotelian syllogisms.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the manuscript would benefit from greater transparency on both the disagreement-subset performance and the exact prompting strategies to better support the central hypothesis. We address each major comment below and will incorporate the suggested revisions.
read point-by-point responses
-
Referee: [Results section] Results section (and abstract): the 0.9% accuracy lift and 16% content-effect drop are reported only in aggregate (94.3% accuracy, 2.85 content effect, 41.88 combined score). No table, figure, or subsection isolates the LLM-disagreement subset to report Z3's accuracy against ground truth on those exact instances, nor compares it to the accuracy the ensemble majority vote would have achieved on the same cases. Without this breakdown the causal claim that Z3 is correcting content-biased errors cannot be verified and the improvement could stem from prompting variation or sampling.
Authors: We acknowledge that the current aggregate reporting leaves the causal contribution of Z3 on disagreement cases unverified. We will add a new table in the Results section (and update the abstract accordingly) that isolates the disagreement subset from the nested 5-fold CV, reporting (i) the number of such instances, (ii) ensemble majority-vote accuracy on them, and (iii) Z3 accuracy on the same instances against ground truth. This will directly test whether Z3 improves over the ensemble on precisely those cases and help rule out alternative explanations such as prompting variation. revision: yes
-
Referee: [Methodology section] Methodology section: the exact prompting templates and variation strategies for the five LLMs are described at a high level only. Because the central hypothesis ties improvement to disagreement patterns, the absence of reproducible prompt text prevents independent verification that the observed disagreement rate and subsequent Z3 corrections are not artifacts of particular prompt choices.
Authors: We agree that full prompt text is required for reproducibility and for confirming that the observed disagreement patterns are not prompt-specific artifacts. In the revised manuscript we will include the complete, verbatim prompting templates for all five configurations (Llama 4 Maverick, Llama 4 Scout, and Qwen3-32B with their respective variations) in a dedicated appendix. This will enable readers to replicate the ensemble and independently assess the disagreement rates. revision: yes
Circularity Check
No significant circularity; empirical CV metrics are independent of inputs
full rationale
The paper reports aggregate accuracy, content-effect, and combined-score results from nested 5-fold cross-validation on the external SemEval-2026 Task 11 dataset (N=960). These metrics are computed against task-provided ground-truth annotations and the externally defined combined score; no equations, fitted parameters, or self-citations are invoked to derive the reported gains. The hybrid architecture (LLM ensemble + Z3 tiebreaker) is described procedurally, and the central hypothesis is evaluated by the observed deltas rather than being true by construction. No load-bearing step reduces to a self-definition, renamed fit, or author-prior ansatz.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Z3 SMT solver with Aristotelian encoding and existence axioms correctly captures the syllogistic validity judgments required by the task
Reference graph
Works this paper leans on
-
[1]
S em E val-2026 Task 11: Disentangling Content and Formal Reasoning in Large Language Models
Valentino, Marco and Ranaldi, Leonardo and Pucci, Giulia and Ranaldi, Federico and Freitas, Andr \'e. S em E val-2026 Task 11: Disentangling Content and Formal Reasoning in Large Language Models. Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026). 2026
work page 2026
-
[2]
Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference
Kim, Geonhee and Valentino, Marco and Freitas, Andr 'e. Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.525
-
[3]
Language models show human-like content effects on reasoning tasks
Dasgupta, Ishita and Lampinen, Andrew K and Chan, Stephanie CY and Sheahan, Hannah R and Creswell, Antonia and Kumaran, Dharshan and McClelland, James L and Hill, Felix. Language models show human-like content effects on reasoning tasks. arXiv preprint arXiv:2207.07051. 2022
-
[4]
A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences
Bertolazzi, Leonardo and Gatt, Albert and Bernardi, Raffaella. A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024
work page 2024
-
[5]
A systematic comparison of syllogistic reasoning in humans and language models
Eisape, Tiwalayo and Tessler, Michael and Dasgupta, Ishita and Sha, Fei and van Steenkiste, Sjoerd and Linzen, Tal. A systematic comparison of syllogistic reasoning in humans and language models. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics. 2024
work page 2024
-
[6]
S yllo B io- NLI : Evaluating Large Language Models on Biomedical Syllogistic Reasoning
Wysocka, Magdalena and Carvalho, Danilo and Wysocki, Oskar and Valentino, Marco and Freitas, Andr 'e. S yllo B io- NLI : Evaluating Large Language Models on Biomedical Syllogistic Reasoning. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics. 2025
work page 2025
-
[7]
Ozeki, Kazuki and Ando, Risako and Morishita, Takuro and Abe, Hirohiko and Mineshima, Koji and Okada, Mitsuhiro. Exploring reasoning biases in large language models through syllogism: Insights from the N eu BAROCO dataset. Findings of the Association for Computational Linguistics: ACL 2024. 2024
work page 2024
-
[8]
Improving chain-of-thought reasoning via quasi-symbolic abstractions
Ranaldi, Leonardo and Valentino, Marco and Freitas, Andr 'e. Improving chain-of-thought reasoning via quasi-symbolic abstractions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025
work page 2025
-
[9]
Bayless, Sam and Buliani, Stefano and Cassel, Darion and Cook, Byron and Clough, Duncan and Delmas, R 'e mi and Diallo, Nafi and Erata, Ferhat and Feng, Nick and Giannakopoulou, Dimitra and others. A neurosymbolic approach to natural language formalization and verification. arXiv preprint arXiv:2511.09008. 2025
-
[10]
De Moura, Leonardo and Bj rner, Nikolaj. Z3 : An Efficient SMT Solver. Proceedings of the 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS). 2008
work page 2008
-
[11]
In Findings of the Association for Computational Linguistics: ACL 2024, pages 16063– 16077
Valentino, Marco and Kim, Geonhee and Dalal, Dhairya and Zhao, Zhixue and Freitas, Andr 'e. Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering. arXiv preprint arXiv:2505.12189. 2025
-
[12]
Abstract Activation Spaces for Content-Invariant Reasoning in Large Language Models
Maraia, Giovanni and Valentino, Marco and Zanzotto, Fabio Massimo and Ranaldi, Leonardo. Abstract Activation Spaces for Content-Invariant Reasoning in Large Language Models. arXiv preprint arXiv:2602.02462. 2026
-
[13]
and Gu, Alex and Lipkin, Benjamin and Zhang, Cedegao E
Olausson, Theo X. and Gu, Alex and Lipkin, Benjamin and Zhang, Cedegao E. and Solar-Lezama, Armando and Tenenbaum, Joshua B. and Levy, Roger. LINC : A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10...
-
[14]
Findings of the Association for Computational Linguistics: EMNLP 2023 , year =
Pan, Liangming and Albalak, Alon and Wang, Xinyi and Wang, William Yang. LOGIC-LM : Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.248
-
[15]
VERAFI : Verified Agentic Financial Intelligence through Neurosymbolic Policy Generation
Akinfaderin, Adewale and Subramanian, Shreyas. VERAFI : Verified Agentic Financial Intelligence through Neurosymbolic Policy Generation. AAAI 2026 Workshop on Agentic AI in Financial Services. 2026
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.