ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs

Joanito Agili Lopo; Muhammad Ravi Shulthan Habibi; Samuel Cahyawijaya; Tack Hwa Wong; Wicaksono Leksono Muhamad

arxiv: 2603.02676 · v2 · submitted 2026-03-03 · 💻 cs.CL · cs.AI

ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs

Wicaksono Leksono Muhamad , Joanito Agili Lopo , Tack Hwa Wong , Muhammad Ravi Shulthan Habibi , Samuel Cahyawijaya This is my paper

Pith reviewed 2026-05-15 17:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords content effectssyllogism normalizationdeterministic parsingLLM reasoningmultilingual benchmarkformal logicSemEval task

0 comments

The pith

Transforming syllogisms to canonical logical forms and applying deterministic parsing reduces content effects in LLM reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models frequently exhibit content effects that skew their reasoning on syllogisms, particularly in multilingual settings. The paper introduces a method that first abstracts these problems into standardized logical representations before using deterministic parsing to assess validity. This approach was tested on the SemEval-2026 Task 11 benchmark and secured top-five positions in all subtasks. It demonstrates a practical way to mitigate biases without relying on extensive model retraining or internal activation adjustments.

Core claim

By normalizing syllogisms into canonical logical representations and following with deterministic parsing, the method preserves logical validity while substantially diminishing content effects, achieving top-5 rankings across subtasks on the multilingual benchmark as a simpler alternative to fine-tuning.

What carries the argument

Normalization to canonical logical representations followed by deterministic parsing, which extracts structure from natural language input to determine validity without content influence.

If this is right

This method provides a competitive performance on multilingual formal reasoning tasks.
It reduces reliance on complex fine-tuning for bias mitigation.
The approach can serve as an alternative to activation-level interventions in LLMs.
Logical validity is maintained through the abstraction and parsing steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar normalization techniques could extend to other types of logical reasoning problems.
Hybrid systems might combine LLMs for language understanding with deterministic parsers for logic evaluation.
Further tests on varied language sets could show how well the reduction in content effects holds.

Load-bearing premise

That mapping syllogisms to canonical logical representations and then parsing them deterministically will keep all logical information intact and eliminate content biases without creating new mistakes.

What would settle it

A test case where a syllogism is judged invalid by the method but is actually valid according to standard logic, or where content variations still lead to different validity judgments for equivalent structures.

Figures

Figures reproduced from arXiv: 2603.02676 by Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Samuel Cahyawijaya, Tack Hwa Wong, Wicaksono Leksono Muhamad.

**Figure 2.** Figure 2: Content-effect reduction in English-only and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: LLM-only prompt for retrieve the validity and relevant premise directly [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Norm prompt for normalize sentences into standard categorical form [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: EPN prompt for Subtask 3 for extract subject term [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: EPN prompt for Subtask 4 for extract and filter relevant premise and conclusion [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: EPN prompt with google translated sentence for Subtask 4 for extract and filter relevant premise and [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Large language models suffer from content effects in reasoning tasks, particularly in multi-lingual contexts. We introduce a novel method that reduces these biases through explicit structural abstraction that transforms syllogisms into canonical logical representations and applies deterministic parsing to determine validity. Evaluated on the SemEval-2026 Task 11 multilingual benchmark, our approach achieves top-5 rankings across all subtasks while substantially reducing content effects and offering a competitive alternative to complex fine-tuning or activation-level interventions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward shared-task system paper that applies normalization plus deterministic parsing to cut content effects on a multilingual syllogism benchmark and reports top-5 ranks, but the evidence is still thin.

read the letter

The main takeaway is that the authors normalize syllogisms into canonical logical forms to strip surface content, then run deterministic parsing to judge validity. On the SemEval-2026 Task 11 multilingual benchmark this gets them top-5 across subtasks and they position it as a lighter alternative to fine-tuning or activation edits. That is the concrete contribution: a simple structural pipeline that works on an existing shared task without new theory or heavy compute. It is useful for anyone who needs a reproducible way to reduce content bias in formal reasoning across languages. The approach is honest about sticking to established techniques rather than claiming a breakthrough. The soft spots are the missing numbers. The write-up states rankings and reduced content effects but does not show the size of the reduction, the baselines used, or an error analysis that would confirm the canonical forms preserve validity instead of introducing new failures. Without those details it is difficult to know how much of the gain comes from the method versus the task design. The central assumption—that explicit abstraction plus deterministic steps reliably lowers bias without information loss—needs checking against actual examples and failure cases. This paper is mainly for teams working on shared-task systems or practical multilingual reasoning pipelines. A reader who wants new formal results or broad theoretical claims will not find them here. It deserves peer review so referees can request the quantitative breakdowns, implementation details, and any code that would let others reproduce the content-effect measurements.

Referee Report

1 major / 0 minor

Summary. The paper introduces a method for reducing content effects in LLMs on multilingual reasoning tasks by applying explicit structural abstraction to convert syllogisms into canonical logical representations, followed by deterministic parsing to assess validity. Evaluated on the SemEval-2026 Task 11 benchmark, the approach is reported to achieve top-5 rankings across all subtasks while providing a competitive alternative to fine-tuning or activation-level methods.

Significance. If the reported rankings and reduction in content effects are substantiated, the work demonstrates a lightweight, interpretable pipeline that leverages normalization and deterministic parsing to mitigate biases in formal reasoning, offering a practical baseline for shared-task systems in multilingual settings.

major comments (1)

Abstract: The claims of top-5 rankings across subtasks and substantial reduction in content effects lack any supporting quantitative metrics, baseline comparisons, error analysis, or details on how content effects were measured, rendering the central empirical assertions unverifiable from the manuscript text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in the abstract. We address the concern directly below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract: The claims of top-5 rankings across subtasks and substantial reduction in content effects lack any supporting quantitative metrics, baseline comparisons, error analysis, or details on how content effects were measured, rendering the central empirical assertions unverifiable from the manuscript text.

Authors: We agree that the abstract as currently written does not embed the supporting numbers or methodological details, making the claims difficult to verify at a glance. The full manuscript reports the rankings in Table 1, the content-effect reduction (measured as the accuracy gap between content-laden and normalized syllogisms) in Section 4.2 with explicit deltas, baseline comparisons against fine-tuning and activation-steering systems in Section 4.3, and error analysis in Section 5. To resolve the issue, we will expand the abstract to include the key quantitative results (specific subtask rankings and the measured reduction percentage) and a one-sentence description of the content-effect metric, while retaining the overall length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a system for SemEval-2026 Task 11 that applies explicit structural abstraction to canonical logical forms followed by deterministic parsing. All performance claims rest on external benchmark rankings and measured reductions in content effects, with no equations, fitted parameters, or self-citations that reduce the central result to its own inputs by construction. The derivation chain is self-contained against the shared-task evaluation and does not invoke any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract does not introduce or fit any free parameters. The method implicitly relies on standard logical validity rules for syllogisms.

axioms (1)

standard math Standard first-order logic rules determine validity of syllogisms once normalized to canonical form
Deterministic parsing presupposes classical logic axioms for entailment checking.

pith-pipeline@v0.9.0 · 5399 in / 1087 out tokens · 76811 ms · 2026-05-15T17:40:05.905665+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and Peano structure from Law of Logic unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

transforms syllogisms into canonical logical representations... deterministic parsing... mood–figure pair... valid moods Vk... mood ∈ Vfig
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 forcing) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lookup table for valid moods by figure (AAA, EAE, ... for figures 1-4)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Language models show human-like content effects on reasoning tasks

Generalized quantifiers as a source of error in multilingual NLU benchmarks. In Proceedings of the 2022 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, pages 4875–4893. Association for Computational Linguistics. Ishita Dasgupta, Andrew K. Lampinen, Stephanie C. Y . Chan, Hannah R. S...

work page arXiv 2022
[2]

In Findings of the Association for Computational Lin- guistics: ACL 2025 , pages 10074–10095, Vienna, Austria

Reasoning circuits in language models: A mechanistic interpretation of syllogistic inference. In Findings of the Association for Computational Lin- guistics: ACL 2025 , pages 10074–10095, Vienna, Austria. Association for Computational Linguistics. Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta I...

work page 2025
[3]

In Findings of the Association for Computational Linguistics: ACL 2024, pages 16063– 16077

Exploring reasoning biases in large lan- guage models through syllogism: Insights from the neubaroco dataset. In Findings of the Association for Computational Linguistics: ACL 2024, pages 16063– 16077. Terence Parsons. 2014. Articulating Medieval Logic. Oxford University Press, Oxford. Graham Priest. 2008. An Introduction to Non-Classical Logic: From If t...

work page arXiv 2024
[4]

is ” with “ are

Association for Computational Linguistics. Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong- Li Lee, and Wynne Hsu. 2024. Faithful logical rea- soning via symbolic chain-of-thought. In Proceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 13326–13365, Bangkok, Thailand. As- sociation f...

work page 2024
[5]

Therefore

All sentences before the final "Therefore" are premises. The sentence after "Therefore" is the conclusion

work page
[6]

Internally rewrite each statement into standard categorical form: - All X are Y (A) - No X are Y (E) - Some X are Y (I) - Some X are not Y (O) Handle paraphrases such as: - Every X is Y - Not a single X is Y - At least one X is not Y - Double negations

work page
[7]

Determine whether the conclusion NECESSARILY follows from a subset of the premises under classical categorical logic

work page
[8]

- Return their indexes (0-based)

If the argument is valid: - Identify the MINIMAL set of premises required to entail the conclusion. - Return their indexes (0-based). - Indexing is based on order of appearance in the text. - Do NOT include unused premises

work page
[9]

If the argument is invalid: - validity = false - relevant_premises = []

work page
[10]

validity

Do NOT explain reasoning. Output JSON ONLY. STRICT REQUIREMENTS: - Output must be valid JSON. - No explanation. - No markdown. - No extra keys. - Only "validity" and "relevant_premises". ---------------------------------------- OUTPUT FORMAT: {{ "validity": true or false, "relevant_premises": [int, int] }} ---------------------------------------- SYLLOGIS...

work page
[11]

Extract subj/pred per sentence

work page
[12]

Count term distribution

work page
[13]

Fix distribution if needed (replace or tag)

work page
[14]

detected_language

Output OUTPUT (JSON only, no markdown): {{ "detected_language": "<lang>", "reasoning": "<extract, count, fix, output>", "english": "P1. P2. Therefore, C." }} EXAMPLES: Input: "All goyangi are dongmul. Some dongmul are not poyuryu. Therefore, some goyangi are not poyuryu." reasoning: "Extract: P1 subj=goyangi pred=dongmul. P2 subj=dongmul pred=poyuryu. C s...

work page
[23]

detected_language

Output final structured form OUTPUT (JSON only, no markdown): { "detected_language": "<lang>", "reasoning": "<selection + extraction + term_count>", "english": "P1. P2. Therefore, C." } Figure 6: EPN prompt for Subtask 4 for extract and filter relevant premise and conclusion EPN prompt with google translated sentence for Relevance Premises , Multilingual ...

work page
[24]

All X are not Y

FORMAT Each sentence must follow: [Quantifier] [native_subject] are [native_predicate] Allowed quantifiers: All / No / Some / Some...not Never output: "All X are not Y"

work page
[25]

- Do NOT replace terms

DO NOT MODIFY LOGIC - Do NOT swap subject and predicate. - Do NOT replace terms. - Do NOT merge synonyms. - Do NOT normalize or repair the argument. - Do NOT balance term distribution. - Do NOT split polysemous words. - Do NOT introduce [s] or [g] tags. Extract the argument exactly as written

work page
[26]

GOOGLE TRANSLATE CHECK (Fidelity Only) Mentally compare with Google Translate ONLY to: - Verify quantifier accuracy - Verify negation scope - Verify copula meaning

work page
[27]

- Singular/plural variants count as the same term

TERM HANDLING - Copy subject and predicate verbatim. - Singular/plural variants count as the same term. - Descriptive phrases remain descriptive phrases. - Identity statements (X are X) remain unchanged

work page
[28]

- A classical syllogism should have exactly 3 distinct terms

TERM COUNT CHECK (Diagnostic Only) After extraction: - Count distinct terms. - A classical syllogism should have exactly 3 distinct terms. - If not, DO NOT repair. - Simply report the count in reasoning

work page
[29]

- Ignore rhetorical commentary

SENTENCE SELECTION If more than 3 sentences are present: - Select the two premises and the conclusion that form the main argument. - Ignore rhetorical commentary. REASONING FORMAT:

work page
[30]

Identify selected sentences (P1, P2, C)

work page
[31]

Extract subject and predicate

work page
[32]

Count distinct terms (no fixing)

work page
[33]

detected_language

Output final structured form OUTPUT (JSON only, no markdown): { "detected_language": "<lang>", "reasoning": "<selection + extraction + term_count>", "english": "P1. P2. Therefore, C." } Figure 7: EPN prompt with google translated sentence for Subtask 4 for extract and filter relevant premise and conclusion

work page

[1] [1]

Language models show human-like content effects on reasoning tasks

Generalized quantifiers as a source of error in multilingual NLU benchmarks. In Proceedings of the 2022 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, pages 4875–4893. Association for Computational Linguistics. Ishita Dasgupta, Andrew K. Lampinen, Stephanie C. Y . Chan, Hannah R. S...

work page arXiv 2022

[2] [2]

In Findings of the Association for Computational Lin- guistics: ACL 2025 , pages 10074–10095, Vienna, Austria

Reasoning circuits in language models: A mechanistic interpretation of syllogistic inference. In Findings of the Association for Computational Lin- guistics: ACL 2025 , pages 10074–10095, Vienna, Austria. Association for Computational Linguistics. Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta I...

work page 2025

[3] [3]

In Findings of the Association for Computational Linguistics: ACL 2024, pages 16063– 16077

Exploring reasoning biases in large lan- guage models through syllogism: Insights from the neubaroco dataset. In Findings of the Association for Computational Linguistics: ACL 2024, pages 16063– 16077. Terence Parsons. 2014. Articulating Medieval Logic. Oxford University Press, Oxford. Graham Priest. 2008. An Introduction to Non-Classical Logic: From If t...

work page arXiv 2024

[4] [4]

is ” with “ are

Association for Computational Linguistics. Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong- Li Lee, and Wynne Hsu. 2024. Faithful logical rea- soning via symbolic chain-of-thought. In Proceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 13326–13365, Bangkok, Thailand. As- sociation f...

work page 2024

[5] [5]

Therefore

All sentences before the final "Therefore" are premises. The sentence after "Therefore" is the conclusion

work page

[6] [6]

Internally rewrite each statement into standard categorical form: - All X are Y (A) - No X are Y (E) - Some X are Y (I) - Some X are not Y (O) Handle paraphrases such as: - Every X is Y - Not a single X is Y - At least one X is not Y - Double negations

work page

[7] [7]

Determine whether the conclusion NECESSARILY follows from a subset of the premises under classical categorical logic

work page

[8] [8]

- Return their indexes (0-based)

If the argument is valid: - Identify the MINIMAL set of premises required to entail the conclusion. - Return their indexes (0-based). - Indexing is based on order of appearance in the text. - Do NOT include unused premises

work page

[9] [9]

If the argument is invalid: - validity = false - relevant_premises = []

work page

[10] [10]

validity

Do NOT explain reasoning. Output JSON ONLY. STRICT REQUIREMENTS: - Output must be valid JSON. - No explanation. - No markdown. - No extra keys. - Only "validity" and "relevant_premises". ---------------------------------------- OUTPUT FORMAT: {{ "validity": true or false, "relevant_premises": [int, int] }} ---------------------------------------- SYLLOGIS...

work page

[11] [11]

Extract subj/pred per sentence

work page

[12] [12]

Count term distribution

work page

[13] [13]

Fix distribution if needed (replace or tag)

work page

[14] [14]

detected_language

Output OUTPUT (JSON only, no markdown): {{ "detected_language": "<lang>", "reasoning": "<extract, count, fix, output>", "english": "P1. P2. Therefore, C." }} EXAMPLES: Input: "All goyangi are dongmul. Some dongmul are not poyuryu. Therefore, some goyangi are not poyuryu." reasoning: "Extract: P1 subj=goyangi pred=dongmul. P2 subj=dongmul pred=poyuryu. C s...

work page

[15] [23]

detected_language

Output final structured form OUTPUT (JSON only, no markdown): { "detected_language": "<lang>", "reasoning": "<selection + extraction + term_count>", "english": "P1. P2. Therefore, C." } Figure 6: EPN prompt for Subtask 4 for extract and filter relevant premise and conclusion EPN prompt with google translated sentence for Relevance Premises , Multilingual ...

work page

[16] [24]

All X are not Y

FORMAT Each sentence must follow: [Quantifier] [native_subject] are [native_predicate] Allowed quantifiers: All / No / Some / Some...not Never output: "All X are not Y"

work page

[17] [25]

- Do NOT replace terms

DO NOT MODIFY LOGIC - Do NOT swap subject and predicate. - Do NOT replace terms. - Do NOT merge synonyms. - Do NOT normalize or repair the argument. - Do NOT balance term distribution. - Do NOT split polysemous words. - Do NOT introduce [s] or [g] tags. Extract the argument exactly as written

work page

[18] [26]

GOOGLE TRANSLATE CHECK (Fidelity Only) Mentally compare with Google Translate ONLY to: - Verify quantifier accuracy - Verify negation scope - Verify copula meaning

work page

[19] [27]

- Singular/plural variants count as the same term

TERM HANDLING - Copy subject and predicate verbatim. - Singular/plural variants count as the same term. - Descriptive phrases remain descriptive phrases. - Identity statements (X are X) remain unchanged

work page

[20] [28]

- A classical syllogism should have exactly 3 distinct terms

TERM COUNT CHECK (Diagnostic Only) After extraction: - Count distinct terms. - A classical syllogism should have exactly 3 distinct terms. - If not, DO NOT repair. - Simply report the count in reasoning

work page

[21] [29]

- Ignore rhetorical commentary

SENTENCE SELECTION If more than 3 sentences are present: - Select the two premises and the conclusion that form the main argument. - Ignore rhetorical commentary. REASONING FORMAT:

work page

[22] [30]

Identify selected sentences (P1, P2, C)

work page

[23] [31]

Extract subject and predicate

work page

[24] [32]

Count distinct terms (no fixing)

work page

[25] [33]

detected_language

Output final structured form OUTPUT (JSON only, no markdown): { "detected_language": "<lang>", "reasoning": "<selection + extraction + term_count>", "english": "P1. P2. Therefore, C." } Figure 7: EPN prompt with google translated sentence for Subtask 4 for extract and filter relevant premise and conclusion

work page