pith. sign in

arxiv: 2607.01927 · v1 · pith:7CXA5ZRQnew · submitted 2026-07-02 · 💻 cs.CL · cs.AI

TUDUM: A Turkish-Thinking Reasoning Pipeline for Qwen3.5-27B

Pith reviewed 2026-07-03 15:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Turkish reasoningthinking modelssupervised fine-tuningGRPO reinforcement learningQwen model adaptationmultilingual reasoning tracesLoRA fine-tuning
0
0 comments X

The pith

SFT on Turkish reasoning examples followed by GRPO produces consistent Turkish traces in the visible thinking block of Qwen3.5-27B.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a pipeline for shifting the explicit reasoning trace of a 27B thinking model from English-centered to Turkish. It starts with supervised fine-tuning on nearly 16,000 Turkish reasoning examples using LoRA, then applies GRPO reinforcement learning on filtered Turkish math problems. A sympathetic reader would care because many models answer non-English prompts while keeping internal scratchpads in English, which limits genuine multilingual reasoning. The results show SFT shortens responses and increases Turkish consistency at the cost of benchmark accuracy, while RL recovers some performance without surpassing the base model overall.

Core claim

TUDUM treats the generated <think>...</think> block as a trainable behavior. Starting from unsloth/Qwen3.5-27B, supervised fine-tuning on 15,991 Turkish reasoning examples makes the model generate shorter and more consistently Turkish reasoning traces, while subsequent GRPO reinforcement learning on a proxy-filtered Turkish mathematics environment recovers some mathematical performance such as on AIME24 at early checkpoints, though the Macro-6 average does not exceed the base model.

What carries the argument

The <think>...</think> block treated as a trainable behavior that can be shifted toward Turkish via SFT on reasoning examples and GRPO on Turkish math.

If this is right

  • SFT alone reduces average response length and thinking exhaustion while increasing Turkish consistency in the trace.
  • GRPO reinforcement learning partially restores benchmark accuracy, particularly on AIME24 at the best early checkpoint.
  • The full pipeline does not produce uniform gains across all benchmarks and does not exceed the base model on the reported Macro-6 average.
  • The released step-50 model provides a publicly available checkpoint for further Turkish-thinking adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same SFT-plus-GRPO approach could be applied to other low-resource languages by collecting parallel reasoning data.
  • Controlling the language of the visible scratchpad may indirectly influence the model's choice of internal computation language across tasks.
  • If the visible trace language affects downstream answer quality, future work could measure whether Turkish traces improve or degrade performance on Turkish-specific cultural or linguistic benchmarks.

Load-bearing premise

That training the visible reasoning trace on Turkish examples will make the model perform its actual problem-solving computation in Turkish rather than merely producing Turkish output while retaining English-centered internal steps.

What would settle it

A controlled test where the model solves Turkish math problems with fully Turkish traces yet shows evidence of internal English translation steps, such as through hidden-state analysis or English-only intermediate calculations.

Figures

Figures reproduced from arXiv: 2607.01927 by Bahaeddin Turkoglu, Baran Bingol.

Figure 1
Figure 1. Figure 1: The TÜDÜM training pipeline. The contribution is the full adaptation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

This paper presents TUDUM (T\"urk\c{c}e D\"u\c{s}\"unen \"Uretken Model), a project pipeline for adapting a Qwen-family 27B thinking model toward Turkish reasoning. The central problem is not only to answer Turkish prompts in Turkish, but to make the explicit reasoning trace itself Turkish. A thinking model may translate a Turkish prompt into an English-centered internal or visible scratchpad, solve the problem mostly in English, and only localize the final answer. TUDUM instead treats the generated <think>...</think> block as a trainable behavior. The pipeline starts from the project base checkpoint unsloth/Qwen3.5-27B, applies supervised fine-tuning (SFT) on 15,991 Turkish reasoning examples using LoRA adapters, and then applies GRPO-family reinforcement learning on a proxy-filtered Turkish mathematics environment. The results are mixed. SFT made the model shorter and more consistently Turkish in its reasoning behavior, with large reductions in average response length and thinking exhaustion, but reduced benchmark accuracy. RL recovered some mathematical performance, especially AIME24 at the best early checkpoint, yet did not uniformly improve all benchmarks and did not exceed the base model on the reported Macro-6 average. The contribution is therefore best framed as a technically honest Turkish-thinking reasoning pipeline and evaluation, not as a claim of state-of-the-art Turkish reasoning. The released step-50 model is publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript presents TUDUM, a pipeline adapting the unsloth/Qwen3.5-27B thinking model for Turkish reasoning. It applies SFT with LoRA on 15,991 Turkish reasoning examples to train the <think> block behavior, followed by GRPO RL on a proxy-filtered Turkish mathematics environment. Results show SFT yields shorter responses and more consistent Turkish traces but lowers benchmark accuracy; RL provides partial recovery (notably on AIME24 at early checkpoints) without exceeding the base model on the reported Macro-6 average. The step-50 checkpoint is released publicly, with the contribution framed as a technically honest pipeline rather than an SOTA claim.

Significance. If the empirical results hold, the work offers a reproducible pipeline for adapting thinking-model reasoning traces to a target language, with transparent reporting of length reductions, consistency gains, and mixed accuracy effects. It contributes to multilingual LLM research by treating the visible <think> block as trainable behavior and releasing the model for further study.

major comments (1)
  1. [Abstract and §1] Abstract and §1 (Introduction): The claim that SFT+GRPO produces 'consistently Turkish reasoning traces' (as opposed to English-internal solving followed by Turkish localization of the visible trace) is central but untested. The manuscript itself flags the latter failure mode for thinking models yet provides no diagnostic (e.g., forced-English ablation, Turkish terminology consistency probes, or cross-lingual reasoning checks) to distinguish the interpretations. Mixed benchmark outcomes are compatible with either reading and do not resolve the distinction.
minor comments (3)
  1. [Results] Results section: A consolidated table reporting all Macro-6 component scores, length statistics, and checkpoint comparisons (including the step-50 model) would improve clarity and allow direct verification of the 'did not exceed base model' statement.
  2. [Methods] Methods: Hyperparameter details for the GRPO stage (learning rate, batch size, proxy filter criteria) and the exact composition of the 15,991-example SFT dataset (source, filtering, language verification) are needed for reproducibility.
  3. [Evaluation] Evaluation: Clarify whether 'thinking exhaustion' and 'response length' metrics are computed on the full output or only the <think> block, and report variance across multiple runs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the single major comment below and agree that the central claim requires clarification.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1 (Introduction): The claim that SFT+GRPO produces 'consistently Turkish reasoning traces' (as opposed to English-internal solving followed by Turkish localization of the visible trace) is central but untested. The manuscript itself flags the latter failure mode for thinking models yet provides no diagnostic (e.g., forced-English ablation, Turkish terminology consistency probes, or cross-lingual reasoning checks) to distinguish the interpretations. Mixed benchmark outcomes are compatible with either reading and do not resolve the distinction.

    Authors: We agree that the manuscript does not include diagnostics capable of distinguishing genuine Turkish-internal reasoning from English-internal solving followed by Turkish localization of the visible <think> block. Observable improvements in Turkish token consistency and trace length are compatible with both interpretations, and the mixed benchmark results do not resolve the ambiguity. We will revise the abstract and Section 1 to explicitly acknowledge this limitation, replace the phrasing 'consistently Turkish reasoning traces' with the more precise 'reliably Turkish-visible reasoning traces,' and note that distinguishing the two mechanisms remains an open question for future work. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline report with no derivations or fitted predictions

full rationale

The paper describes an experimental pipeline (SFT on 15,991 Turkish examples followed by GRPO) and reports mixed benchmark outcomes without any equations, mathematical derivations, parameter fits, or predictions that reduce to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core claims. The work is self-contained as a technical report on observable training behavior and does not rely on self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, fitted constants, or new entities are described. Free parameters, axioms, and invented entities cannot be enumerated from the provided text.

pith-pipeline@v0.9.1-grok · 5797 in / 1194 out tokens · 21146 ms · 2026-07-03T15:07:17.062125+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 12 canonical work pages · 10 internal anchors

  1. [1]

    Qwen/Qwen3.5-27B,

    Qwen Team, “Qwen/Qwen3.5-27B,” Hugging Face model card. [On- line]. Available: https://huggingface.co/Qwen/Qwen3.5-27B. Accessed: Jun. 12, 2026

  2. [2]

    unsloth/Qwen3.5-27B,

    Unsloth, “unsloth/Qwen3.5-27B,” Hugging Face model card. [Online]. Available: https://huggingface.co/unsloth/Qwen3.5-27B. Accessed: Jun. 12, 2026

  3. [3]

    Qwen3 Technical Report

    A. Yanget al., “Qwen3 Technical Report,” arXiv preprint arXiv:2505.09388, 2025

  4. [4]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,

    J. Weiet al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” inAdvances in Neural Information Processing Systems, 2022

  5. [5]

    Language Models are Multilingual Chain-of-Thought Reasoners,

    F. Shiet al., “Language Models are Multilingual Chain-of-Thought Reasoners,” inProc. Int. Conf. Learn. Representations, 2023

  6. [6]

    Do Multilingual LLMs Think In English?

    L. Schut, Y . Gal, and S. Farquhar, “Do Multilingual LLMs Think In English?” arXiv preprint arXiv:2502.15603, 2025

  7. [7]

    Training language models to follow instructions with human feedback,

    L. Ouyanget al., “Training language models to follow instructions with human feedback,” inAdvances in Neural Information Processing Systems, vol. 35, 2022

  8. [8]

    barandinho/DAPO-Math-14k-Turkish,

    barandinho, “barandinho/DAPO-Math-14k-Turkish,” Hugging Face dataset card. [Online]. Available: https://huggingface.co/datasets/ barandinho/DAPO-Math-14k-Turkish. Accessed: Jun. 12, 2026

  9. [9]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-AIet al., “DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models,” arXiv preprint arXiv:2512.02556, 2025

  10. [10]

    Math-Verify: Math Verification Library,

    H. Kydlí ˇcek, “Math-Verify: Math Verification Library,” software, version 0.6.1. [Online]. Available: https://github.com/huggingface/Math-Verify. Accessed: Jun. 12, 2026

  11. [11]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Teamet al., “Kimi K2.5: Visual Agentic Intelligence,” arXiv preprint arXiv:2602.02276, 2026

  12. [12]

    LoRA: Low-Rank Adaptation of Large Language Models,

    E. J. Huet al., “LoRA: Low-Rank Adaptation of Large Language Models,” inProc. Int. Conf. Learn. Representations, 2022

  13. [13]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Z. Shaoet al., “DeepSeekMath: Pushing the Limits of Math- ematical Reasoning in Open Language Models,” arXiv preprint arXiv:2402.03300, 2024

  14. [14]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proxi- mal Policy Optimization Algorithms,” arXiv preprint arXiv:1707.06347, 2017

  15. [15]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Q. Yuet al., “DAPO: An Open-Source LLM Reinforcement Learning System at Scale,” arXiv preprint arXiv:2503.14476, 2025

  16. [16]

    GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

    S.-Y . Liuet al., “GDPO: Group reward-Decoupled Normalization Pol- icy Optimization for Multi-reward RL Optimization,” arXiv preprint arXiv:2601.05242, 2026

  17. [17]

    Measuring Massive Multitask Language Under- standing,

    D. Hendryckset al., “Measuring Massive Multitask Language Under- standing,” inInternational Conference on Learning Representations, 2021

  18. [18]

    Turkish MMLU: Yapay Zeka ve Akademik Uygula- malar ˙Için En Kapsamlı ve Özgün Türkçe Veri Seti,

    M. A. Bayram, “Turkish MMLU: Yapay Zeka ve Akademik Uygula- malar ˙Için En Kapsamlı ve Özgün Türkçe Veri Seti,” Zenodo, version v1.2, Aug. 2024, doi: 10.5281/zenodo.13378019. [Online]. Available: https://doi.org/10.5281/zenodo.13378019

  19. [19]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    D. Reinet al., “GPQA: A Graduate-Level Google-Proof Q&A Bench- mark,” arXiv preprint arXiv:2311.12022, 2023

  20. [20]

    Evaluating Large Language Models Trained on Code

    M. Chenet al., “Evaluating Large Language Models Trained on Code,” arXiv preprint arXiv:2107.03374, 2021

  21. [21]

    Instruction-Following Evaluation for Large Language Models

    J. Zhouet al., “Instruction-Following Evaluation for Large Language Models,” arXiv preprint arXiv:2311.07911, 2023