TUDUM: A Turkish-Thinking Reasoning Pipeline for Qwen3.5-27B

Bahaeddin Turkoglu; Baran Bingol

arxiv: 2607.01927 · v1 · pith:7CXA5ZRQnew · submitted 2026-07-02 · 💻 cs.CL · cs.AI

TUDUM: A Turkish-Thinking Reasoning Pipeline for Qwen3.5-27B

Baran Bingol , Bahaeddin Turkoglu This is my paper

Pith reviewed 2026-07-03 15:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Turkish reasoningthinking modelssupervised fine-tuningGRPO reinforcement learningQwen model adaptationmultilingual reasoning tracesLoRA fine-tuning

0 comments

The pith

SFT on Turkish reasoning examples followed by GRPO produces consistent Turkish traces in the visible thinking block of Qwen3.5-27B.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a pipeline for shifting the explicit reasoning trace of a 27B thinking model from English-centered to Turkish. It starts with supervised fine-tuning on nearly 16,000 Turkish reasoning examples using LoRA, then applies GRPO reinforcement learning on filtered Turkish math problems. A sympathetic reader would care because many models answer non-English prompts while keeping internal scratchpads in English, which limits genuine multilingual reasoning. The results show SFT shortens responses and increases Turkish consistency at the cost of benchmark accuracy, while RL recovers some performance without surpassing the base model overall.

Core claim

TUDUM treats the generated <think>...</think> block as a trainable behavior. Starting from unsloth/Qwen3.5-27B, supervised fine-tuning on 15,991 Turkish reasoning examples makes the model generate shorter and more consistently Turkish reasoning traces, while subsequent GRPO reinforcement learning on a proxy-filtered Turkish mathematics environment recovers some mathematical performance such as on AIME24 at early checkpoints, though the Macro-6 average does not exceed the base model.

What carries the argument

The <think>...</think> block treated as a trainable behavior that can be shifted toward Turkish via SFT on reasoning examples and GRPO on Turkish math.

If this is right

SFT alone reduces average response length and thinking exhaustion while increasing Turkish consistency in the trace.
GRPO reinforcement learning partially restores benchmark accuracy, particularly on AIME24 at the best early checkpoint.
The full pipeline does not produce uniform gains across all benchmarks and does not exceed the base model on the reported Macro-6 average.
The released step-50 model provides a publicly available checkpoint for further Turkish-thinking adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same SFT-plus-GRPO approach could be applied to other low-resource languages by collecting parallel reasoning data.
Controlling the language of the visible scratchpad may indirectly influence the model's choice of internal computation language across tasks.
If the visible trace language affects downstream answer quality, future work could measure whether Turkish traces improve or degrade performance on Turkish-specific cultural or linguistic benchmarks.

Load-bearing premise

That training the visible reasoning trace on Turkish examples will make the model perform its actual problem-solving computation in Turkish rather than merely producing Turkish output while retaining English-centered internal steps.

What would settle it

A controlled test where the model solves Turkish math problems with fully Turkish traces yet shows evidence of internal English translation steps, such as through hidden-state analysis or English-only intermediate calculations.

Figures

Figures reproduced from arXiv: 2607.01927 by Bahaeddin Turkoglu, Baran Bingol.

read the original abstract

This paper presents TUDUM (T\"urk\c{c}e D\"u\c{s}\"unen \"Uretken Model), a project pipeline for adapting a Qwen-family 27B thinking model toward Turkish reasoning. The central problem is not only to answer Turkish prompts in Turkish, but to make the explicit reasoning trace itself Turkish. A thinking model may translate a Turkish prompt into an English-centered internal or visible scratchpad, solve the problem mostly in English, and only localize the final answer. TUDUM instead treats the generated <think>...</think> block as a trainable behavior. The pipeline starts from the project base checkpoint unsloth/Qwen3.5-27B, applies supervised fine-tuning (SFT) on 15,991 Turkish reasoning examples using LoRA adapters, and then applies GRPO-family reinforcement learning on a proxy-filtered Turkish mathematics environment. The results are mixed. SFT made the model shorter and more consistently Turkish in its reasoning behavior, with large reductions in average response length and thinking exhaustion, but reduced benchmark accuracy. RL recovered some mathematical performance, especially AIME24 at the best early checkpoint, yet did not uniformly improve all benchmarks and did not exceed the base model on the reported Macro-6 average. The contribution is therefore best framed as a technically honest Turkish-thinking reasoning pipeline and evaluation, not as a claim of state-of-the-art Turkish reasoning. The released step-50 model is publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TUDUM is a straightforward empirical report on SFT+GRPO for Turkish reasoning traces in Qwen3.5-27B, honest about mixed results but without tests for whether internal computation actually switched languages.

read the letter

The paper takes the visible block in a thinking model and trains it directly on Turkish examples. They start from unsloth/Qwen3.5-27B, run LoRA SFT on 15,991 Turkish reasoning traces, then apply GRPO on a Turkish math setup. SFT shortens responses and makes the traces consistently Turkish, but drops accuracy. GRPO recovers some ground on AIME24 at early checkpoints without lifting the overall Macro-6 average above the base model.

What stands out is the lack of overclaim. The abstract frames the work as a pipeline report rather than a new method, and they release the step-50 checkpoint. That transparency is useful when most adaptation papers hide the trade-offs.

The main gap is the missing check on internal language. The abstract itself notes that models can solve in English and only localize the output, yet the experiments offer no ablation or probe to separate that from genuine Turkish reasoning. The accuracy numbers are compatible with either story, so the central behavioral claim stays untested.

This is for people building language-specific reasoning systems who need a concrete recipe and realistic numbers rather than a new algorithm. The methods are standard, the data scale is modest, and the results are incremental, but the honest reporting makes the pipeline worth examining in detail.

I would send it to peer review. The implementation choices and evaluation setup deserve referee scrutiny even if the core claim needs stronger diagnostics.

Referee Report

1 major / 3 minor

Summary. The manuscript presents TUDUM, a pipeline adapting the unsloth/Qwen3.5-27B thinking model for Turkish reasoning. It applies SFT with LoRA on 15,991 Turkish reasoning examples to train the <think> block behavior, followed by GRPO RL on a proxy-filtered Turkish mathematics environment. Results show SFT yields shorter responses and more consistent Turkish traces but lowers benchmark accuracy; RL provides partial recovery (notably on AIME24 at early checkpoints) without exceeding the base model on the reported Macro-6 average. The step-50 checkpoint is released publicly, with the contribution framed as a technically honest pipeline rather than an SOTA claim.

Significance. If the empirical results hold, the work offers a reproducible pipeline for adapting thinking-model reasoning traces to a target language, with transparent reporting of length reductions, consistency gains, and mixed accuracy effects. It contributes to multilingual LLM research by treating the visible <think> block as trainable behavior and releasing the model for further study.

major comments (1)

[Abstract and §1] Abstract and §1 (Introduction): The claim that SFT+GRPO produces 'consistently Turkish reasoning traces' (as opposed to English-internal solving followed by Turkish localization of the visible trace) is central but untested. The manuscript itself flags the latter failure mode for thinking models yet provides no diagnostic (e.g., forced-English ablation, Turkish terminology consistency probes, or cross-lingual reasoning checks) to distinguish the interpretations. Mixed benchmark outcomes are compatible with either reading and do not resolve the distinction.

minor comments (3)

[Results] Results section: A consolidated table reporting all Macro-6 component scores, length statistics, and checkpoint comparisons (including the step-50 model) would improve clarity and allow direct verification of the 'did not exceed base model' statement.
[Methods] Methods: Hyperparameter details for the GRPO stage (learning rate, batch size, proxy filter criteria) and the exact composition of the 15,991-example SFT dataset (source, filtering, language verification) are needed for reproducibility.
[Evaluation] Evaluation: Clarify whether 'thinking exhaustion' and 'response length' metrics are computed on the full output or only the <think> block, and report variance across multiple runs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the single major comment below and agree that the central claim requires clarification.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1 (Introduction): The claim that SFT+GRPO produces 'consistently Turkish reasoning traces' (as opposed to English-internal solving followed by Turkish localization of the visible trace) is central but untested. The manuscript itself flags the latter failure mode for thinking models yet provides no diagnostic (e.g., forced-English ablation, Turkish terminology consistency probes, or cross-lingual reasoning checks) to distinguish the interpretations. Mixed benchmark outcomes are compatible with either reading and do not resolve the distinction.

Authors: We agree that the manuscript does not include diagnostics capable of distinguishing genuine Turkish-internal reasoning from English-internal solving followed by Turkish localization of the visible <think> block. Observable improvements in Turkish token consistency and trace length are compatible with both interpretations, and the mixed benchmark results do not resolve the ambiguity. We will revise the abstract and Section 1 to explicitly acknowledge this limitation, replace the phrasing 'consistently Turkish reasoning traces' with the more precise 'reliably Turkish-visible reasoning traces,' and note that distinguishing the two mechanisms remains an open question for future work. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline report with no derivations or fitted predictions

full rationale

The paper describes an experimental pipeline (SFT on 15,991 Turkish examples followed by GRPO) and reports mixed benchmark outcomes without any equations, mathematical derivations, parameter fits, or predictions that reduce to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core claims. The work is self-contained as a technical report on observable training behavior and does not rely on self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no equations, fitted constants, or new entities are described. Free parameters, axioms, and invented entities cannot be enumerated from the provided text.

pith-pipeline@v0.9.1-grok · 5797 in / 1194 out tokens · 21146 ms · 2026-07-03T15:07:17.062125+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 12 canonical work pages · 10 internal anchors

[1]

Qwen/Qwen3.5-27B,

Qwen Team, “Qwen/Qwen3.5-27B,” Hugging Face model card. [On- line]. Available: https://huggingface.co/Qwen/Qwen3.5-27B. Accessed: Jun. 12, 2026

2026
[2]

unsloth/Qwen3.5-27B,

Unsloth, “unsloth/Qwen3.5-27B,” Hugging Face model card. [Online]. Available: https://huggingface.co/unsloth/Qwen3.5-27B. Accessed: Jun. 12, 2026

2026
[3]

Qwen3 Technical Report

A. Yanget al., “Qwen3 Technical Report,” arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,

J. Weiet al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” inAdvances in Neural Information Processing Systems, 2022

2022
[5]

Language Models are Multilingual Chain-of-Thought Reasoners,

F. Shiet al., “Language Models are Multilingual Chain-of-Thought Reasoners,” inProc. Int. Conf. Learn. Representations, 2023

2023
[6]

Do Multilingual LLMs Think In English?

L. Schut, Y . Gal, and S. Farquhar, “Do Multilingual LLMs Think In English?” arXiv preprint arXiv:2502.15603, 2025

work page arXiv 2025
[7]

Training language models to follow instructions with human feedback,

L. Ouyanget al., “Training language models to follow instructions with human feedback,” inAdvances in Neural Information Processing Systems, vol. 35, 2022

2022
[8]

barandinho/DAPO-Math-14k-Turkish,

barandinho, “barandinho/DAPO-Math-14k-Turkish,” Hugging Face dataset card. [Online]. Available: https://huggingface.co/datasets/ barandinho/DAPO-Math-14k-Turkish. Accessed: Jun. 12, 2026

2026
[9]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AIet al., “DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models,” arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Math-Verify: Math Verification Library,

H. Kydlí ˇcek, “Math-Verify: Math Verification Library,” software, version 0.6.1. [Online]. Available: https://github.com/huggingface/Math-Verify. Accessed: Jun. 12, 2026

2026
[11]

Kimi K2.5: Visual Agentic Intelligence

Kimi Teamet al., “Kimi K2.5: Visual Agentic Intelligence,” arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

LoRA: Low-Rank Adaptation of Large Language Models,

E. J. Huet al., “LoRA: Low-Rank Adaptation of Large Language Models,” inProc. Int. Conf. Learn. Representations, 2022

2022
[13]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shaoet al., “DeepSeekMath: Pushing the Limits of Math- ematical Reasoning in Open Language Models,” arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proxi- mal Policy Optimization Algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yuet al., “DAPO: An Open-Source LLM Reinforcement Learning System at Scale,” arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

S.-Y . Liuet al., “GDPO: Group reward-Decoupled Normalization Pol- icy Optimization for Multi-reward RL Optimization,” arXiv preprint arXiv:2601.05242, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Measuring Massive Multitask Language Under- standing,

D. Hendryckset al., “Measuring Massive Multitask Language Under- standing,” inInternational Conference on Learning Representations, 2021

2021
[18]

Turkish MMLU: Yapay Zeka ve Akademik Uygula- malar ˙Için En Kapsamlı ve Özgün Türkçe Veri Seti,

M. A. Bayram, “Turkish MMLU: Yapay Zeka ve Akademik Uygula- malar ˙Için En Kapsamlı ve Özgün Türkçe Veri Seti,” Zenodo, version v1.2, Aug. 2024, doi: 10.5281/zenodo.13378019. [Online]. Available: https://doi.org/10.5281/zenodo.13378019

work page doi:10.5281/zenodo.13378019 2024
[19]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

D. Reinet al., “GPQA: A Graduate-Level Google-Proof Q&A Bench- mark,” arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Evaluating Large Language Models Trained on Code

M. Chenet al., “Evaluating Large Language Models Trained on Code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Instruction-Following Evaluation for Large Language Models

J. Zhouet al., “Instruction-Following Evaluation for Large Language Models,” arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Qwen/Qwen3.5-27B,

Qwen Team, “Qwen/Qwen3.5-27B,” Hugging Face model card. [On- line]. Available: https://huggingface.co/Qwen/Qwen3.5-27B. Accessed: Jun. 12, 2026

2026

[2] [2]

unsloth/Qwen3.5-27B,

Unsloth, “unsloth/Qwen3.5-27B,” Hugging Face model card. [Online]. Available: https://huggingface.co/unsloth/Qwen3.5-27B. Accessed: Jun. 12, 2026

2026

[3] [3]

Qwen3 Technical Report

A. Yanget al., “Qwen3 Technical Report,” arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,

J. Weiet al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” inAdvances in Neural Information Processing Systems, 2022

2022

[5] [5]

Language Models are Multilingual Chain-of-Thought Reasoners,

F. Shiet al., “Language Models are Multilingual Chain-of-Thought Reasoners,” inProc. Int. Conf. Learn. Representations, 2023

2023

[6] [6]

Do Multilingual LLMs Think In English?

L. Schut, Y . Gal, and S. Farquhar, “Do Multilingual LLMs Think In English?” arXiv preprint arXiv:2502.15603, 2025

work page arXiv 2025

[7] [7]

Training language models to follow instructions with human feedback,

L. Ouyanget al., “Training language models to follow instructions with human feedback,” inAdvances in Neural Information Processing Systems, vol. 35, 2022

2022

[8] [8]

barandinho/DAPO-Math-14k-Turkish,

barandinho, “barandinho/DAPO-Math-14k-Turkish,” Hugging Face dataset card. [Online]. Available: https://huggingface.co/datasets/ barandinho/DAPO-Math-14k-Turkish. Accessed: Jun. 12, 2026

2026

[9] [9]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AIet al., “DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models,” arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Math-Verify: Math Verification Library,

H. Kydlí ˇcek, “Math-Verify: Math Verification Library,” software, version 0.6.1. [Online]. Available: https://github.com/huggingface/Math-Verify. Accessed: Jun. 12, 2026

2026

[11] [11]

Kimi K2.5: Visual Agentic Intelligence

Kimi Teamet al., “Kimi K2.5: Visual Agentic Intelligence,” arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

LoRA: Low-Rank Adaptation of Large Language Models,

E. J. Huet al., “LoRA: Low-Rank Adaptation of Large Language Models,” inProc. Int. Conf. Learn. Representations, 2022

2022

[13] [13]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shaoet al., “DeepSeekMath: Pushing the Limits of Math- ematical Reasoning in Open Language Models,” arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proxi- mal Policy Optimization Algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yuet al., “DAPO: An Open-Source LLM Reinforcement Learning System at Scale,” arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

S.-Y . Liuet al., “GDPO: Group reward-Decoupled Normalization Pol- icy Optimization for Multi-reward RL Optimization,” arXiv preprint arXiv:2601.05242, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Measuring Massive Multitask Language Under- standing,

D. Hendryckset al., “Measuring Massive Multitask Language Under- standing,” inInternational Conference on Learning Representations, 2021

2021

[18] [18]

Turkish MMLU: Yapay Zeka ve Akademik Uygula- malar ˙Için En Kapsamlı ve Özgün Türkçe Veri Seti,

M. A. Bayram, “Turkish MMLU: Yapay Zeka ve Akademik Uygula- malar ˙Için En Kapsamlı ve Özgün Türkçe Veri Seti,” Zenodo, version v1.2, Aug. 2024, doi: 10.5281/zenodo.13378019. [Online]. Available: https://doi.org/10.5281/zenodo.13378019

work page doi:10.5281/zenodo.13378019 2024

[19] [19]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

D. Reinet al., “GPQA: A Graduate-Level Google-Proof Q&A Bench- mark,” arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Evaluating Large Language Models Trained on Code

M. Chenet al., “Evaluating Large Language Models Trained on Code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

Instruction-Following Evaluation for Large Language Models

J. Zhouet al., “Instruction-Following Evaluation for Large Language Models,” arXiv preprint arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023