TUDUM: A Turkish-Thinking Reasoning Pipeline for Qwen3.5-27B
Pith reviewed 2026-07-03 15:07 UTC · model grok-4.3
The pith
SFT on Turkish reasoning examples followed by GRPO produces consistent Turkish traces in the visible thinking block of Qwen3.5-27B.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TUDUM treats the generated <think>...</think> block as a trainable behavior. Starting from unsloth/Qwen3.5-27B, supervised fine-tuning on 15,991 Turkish reasoning examples makes the model generate shorter and more consistently Turkish reasoning traces, while subsequent GRPO reinforcement learning on a proxy-filtered Turkish mathematics environment recovers some mathematical performance such as on AIME24 at early checkpoints, though the Macro-6 average does not exceed the base model.
What carries the argument
The <think>...</think> block treated as a trainable behavior that can be shifted toward Turkish via SFT on reasoning examples and GRPO on Turkish math.
If this is right
- SFT alone reduces average response length and thinking exhaustion while increasing Turkish consistency in the trace.
- GRPO reinforcement learning partially restores benchmark accuracy, particularly on AIME24 at the best early checkpoint.
- The full pipeline does not produce uniform gains across all benchmarks and does not exceed the base model on the reported Macro-6 average.
- The released step-50 model provides a publicly available checkpoint for further Turkish-thinking adaptation.
Where Pith is reading between the lines
- The same SFT-plus-GRPO approach could be applied to other low-resource languages by collecting parallel reasoning data.
- Controlling the language of the visible scratchpad may indirectly influence the model's choice of internal computation language across tasks.
- If the visible trace language affects downstream answer quality, future work could measure whether Turkish traces improve or degrade performance on Turkish-specific cultural or linguistic benchmarks.
Load-bearing premise
That training the visible reasoning trace on Turkish examples will make the model perform its actual problem-solving computation in Turkish rather than merely producing Turkish output while retaining English-centered internal steps.
What would settle it
A controlled test where the model solves Turkish math problems with fully Turkish traces yet shows evidence of internal English translation steps, such as through hidden-state analysis or English-only intermediate calculations.
Figures
read the original abstract
This paper presents TUDUM (T\"urk\c{c}e D\"u\c{s}\"unen \"Uretken Model), a project pipeline for adapting a Qwen-family 27B thinking model toward Turkish reasoning. The central problem is not only to answer Turkish prompts in Turkish, but to make the explicit reasoning trace itself Turkish. A thinking model may translate a Turkish prompt into an English-centered internal or visible scratchpad, solve the problem mostly in English, and only localize the final answer. TUDUM instead treats the generated <think>...</think> block as a trainable behavior. The pipeline starts from the project base checkpoint unsloth/Qwen3.5-27B, applies supervised fine-tuning (SFT) on 15,991 Turkish reasoning examples using LoRA adapters, and then applies GRPO-family reinforcement learning on a proxy-filtered Turkish mathematics environment. The results are mixed. SFT made the model shorter and more consistently Turkish in its reasoning behavior, with large reductions in average response length and thinking exhaustion, but reduced benchmark accuracy. RL recovered some mathematical performance, especially AIME24 at the best early checkpoint, yet did not uniformly improve all benchmarks and did not exceed the base model on the reported Macro-6 average. The contribution is therefore best framed as a technically honest Turkish-thinking reasoning pipeline and evaluation, not as a claim of state-of-the-art Turkish reasoning. The released step-50 model is publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents TUDUM, a pipeline adapting the unsloth/Qwen3.5-27B thinking model for Turkish reasoning. It applies SFT with LoRA on 15,991 Turkish reasoning examples to train the <think> block behavior, followed by GRPO RL on a proxy-filtered Turkish mathematics environment. Results show SFT yields shorter responses and more consistent Turkish traces but lowers benchmark accuracy; RL provides partial recovery (notably on AIME24 at early checkpoints) without exceeding the base model on the reported Macro-6 average. The step-50 checkpoint is released publicly, with the contribution framed as a technically honest pipeline rather than an SOTA claim.
Significance. If the empirical results hold, the work offers a reproducible pipeline for adapting thinking-model reasoning traces to a target language, with transparent reporting of length reductions, consistency gains, and mixed accuracy effects. It contributes to multilingual LLM research by treating the visible <think> block as trainable behavior and releasing the model for further study.
major comments (1)
- [Abstract and §1] Abstract and §1 (Introduction): The claim that SFT+GRPO produces 'consistently Turkish reasoning traces' (as opposed to English-internal solving followed by Turkish localization of the visible trace) is central but untested. The manuscript itself flags the latter failure mode for thinking models yet provides no diagnostic (e.g., forced-English ablation, Turkish terminology consistency probes, or cross-lingual reasoning checks) to distinguish the interpretations. Mixed benchmark outcomes are compatible with either reading and do not resolve the distinction.
minor comments (3)
- [Results] Results section: A consolidated table reporting all Macro-6 component scores, length statistics, and checkpoint comparisons (including the step-50 model) would improve clarity and allow direct verification of the 'did not exceed base model' statement.
- [Methods] Methods: Hyperparameter details for the GRPO stage (learning rate, batch size, proxy filter criteria) and the exact composition of the 15,991-example SFT dataset (source, filtering, language verification) are needed for reproducibility.
- [Evaluation] Evaluation: Clarify whether 'thinking exhaustion' and 'response length' metrics are computed on the full output or only the <think> block, and report variance across multiple runs.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address the single major comment below and agree that the central claim requires clarification.
read point-by-point responses
-
Referee: [Abstract and §1] Abstract and §1 (Introduction): The claim that SFT+GRPO produces 'consistently Turkish reasoning traces' (as opposed to English-internal solving followed by Turkish localization of the visible trace) is central but untested. The manuscript itself flags the latter failure mode for thinking models yet provides no diagnostic (e.g., forced-English ablation, Turkish terminology consistency probes, or cross-lingual reasoning checks) to distinguish the interpretations. Mixed benchmark outcomes are compatible with either reading and do not resolve the distinction.
Authors: We agree that the manuscript does not include diagnostics capable of distinguishing genuine Turkish-internal reasoning from English-internal solving followed by Turkish localization of the visible <think> block. Observable improvements in Turkish token consistency and trace length are compatible with both interpretations, and the mixed benchmark results do not resolve the ambiguity. We will revise the abstract and Section 1 to explicitly acknowledge this limitation, replace the phrasing 'consistently Turkish reasoning traces' with the more precise 'reliably Turkish-visible reasoning traces,' and note that distinguishing the two mechanisms remains an open question for future work. revision: yes
Circularity Check
No circularity: empirical pipeline report with no derivations or fitted predictions
full rationale
The paper describes an experimental pipeline (SFT on 15,991 Turkish examples followed by GRPO) and reports mixed benchmark outcomes without any equations, mathematical derivations, parameter fits, or predictions that reduce to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to justify core claims. The work is self-contained as a technical report on observable training behavior and does not rely on self-referential steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qwen/Qwen3.5-27B,
Qwen Team, “Qwen/Qwen3.5-27B,” Hugging Face model card. [On- line]. Available: https://huggingface.co/Qwen/Qwen3.5-27B. Accessed: Jun. 12, 2026
2026
-
[2]
unsloth/Qwen3.5-27B,
Unsloth, “unsloth/Qwen3.5-27B,” Hugging Face model card. [Online]. Available: https://huggingface.co/unsloth/Qwen3.5-27B. Accessed: Jun. 12, 2026
2026
-
[3]
A. Yanget al., “Qwen3 Technical Report,” arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,
J. Weiet al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” inAdvances in Neural Information Processing Systems, 2022
2022
-
[5]
Language Models are Multilingual Chain-of-Thought Reasoners,
F. Shiet al., “Language Models are Multilingual Chain-of-Thought Reasoners,” inProc. Int. Conf. Learn. Representations, 2023
2023
-
[6]
Do Multilingual LLMs Think In English?
L. Schut, Y . Gal, and S. Farquhar, “Do Multilingual LLMs Think In English?” arXiv preprint arXiv:2502.15603, 2025
-
[7]
Training language models to follow instructions with human feedback,
L. Ouyanget al., “Training language models to follow instructions with human feedback,” inAdvances in Neural Information Processing Systems, vol. 35, 2022
2022
-
[8]
barandinho/DAPO-Math-14k-Turkish,
barandinho, “barandinho/DAPO-Math-14k-Turkish,” Hugging Face dataset card. [Online]. Available: https://huggingface.co/datasets/ barandinho/DAPO-Math-14k-Turkish. Accessed: Jun. 12, 2026
2026
-
[9]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AIet al., “DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models,” arXiv preprint arXiv:2512.02556, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Math-Verify: Math Verification Library,
H. Kydlí ˇcek, “Math-Verify: Math Verification Library,” software, version 0.6.1. [Online]. Available: https://github.com/huggingface/Math-Verify. Accessed: Jun. 12, 2026
2026
-
[11]
Kimi K2.5: Visual Agentic Intelligence
Kimi Teamet al., “Kimi K2.5: Visual Agentic Intelligence,” arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
LoRA: Low-Rank Adaptation of Large Language Models,
E. J. Huet al., “LoRA: Low-Rank Adaptation of Large Language Models,” inProc. Int. Conf. Learn. Representations, 2022
2022
-
[13]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shaoet al., “DeepSeekMath: Pushing the Limits of Math- ematical Reasoning in Open Language Models,” arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proxi- mal Policy Optimization Algorithms,” arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[15]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Q. Yuet al., “DAPO: An Open-Source LLM Reinforcement Learning System at Scale,” arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
S.-Y . Liuet al., “GDPO: Group reward-Decoupled Normalization Pol- icy Optimization for Multi-reward RL Optimization,” arXiv preprint arXiv:2601.05242, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Measuring Massive Multitask Language Under- standing,
D. Hendryckset al., “Measuring Massive Multitask Language Under- standing,” inInternational Conference on Learning Representations, 2021
2021
-
[18]
Turkish MMLU: Yapay Zeka ve Akademik Uygula- malar ˙Için En Kapsamlı ve Özgün Türkçe Veri Seti,
M. A. Bayram, “Turkish MMLU: Yapay Zeka ve Akademik Uygula- malar ˙Için En Kapsamlı ve Özgün Türkçe Veri Seti,” Zenodo, version v1.2, Aug. 2024, doi: 10.5281/zenodo.13378019. [Online]. Available: https://doi.org/10.5281/zenodo.13378019
-
[19]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
D. Reinet al., “GPQA: A Graduate-Level Google-Proof Q&A Bench- mark,” arXiv preprint arXiv:2311.12022, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Evaluating Large Language Models Trained on Code
M. Chenet al., “Evaluating Large Language Models Trained on Code,” arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Instruction-Following Evaluation for Large Language Models
J. Zhouet al., “Instruction-Following Evaluation for Large Language Models,” arXiv preprint arXiv:2311.07911, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.