Jupiter-N Technical Report

George Drayson

arxiv: 2604.17429 · v1 · submitted 2026-04-19 · 💻 cs.CL · cs.AI

Jupiter-N Technical Report

George Drayson This is my paper

Pith reviewed 2026-05-10 05:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords post-traininglarge language modelsWelsh languagecatastrophic forgettingagentic capabilitycultural alignmenthybrid reasoningsovereign post-training

0 comments

The pith

Post-training an open 120B model with curated synthetic data and a forgetting-prevention mix produces large targeted gains in Welsh, terminal use, and instruction following while retaining general capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Jupiter-N as a hybrid reasoning model derived from Nemotron 3 Super through post-training focused on agentic skills, UK cultural alignment, and Welsh language support. It relies on the Forget-Me-Not framework to combine uncertainty-curated trajectories, culturally grounded synthetic data, and parallel Welsh corpora while mixing reasoning and non-reasoning traces to avoid capability loss. Reported results include substantial improvements over the base model on Welsh benchmarks, terminal benchmarks, and instruction-following tasks. The authors describe the full pipeline, including public release of weights and datasets, as a reusable template that can be adapted for other languages and cultures by swapping in local data sources.

Core claim

What carries the argument

The Forget-Me-Not framework, a data curation strategy that mixes on-policy synthetic replay with off-policy task data and balances reasoning with non-reasoning traces to prevent catastrophic forgetting during targeted post-training.

If this is right

Substituting cultural knowledge, institutional corpora, and target languages in the same pipeline produces an equivalent model for any country.
Hybrid reasoning ability remains intact when the training mix includes both reasoning and non-reasoning traces.
Public release of all weights and post-training datasets under open licences enables direct reproduction and further adaptation.
Agentic performance rises through the addition of uncertainty-curated trajectories in the synthetic data.
Cultural alignment improves when synthetic data is grounded in the target norms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could support more decentralized creation of region-specific AI systems by allowing nations to adapt open base models independently.
Similar curation methods might extend to other specialized domains such as legal or scientific reasoning where capability preservation matters.
Ablation studies on the individual data components would clarify which parts of the mixture drive the observed gains versus the preservation effect.
Community follow-up work could test the pipeline on different base models to assess how widely the forgetting-prevention technique applies.

Load-bearing premise

The Forget-Me-Not data curation and synthetic mixture successfully preserves the base model's hybrid reasoning and general capabilities without any loss.

What would settle it

If Jupiter-N scores lower than the original Nemotron on general benchmarks such as standard MMLU or non-Welsh ARC-Easy, that would show loss of base capabilities.

Figures

Figures reproduced from arXiv: 2604.17429 by George Drayson.

**Figure 3.** Figure 3: Prompt template for English→Welsh translation of the synthetic Welsh chat dataset. Code, URLs, and mathematical notation are preserved verbatim. temperature 0.7, top-p 0.8, top-k 20, presence penalty 1.5). Both user and assistant fields are independently translated with the prompt shown in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Training loss over the single-epoch fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

We present Jupiter-N, a hybrid reasoning model post-trained from Nemotron 3 Super, a fully open-source 120 billion parameter LLM. We target three objectives: (1) agentic capability via uncertainty-curated trajectories; (2) UK cultural alignment via synthetic data grounded in cultural norms; and (3) Welsh language support via parallel corpora and LLM-translated Welsh conversations. Our data curation strategy carefully preserves the base model's capabilities: using our Forget-Me-Not framework, we mix on-policy synthetic replay with off-policy task data to mitigate catastrophic forgetting, and include a mixture of reasoning and non-reasoning traces to maintain Nemotron's hybrid reasoning ability. Jupiter-N achieves standout gains over Nemotron in Welsh (+18 on ARC-Easy, +5.25 on MMLU-Lite), terminal-use (+9.1 on Terminal Bench 2) and instruction following (+4.4 on IFBench), while retaining the base model capabilities. We frame this work as a reproducible template for sovereign post-training: substituting cultural knowledge, institutional corpora, and target languages produces an equivalent pipeline for any country. All model weights and all post-training datasets are publicly released under open licences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Jupiter-N, a 120B-parameter hybrid reasoning model obtained by post-training the open-source Nemotron 3 Super base model. It targets three goals: agentic capability through uncertainty-curated trajectories, UK cultural alignment via synthetic data grounded in cultural norms, and Welsh language support via parallel corpora and translated conversations. The central technical contribution is the Forget-Me-Not data curation strategy, which mixes on-policy synthetic replay with off-policy task data and includes both reasoning and non-reasoning traces to mitigate catastrophic forgetting and preserve the base model's hybrid reasoning ability. The paper reports concrete benchmark gains over Nemotron (Welsh ARC-Easy +18, MMLU-Lite +5.25, Terminal Bench 2 +9.1, IFBench +4.4) while claiming retention of general capabilities, and positions the work as a reproducible template for sovereign post-training, with all weights and datasets released publicly under open licenses.

Significance. If the reported gains and capability retention are substantiated by rigorous evaluation, the work would offer a practical, fully open example of targeted post-training for language and cultural adaptation at 120B scale. The public release of both model weights and the complete post-training datasets is a notable strength that directly supports reproducibility and extension to other languages or regions. The Forget-Me-Not mixture approach, if shown to be effective, could serve as a concrete template for balancing specialization and retention in large-model post-training.

major comments (2)

[Abstract] Abstract: the central claim that the Forget-Me-Not framework 'successfully preserves the base model's hybrid reasoning ability' and 'retains the base model capabilities' is load-bearing for the paper's contribution, yet the text supplies no ablations (with vs. without Forget-Me-Not), no before/after scores on the base Nemotron evaluation suite, and no quantitative retention metrics on non-targeted general capabilities.
[Abstract] Abstract: the specific numerical gains (+18 on Welsh ARC-Easy, +5.25 on MMLU-Lite, +9.1 on Terminal Bench 2, +4.4 on IFBench) are presented without any description of evaluation protocol, number of runs, statistical tests, or comparison to alternative post-training mixtures, which is required to attribute the improvements to the proposed data strategy rather than other factors.

minor comments (1)

[Abstract] The phrase 'sovereign post-training' is used without a precise definition or citation; a short clarifying sentence in the introduction would improve accessibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We agree that strengthening the evidence for capability retention and clarifying evaluation details will improve the manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the Forget-Me-Not framework 'successfully preserves the base model's hybrid reasoning ability' and 'retains the base model capabilities' is load-bearing for the paper's contribution, yet the text supplies no ablations (with vs. without Forget-Me-Not), no before/after scores on the base Nemotron evaluation suite, and no quantitative retention metrics on non-targeted general capabilities.

Authors: We acknowledge that the abstract's claims on retention would be more robust with explicit supporting evidence. The full manuscript reports post-training performance on multiple general benchmarks without observed degradation relative to the base model, but we agree that direct before/after comparisons on the Nemotron evaluation suite and quantitative retention metrics are needed. We will add a dedicated table in the revised Experiments section showing base Nemotron 3 Super versus Jupiter-N scores on key non-targeted benchmarks (including MMLU, GSM8K, and HumanEval subsets) to provide these metrics. We did not conduct a complete ablation removing the Forget-Me-Not mixture, as the on-policy replay component was central to our training pipeline; however, we will expand the existing comparisons to alternative data mixtures to better isolate the contribution of the retention strategy. revision: partial
Referee: [Abstract] Abstract: the specific numerical gains (+18 on Welsh ARC-Easy, +5.25 on MMLU-Lite, +9.1 on Terminal Bench 2, +4.4 on IFBench) are presented without any description of evaluation protocol, number of runs, statistical tests, or comparison to alternative post-training mixtures, which is required to attribute the improvements to the proposed data strategy rather than other factors.

Authors: We agree that the abstract alone does not convey the evaluation details. The full manuscript describes the evaluation protocol in the Experiments section, including use of standard benchmark implementations, but we will revise the abstract to include a brief reference to this section and add a footnote summarizing the protocol. We will also report the number of runs (five independent evaluations for stochastic tasks, with means and standard deviations) and note where statistical tests were applied. Expanded comparisons to alternative post-training data mixtures are already present in Section 4; we will ensure these are clearly linked to the reported gains to support attribution to the Forget-Me-Not strategy. revision: yes

standing simulated objections not resolved

A full ablation study with versus without the Forget-Me-Not framework was not performed, due to the prohibitive computational cost of additional 120B-scale training runs.

Circularity Check

0 steps flagged

Empirical training report with no derivations or self-referential logic

full rationale

The paper is a technical report on post-training an LLM (Jupiter-N from Nemotron base) using data curation strategies like Forget-Me-Not mixing of on-policy and off-policy data. It reports measured benchmark gains (e.g., +18 on ARC-Easy Welsh) as experimental outcomes against external suites, with no equations, derivations, parameter fits presented as predictions, or self-citations that bear the central claims. All load-bearing steps are direct empirical measurements, rendering the work self-contained with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard assumptions in LLM post-training about data mixing preventing forgetting and on the quality of synthetic cultural and language data; the Forget-Me-Not framework is newly introduced here.

axioms (2)

domain assumption Mixing on-policy synthetic replay with off-policy task data mitigates catastrophic forgetting during post-training
Core premise of the Forget-Me-Not framework described in the abstract.
domain assumption LLM-translated Welsh conversations and synthetic UK cultural data accurately represent target norms without introducing artifacts
Used to create the parallel corpora and cultural alignment data.

invented entities (1)

Forget-Me-Not framework no independent evidence
purpose: Data curation strategy that mixes synthetic replay with task data to preserve base model capabilities
Newly presented in this work as the key mechanism for avoiding forgetting while adding targeted skills.

pith-pipeline@v0.9.0 · 5491 in / 1509 out tokens · 62461 ms · 2026-05-10T05:41:33.714208+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 4 internal anchors

[1]

Drayson, George , year=. Locai

work page
[2]

Trends in cognitive sciences , volume=

Catastrophic forgetting in connectionist networks , author=. Trends in cognitive sciences , volume=. 1999 , publisher=

work page 1999
[3]

Psychology of learning and motivation , volume=

Catastrophic interference in connectionist networks: The sequential learning problem , author=. Psychology of learning and motivation , volume=. 1989 , publisher=

work page 1989
[4]

Advances in neural information processing systems , volume=

Experience replay for continual learning , author=. Advances in neural information processing systems , volume=

work page
[5]

AgentHarm: A Benchmark for Measuring Harmfulness of

Maksym Andriushchenko and Alexandra Souly and Mateusz Dziemian and Derek Duenas and Maxwell Lin and Justin Wang and Dan Hendrycks and Andy Zou and J Zico Kolter and Matt Fredrikson and Yarin Gal and Xander Davies , booktitle=. AgentHarm: A Benchmark for Measuring Harmfulness of. 2025 , url=

work page 2025
[6]

doi:10.5281/zenodo.17265942 , url =

SemHash: Fast Multimodal Semantic Deduplication & Filtering , year =. doi:10.5281/zenodo.17265942 , url =

work page doi:10.5281/zenodo.17265942
[7]

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ding, Ning and Chen, Yulin and Xu, Bokai and Qin, Yujia and Hu, Shengding and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.183

work page doi:10.18653/v1/2023.emnlp-main.183 2023
[8]

C ulture B ank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies

Shi, Weiyan and Li, Ryan and Zhang, Yutong and Ziems, Caleb and Yu, Sunny and Horesh, Raya and Paula, Rog \'e rio Abreu De and Yang, Diyi. C ulture B ank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.288

work page doi:10.18653/v1/2024.findings-emnlp.288 2024
[9]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Generalizing Verifiable Instruction Following , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[10]

2025 , url =

NVIDIA Nemotron 3: Efficient and Open Intelligence , author =. 2025 , url =

work page 2025
[11]

2026 , howpublished=

llm-evals-cy: Welsh Language Evaluation Suite for Large Language Models , author=. 2026 , howpublished=

work page 2026
[12]

Proceedings

On the resemblance and containment of documents , author=. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171) , pages=. 1997 , organization=

work page 1997
[13]

Adapting Multilingual

Joshi, Raviraj and Singla, Kanishk and Kamath, Anusha and Kalani, Raunak and Paul, Rakesh and Vaidya, Utkarsh and Chauhan, Sanjay Singh and Wartikar, Niranjan and Long, Eileen , journal=. Adapting Multilingual

work page
[14]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

work page 2022
[15]

Sovereign large language models: Advantages, strategy and regulations,

Sovereign Large Language Models: Advantages, Strategy and Regulations , author=. arXiv preprint arXiv:2503.04745 , year=

work page arXiv
[16]

Alexandrov, Anton and Raychev, Veselin and Dimitrov, Dimitar I and Zhang, Ce and Vechev, Martin and Toutanova, Kristina , journal=

work page
[17]

Mitigating catastrophic forgetting in language transfer via model merging

Alexandrov, Anton and Raychev, Veselin and M. Mitigating Catastrophic Forgetting in Language Transfer via Model Merging. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.1000

work page doi:10.18653/v1/2024.findings-emnlp.1000 2024
[18]

Yang, John and Jimenez, Carlos E and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , journal=

work page
[19]

Transformers are

Tri Dao and Albert Gu , booktitle=. Transformers are. 2024 , url=

work page 2024
[20]

The Fourteenth International Conference on Learning Representations , year=

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[21]

Instruction-Following Evaluation for Large Language Models

Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Wang, Yinjie and Chen, Xuyang and Jin, Xiaolong and Wang, Mengdi and Yang, Ling , journal=

work page
[23]

2026 , eprint=

On Data Engineering for Scaling LLM Terminal Capabilities , author=. 2026 , eprint=

work page 2026
[24]

The Bell system technical journal , volume=

A mathematical theory of communication , author=. The Bell system technical journal , volume=. 1948 , publisher=

work page 1948
[25]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2601.18129 , year=

Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models , author=. arXiv preprint arXiv:2601.18129 , year=

work page arXiv
[28]

Self-Pluralising Culture Alignment for Large Language Models

Xu, Shaoyang and Leng, Yongqi and Yu, Linhao and Xiong, Deyi. Self-Pluralising Culture Alignment for Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.350

work page doi:10.18653/v1/2025.naacl-long.350 2025
[29]

and Liu, Ziquan and Ferianc, Martin and Treleaven, Philip and Rodrigues, Miguel

Masoud, Reem I. and Liu, Ziquan and Ferianc, Martin and Treleaven, Philip and Rodrigues, Miguel. Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede ' s Cultural Dimensions. Proceedings of the 31st International Conference on Computational Linguistics. 2025

work page 2025
[30]

The Twelfth International Conference on Learning Representations , year=

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. The Twelfth International Conference on Learning Representations , year=

work page
[31]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Computer , volume=

The carbon footprint of machine learning training will plateau, then shrink , author=. Computer , volume=. 2022 , publisher=

work page 2022
[33]

2017 , howpublished=

Cymraeg 2050: A Million Welsh Speakers , author=. 2017 , howpublished=

work page 2050

[1] [1]

Drayson, George , year=. Locai

work page

[2] [2]

Trends in cognitive sciences , volume=

Catastrophic forgetting in connectionist networks , author=. Trends in cognitive sciences , volume=. 1999 , publisher=

work page 1999

[3] [3]

Psychology of learning and motivation , volume=

Catastrophic interference in connectionist networks: The sequential learning problem , author=. Psychology of learning and motivation , volume=. 1989 , publisher=

work page 1989

[4] [4]

Advances in neural information processing systems , volume=

Experience replay for continual learning , author=. Advances in neural information processing systems , volume=

work page

[5] [5]

AgentHarm: A Benchmark for Measuring Harmfulness of

Maksym Andriushchenko and Alexandra Souly and Mateusz Dziemian and Derek Duenas and Maxwell Lin and Justin Wang and Dan Hendrycks and Andy Zou and J Zico Kolter and Matt Fredrikson and Yarin Gal and Xander Davies , booktitle=. AgentHarm: A Benchmark for Measuring Harmfulness of. 2025 , url=

work page 2025

[6] [6]

doi:10.5281/zenodo.17265942 , url =

SemHash: Fast Multimodal Semantic Deduplication & Filtering , year =. doi:10.5281/zenodo.17265942 , url =

work page doi:10.5281/zenodo.17265942

[7] [7]

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ding, Ning and Chen, Yulin and Xu, Bokai and Qin, Yujia and Hu, Shengding and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.183

work page doi:10.18653/v1/2023.emnlp-main.183 2023

[8] [8]

C ulture B ank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies

Shi, Weiyan and Li, Ryan and Zhang, Yutong and Ziems, Caleb and Yu, Sunny and Horesh, Raya and Paula, Rog \'e rio Abreu De and Yang, Diyi. C ulture B ank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.288

work page doi:10.18653/v1/2024.findings-emnlp.288 2024

[9] [9]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Generalizing Verifiable Instruction Following , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page

[10] [10]

2025 , url =

NVIDIA Nemotron 3: Efficient and Open Intelligence , author =. 2025 , url =

work page 2025

[11] [11]

2026 , howpublished=

llm-evals-cy: Welsh Language Evaluation Suite for Large Language Models , author=. 2026 , howpublished=

work page 2026

[12] [12]

Proceedings

On the resemblance and containment of documents , author=. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171) , pages=. 1997 , organization=

work page 1997

[13] [13]

Adapting Multilingual

Joshi, Raviraj and Singla, Kanishk and Kamath, Anusha and Kalani, Raunak and Paul, Rakesh and Vaidya, Utkarsh and Chauhan, Sanjay Singh and Wartikar, Niranjan and Long, Eileen , journal=. Adapting Multilingual

work page

[14] [14]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

work page 2022

[15] [15]

Sovereign large language models: Advantages, strategy and regulations,

Sovereign Large Language Models: Advantages, Strategy and Regulations , author=. arXiv preprint arXiv:2503.04745 , year=

work page arXiv

[16] [16]

Alexandrov, Anton and Raychev, Veselin and Dimitrov, Dimitar I and Zhang, Ce and Vechev, Martin and Toutanova, Kristina , journal=

work page

[17] [17]

Mitigating catastrophic forgetting in language transfer via model merging

Alexandrov, Anton and Raychev, Veselin and M. Mitigating Catastrophic Forgetting in Language Transfer via Model Merging. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.1000

work page doi:10.18653/v1/2024.findings-emnlp.1000 2024

[18] [18]

Yang, John and Jimenez, Carlos E and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , journal=

work page

[19] [19]

Transformers are

Tri Dao and Albert Gu , booktitle=. Transformers are. 2024 , url=

work page 2024

[20] [20]

The Fourteenth International Conference on Learning Representations , year=

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. The Fourteenth International Conference on Learning Representations , year=

work page

[21] [21]

Instruction-Following Evaluation for Large Language Models

Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Wang, Yinjie and Chen, Xuyang and Jin, Xiaolong and Wang, Mengdi and Yang, Ling , journal=

work page

[23] [23]

2026 , eprint=

On Data Engineering for Scaling LLM Terminal Capabilities , author=. 2026 , eprint=

work page 2026

[24] [24]

The Bell system technical journal , volume=

A mathematical theory of communication , author=. The Bell system technical journal , volume=. 1948 , publisher=

work page 1948

[25] [25]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

arXiv preprint arXiv:2601.18129 , year=

Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models , author=. arXiv preprint arXiv:2601.18129 , year=

work page arXiv

[28] [28]

Self-Pluralising Culture Alignment for Large Language Models

Xu, Shaoyang and Leng, Yongqi and Yu, Linhao and Xiong, Deyi. Self-Pluralising Culture Alignment for Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.350

work page doi:10.18653/v1/2025.naacl-long.350 2025

[29] [29]

and Liu, Ziquan and Ferianc, Martin and Treleaven, Philip and Rodrigues, Miguel

Masoud, Reem I. and Liu, Ziquan and Ferianc, Martin and Treleaven, Philip and Rodrigues, Miguel. Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede ' s Cultural Dimensions. Proceedings of the 31st International Conference on Computational Linguistics. 2025

work page 2025

[30] [30]

The Twelfth International Conference on Learning Representations , year=

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. The Twelfth International Conference on Learning Representations , year=

work page

[31] [31]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Computer , volume=

The carbon footprint of machine learning training will plateau, then shrink , author=. Computer , volume=. 2022 , publisher=

work page 2022

[33] [33]

2017 , howpublished=

Cymraeg 2050: A Million Welsh Speakers , author=. 2017 , howpublished=

work page 2050