Jupiter-N Technical Report
Pith reviewed 2026-05-10 05:41 UTC · model grok-4.3
The pith
Post-training an open 120B model with curated synthetic data and a forgetting-prevention mix produces large targeted gains in Welsh, terminal use, and instruction following while retaining general capabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present Jupiter-N, a hybrid reasoning model post-trained from Nemotron 3 Super, a fully open-source 120 billion parameter LLM. We target three objectives: (1) agentic capability via uncertainty-curated trajectories; (2) UK cultural alignment via synthetic data grounded in cultural norms; and (3) Welsh language support via parallel corpora and LLM-translated Welsh conversations. Our data curation strategy carefully preserves the base model's capabilities: using our Forget-Me-Not framework, we mix on-policy synthetic replay with off-policy task data to mitigate catastrophic forgetting, and include a mixture of reasoning and non-reasoning traces to maintain Nemotron's hybrid reasoning. This,
What carries the argument
The Forget-Me-Not framework, a data curation strategy that mixes on-policy synthetic replay with off-policy task data and balances reasoning with non-reasoning traces to prevent catastrophic forgetting during targeted post-training.
If this is right
- Substituting cultural knowledge, institutional corpora, and target languages in the same pipeline produces an equivalent model for any country.
- Hybrid reasoning ability remains intact when the training mix includes both reasoning and non-reasoning traces.
- Public release of all weights and post-training datasets under open licences enables direct reproduction and further adaptation.
- Agentic performance rises through the addition of uncertainty-curated trajectories in the synthetic data.
- Cultural alignment improves when synthetic data is grounded in the target norms.
Where Pith is reading between the lines
- The approach could support more decentralized creation of region-specific AI systems by allowing nations to adapt open base models independently.
- Similar curation methods might extend to other specialized domains such as legal or scientific reasoning where capability preservation matters.
- Ablation studies on the individual data components would clarify which parts of the mixture drive the observed gains versus the preservation effect.
- Community follow-up work could test the pipeline on different base models to assess how widely the forgetting-prevention technique applies.
Load-bearing premise
The Forget-Me-Not data curation and synthetic mixture successfully preserves the base model's hybrid reasoning and general capabilities without any loss.
What would settle it
If Jupiter-N scores lower than the original Nemotron on general benchmarks such as standard MMLU or non-Welsh ARC-Easy, that would show loss of base capabilities.
Figures
read the original abstract
We present Jupiter-N, a hybrid reasoning model post-trained from Nemotron 3 Super, a fully open-source 120 billion parameter LLM. We target three objectives: (1) agentic capability via uncertainty-curated trajectories; (2) UK cultural alignment via synthetic data grounded in cultural norms; and (3) Welsh language support via parallel corpora and LLM-translated Welsh conversations. Our data curation strategy carefully preserves the base model's capabilities: using our Forget-Me-Not framework, we mix on-policy synthetic replay with off-policy task data to mitigate catastrophic forgetting, and include a mixture of reasoning and non-reasoning traces to maintain Nemotron's hybrid reasoning ability. Jupiter-N achieves standout gains over Nemotron in Welsh (+18 on ARC-Easy, +5.25 on MMLU-Lite), terminal-use (+9.1 on Terminal Bench 2) and instruction following (+4.4 on IFBench), while retaining the base model capabilities. We frame this work as a reproducible template for sovereign post-training: substituting cultural knowledge, institutional corpora, and target languages produces an equivalent pipeline for any country. All model weights and all post-training datasets are publicly released under open licences.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Jupiter-N, a 120B-parameter hybrid reasoning model obtained by post-training the open-source Nemotron 3 Super base model. It targets three goals: agentic capability through uncertainty-curated trajectories, UK cultural alignment via synthetic data grounded in cultural norms, and Welsh language support via parallel corpora and translated conversations. The central technical contribution is the Forget-Me-Not data curation strategy, which mixes on-policy synthetic replay with off-policy task data and includes both reasoning and non-reasoning traces to mitigate catastrophic forgetting and preserve the base model's hybrid reasoning ability. The paper reports concrete benchmark gains over Nemotron (Welsh ARC-Easy +18, MMLU-Lite +5.25, Terminal Bench 2 +9.1, IFBench +4.4) while claiming retention of general capabilities, and positions the work as a reproducible template for sovereign post-training, with all weights and datasets released publicly under open licenses.
Significance. If the reported gains and capability retention are substantiated by rigorous evaluation, the work would offer a practical, fully open example of targeted post-training for language and cultural adaptation at 120B scale. The public release of both model weights and the complete post-training datasets is a notable strength that directly supports reproducibility and extension to other languages or regions. The Forget-Me-Not mixture approach, if shown to be effective, could serve as a concrete template for balancing specialization and retention in large-model post-training.
major comments (2)
- [Abstract] Abstract: the central claim that the Forget-Me-Not framework 'successfully preserves the base model's hybrid reasoning ability' and 'retains the base model capabilities' is load-bearing for the paper's contribution, yet the text supplies no ablations (with vs. without Forget-Me-Not), no before/after scores on the base Nemotron evaluation suite, and no quantitative retention metrics on non-targeted general capabilities.
- [Abstract] Abstract: the specific numerical gains (+18 on Welsh ARC-Easy, +5.25 on MMLU-Lite, +9.1 on Terminal Bench 2, +4.4 on IFBench) are presented without any description of evaluation protocol, number of runs, statistical tests, or comparison to alternative post-training mixtures, which is required to attribute the improvements to the proposed data strategy rather than other factors.
minor comments (1)
- [Abstract] The phrase 'sovereign post-training' is used without a precise definition or citation; a short clarifying sentence in the introduction would improve accessibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that strengthening the evidence for capability retention and clarifying evaluation details will improve the manuscript. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the Forget-Me-Not framework 'successfully preserves the base model's hybrid reasoning ability' and 'retains the base model capabilities' is load-bearing for the paper's contribution, yet the text supplies no ablations (with vs. without Forget-Me-Not), no before/after scores on the base Nemotron evaluation suite, and no quantitative retention metrics on non-targeted general capabilities.
Authors: We acknowledge that the abstract's claims on retention would be more robust with explicit supporting evidence. The full manuscript reports post-training performance on multiple general benchmarks without observed degradation relative to the base model, but we agree that direct before/after comparisons on the Nemotron evaluation suite and quantitative retention metrics are needed. We will add a dedicated table in the revised Experiments section showing base Nemotron 3 Super versus Jupiter-N scores on key non-targeted benchmarks (including MMLU, GSM8K, and HumanEval subsets) to provide these metrics. We did not conduct a complete ablation removing the Forget-Me-Not mixture, as the on-policy replay component was central to our training pipeline; however, we will expand the existing comparisons to alternative data mixtures to better isolate the contribution of the retention strategy. revision: partial
-
Referee: [Abstract] Abstract: the specific numerical gains (+18 on Welsh ARC-Easy, +5.25 on MMLU-Lite, +9.1 on Terminal Bench 2, +4.4 on IFBench) are presented without any description of evaluation protocol, number of runs, statistical tests, or comparison to alternative post-training mixtures, which is required to attribute the improvements to the proposed data strategy rather than other factors.
Authors: We agree that the abstract alone does not convey the evaluation details. The full manuscript describes the evaluation protocol in the Experiments section, including use of standard benchmark implementations, but we will revise the abstract to include a brief reference to this section and add a footnote summarizing the protocol. We will also report the number of runs (five independent evaluations for stochastic tasks, with means and standard deviations) and note where statistical tests were applied. Expanded comparisons to alternative post-training data mixtures are already present in Section 4; we will ensure these are clearly linked to the reported gains to support attribution to the Forget-Me-Not strategy. revision: yes
- A full ablation study with versus without the Forget-Me-Not framework was not performed, due to the prohibitive computational cost of additional 120B-scale training runs.
Circularity Check
Empirical training report with no derivations or self-referential logic
full rationale
The paper is a technical report on post-training an LLM (Jupiter-N from Nemotron base) using data curation strategies like Forget-Me-Not mixing of on-policy and off-policy data. It reports measured benchmark gains (e.g., +18 on ARC-Easy Welsh) as experimental outcomes against external suites, with no equations, derivations, parameter fits presented as predictions, or self-citations that bear the central claims. All load-bearing steps are direct empirical measurements, rendering the work self-contained with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Mixing on-policy synthetic replay with off-policy task data mitigates catastrophic forgetting during post-training
- domain assumption LLM-translated Welsh conversations and synthetic UK cultural data accurately represent target norms without introducing artifacts
invented entities (1)
-
Forget-Me-Not framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Drayson, George , year=. Locai
-
[2]
Trends in cognitive sciences , volume=
Catastrophic forgetting in connectionist networks , author=. Trends in cognitive sciences , volume=. 1999 , publisher=
work page 1999
-
[3]
Psychology of learning and motivation , volume=
Catastrophic interference in connectionist networks: The sequential learning problem , author=. Psychology of learning and motivation , volume=. 1989 , publisher=
work page 1989
-
[4]
Advances in neural information processing systems , volume=
Experience replay for continual learning , author=. Advances in neural information processing systems , volume=
-
[5]
AgentHarm: A Benchmark for Measuring Harmfulness of
Maksym Andriushchenko and Alexandra Souly and Mateusz Dziemian and Derek Duenas and Maxwell Lin and Justin Wang and Dan Hendrycks and Andy Zou and J Zico Kolter and Matt Fredrikson and Yarin Gal and Xander Davies , booktitle=. AgentHarm: A Benchmark for Measuring Harmfulness of. 2025 , url=
work page 2025
-
[6]
doi:10.5281/zenodo.17265942 , url =
SemHash: Fast Multimodal Semantic Deduplication & Filtering , year =. doi:10.5281/zenodo.17265942 , url =
-
[7]
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
Ding, Ning and Chen, Yulin and Xu, Bokai and Qin, Yujia and Hu, Shengding and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.183
-
[8]
Shi, Weiyan and Li, Ryan and Zhang, Yutong and Ziems, Caleb and Yu, Sunny and Horesh, Raya and Paula, Rog \'e rio Abreu De and Yang, Diyi. C ulture B ank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.288
-
[9]
Generalizing Verifiable Instruction Following , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[10]
NVIDIA Nemotron 3: Efficient and Open Intelligence , author =. 2025 , url =
work page 2025
-
[11]
llm-evals-cy: Welsh Language Evaluation Suite for Large Language Models , author=. 2026 , howpublished=
work page 2026
-
[12]
On the resemblance and containment of documents , author=. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171) , pages=. 1997 , organization=
work page 1997
-
[13]
Joshi, Raviraj and Singla, Kanishk and Kamath, Anusha and Kalani, Raunak and Paul, Rakesh and Vaidya, Utkarsh and Chauhan, Sanjay Singh and Wartikar, Niranjan and Long, Eileen , journal=. Adapting Multilingual
-
[14]
Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=
work page 2022
-
[15]
Sovereign large language models: Advantages, strategy and regulations,
Sovereign Large Language Models: Advantages, Strategy and Regulations , author=. arXiv preprint arXiv:2503.04745 , year=
-
[16]
Alexandrov, Anton and Raychev, Veselin and Dimitrov, Dimitar I and Zhang, Ce and Vechev, Martin and Toutanova, Kristina , journal=
-
[17]
Mitigating catastrophic forgetting in language transfer via model merging
Alexandrov, Anton and Raychev, Veselin and M. Mitigating Catastrophic Forgetting in Language Transfer via Model Merging. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.1000
-
[18]
Yang, John and Jimenez, Carlos E and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , journal=
- [19]
-
[20]
The Fourteenth International Conference on Learning Representations , year=
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. The Fourteenth International Conference on Learning Representations , year=
-
[21]
Instruction-Following Evaluation for Large Language Models
Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Wang, Yinjie and Chen, Xuyang and Jin, Xiaolong and Wang, Mengdi and Yang, Ling , journal=
-
[23]
On Data Engineering for Scaling LLM Terminal Capabilities , author=. 2026 , eprint=
work page 2026
-
[24]
The Bell system technical journal , volume=
A mathematical theory of communication , author=. The Bell system technical journal , volume=. 1948 , publisher=
work page 1948
-
[25]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
arXiv preprint arXiv:2601.18129 , year=
Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models , author=. arXiv preprint arXiv:2601.18129 , year=
-
[28]
Self-Pluralising Culture Alignment for Large Language Models
Xu, Shaoyang and Leng, Yongqi and Yu, Linhao and Xiong, Deyi. Self-Pluralising Culture Alignment for Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.350
-
[29]
and Liu, Ziquan and Ferianc, Martin and Treleaven, Philip and Rodrigues, Miguel
Masoud, Reem I. and Liu, Ziquan and Ferianc, Martin and Treleaven, Philip and Rodrigues, Miguel. Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede ' s Cultural Dimensions. Proceedings of the 31st International Conference on Computational Linguistics. 2025
work page 2025
-
[30]
The Twelfth International Conference on Learning Representations , year=
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , author=. The Twelfth International Conference on Learning Representations , year=
-
[31]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
The carbon footprint of machine learning training will plateau, then shrink , author=. Computer , volume=. 2022 , publisher=
work page 2022
-
[33]
Cymraeg 2050: A Million Welsh Speakers , author=. 2017 , howpublished=
work page 2050
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.