arxiv: 2605.10391 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI· cs.CV

Recognition: no theorem link

Phoenix-VL 1.5 Medium Technical Report

Team Phoenix: Arka Ray , Askar Ali Mohamed Jawad , Biondi Lee , Elijah Seah , Eva Lim , Fiona Teo , Grace Toh , Guang Xiang Teo

show 21 more authors

Jun En Tan Jia Hui Bong Jiale Wang Jonathan Ng Justin Tan Kai Zhe Yew Matthew Ong Shun Yi Yeo Wen Jett Lam Wen Xiu Tan Ze Yu Zhang Gee Wah Ng Chee Wee Ang Mistral AI: Adrien Sad\'e Guillaume Kunsch Jia Sin Loh Nicolas Schuhl Rupert Menneer Umar Jamil Vincent Maladi\`ere Yimu Pan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords multimodalmediumsingaporetokensbenchmarksbillionmodelphoenix-vl

0 comments

The pith

Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying competitive on global benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors start with an existing large model and keep training it on a massive collection of text and images focused on Singapore. They add more training for handling long documents, then fine-tune using specially created local datasets covering culture, laws, and government topics. A final alignment step uses human preferences to make outputs safer and more appropriate. The outcome is a model that performs especially well on Singapore-related tasks without falling far behind on general knowledge tests.

Core claim

Phoenix-VL 1.5 Medium achieves state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks.

Load-bearing premise

That the novel human-annotated Singapore multimodal dataset and curated 22B-token corpus sufficiently represent the target domain and that benchmark scores accurately reflect real-world utility without hidden biases or overfitting to the evaluation sets.

Figures

Figures reproduced from arXiv: 2605.10391 by Askar Ali Mohamed Jawad, Biondi Lee, Chee Wee Ang, Elijah Seah, Eva Lim, Fiona Teo, Gee Wah Ng, Grace Toh, Guang Xiang Teo, Guillaume Kunsch, Jia Hui Bong, Jiale Wang, Jia Sin Loh, Jonathan Ng, Jun En Tan, Justin Tan, Kai Zhe Yew, Matthew Ong, Mistral AI: Adrien Sad\'e, Nicolas Schuhl, Rupert Menneer, Shun Yi Yeo, Team Phoenix: Arka Ray, Umar Jamil, Vincent Maladi\`ere, Wen Jett Lam, Wen Xiu Tan, Yimu Pan, Ze Yu Zhang.

**Figure 2.** Figure 2: Example Multimodal interactions demonstrating strong visual Q&A in Singapore context [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Multilingual performance: Phoenix-VL 1.5 Medium exhibits competitive comprehension of Malay, Tamil, and Chinese, the major languages of Singapore [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Multimodal reasoning: Phoenix-VL 1.5 Medium demonstrates strong vision capabilities, particularly on complex document analysis [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Preservation of general intelligence. Our training recipe ensures domain adaptation does not compromise Phoenix-VL 1.5 Medium’s broad-based multimodal intelligence. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: High-level training pipeline for Phoenix-VL 1.5 Medium [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Safety and behavioral profiles of Phoenix-VL 1.5 Medium against frontier baselines. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Pareto frontier illustrating the trade-off between Interactivity (user experience) and Output [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Interactions demonstrating strong visual (multi-frame) Q&A in Singapore context and [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Interaction demonstrating Visual Q&A in Singapore context and complex chart analysis. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

read the original abstract

We introduce Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model, adapted to regional languages and the Singapore context. Developed as a sovereign AI asset, it demonstrates that deep domain adaptation can be achieved with minimal degradation to broad-spectrum intelligence and alignment. Continued pretraining was performed on Mistral Medium 3.1 using a localized 1-trillion tokens multimodal corpus, followed by a 250-billion tokens long-context extension phase. Subsequent post-training incorporated a novel human-annotated Singapore multimodal dataset and curated textual corpus on Singapore culture, knowledge, and legislation, totaling 22-billion tokens. An additional 5 billion tokens of model alignment was performed through Online Direct Preference Optimization. Phoenix-VL 1.5 Medium achieves state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks. We also introduce a novel evaluation suite encompassing localized knowledge benchmarks and an institutionally aligned model behavior and safety framework. We report the data curation principles, training methodology, and highlight benchmark and inference performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical technical report on adapting a 123B multimodal model to Singapore data with standard continued pretraining and alignment steps.

read the letter

The core of the paper is the description of how they took Mistral Medium 3.1 and ran continued pretraining on a 1-trillion-token localized multimodal corpus, added a 250-billion-token long-context phase, then did post-training on a new 22-billion-token Singapore-specific corpus plus 5 billion tokens of Online DPO. They also built a fresh evaluation suite for local knowledge, legal, and policy tasks. That is the actual new thing: one more large-scale regional adaptation instance with its data and training choices documented.

Referee Report

2 major / 2 minor

Summary. The paper introduces Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model adapted to regional languages and the Singapore context. It describes continued pretraining on Mistral Medium 3.1 using a 1-trillion token multimodal corpus followed by a 250-billion token long-context extension, post-training on a novel human-annotated Singapore multimodal dataset plus a curated 22-billion token textual corpus on Singapore culture/knowledge/legislation, and 5 billion tokens of alignment via Online Direct Preference Optimization. The model is claimed to achieve state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks. A novel evaluation suite for localized knowledge and institutionally aligned model behavior/safety is introduced, along with details on data curation principles, training methodology, and benchmark/inference performance.

Significance. If the performance claims hold, the work is significant for demonstrating that targeted domain adaptation on a localized corpus can produce a sovereign AI asset with limited degradation to general capabilities. This has implications for multilingual and culturally specific AI development. The introduction of a novel evaluation suite and explicit reporting of curation and training details provide a useful template for future regional adaptation efforts.

major comments (2)

[Benchmark Results] The central SOTA claim for Singapore multimodal, legal, and government policy benchmarks is load-bearing but unsupported by any numerical scores, baselines, error bars, or ablation results in the abstract or summary. The manuscript must include explicit comparison tables (e.g., in the benchmark results section) showing exact metrics against relevant models to allow verification of the 'state-of-the-art for its size' assertion.
[Data and Post-Training] The novel human-annotated Singapore multimodal dataset and 22B-token curated corpus are central to the post-training adaptation claim, yet no details are supplied on annotation guidelines, inter-annotator agreement, dataset size breakdown, or contamination checks against the evaluation sets. This leaves the weakest assumption unaddressed and risks undermining the domain-representation argument.

minor comments (2)

[Abstract] The abstract states that benchmark and inference performance are highlighted but supplies no concrete metrics; adding one or two key numbers would improve readability without altering length.
[Training Methodology] Clarify the relationship between the 1-trillion token continued-pretraining corpus and the separate 22-billion token post-training corpus to prevent scale confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments help clarify how to strengthen the presentation of our results and data practices. We agree that explicit numerical support for the SOTA claims and greater transparency on the post-training datasets will improve verifiability. We address each major comment below and will incorporate the requested additions in the revised manuscript.

read point-by-point responses

Referee: [Benchmark Results] The central SOTA claim for Singapore multimodal, legal, and government policy benchmarks is load-bearing but unsupported by any numerical scores, baselines, error bars, or ablation results in the abstract or summary. The manuscript must include explicit comparison tables (e.g., in the benchmark results section) showing exact metrics against relevant models to allow verification of the 'state-of-the-art for its size' assertion.

Authors: We acknowledge that the abstract and high-level summary do not contain the numerical tables. The full manuscript reports detailed benchmark results in Section 4, including comparisons on Singapore-specific and global benchmarks. To directly address the concern, we will revise the abstract and add a consolidated results table (with exact metrics, baselines, and where available error bars) at the beginning of the benchmark section. We will also expand the ablation discussion on the contribution of the localized pretraining and post-training phases. revision: yes
Referee: [Data and Post-Training] The novel human-annotated Singapore multimodal dataset and 22B-token curated corpus are central to the post-training adaptation claim, yet no details are supplied on annotation guidelines, inter-annotator agreement, dataset size breakdown, or contamination checks against the evaluation sets. This leaves the weakest assumption unaddressed and risks undermining the domain-representation argument.

Authors: We agree that additional specifics are warranted. The current manuscript summarizes data curation principles and reports the overall token counts, but does not provide the granular details requested. In the revision we will expand the data section to include: (1) annotation guidelines and quality-control procedures, (2) inter-annotator agreement statistics, (3) a breakdown of the 22B-token corpus by category (culture, legislation, multimodal pairs, etc.), and (4) explicit contamination analysis against the evaluation benchmarks. These additions will be placed in the post-training data subsection. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a technical report on model training and empirical benchmark results. It describes continued pretraining on a 1-trillion token corpus, post-training on a 22B token Singapore-specific dataset, and reports performance on various benchmarks. No mathematical derivations, equations, predictions from fitted parameters, or self-referential definitions are present. Claims rest on held-out evaluation metrics rather than any reduction to inputs by construction. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an engineering technical report on model training and evaluation rather than a theoretical contribution; no free parameters, axioms, or invented entities are introduced or required to support the central claims in the abstract.

pith-pipeline@v0.9.0 · 5614 in / 1160 out tokens · 39009 ms · 2026-05-12T04:38:17.652362+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 10 internal anchors

[1]

Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Devendra Chaplot, Jessica Chudnovsky, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amélie Héliou, Paul Jacob, Albert Q. Jiang, Timothée Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Marshall, Louis Martin, Arthur Mensch, Pavankumar Muddireddy, Valera Nemychniko...

work page internal anchor Pith review arXiv
[2]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page 2023
[3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Lost in localization: Building rabak- bench with human-in-the-loop validation to measure multilingual safety gaps.arXiv preprint arXiv:2507.05980,

Gabriel Chua, Leanne Tan, Ziyu Ge, and Roy Ka-Wei Lee. Lost in localization: Building rabak- bench with human-in-the-loop validation to measure multilingual safety gaps.arXiv preprint arXiv:2507.05980,

work page arXiv
[5]

No Language Left Behind: Scaling Human-Centered Machine Translation

Marta R Costa-Jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672,

work page internal anchor Pith review arXiv
[6]

Sailor2: sailing in south-east asia with inclusive multilingual llms.arXiv preprint arXiv:2502.12982,

Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, et al. Sailor2: sailing in south-east asia with inclusive multilingual llms.arXiv preprint arXiv:2502.12982,

work page arXiv
[7]

Investigating the factual knowledge boundary of large language models with retrieval augmentation,

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yutao Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2307.11019,

work page arXiv
[8]

Suchin Gururangan, Ana Marasovi ´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith

URL https://alignment.anthropic.com/ 2025/bloom-auto-evals/. Suchin Gururangan, Ana Marasovi ´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 8342–8360,

work page 2025
[9]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[10]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Av- eraging weights leads to wider optima and better generalization.arXiv preprint arXiv:1803.05407,

work page Pith review arXiv
[12]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

16 Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Cmmlu: Measuring massive multitask language understanding in chinese

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11260–11285,

work page 2024
[14]

Ministral 3

Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584,

work page internal anchor Pith review arXiv
[15]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Mistral AI

doi: 10.24963/ijcai.2025/53. Mistral AI. Introducing Mistral 3.https://mistral.ai/news/mistral-3, dec

work page doi:10.24963/ijcai.2025/53 2025
[17]

Nakanishi

Ken M. Nakanishi. Scalable-softmax is superior for attention. InarXiv preprint arXiv:2501.19399,

work page arXiv
[18]

Reuse, don’t retrain: A recipe for continued pretraining of language models,

Jupinder Parmar, Sanjev Satheesh, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263,

work page arXiv
[19]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InarXiv preprint arXiv:2309.00071,

work page internal anchor Pith review arXiv
[20]

GLU Variants Improve Transformer

17 Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[21]

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

URLhttps://file.go.gov.sg/nais2023.pdf. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model.arXiv preprint arXiv:2201.11990,

work page Pith review arXiv
[22]

Kmmlu: Measuring massive multitask language understanding in korean

Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, and Stella Biderman. Kmmlu: Measuring massive multitask language understanding in korean. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo-...

work page 2025
[23]

Sea-helm: Southeast asian holistic evaluation of language models

Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xian- bin Yong, Wei Qi Leong, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Yifan Mai, and William Chandra Tjhi. Sea-helm: Southeast asian holistic evaluation of language models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 12308–12336,

work page 2025
[24]

InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211

URLhttps://go.gov.sg/phoenix-small. Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364,

work page arXiv
[25]

Alignment for honesty

Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, and Pengfei Liu. Alignment for honesty. arXiv preprint arXiv:2312.07000,

work page arXiv
[26]

Collie: Systematic construction of constrained text generation tasks.arXiv preprint arXiv:2307.08689,

Shunyu Yao, Howard Chen, Austin W Hanjie, Runzhe Yang, and Karthik Narasimhan. Collie: Systematic construction of constrained text generation tasks.arXiv preprint arXiv:2307.08689,

work page arXiv
[27]

Should we respect llms? a cross-lingual study on the influence of prompt politeness on llm performance

Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, and Satoshi Sekine. Should we respect llms? a cross-lingual study on the influence of prompt politeness on llm performance. InProceedings of the Second Workshop on Social Influence in Conversations (SICon 2024), pages 9–35,

work page 2024
[28]

Seallms 3: Open foundation and chat multilingual large language models for southeast asian languages

Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, et al. Seallms 3: Open foundation and chat multilingual large language models for southeast asian languages. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Li...

work page 2025
[29]

Figure 9: Interactions demonstrating strong visual (multi-frame) Q&A in Singapore context and zero-shot Singapore legal knowledge recall

18 Appendix A Additional Use Cases The figures below provide a further, non-exhaustive list of possible use cases with Phoenix-VL 1.5 Medium. Figure 9: Interactions demonstrating strong visual (multi-frame) Q&A in Singapore context and zero-shot Singapore legal knowledge recall. Figure 10: Interaction demonstrating Visual Q&A in Singapore context and comp...

work page arXiv 2020