Recognition: no theorem link
Phoenix-VL 1.5 Medium Technical Report
Pith reviewed 2026-05-12 04:38 UTC · model grok-4.3
The pith
Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying competitive on global benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Phoenix-VL 1.5 Medium achieves state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks.
Load-bearing premise
That the novel human-annotated Singapore multimodal dataset and curated 22B-token corpus sufficiently represent the target domain and that benchmark scores accurately reflect real-world utility without hidden biases or overfitting to the evaluation sets.
Figures
read the original abstract
We introduce Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model, adapted to regional languages and the Singapore context. Developed as a sovereign AI asset, it demonstrates that deep domain adaptation can be achieved with minimal degradation to broad-spectrum intelligence and alignment. Continued pretraining was performed on Mistral Medium 3.1 using a localized 1-trillion tokens multimodal corpus, followed by a 250-billion tokens long-context extension phase. Subsequent post-training incorporated a novel human-annotated Singapore multimodal dataset and curated textual corpus on Singapore culture, knowledge, and legislation, totaling 22-billion tokens. An additional 5 billion tokens of model alignment was performed through Online Direct Preference Optimization. Phoenix-VL 1.5 Medium achieves state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks. We also introduce a novel evaluation suite encompassing localized knowledge benchmarks and an institutionally aligned model behavior and safety framework. We report the data curation principles, training methodology, and highlight benchmark and inference performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model adapted to regional languages and the Singapore context. It describes continued pretraining on Mistral Medium 3.1 using a 1-trillion token multimodal corpus followed by a 250-billion token long-context extension, post-training on a novel human-annotated Singapore multimodal dataset plus a curated 22-billion token textual corpus on Singapore culture/knowledge/legislation, and 5 billion tokens of alignment via Online Direct Preference Optimization. The model is claimed to achieve state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks. A novel evaluation suite for localized knowledge and institutionally aligned model behavior/safety is introduced, along with details on data curation principles, training methodology, and benchmark/inference performance.
Significance. If the performance claims hold, the work is significant for demonstrating that targeted domain adaptation on a localized corpus can produce a sovereign AI asset with limited degradation to general capabilities. This has implications for multilingual and culturally specific AI development. The introduction of a novel evaluation suite and explicit reporting of curation and training details provide a useful template for future regional adaptation efforts.
major comments (2)
- [Benchmark Results] The central SOTA claim for Singapore multimodal, legal, and government policy benchmarks is load-bearing but unsupported by any numerical scores, baselines, error bars, or ablation results in the abstract or summary. The manuscript must include explicit comparison tables (e.g., in the benchmark results section) showing exact metrics against relevant models to allow verification of the 'state-of-the-art for its size' assertion.
- [Data and Post-Training] The novel human-annotated Singapore multimodal dataset and 22B-token curated corpus are central to the post-training adaptation claim, yet no details are supplied on annotation guidelines, inter-annotator agreement, dataset size breakdown, or contamination checks against the evaluation sets. This leaves the weakest assumption unaddressed and risks undermining the domain-representation argument.
minor comments (2)
- [Abstract] The abstract states that benchmark and inference performance are highlighted but supplies no concrete metrics; adding one or two key numbers would improve readability without altering length.
- [Training Methodology] Clarify the relationship between the 1-trillion token continued-pretraining corpus and the separate 22-billion token post-training corpus to prevent scale confusion.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. The comments help clarify how to strengthen the presentation of our results and data practices. We agree that explicit numerical support for the SOTA claims and greater transparency on the post-training datasets will improve verifiability. We address each major comment below and will incorporate the requested additions in the revised manuscript.
read point-by-point responses
-
Referee: [Benchmark Results] The central SOTA claim for Singapore multimodal, legal, and government policy benchmarks is load-bearing but unsupported by any numerical scores, baselines, error bars, or ablation results in the abstract or summary. The manuscript must include explicit comparison tables (e.g., in the benchmark results section) showing exact metrics against relevant models to allow verification of the 'state-of-the-art for its size' assertion.
Authors: We acknowledge that the abstract and high-level summary do not contain the numerical tables. The full manuscript reports detailed benchmark results in Section 4, including comparisons on Singapore-specific and global benchmarks. To directly address the concern, we will revise the abstract and add a consolidated results table (with exact metrics, baselines, and where available error bars) at the beginning of the benchmark section. We will also expand the ablation discussion on the contribution of the localized pretraining and post-training phases. revision: yes
-
Referee: [Data and Post-Training] The novel human-annotated Singapore multimodal dataset and 22B-token curated corpus are central to the post-training adaptation claim, yet no details are supplied on annotation guidelines, inter-annotator agreement, dataset size breakdown, or contamination checks against the evaluation sets. This leaves the weakest assumption unaddressed and risks undermining the domain-representation argument.
Authors: We agree that additional specifics are warranted. The current manuscript summarizes data curation principles and reports the overall token counts, but does not provide the granular details requested. In the revision we will expand the data section to include: (1) annotation guidelines and quality-control procedures, (2) inter-annotator agreement statistics, (3) a breakdown of the 22B-token corpus by category (culture, legislation, multimodal pairs, etc.), and (4) explicit contamination analysis against the evaluation benchmarks. These additions will be placed in the post-training data subsection. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a technical report on model training and empirical benchmark results. It describes continued pretraining on a 1-trillion token corpus, post-training on a 22B token Singapore-specific dataset, and reports performance on various benchmarks. No mathematical derivations, equations, predictions from fitted parameters, or self-referential definitions are present. Claims rest on held-out evaluation metrics rather than any reduction to inputs by construction. No load-bearing steps match the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Devendra Chaplot, Jessica Chudnovsky, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amélie Héliou, Paul Jacob, Albert Q. Jiang, Timothée Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Marshall, Louis Martin, Arthur Mensch, Pavankumar Muddireddy, Valera Nemychniko...
work page internal anchor Pith review arXiv
-
[2]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),
work page 2023
-
[3]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Gabriel Chua, Leanne Tan, Ziyu Ge, and Roy Ka-Wei Lee. Lost in localization: Building rabak- bench with human-in-the-loop validation to measure multilingual safety gaps.arXiv preprint arXiv:2507.05980,
-
[5]
No Language Left Behind: Scaling Human-Centered Machine Translation
Marta R Costa-Jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672,
work page internal anchor Pith review arXiv
-
[6]
Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, et al. Sailor2: sailing in south-east asia with inclusive multilingual llms.arXiv preprint arXiv:2502.12982,
-
[7]
Investigating the factual knowledge boundary of large language models with retrieval augmentation,
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yutao Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2307.11019,
-
[8]
URL https://alignment.anthropic.com/ 2025/bloom-auto-evals/. Suchin Gururangan, Ana Marasovi ´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 8342–8360,
work page 2025
-
[9]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[10]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Averaging Weights Leads to Wider Optima and Better Generalization
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Av- eraging weights leads to wider optima and better generalization.arXiv preprint arXiv:1803.05407,
-
[12]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
16 Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Cmmlu: Measuring massive multitask language understanding in chinese
Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11260–11285,
work page 2024
-
[14]
Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584,
work page internal anchor Pith review arXiv
-
[15]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
doi: 10.24963/ijcai.2025/53. Mistral AI. Introducing Mistral 3.https://mistral.ai/news/mistral-3, dec
- [17]
-
[18]
Reuse, don’t retrain: A recipe for continued pretraining of language models,
Jupinder Parmar, Sanjev Satheesh, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263,
-
[19]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InarXiv preprint arXiv:2309.00071,
work page internal anchor Pith review arXiv
-
[20]
GLU Variants Improve Transformer
17 Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[21]
URLhttps://file.go.gov.sg/nais2023.pdf. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model.arXiv preprint arXiv:2201.11990,
-
[22]
Kmmlu: Measuring massive multitask language understanding in korean
Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, and Stella Biderman. Kmmlu: Measuring massive multitask language understanding in korean. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technolo-...
work page 2025
-
[23]
Sea-helm: Southeast asian holistic evaluation of language models
Yosephine Susanto, Adithya Venkatadri Hulagadri, Jann Railey Montalan, Jian Gang Ngui, Xian- bin Yong, Wei Qi Leong, Hamsawardhini Rengarajan, Peerat Limkonchotiwat, Yifan Mai, and William Chandra Tjhi. Sea-helm: Southeast asian holistic evaluation of language models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 12308–12336,
work page 2025
-
[24]
URLhttps://go.gov.sg/phoenix-small. Tongtong Wu, Linhao Luo, Yuan-Fang Li, Shirui Pan, Thuy-Trang Vu, and Gholamreza Haffari. Continual learning for large language models: A survey.arXiv preprint arXiv:2402.01364,
-
[25]
Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, and Pengfei Liu. Alignment for honesty. arXiv preprint arXiv:2312.07000,
-
[26]
Shunyu Yao, Howard Chen, Austin W Hanjie, Runzhe Yang, and Karthik Narasimhan. Collie: Systematic construction of constrained text generation tasks.arXiv preprint arXiv:2307.08689,
-
[27]
Ziqi Yin, Hao Wang, Kaito Horio, Daisuke Kawahara, and Satoshi Sekine. Should we respect llms? a cross-lingual study on the influence of prompt politeness on llm performance. InProceedings of the Second Workshop on Social Influence in Conversations (SICon 2024), pages 9–35,
work page 2024
-
[28]
Seallms 3: Open foundation and chat multilingual large language models for southeast asian languages
Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, et al. Seallms 3: Open foundation and chat multilingual large language models for southeast asian languages. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Li...
work page 2025
-
[29]
18 Appendix A Additional Use Cases The figures below provide a further, non-exhaustive list of possible use cases with Phoenix-VL 1.5 Medium. Figure 9: Interactions demonstrating strong visual (multi-frame) Q&A in Singapore context and zero-shot Singapore legal knowledge recall. Figure 10: Interaction demonstrating Visual Q&A in Singapore context and comp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.