Recognition: unknown
Valley3: Scaling Omni Foundation Models for E-commerce
Pith reviewed 2026-05-09 14:50 UTC · model grok-4.3
The pith
Valley3 uses a four-stage training pipeline to build an omni model that unifies text, image, video, and audio understanding for e-commerce tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Valley3 is created by extending existing vision-language models via a four-stage omni e-commerce continued pre-training pipeline that sequentially develops audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning, after which post-training introduces one non-thinking mode and three levels of thinking plus agentic search capabilities, leading to consistent outperformance over strong baselines on in-house and open-source e-commerce benchmarks across six tasks while remaining competitive on general-domain benchmarks.
What carries the argument
The four-stage continued pre-training pipeline that progressively adds audio understanding, cross-modal capabilities, e-commerce knowledge, and long-context reasoning before post-training for controllable reasoning modes.
Load-bearing premise
The four-stage pipeline and post-training steps deliver the full set of claimed audio, cross-modal, domain, and reasoning capabilities without hidden performance trade-offs or efficiency losses.
What would settle it
An independent test set of short-video e-commerce queries in multiple languages where Valley3 shows no improvement over baselines or drops sharply on general reasoning tasks.
Figures
read the original abstract
In this work, we present Valley3, an omni multimodal large language model (MLLM) developed for diverse global e-commerce tasks, with unified understanding and reasoning capabilities across text, images, video, and audio. A key feature of Valley3 is its native multilingual audio capability for e-commerce, developed by extending vision-language models to better support crucial audio-visual tasks, particularly in short-video scenarios. To achieve this, we carefully design a four-stage omni e-commerce continued pre-training pipeline, through which Valley3 progressively acquires audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning capabilities, ultimately evolving into an omni model for diverse e-commerce scenarios. Then, we further improve Valley3 through post-training to encourage long-chain reasoning with controllable reasoning modes, enabling one non-thinking mode and three distinct levels of thinking, thereby balancing inference efficiency in simple scenarios with deep reasoning for complex applications. Moreover, we equip Valley3 with agentic search capabilities to proactively invoke search tools and acquire task-relevant information for e-commerce deep research tasks. To comprehensively assess the capabilities of Valley3, we construct an omni e-commerce benchmark spanning 6 tasks. Experimental results show that Valley3 consistently outperforms strong baselines on our in-house and open-source e-commerce benchmarks, while remaining competitive on general-domain benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Valley3, an omni multimodal large language model for global e-commerce tasks with unified capabilities across text, images, video, and audio. It describes a four-stage continued pre-training pipeline that progressively builds audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning, followed by post-training to enable controllable reasoning modes (one non-thinking mode and three thinking levels) plus agentic search. The model is evaluated on a newly constructed omni e-commerce benchmark spanning 6 tasks, with claims of consistent outperformance over strong baselines on in-house and open-source e-commerce benchmarks while remaining competitive on general-domain benchmarks.
Significance. If the performance claims and pipeline effectiveness are substantiated, the work would offer a practical template for scaling omni models in a high-value domain, particularly by extending vision-language models to native multilingual audio for short-video e-commerce scenarios. The controllable reasoning modes and agentic search features address real deployment trade-offs between efficiency and depth. However, the absence of supporting ablations and benchmark details currently limits the ability to gauge broader impact or reproducibility.
major comments (3)
- [Abstract] Abstract and Experiments section: The central claim that Valley3 'consistently outperforms strong baselines on our in-house and open-source e-commerce benchmarks' cannot be evaluated because the manuscript supplies no details on benchmark construction, task definitions, baseline selection criteria, statistical tests, or error bars.
- [Method] Method (four-stage omni e-commerce continued pre-training pipeline): The assertion that the pipeline 'progressively acquires audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning' is load-bearing for the performance claims, yet no ablation studies, stage-wise metrics, intermediate checkpoint results, or single-stage baseline comparisons are provided to isolate the contribution of each stage or rule out trade-offs.
- [Post-training] Post-training and Experiments: The introduction of controllable reasoning modes (non-thinking plus three thinking levels) and agentic search is presented as an improvement, but no quantitative comparisons (e.g., accuracy vs. latency trade-offs or ablation on the reasoning modes) are reported to demonstrate that these additions deliver the claimed balance without degrading other capabilities.
minor comments (2)
- [Introduction] The repeated use of 'omni' would benefit from an explicit definition or scope statement early in the introduction to clarify which modalities and capabilities are included versus excluded.
- [Experiments] Figure and table captions in the experimental section could be expanded to include exact metric definitions and baseline model versions for easier cross-reference.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that additional details and analyses are needed to strengthen the substantiation and reproducibility of our claims. We will revise the manuscript accordingly, as detailed in the point-by-point responses below.
read point-by-point responses
-
Referee: [Abstract] Abstract and Experiments section: The central claim that Valley3 'consistently outperforms strong baselines on our in-house and open-source e-commerce benchmarks' cannot be evaluated because the manuscript supplies no details on benchmark construction, task definitions, baseline selection criteria, statistical tests, or error bars.
Authors: We agree that the current level of detail limits independent evaluation. In the revised manuscript, we will expand the Experiments section with a dedicated subsection on benchmark construction. This will describe data sources and curation for the 6 tasks in the omni e-commerce benchmark, precise task definitions, how in-house data was sampled from real global e-commerce scenarios, and adaptations of open-source datasets for multimodal inputs. We will also specify baseline selection criteria (strong open-source and proprietary models matched on scale and modality support), report results with error bars from multiple runs, and include statistical significance tests (e.g., paired t-tests) to support the outperformance claims. revision: yes
-
Referee: [Method] Method (four-stage omni e-commerce continued pre-training pipeline): The assertion that the pipeline 'progressively acquires audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning' is load-bearing for the performance claims, yet no ablation studies, stage-wise metrics, intermediate checkpoint results, or single-stage baseline comparisons are provided to isolate the contribution of each stage or rule out trade-offs.
Authors: The pipeline design is motivated by incremental capability building, but we acknowledge that ablations are essential to isolate effects. In the revision, we will add a new ablation subsection reporting stage-wise metrics on representative tasks (e.g., audio QA, cross-modal VQA, domain-specific reasoning). We will include performance of intermediate checkpoints after each stage and comparisons to single-stage or partial-pipeline baselines. These results will demonstrate the progressive gains and confirm that later stages do not introduce trade-offs in earlier-acquired capabilities. revision: yes
-
Referee: [Post-training] Post-training and Experiments: The introduction of controllable reasoning modes (non-thinking plus three thinking levels) and agentic search is presented as an improvement, but no quantitative comparisons (e.g., accuracy vs. latency trade-offs or ablation on the reasoning modes) are reported to demonstrate that these additions deliver the claimed balance without degrading other capabilities.
Authors: We will revise the post-training and Experiments sections to include the requested quantitative evidence. Specifically, we will report accuracy versus latency trade-offs for the non-thinking mode and each of the three thinking levels across task types, plus ablations measuring the isolated impact of agentic search on deep research tasks. These additions will show that the controllable modes enable flexible efficiency-depth balancing while preserving or improving base performance, with no degradation on standard benchmarks. revision: yes
Circularity Check
No circularity: empirical training pipeline and benchmark results
full rationale
The paper describes a four-stage continued pre-training pipeline and post-training steps for Valley3, then reports outperformance on in-house and open-source e-commerce benchmarks plus competitiveness on general benchmarks. No mathematical derivations, equations, or first-principles results are presented. No parameters are fitted to a subset and then called predictions. No self-citations are invoked as load-bearing uniqueness theorems or to smuggle in ansatzes. The central claims are validated externally via benchmark comparisons rather than reducing to self-referential definitions or inputs by construction. The derivation chain is therefore self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[2]
Li, Yangning and Ma, Shirong and Wang, Xiaobin and Huang, Shen and Jiang, Chengyue and Zheng, Hai-Tao and Xie, Pengjun and Huang, Fei and Jiang, Yong , booktitle =. 2024 , publisher =. doi:10.1609/aaai.v38i17.29820 , url =
-
[3]
Peng, Bo and Ling, Xinyi and Chen, Ziru and Sun, Huan and Ning, Xia , booktitle =. e. 2024 , publisher =
2024
-
[4]
Mmsearch: Benchmarking the potential of large models as multi-modal search engines , author=. arXiv preprint arXiv:2409.12959 , year=
-
[5]
EcomBench: Towards holistic evaluation of foundation agents in e-commerce
EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce , author=. arXiv preprint arXiv:2512.08868 , year=
-
[6]
Towards Reliable Evaluation of Large Language Models for Multilingual and Multimodal
Xie, Shuyi and Liew, Ziqin and Zhang, Hailing and Zhang, Haibo and Hu, Ling and Zhou, Zhiqiang and Liu, Shuman and Zeng, Anxiang , journal =. Towards Reliable Evaluation of Large Language Models for Multilingual and Multimodal. 2025 , url =
2025
-
[7]
Hallucination of Multimodal Large Language Models: A Survey
Hallucination of Multimodal Large Language Models: A Survey , author =. arXiv preprint arXiv:2404.18930 , year =
work page internal anchor Pith review arXiv
-
[8]
Hallucination Detection in
Jiang, Ling and Jiang, Keer and Chu, Xiaoyu and Gulati, Saaransh and Garg, Pulkit , booktitle =. Hallucination Detection in. 2024 , address =
2024
-
[9]
2026 , publisher =
Zhang, Daoze and Fu, Chenghan and Nie, Zhanheng and Liu, Jianyu and Guan, Wanxian and Gao, Yuan and Song, Jun and Wang, Pengjie and Xu, Jian and Zheng, Bo , booktitle =. 2026 , publisher =
2026
-
[10]
ACM Transactions on Multimedia Computing, Communications and Applications , year=
Valley: Video assistant with large language model enhanced ability , author=. ACM Transactions on Multimedia Computing, Communications and Applications , year=
-
[11]
2025 , eprint=
FineVision: Open Data Is All You Need , author=. 2025 , eprint=
2025
-
[12]
2025 , eprint=
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe , author=. 2025 , eprint=
2025
-
[13]
2025 , eprint=
TaoSR1: The Thinking Model for E-commerce Relevance Search , author=. 2025 , eprint=
2025
-
[14]
2025 , eprint=
Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG , author=. 2025 , eprint=
2025
-
[15]
2024 , eprint=
A Survey on Multimodal Large Language Models , author=. 2024 , eprint=
2024
-
[16]
Proceedings of the 32nd ACM International Conference on Multimedia , pages=
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models , author=. Proceedings of the 32nd ACM International Conference on Multimedia , pages=
-
[17]
Bo Peng and Xinyi Ling and Ziru Chen and Huan Sun and Xia Ning , booktitle=. eCe. 2024 , url=
2024
-
[18]
2025 , eprint=
ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models , author=. 2025 , eprint=
2025
-
[19]
arXiv preprint arXiv:2410.17337 , year=
Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data , author=. arXiv preprint arXiv:2410.17337 , year=
-
[20]
arXiv preprint arXiv:2504.18428 , year=
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts , author=. arXiv preprint arXiv:2504.18428 , year=
-
[21]
2025 , eprint=
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines , author=. 2025 , eprint=
2025
-
[22]
ModelScope Team , year=
-
[23]
MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline , author=. arXiv preprint arXiv:2603.01050 , year=
-
[24]
Sakshi and Utkarsh Tyagi and Sonal Kumar and Ashish Seth and Ramaneswaran Selvakumar and Oriol Nieto and Ramani Duraiswami and Sreyan Ghosh and Dinesh Manocha , title =
S. Sakshi and Utkarsh Tyagi and Sonal Kumar and Ashish Seth and Ramaneswaran Selvakumar and Oriol Nieto and Ramani Duraiswami and Sreyan Ghosh and Dinesh Manocha , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =
2025
-
[25]
What’s in the image? a deep-dive into the vision of vision language models
Junjie Zhou and Yan Shu and Bo Zhao and Boya Wu and Zhengyang Liang and Shitao Xiao and Minghao Qin and Xi Yang and Yongping Xiong and Bo Zhang and Tiejun Huang and Zheng Liu , title =. 2025 , url =. doi:10.1109/CVPR52734.2025.01278 , timestamp =
-
[26]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Kairui Hu and Penghao Wu and Fanyi Pu and Wang Xiao and Yuanhan Zhang and Xiang Yue and Bo Li and Ziwei Liu , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2501.13826 , eprinttype =. 2501.13826 , timestamp =
work page internal anchor Pith review doi:10.48550/arxiv.2501.13826 2025
-
[27]
Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.01363 , timestamp =
-
[28]
MMMU-Pro:
Xiang Yue and Tianyu Zheng and Yuansheng Ni and Yubo Wang and Kai Zhang and Shengbang Tong and Yuxuan Sun and Botao Yu and Ge Zhang and Huan Sun and Yu Su and Wenhu Chen and Graham Neubig , editor =. MMMU-Pro:. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2025 , url =
2025
-
[29]
Xiang Yue and Yuansheng Ni and Tianyu Zheng and Kai Zhang and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen , title =. 2024 , url =. doi:10...
-
[30]
Tianyu Yu and Zefan Wang and Chongyi Wang and Fuwei Huang and Wenshuo Ma and Zhihui He and Tianchi Cai and Weize Chen and Yuxiang Huang and Yuanqian Zhao and Bokai Xu and Junbo Cui and Yingjing Xu and Liqing Ruan and Luoyuan Zhang and Hanyu Liu and Jingkun Tang and Hongyuan Liu and Qining Guo and Wenhao Hu and Bingxiang He and Jie Zhou and Jie Cai and Ji ...
-
[31]
Qwen Team , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.17765 , eprinttype =. 2509.17765 , timestamp =
work page internal anchor Pith review doi:10.48550/arxiv.2509.17765 2025
-
[32]
Vita-1.5: Towards gpt-4o level real-time vision and speech interaction , author=. arXiv preprint arXiv:2501.01957 , year=
-
[33]
Baichuan-omni-1.5 technical report.arXiv preprint arXiv:2501.15368, 2025
Baichuan-omni-1.5 technical report , author=. arXiv preprint arXiv:2501.15368 , year=
-
[34]
Advances in Neural Information Processing Systems , volume=
Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset , author=. Advances in Neural Information Processing Systems , volume=
-
[35]
Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, 2025
Ola: Pushing the frontiers of omni-modal language model , author=. arXiv preprint arXiv:2502.04328 , year=
-
[36]
arXiv preprint arXiv:2501.04561 , year=
Openomni: Advancing open-source omnimodal large language models with progressive multimodal alignment and real-time self-aware emotional speech synthesis , author=. arXiv preprint arXiv:2501.04561 , year=
-
[37]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Longvale: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[38]
arXiv preprint arXiv:2502.15803 , year=
Megrez-omni technical report , author=. arXiv preprint arXiv:2502.15803 , year=
-
[39]
arXiv preprint arXiv:2507.06119 , year=
Omni-video: Democratizing unified video understanding and generation , author=. arXiv preprint arXiv:2507.06119 , year=
-
[40]
Qwen3.5-Omni: Scaling Up, Toward Native Omni-Modal AGI , url =
Qwen Team , month =. Qwen3.5-Omni: Scaling Up, Toward Native Omni-Modal AGI , url =
-
[41]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=
Domain adaptation of foundation llms for e-commerce , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=
-
[42]
Proceedings of the 2025 2nd International Conference on Computer and Multimedia Technology , pages=
LLM-driven e-commerce marketing content optimization: Balancing creativity and conversion , author=. Proceedings of the 2025 2nd International Conference on Computer and Multimedia Technology , pages=
2025
-
[43]
arXiv preprint arXiv:2406.12023 , year=
LiLiuM: eBay's Large Language Models for e-commerce , author=. arXiv preprint arXiv:2406.12023 , year=
-
[44]
arXiv preprint arXiv:2508.12365 , year=
TaoSR1: The thinking model for e-commerce relevance search , author=. arXiv preprint arXiv:2508.12365 , year=
-
[45]
arXiv preprint arXiv:2511.13885 , year=
Taosearchemb: A multi-objective reinforcement learning framework for dense retrieval in taobao search , author=. arXiv preprint arXiv:2511.13885 , year=
-
[46]
arXiv preprint arXiv:2601.21611 , year=
Thinking Broad, Acting Fast: Latent Reasoning Distillation from Multi-Perspective Chain-of-Thought for E-Commerce Relevance , author=. arXiv preprint arXiv:2601.21611 , year=
-
[47]
arXiv preprint arXiv:2509.09121 , year=
Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in Southeast Asia , author=. arXiv preprint arXiv:2509.09121 , year=
-
[48]
Captions Speak Louder than Images: Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data , author=. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , pages=
-
[49]
Proceedings of the AAAI conference on artificial intelligence , volume=
Ecomgpt: Instruction-tuning large language models with chain-of-task tasks for e-commerce , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[50]
Proceedings of the 34th ACM International Conference on Information and Knowledge Management , pages=
Eckgbench: Benchmarking large language models in e-commerce leveraging knowledge graph , author=. Proceedings of the 34th ACM International Conference on Information and Knowledge Management , pages=
-
[51]
arXiv preprint arXiv:2602.12315 , year=
AgenticShop: Benchmarking Agentic Product Curation for Personalized Web Shopping , author=. arXiv preprint arXiv:2602.12315 , year=
-
[52]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[53]
OneMall: One Model, More Scenarios--End-to-End Generative Recommender Family at Kuaishou E-Commerce , author=. arXiv preprint arXiv:2601.21770 , year=
-
[54]
arXiv preprint arXiv:2602.11518 , year=
KuaiSearch: A Large-Scale E-Commerce Search Dataset for Recall, Ranking, and Relevance , author=. arXiv preprint arXiv:2602.11518 , year=
-
[55]
arXiv preprint arXiv:2501.05901 , year=
Valley2: Exploring multimodal models with scalable vision-language design , author=. arXiv preprint arXiv:2501.05901 , year=
-
[56]
Seed1.8 Model Card: Towards Generalized Real-World Agency
Seed1. 8 model card: Towards generalized real-world agency , author=. arXiv preprint arXiv:2603.20633 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Nature , volume=
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=
2025
-
[58]
Zone of Proximal Development , isbn =
Podolskij, Andrei , year =. Zone of Proximal Development , isbn =
-
[59]
Tongyi DeepResearch Technical Report
Tongyi DeepResearch Technical Report , author=. arXiv preprint arXiv:2510.24701 , year=
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.