arxiv: 2605.01278 · v2 · submitted 2026-05-02 · 💻 cs.AI

Recognition: unknown

Valley3: Scaling Omni Foundation Models for E-commerce

Zeyu Chen , Guanghao Zhou , Qixiang Yin , Ziwang Zhao , Huanjin Yao , Pengjiu Xia , Min Yang , Cen Chen

show 1 more author

Minghui Qiu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:50 UTC · model grok-4.3

classification 💻 cs.AI

keywords omni multimodal modele-commercecontinued pre-trainingaudio-visual understandingcontrollable reasoningagentic searchmultilingual audioshort-video scenarios

0 comments

The pith

Valley3 uses a four-stage training pipeline to build an omni model that unifies text, image, video, and audio understanding for e-commerce tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Valley3 as an omni multimodal large language model tailored for global e-commerce. It claims the model gains audio understanding, cross-modal instruction following, domain knowledge, and long-context reasoning through a progressive four-stage continued pre-training process applied to vision-language models. Post-training then adds controllable reasoning modes and agentic search tools. This approach aims to handle short-video scenarios and multilingual audio that matter for modern shopping platforms. If the claims hold, the result is a single model that performs well on specialized e-commerce benchmarks while staying competitive on general tasks.

Core claim

Valley3 is created by extending existing vision-language models via a four-stage omni e-commerce continued pre-training pipeline that sequentially develops audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning, after which post-training introduces one non-thinking mode and three levels of thinking plus agentic search capabilities, leading to consistent outperformance over strong baselines on in-house and open-source e-commerce benchmarks across six tasks while remaining competitive on general-domain benchmarks.

What carries the argument

The four-stage continued pre-training pipeline that progressively adds audio understanding, cross-modal capabilities, e-commerce knowledge, and long-context reasoning before post-training for controllable reasoning modes.

Load-bearing premise

The four-stage pipeline and post-training steps deliver the full set of claimed audio, cross-modal, domain, and reasoning capabilities without hidden performance trade-offs or efficiency losses.

What would settle it

An independent test set of short-video e-commerce queries in multiple languages where Valley3 shows no improvement over baselines or drops sharply on general reasoning tasks.

Figures

Figures reproduced from arXiv: 2605.01278 by Cen Chen, Guanghao Zhou, Huanjin Yao, Minghui Qiu, Min Yang, Pengjiu Xia, Qixiang Yin, Zeyu Chen, Ziwang Zhao.

**Figure 1.** Figure 1: The architectural overview of Valley3. It is built upon the Qwen3-VL backbone and view at source ↗

**Figure 2.** Figure 2: Overview of the proposed data construction framework for the e-commerce view at source ↗

**Figure 3.** Figure 3: Overview of the proposed post-training recipe for Valley3. The pipeline enhances view at source ↗

**Figure 4.** Figure 4: Performance of Valley3-8B-Thinking across view at source ↗

read the original abstract

In this work, we present Valley3, an omni multimodal large language model (MLLM) developed for diverse global e-commerce tasks, with unified understanding and reasoning capabilities across text, images, video, and audio. A key feature of Valley3 is its native multilingual audio capability for e-commerce, developed by extending vision-language models to better support crucial audio-visual tasks, particularly in short-video scenarios. To achieve this, we carefully design a four-stage omni e-commerce continued pre-training pipeline, through which Valley3 progressively acquires audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning capabilities, ultimately evolving into an omni model for diverse e-commerce scenarios. Then, we further improve Valley3 through post-training to encourage long-chain reasoning with controllable reasoning modes, enabling one non-thinking mode and three distinct levels of thinking, thereby balancing inference efficiency in simple scenarios with deep reasoning for complex applications. Moreover, we equip Valley3 with agentic search capabilities to proactively invoke search tools and acquire task-relevant information for e-commerce deep research tasks. To comprehensively assess the capabilities of Valley3, we construct an omni e-commerce benchmark spanning 6 tasks. Experimental results show that Valley3 consistently outperforms strong baselines on our in-house and open-source e-commerce benchmarks, while remaining competitive on general-domain benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Valley3 is a straightforward industry adaptation of MLLMs to e-commerce with audio and agentic add-ons, but the four-stage pipeline lacks the ablations needed to show it actually drives the gains.

read the letter

Valley3 is a straightforward industry adaptation of MLLMs to e-commerce with audio and agentic add-ons, but the four-stage pipeline lacks the ablations needed to show it actually drives the gains. The paper introduces Valley3, an omni model that adds native multilingual audio support to vision-language foundations, aimed at short-video and other e-commerce scenarios. It uses a four-stage continued pre-training sequence to layer in audio understanding, cross-modal instruction following, domain knowledge, and long-context reasoning, followed by post-training for controllable reasoning modes (non-thinking plus three thinking levels) and agentic search tools that call external info on demand. They also built a six-task omni e-commerce benchmark and report consistent wins over baselines on both internal and public e-commerce tests while staying competitive on general ones. The practical engineering focus stands out. Targeting real commercial needs like audio in short videos and balancing fast inference with deeper reasoning through mode control is a reasonable applied step, and the agentic search extension fits tasks that require pulling fresh product or review data. The stress-test concern holds up from the abstract. No stage-wise metrics, no ablations against a single-stage baseline, and no checkpoint comparisons appear, so it is impossible to tell whether the progressive pipeline is load-bearing or whether gains come from scale and data volume alone. Benchmark construction details, baseline choices, and any error bars or significance tests are also missing, which leaves the outperformance claim difficult to evaluate. This work is mainly useful for applied researchers and engineers adapting foundation models to vertical domains like retail. Pure theory readers will not find much, but anyone scaling omni models for specific business tasks can extract the pipeline outline and mode-control idea. It deserves a serious referee because the model, the benchmark, and the agentic extension are concrete and testable. I would send it to peer review and ask specifically for ablations on the training stages plus fuller evaluation protocols.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Valley3, an omni multimodal large language model for global e-commerce tasks with unified capabilities across text, images, video, and audio. It describes a four-stage continued pre-training pipeline that progressively builds audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning, followed by post-training to enable controllable reasoning modes (one non-thinking mode and three thinking levels) plus agentic search. The model is evaluated on a newly constructed omni e-commerce benchmark spanning 6 tasks, with claims of consistent outperformance over strong baselines on in-house and open-source e-commerce benchmarks while remaining competitive on general-domain benchmarks.

Significance. If the performance claims and pipeline effectiveness are substantiated, the work would offer a practical template for scaling omni models in a high-value domain, particularly by extending vision-language models to native multilingual audio for short-video e-commerce scenarios. The controllable reasoning modes and agentic search features address real deployment trade-offs between efficiency and depth. However, the absence of supporting ablations and benchmark details currently limits the ability to gauge broader impact or reproducibility.

major comments (3)

[Abstract] Abstract and Experiments section: The central claim that Valley3 'consistently outperforms strong baselines on our in-house and open-source e-commerce benchmarks' cannot be evaluated because the manuscript supplies no details on benchmark construction, task definitions, baseline selection criteria, statistical tests, or error bars.
[Method] Method (four-stage omni e-commerce continued pre-training pipeline): The assertion that the pipeline 'progressively acquires audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning' is load-bearing for the performance claims, yet no ablation studies, stage-wise metrics, intermediate checkpoint results, or single-stage baseline comparisons are provided to isolate the contribution of each stage or rule out trade-offs.
[Post-training] Post-training and Experiments: The introduction of controllable reasoning modes (non-thinking plus three thinking levels) and agentic search is presented as an improvement, but no quantitative comparisons (e.g., accuracy vs. latency trade-offs or ablation on the reasoning modes) are reported to demonstrate that these additions deliver the claimed balance without degrading other capabilities.

minor comments (2)

[Introduction] The repeated use of 'omni' would benefit from an explicit definition or scope statement early in the introduction to clarify which modalities and capabilities are included versus excluded.
[Experiments] Figure and table captions in the experimental section could be expanded to include exact metric definitions and baseline model versions for easier cross-reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that additional details and analyses are needed to strengthen the substantiation and reproducibility of our claims. We will revise the manuscript accordingly, as detailed in the point-by-point responses below.

read point-by-point responses

Referee: [Abstract] Abstract and Experiments section: The central claim that Valley3 'consistently outperforms strong baselines on our in-house and open-source e-commerce benchmarks' cannot be evaluated because the manuscript supplies no details on benchmark construction, task definitions, baseline selection criteria, statistical tests, or error bars.

Authors: We agree that the current level of detail limits independent evaluation. In the revised manuscript, we will expand the Experiments section with a dedicated subsection on benchmark construction. This will describe data sources and curation for the 6 tasks in the omni e-commerce benchmark, precise task definitions, how in-house data was sampled from real global e-commerce scenarios, and adaptations of open-source datasets for multimodal inputs. We will also specify baseline selection criteria (strong open-source and proprietary models matched on scale and modality support), report results with error bars from multiple runs, and include statistical significance tests (e.g., paired t-tests) to support the outperformance claims. revision: yes
Referee: [Method] Method (four-stage omni e-commerce continued pre-training pipeline): The assertion that the pipeline 'progressively acquires audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning' is load-bearing for the performance claims, yet no ablation studies, stage-wise metrics, intermediate checkpoint results, or single-stage baseline comparisons are provided to isolate the contribution of each stage or rule out trade-offs.

Authors: The pipeline design is motivated by incremental capability building, but we acknowledge that ablations are essential to isolate effects. In the revision, we will add a new ablation subsection reporting stage-wise metrics on representative tasks (e.g., audio QA, cross-modal VQA, domain-specific reasoning). We will include performance of intermediate checkpoints after each stage and comparisons to single-stage or partial-pipeline baselines. These results will demonstrate the progressive gains and confirm that later stages do not introduce trade-offs in earlier-acquired capabilities. revision: yes
Referee: [Post-training] Post-training and Experiments: The introduction of controllable reasoning modes (non-thinking plus three thinking levels) and agentic search is presented as an improvement, but no quantitative comparisons (e.g., accuracy vs. latency trade-offs or ablation on the reasoning modes) are reported to demonstrate that these additions deliver the claimed balance without degrading other capabilities.

Authors: We will revise the post-training and Experiments sections to include the requested quantitative evidence. Specifically, we will report accuracy versus latency trade-offs for the non-thinking mode and each of the three thinking levels across task types, plus ablations measuring the isolated impact of agentic search on deep research tasks. These additions will show that the controllable modes enable flexible efficiency-depth balancing while preserving or improving base performance, with no degradation on standard benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training pipeline and benchmark results

full rationale

The paper describes a four-stage continued pre-training pipeline and post-training steps for Valley3, then reports outperformance on in-house and open-source e-commerce benchmarks plus competitiveness on general benchmarks. No mathematical derivations, equations, or first-principles results are presented. No parameters are fitted to a subset and then called predictions. No self-citations are invoked as load-bearing uniqueness theorems or to smuggle in ansatzes. The central claims are validated externally via benchmark comparisons rather than reducing to self-referential definitions or inputs by construction. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied engineering paper on model development and benchmarking with no mathematical derivations, free parameters, axioms, or postulated entities described in the abstract.

pith-pipeline@v0.9.0 · 5556 in / 1330 out tokens · 78251 ms · 2026-05-09T14:50:19.876726+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 30 canonical work pages · 5 internal anchors

[1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[2]

2024 , publisher =

Li, Yangning and Ma, Shirong and Wang, Xiaobin and Huang, Shen and Jiang, Chengyue and Zheng, Hai-Tao and Xie, Pengjun and Huang, Fei and Jiang, Yong , booktitle =. 2024 , publisher =. doi:10.1609/aaai.v38i17.29820 , url =

work page doi:10.1609/aaai.v38i17.29820 2024
[3]

Peng, Bo and Ling, Xinyi and Chen, Ziru and Sun, Huan and Ning, Xia , booktitle =. e. 2024 , publisher =

2024
[4]

Mmsearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

Mmsearch: Benchmarking the potential of large models as multi-modal search engines , author=. arXiv preprint arXiv:2409.12959 , year=

work page arXiv
[5]

EcomBench: Towards holistic evaluation of foundation agents in e-commerce

EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce , author=. arXiv preprint arXiv:2512.08868 , year=

work page arXiv
[6]

Towards Reliable Evaluation of Large Language Models for Multilingual and Multimodal

Xie, Shuyi and Liew, Ziqin and Zhang, Hailing and Zhang, Haibo and Hu, Ling and Zhou, Zhiqiang and Liu, Shuman and Zeng, Anxiang , journal =. Towards Reliable Evaluation of Large Language Models for Multilingual and Multimodal. 2025 , url =

2025
[7]

Hallucination of Multimodal Large Language Models: A Survey

Hallucination of Multimodal Large Language Models: A Survey , author =. arXiv preprint arXiv:2404.18930 , year =

work page internal anchor Pith review arXiv
[8]

Hallucination Detection in

Jiang, Ling and Jiang, Keer and Chu, Xiaoyu and Gulati, Saaransh and Garg, Pulkit , booktitle =. Hallucination Detection in. 2024 , address =

2024
[9]

2026 , publisher =

Zhang, Daoze and Fu, Chenghan and Nie, Zhanheng and Liu, Jianyu and Guan, Wanxian and Gao, Yuan and Song, Jun and Wang, Pengjie and Xu, Jian and Zheng, Bo , booktitle =. 2026 , publisher =

2026
[10]

ACM Transactions on Multimedia Computing, Communications and Applications , year=

Valley: Video assistant with large language model enhanced ability , author=. ACM Transactions on Multimedia Computing, Communications and Applications , year=
[11]

2025 , eprint=

FineVision: Open Data Is All You Need , author=. 2025 , eprint=

2025
[12]

2025 , eprint=

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe , author=. 2025 , eprint=

2025
[13]

2025 , eprint=

TaoSR1: The Thinking Model for E-commerce Relevance Search , author=. 2025 , eprint=

2025
[14]

2025 , eprint=

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG , author=. 2025 , eprint=

2025
[15]

2024 , eprint=

A Survey on Multimodal Large Language Models , author=. 2024 , eprint=

2024
[16]

Proceedings of the 32nd ACM International Conference on Multimedia , pages=

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models , author=. Proceedings of the 32nd ACM International Conference on Multimedia , pages=
[17]

Bo Peng and Xinyi Ling and Ziru Chen and Huan Sun and Xia Ning , booktitle=. eCe. 2024 , url=

2024
[18]

2025 , eprint=

ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models , author=. 2025 , eprint=

2025
[19]

arXiv preprint arXiv:2410.17337 , year=

Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data , author=. arXiv preprint arXiv:2410.17337 , year=

work page arXiv
[20]

arXiv preprint arXiv:2504.18428 , year=

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts , author=. arXiv preprint arXiv:2504.18428 , year=

work page arXiv
[21]

2025 , eprint=

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines , author=. 2025 , eprint=

2025
[22]

ModelScope Team , year=
[23]

Mm-deepresearch: A simple and effective multimodal agentic search baseline.arXiv preprint arXiv:2603.01050, 2026

MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline , author=. arXiv preprint arXiv:2603.01050 , year=

work page arXiv
[24]

Sakshi and Utkarsh Tyagi and Sonal Kumar and Ashish Seth and Ramaneswaran Selvakumar and Oriol Nieto and Ramani Duraiswami and Sreyan Ghosh and Dinesh Manocha , title =

S. Sakshi and Utkarsh Tyagi and Sonal Kumar and Ashish Seth and Ramaneswaran Selvakumar and Oriol Nieto and Ramani Duraiswami and Sreyan Ghosh and Dinesh Manocha , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025
[25]

What’s in the image? a deep-dive into the vision of vision language models

Junjie Zhou and Yan Shu and Bo Zhao and Boya Wu and Zhengyang Liang and Shitao Xiao and Minghao Qin and Xi Yang and Yongping Xiong and Bo Zhang and Tiejun Huang and Zheng Liu , title =. 2025 , url =. doi:10.1109/CVPR52734.2025.01278 , timestamp =

work page doi:10.1109/cvpr52734.2025.01278 2025
[26]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu and Penghao Wu and Fanyi Pu and Wang Xiao and Yuanhan Zhang and Xiang Yue and Bo Li and Ziwei Liu , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2501.13826 , eprinttype =. 2501.13826 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2501.13826 2025
[27]

In: CVPR

Tianrui Guan and Fuxiao Liu and Xiyang Wu and Ruiqi Xian and Zongxia Li and Xiaoyu Liu and Xijun Wang and Lichang Chen and Furong Huang and Yaser Yacoob and Dinesh Manocha and Tianyi Zhou , title =. 2024 , url =. doi:10.1109/CVPR52733.2024.01363 , timestamp =

work page doi:10.1109/cvpr52733.2024.01363 2024
[28]

MMMU-Pro:

Xiang Yue and Tianyu Zheng and Yuansheng Ni and Yubo Wang and Kai Zhang and Shengbang Tong and Yuxuan Sun and Botao Yu and Ge Zhang and Huan Sun and Yu Su and Wenhu Chen and Graham Neubig , editor =. MMMU-Pro:. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2025 , url =

2025
[29]

In: CVPR

Xiang Yue and Yuansheng Ni and Tianyu Zheng and Kai Zhang and Ruoqi Liu and Ge Zhang and Samuel Stevens and Dongfu Jiang and Weiming Ren and Yuxuan Sun and Cong Wei and Botao Yu and Ruibin Yuan and Renliang Sun and Ming Yin and Boyuan Zheng and Zhenzhu Yang and Yibo Liu and Wenhao Huang and Huan Sun and Yu Su and Wenhu Chen , title =. 2024 , url =. doi:10...

work page doi:10.1109/cvpr52733.2024.00913 2024
[30]

Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154, 2025

Tianyu Yu and Zefan Wang and Chongyi Wang and Fuwei Huang and Wenshuo Ma and Zhihui He and Tianchi Cai and Weize Chen and Yuxiang Huang and Yuanqian Zhao and Bokai Xu and Junbo Cui and Yingjing Xu and Liqing Ruan and Luoyuan Zhang and Hanyu Liu and Jingkun Tang and Hongyuan Liu and Qining Guo and Wenhao Hu and Bingxiang He and Jie Zhou and Jie Cai and Ji ...

work page doi:10.48550/arxiv.2509.18154 2025
[31]

Qwen3-Omni Technical Report

Qwen Team , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2509.17765 , eprinttype =. 2509.17765 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2509.17765 2025
[32]

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction , author=. arXiv preprint arXiv:2501.01957 , year=

work page arXiv
[33]

Baichuan-omni-1.5 technical report.arXiv preprint arXiv:2501.15368, 2025

Baichuan-omni-1.5 technical report , author=. arXiv preprint arXiv:2501.15368 , year=

work page arXiv
[34]

Advances in Neural Information Processing Systems , volume=

Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset , author=. Advances in Neural Information Processing Systems , volume=
[35]

Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, 2025

Ola: Pushing the frontiers of omni-modal language model , author=. arXiv preprint arXiv:2502.04328 , year=

work page arXiv
[36]

arXiv preprint arXiv:2501.04561 , year=

Openomni: Advancing open-source omnimodal large language models with progressive multimodal alignment and real-time self-aware emotional speech synthesis , author=. arXiv preprint arXiv:2501.04561 , year=

work page arXiv
[37]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Longvale: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[38]

arXiv preprint arXiv:2502.15803 , year=

Megrez-omni technical report , author=. arXiv preprint arXiv:2502.15803 , year=

work page arXiv
[39]

arXiv preprint arXiv:2507.06119 , year=

Omni-video: Democratizing unified video understanding and generation , author=. arXiv preprint arXiv:2507.06119 , year=

work page arXiv
[40]

Qwen3.5-Omni: Scaling Up, Toward Native Omni-Modal AGI , url =

Qwen Team , month =. Qwen3.5-Omni: Scaling Up, Toward Native Omni-Modal AGI , url =
[41]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=

Domain adaptation of foundation llms for e-commerce , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track) , pages=
[42]

Proceedings of the 2025 2nd International Conference on Computer and Multimedia Technology , pages=

LLM-driven e-commerce marketing content optimization: Balancing creativity and conversion , author=. Proceedings of the 2025 2nd International Conference on Computer and Multimedia Technology , pages=

2025
[43]

arXiv preprint arXiv:2406.12023 , year=

LiLiuM: eBay's Large Language Models for e-commerce , author=. arXiv preprint arXiv:2406.12023 , year=

work page arXiv
[44]

arXiv preprint arXiv:2508.12365 , year=

TaoSR1: The thinking model for e-commerce relevance search , author=. arXiv preprint arXiv:2508.12365 , year=

work page arXiv
[45]

arXiv preprint arXiv:2511.13885 , year=

Taosearchemb: A multi-objective reinforcement learning framework for dense retrieval in taobao search , author=. arXiv preprint arXiv:2511.13885 , year=

work page arXiv
[46]

arXiv preprint arXiv:2601.21611 , year=

Thinking Broad, Acting Fast: Latent Reasoning Distillation from Multi-Perspective Chain-of-Thought for E-Commerce Relevance , author=. arXiv preprint arXiv:2601.21611 , year=

work page arXiv
[47]

arXiv preprint arXiv:2509.09121 , year=

Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in Southeast Asia , author=. arXiv preprint arXiv:2509.09121 , year=

work page arXiv
[48]

Captions Speak Louder than Images: Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data , author=. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , pages=
[49]

Proceedings of the AAAI conference on artificial intelligence , volume=

Ecomgpt: Instruction-tuning large language models with chain-of-task tasks for e-commerce , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[50]

Proceedings of the 34th ACM International Conference on Information and Knowledge Management , pages=

Eckgbench: Benchmarking large language models in e-commerce leveraging knowledge graph , author=. Proceedings of the 34th ACM International Conference on Information and Knowledge Management , pages=
[51]

arXiv preprint arXiv:2602.12315 , year=

AgenticShop: Benchmarking Agentic Product Curation for Personalized Web Shopping , author=. arXiv preprint arXiv:2602.12315 , year=

work page arXiv
[52]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[53]

Onemall: One model, more scenarios–end-to-end generative recommender family at kuaishou e-commerce.arXiv preprint arXiv:2601.21770, 2026

OneMall: One Model, More Scenarios--End-to-End Generative Recommender Family at Kuaishou E-Commerce , author=. arXiv preprint arXiv:2601.21770 , year=

work page arXiv
[54]

arXiv preprint arXiv:2602.11518 , year=

KuaiSearch: A Large-Scale E-Commerce Search Dataset for Recall, Ranking, and Relevance , author=. arXiv preprint arXiv:2602.11518 , year=

work page arXiv
[55]

arXiv preprint arXiv:2501.05901 , year=

Valley2: Exploring multimodal models with scalable vision-language design , author=. arXiv preprint arXiv:2501.05901 , year=

work page arXiv
[56]

Seed1.8 Model Card: Towards Generalized Real-World Agency

Seed1. 8 model card: Towards generalized real-world agency , author=. arXiv preprint arXiv:2603.20633 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

2025
[58]

Zone of Proximal Development , isbn =

Podolskij, Andrei , year =. Zone of Proximal Development , isbn =
[59]

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Technical Report , author=. arXiv preprint arXiv:2510.24701 , year=

work page internal anchor Pith review arXiv