arxiv: 2605.09343 · v1 · submitted 2026-05-10 · 💻 cs.AI

SKG-VLA: Scene Knowledge Graph Priors for Structured Scene Semantics and Multimodal Reasoning for Decision Making

Zeyu Li , Lei Li This is my paper

Pith reviewed 2026-05-12 04:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords scene knowledge graphmultimodal decision makingcomplaint handlingstructured scene semanticspolicy-grounded reasoningdata synthesismultimodal alignment

0 comments

The pith

A Scene Knowledge Graph unifies complaint evidence and policies to improve multimodal decision accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SKG-VLA to handle decision making in complaint systems by representing each case as a structured scene captured in a Scene Knowledge Graph. This graph links entities, evidence items such as narratives and screenshots, policy clauses, events, states, and relations into one structure. A synthesis pipeline then produces scene descriptions, graph generalizations, question-answer pairs, and recommendations, while three-stage training embeds these priors into a multimodal model. Experiments demonstrate gains in policy-grounded reasoning, accuracy, long-tail generalization, and robustness to missing evidence. A reader would care because existing systems rely on shallow classification over isolated inputs and miss the cross-dependencies that affect fair outcomes.

Core claim

Each complaint case is modeled as a structured scene whose decision-relevant semantics are captured in a Scene Knowledge Graph that organizes complaint entities, evidence items, policy clauses, temporal events, transactional states, and action-relevant relations into a unified graph. Based on the SKG, a data synthesis pipeline generates complaint scene descriptions, rule-consistent graph generalizations, question-answer supervision, and decision recommendations. A three-stage training strategy of domain-adaptive pre-training, task-oriented instruction fine-tuning, and end-to-end multimodal alignment then injects these structured scene priors into a multimodal decision model.

What carries the argument

The Scene Knowledge Graph (SKG), a unified graph that organizes complaint entities, evidence items, policy clauses, temporal events, transactional states, and action-relevant relations to capture decision-relevant semantics.

If this is right

Policy-grounded reasoning strengthens because the model can follow explicit relations among evidence and clauses.
Complaint decision accuracy rises through integrated use of multimodal evidence and rules.
Long-tail generalization improves via rule-consistent graph generalizations produced by the synthesis pipeline.
Robustness under incomplete evidence increases as the graph encodes dependencies even when some inputs are absent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-structuring approach could extend to other domains that combine heterogeneous records with explicit rules, such as insurance claims or regulatory compliance checks.
The synthesis pipeline may reduce the need for large manually labeled datasets by generating supervision directly from the SKG.
Decision paths through the graph could provide traceable explanations for why a particular recommendation was reached.

Load-bearing premise

A Scene Knowledge Graph can organize heterogeneous evidence, policy clauses, and relations into a unified structure that captures decision-relevant semantics without significant loss or bias.

What would settle it

Re-running the experiments on the constructed complaint scene dataset using a multimodal baseline model without SKG priors and observing no consistent improvements in policy-grounded reasoning or decision accuracy.

Figures

Figures reproduced from arXiv: 2605.09343 by Lei Li, Zeyu Li.

**Figure 2.** Figure 2: Overview of the SKG-VLA framework for multimodal complaint decision making. Starting from multimodal complaint [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the significance of structured scene [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the proposed complaint scene dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison between the original multimodal com [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

read the original abstract

Decision making in large-scale complaint handling systems increasingly relies on heterogeneous evidence, including complaint narratives, screenshots, order metadata, historical interactions, and platform policies. Existing complaint understanding systems mainly perform shallow classification or template matching over isolated modalities, while underutilizing explicit scene structure, rule knowledge, and cross-evidence dependencies. To address this limitation, we present SKG-VLA for multimodal complaint decision making. The core idea is to model each case as a structured complaint scene and represent its decision-relevant semantics with a \emph{Scene Knowledge Graph} (SKG), which organizes complaint entities, evidence items, policy clauses, temporal events, transactional states, and action-relevant relations into a unified graph. Based on SKG, we build a data synthesis pipeline that generates complaint scene descriptions, rule-consistent graph generalizations, question-answer supervision, and decision recommendations. We further construct a large-scale complaint scene dataset with both text-only and multimodal in-domain benchmarks. Finally, we adopt a three-stage training strategy -- domain-adaptive pre-training, task-oriented instruction fine-tuning, and end-to-end multimodal alignment -- to inject structured scene priors into a multimodal decision model. Experiments show that SKG-VLA consistently improves policy-grounded reasoning, complaint decision accuracy, long-tail generalization, and robustness under incomplete evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SKG-VLA adds a Scene Knowledge Graph prior for structuring complaint cases in multimodal decision models, but the synthesis pipeline lacks validation details and the experiments provide no numbers or baselines to back the gains.

read the letter

This paper introduces SKG-VLA to handle multimodal complaint decision making by representing each case as a structured scene in a Scene Knowledge Graph. The graph pulls together entities, evidence items, policy clauses, events, states, and relations, then feeds into a synthesis pipeline that creates scene descriptions, rule-consistent graphs, QA supervision, and recommendations. A new complaint dataset is built with text and multimodal benchmarks, and a three-stage training process (domain-adaptive pre-training, instruction fine-tuning, and multimodal alignment) injects the graph priors into the model. Experiments are said to show better policy-grounded reasoning, accuracy, long-tail generalization, and robustness to incomplete evidence. The framing of complaints as scenes with explicit rule knowledge is new for this applied area and moves past simple classification or template matching. The pipeline and training stages are laid out clearly at a conceptual level, which gives a practical path for combining knowledge graphs with vision-language models on real decision tasks. The main weaknesses are in the supporting evidence. No quantitative results, baselines, or error breakdowns appear, so the size and reliability of the improvements cannot be judged. The data synthesis pipeline is described only at a high level, with no accuracy metrics on entity or relation extraction, no handling details for ambiguous or conflicting inputs, and no human validation of the resulting graphs. That leaves the stress-test concern intact: if the graphs contain systematic errors or biases, especially on long-tail or incomplete cases, the three-stage training would optimize against distorted structure rather than actual scene semantics. This work targets applied researchers and engineers working on multimodal decision support in customer service or similar domains. Readers who want concrete examples of knowledge graph integration for structured reasoning will find the setup useful. It deserves peer review because the core approach is coherent and addresses a genuine practical gap, even though the current version needs stronger validation of the pipeline and full experimental reporting to stand up.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SKG-VLA, a multimodal decision-making framework for large-scale complaint handling that models each case as a structured complaint scene represented by a Scene Knowledge Graph (SKG). The SKG organizes heterogeneous evidence (narratives, screenshots, metadata, policies, events) into entities, relations, and policy clauses. A data synthesis pipeline generates scene descriptions, rule-consistent graph generalizations, QA supervision, and decision recommendations; a large-scale complaint scene dataset (text-only and multimodal) is constructed; and a three-stage training strategy (domain-adaptive pre-training, task-oriented instruction fine-tuning, end-to-end multimodal alignment) injects the structured priors into a vision-language model. Experiments are claimed to show consistent gains in policy-grounded reasoning, decision accuracy, long-tail generalization, and robustness to incomplete evidence.

Significance. If the empirical claims hold after proper validation, the work would offer a concrete advance in integrating explicit knowledge-graph priors with multimodal models for policy-constrained decision tasks. The construction of a dedicated complaint-scene dataset and the three-stage training regimen that progressively aligns graph structure with vision-language reasoning are positive contributions that could be reusable in other domains requiring rule consistency over heterogeneous evidence. The emphasis on long-tail and incomplete-evidence robustness addresses a practically important gap.

major comments (2)

[Abstract / Data Synthesis Pipeline] Abstract / Data Synthesis Pipeline: The pipeline is described only conceptually (generation of 'rule-consistent graph generalizations' and QA supervision) with no reported metrics on entity/relation extraction accuracy, handling of ambiguous or conflicting evidence, or human validation of the resulting SKGs. Because the central claims of improved policy-grounded reasoning and robustness under incomplete evidence rest on the SKG faithfully encoding decision-relevant semantics without significant loss or bias, the absence of these details is load-bearing.
[Experiments] Experiments section: The abstract asserts that SKG-VLA 'consistently improves' policy-grounded reasoning, decision accuracy, long-tail generalization, and robustness, yet supplies no quantitative results, baselines, ablation studies, error analysis, or implementation details. Without these, the magnitude and reliability of the reported gains cannot be assessed.

minor comments (1)

[Abstract] Abstract: The phrase 'rule-consistent graph generalizations' is introduced without a preceding definition or example, which reduces immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. The two major comments identify important gaps in the presentation of the data synthesis pipeline and the experimental results. We address each point below and will revise the manuscript to incorporate the requested details.

read point-by-point responses

Referee: [Abstract / Data Synthesis Pipeline] Abstract / Data Synthesis Pipeline: The pipeline is described only conceptually (generation of 'rule-consistent graph generalizations' and QA supervision) with no reported metrics on entity/relation extraction accuracy, handling of ambiguous or conflicting evidence, or human validation of the resulting SKGs. Because the central claims of improved policy-grounded reasoning and robustness under incomplete evidence rest on the SKG faithfully encoding decision-relevant semantics without significant loss or bias, the absence of these details is load-bearing.

Authors: We agree that the current description of the data synthesis pipeline is primarily conceptual and that quantitative validation is necessary to substantiate the central claims. In the revised manuscript we will add a dedicated subsection that reports entity and relation extraction accuracies (measured against held-out human annotations), describes the rule-based and LLM-assisted procedures used to resolve ambiguous or conflicting evidence, and presents the protocol and results of a human validation study on the generated SKGs. These additions will directly address the load-bearing requirement for demonstrating faithful encoding of decision-relevant semantics. revision: yes
Referee: [Experiments] Experiments section: The abstract asserts that SKG-VLA 'consistently improves' policy-grounded reasoning, decision accuracy, long-tail generalization, and robustness, yet supplies no quantitative results, baselines, ablation studies, error analysis, or implementation details. Without these, the magnitude and reliability of the reported gains cannot be assessed.

Authors: We acknowledge that the Experiments section in the submitted manuscript does not contain the full quantitative results, baseline comparisons, ablation studies, error analysis, or implementation details needed to evaluate the claimed improvements. This omission limits assessability. In the revision we will expand the section to include all requested elements: numerical performance tables, baseline comparisons, ablations isolating the contribution of the Scene Knowledge Graph priors, error analysis focused on long-tail and incomplete-evidence cases, and complete implementation specifications. The underlying experiments were performed as summarized in the abstract; the details were omitted due to length constraints and will now be fully documented. revision: yes

Circularity Check

0 steps flagged

No circularity: forward pipeline from SKG construction to empirical evaluation.

full rationale

The paper presents a descriptive system pipeline—defining the Scene Knowledge Graph to organize evidence and policies, synthesizing data and QA pairs from it, constructing benchmarks, and applying three-stage training—followed by experimental results on accuracy and generalization. No equations, uniqueness theorems, or first-principles derivations appear that would reduce any claimed prediction or result to its own inputs by construction. The central claims rest on empirical outcomes rather than self-definitional fits or self-citation chains, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger populated from abstract only; full paper may contain additional details on parameters or assumptions.

axioms (1)

domain assumption Heterogeneous complaint evidence and policy clauses can be unified into a Scene Knowledge Graph that preserves decision-relevant semantics and relations.
This modeling choice is presented as the core of the approach in the abstract.

invented entities (1)

Scene Knowledge Graph (SKG) no independent evidence
purpose: To organize complaint entities, evidence items, policy clauses, temporal events, transactional states, and action-relevant relations into a unified graph for structured reasoning.
New representation introduced as the central prior for the complaint scene.

pith-pipeline@v0.9.0 · 5529 in / 1418 out tokens · 84663 ms · 2026-05-12T04:01:24.959343+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We represent this complaint scene with a Scene Knowledge Graph G=(V,E,A), which organizes complaint entities, evidence items, policy clauses, temporal events, transactional states, and action-relevant relations into a unified graph.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 10 internal anchors

[1]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

work page
[2]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems35 (2022), 23716–23736

work page 2022
[3]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations

work page 2023
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. 2019. Scene text visual question answering. InProceedings of the IEEE/CVF international conference on computer vision. 4291–4301

work page 2019
[6]

Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 conference on empirical methods in natural language processing. 5016–5026

work page 2018
[7]

Consumer Financial Protection Bureau. 2025. Consumer complaint database

work page 2025
[8]

Meng Chen, Ruixue Liu, Lei Shen, Shaozu Yuan, Jingyan Zhou, Youzheng Wu, Xiaodong He, and Bowen Zhou. 2020. The jddc corpus: A large-scale multi-turn chinese dialogue dataset for e-commerce customer service. InProceedings of the Twelfth Language Resources and Evaluation Conference. 459–466

work page 2020
[9]

Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al

work page
[11]

InProceedings of the Computer Vision and Pattern Recognition Conference

Molmo and pixmo: Open weights and open data for state-of-the-art vision- language models. InProceedings of the Computer Vision and Pattern Recognition Conference. 91–104

work page
[12]

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. 2023. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Peiheng Gao, Ning Sun, Xuefeng Wang, Chen Yang, and Ričardas Zitikis. 2023. Nlp-based detection of systematic anomalies among the narratives of consumer complaints.arXiv preprint arXiv:2308.11138(2023)

work page arXiv 2023
[15]

Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. 2024. Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in neural information processing systems37 (2024), 59532– 59569

work page 2024
[16]

Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. 2024. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering.Advances in Neural Information Processing Systems37, 132876–132907

work page 2024
[17]

Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. 2025. mplug-docowl2: High-resolution compressing for ocr-free multi-page document understanding. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5817–5834

work page 2025
[18]

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM international conference on multimedia. 4083–4091

work page 2022
[19]

Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Structgpt: A general framework for large language model to reason over structured data. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 9237–9251

work page 2023
[20]

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. Ocr-free document understanding transformer. InEuropean Confer- ence on Computer Vision. Springer, 498–517

work page 2022
[21]

Satwik Kottur, Seungwhan Moon, Alborz Geramifard, and Babak Damavandi

work page
[22]

InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

SIMMC 2.0: A task-oriented dialog dataset for immersive multimodal conversations. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 4903–4912

work page 2021
[23]

Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Mar- tin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. 2023. Pix2struct: Screenshot parsing as pretraining for visual lan- guage understanding. InInternational Conference on Machine Learning. PMLR, 18893–18912

work page 2023
[24]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

work page 2020
[25]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

work page 2023
[26]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. (2024), 26296–26306

work page 2024
[27]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

work page 2023
[28]

Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. 2020. K-bert: Enabling language representation with knowledge graph. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 2901–2908

work page 2020
[29]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. 2021. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 2200–2209

work page 2021
[31]

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty

work page
[32]

In2019 international conference on document analysis and recognition (ICDAR)

Ocr-vqa: Visual question answering by reading text in images. In2019 international conference on document analysis and recognition (ICDAR). IEEE, 947–952

work page
[33]

Konstantinos I Roumeliotis, Nikolaos D Tselikas, and Dimitrios K Nasiopoulos

work page
[34]

Think before you classify: The rise of reasoning large language models for consumer complaint detection and classification.Electronics14, 6 (2025), 1070

work page 2025
[35]

Amrita Saha, Mitesh Khapra, and Karthik Sankaranarayanan. 2018. Towards building large scale multimodal domain-aware conversation systems. 32, 1 (2018)

work page 2018
[36]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems36, 68539–68551

work page 2023
[37]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[38]

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8317–8326

work page 2019
[39]

Xiaobo Tang, Hao Mou, Jiangnan Liu, and Xin Du. 2021. Research on automatic labeling of imbalanced texts of customer complaints based on text enhancement and layer-by-layer semantic matching.Scientific Reports11, 1 (2021), 11849

work page 2021
[40]

Qwen Team. 2023. Qwen-VL: A Versatile Vision-Language Model for Under- standing, Localization, Text Reading, and Beyond.arXiv preprint arXiv:2308.12966 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and Jian Tang. 2021. KEPLER: A unified model for knowledge embed- ding and pre-trained language representation.Transactions of the Association for Computational Linguistics9 (2021), 176–194

work page 2021
[43]

Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, et al. 2024. General ocr theory: Towards ocr-2.0 via a unified end-to-end model.arXiv preprint arXiv:2409.01704 (2024)

work page internal anchor Pith review arXiv 2024
[44]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models.Advances in neural information processing systems 35 (2022), 24824–24837

work page 2022
[45]

Yilin Wen, Zifeng Wang, and Jimeng Sun. 2024. Mindmap: Knowledge graph prompting sparks graph of thoughts in large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 10370–10388

work page 2024
[46]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, 9 Li and Li Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

work page 2022
[48]

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. 2024. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9556–9567

work page 2024
[49]

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. 2025. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15134–15186

work page 2025
[50]

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: Enhanced language representation with informative entities. InProceedings of the 57th annual meeting of the association for computational linguistics. 1441–1451

work page 2019
[51]

Nan Zhao, Haoran Li, Youzheng Wu, Xiaodong He, and Bowen Zhou. 2021. The JDDC 2.0 corpus: A large-scale multimodal multi-turn chinese dialogue dataset for e-commerce customer service.arXiv preprint arXiv:2109.12913(2021). 10

work page arXiv 2021