Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution

Mario Luca Bernardi; Marta Cimitile; Susanna Cifani

arxiv: 2605.28607 · v1 · pith:F4TYHOCBnew · submitted 2026-05-27 · 💻 cs.AI · cs.CL

Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution

Susanna Cifani , Mario Luca Bernardi , Marta Cimitile This is my paper

Pith reviewed 2026-06-29 11:44 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords multimodal agentsworkflow executiontopological knowledge baseadaptive RAGmulti-agent frameworktask decompositiongraph-based navigation

0 comments

The pith

A multi-agent framework builds a topological knowledge base from fragmented logs to support adaptive workflow navigation via RAG and verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a two-phase multimodal multi-agent system for executing complex workflows that current linear episode approaches cannot handle well in changing conditions. An offline discovery phase adaptively assembles a graph-like topological knowledge base directly from available execution logs. At runtime, agents apply adaptive retrieval-augmented generation over this fixed graph and run a closed-loop collaborative verification protocol to correct errors and adjust paths. The result is claimed to deliver stronger task decomposition and sustained reliability even when training data is scarce.

Core claim

The authors claim that constructing a topological knowledge base from fragmented execution logs in an offline phase, then performing inference with Adaptive RAG over the resulting graph together with a closed-loop collaborative verification protocol, produces automatic workflow execution that captures transition topology and therefore works reliably in novel or non-stationary scenarios.

What carries the argument

The two-phase pipeline: offline adaptive construction of a topological knowledge base from logs, followed by inference-time Adaptive RAG on the graph plus closed-loop collaborative verification.

If this is right

Agents can decompose tasks more effectively by consulting the graph rather than treating each sequence as an isolated episode.
The system maintains semantic awareness and reliability without requiring large amounts of additional training data.
Navigation remains possible in non-stationary environments because the graph encodes transition topology.
The closed-loop verification step enables dynamic self-correction during execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same log-to-graph construction could be applied to other sequential decision domains that produce execution traces.
If the graph can be incrementally updated from new logs, the framework might support continuous adaptation without full offline rebuilds.
Combining the topological base with direct GUI perception methods could further reduce reliance on structured metadata.

Load-bearing premise

Fragmented execution logs contain enough structure for an adaptive process to build a topological knowledge base that captures the transition topology needed for new scenarios.

What would settle it

Run the framework on a workflow whose execution logs are too sparse or unstructured to yield a usable topological graph; measure whether navigation accuracy and self-correction drop sharply compared with linear baselines.

Figures

Figures reproduced from arXiv: 2605.28607 by Mario Luca Bernardi, Marta Cimitile, Susanna Cifani.

read the original abstract

Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with the transition from structured metadata parsing to general environmental perception. While the integration of MLLMs has enabled agents to interact directly with GUIs, existing approaches typically treat task sequences as discrete, linear episodes. This fragmentation prevents agents from capturing the underlying transition topology, limiting their effectiveness in novel or non-stationary scenarios. To address this, we propose a novel multimodal multi-agent framework that achieves automatic workflow execution through a distinct two-phase pipeline. First, during an offline discovery phase, the architecture adaptively constructs a topological knowledge base from fragmented execution logs. During inference, agents leverage Adaptive Retrieval-Augmented Generation (RAG) over this fixed, pre-established graph, coupled with a closed-loop collaborative verification protocol to dynamically self-correct and navigate. This graph-based approach facilitates superior task decomposition and adaptive navigation performance. We validate our framework in a real-world context, demonstrating its ability to maintain high reliability and semantic awareness even with limited training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a two-phase graph-plus-RAG agent pipeline for workflows but supplies no numbers or method details to support its performance claims.

read the letter

The main thing to know is that the abstract asserts superior task decomposition and high reliability from real-world validation, yet gives no quantitative results, baselines, or even a sketch of how the topological graph gets built from logs.

What the paper does is propose a concrete pipeline: offline adaptive construction of a fixed topological knowledge base from fragmented execution logs, then inference via Adaptive RAG over that graph plus closed-loop multi-agent verification for self-correction. This directly targets the limitation of treating tasks as isolated linear episodes, which is a real issue in current multimodal GUI agents. The framing of non-stationary scenarios and limited training data is clear and reasonable.

The soft spot is the missing evidence. The stress-test concern lands: nothing shows that the offline phase recovers reusable transitions rather than local sequences, or that the resulting graph actually helps on novel workflows. The assumption that fragmented logs contain enough structure for generalization is stated but not tested or even described algorithmically. Without that, the claimed advantage does not follow from the text.

This is for researchers working on agent architectures for information-system workflows. A reader hunting for high-level design patterns around graphs and RAG might pick up an idea or two, but the absence of data means it offers little to cite or replicate.

I would send it to peer review if the full paper contains the omitted experiments, algorithm details, and comparisons; the core idea is coherent enough to be worth referee time even if heavy revision is needed.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a multimodal multi-agent framework for automatic workflow execution via a two-phase pipeline: an offline discovery phase that adaptively constructs a topological knowledge base from fragmented execution logs, followed by an inference phase that uses Adaptive RAG over the resulting fixed graph together with closed-loop collaborative verification for dynamic self-correction. The authors assert that this graph-based approach yields superior task decomposition and adaptive navigation, and that real-world validation demonstrates high reliability and semantic awareness even with limited training data.

Significance. If the empirical claims and generalization properties were substantiated with quantitative evidence, the work could offer a practical advance over linear-episode agent designs for non-stationary GUI workflows. At present the significance cannot be assessed because the central performance assertions rest on unshown validation results.

major comments (2)

[Abstract] Abstract: the assertions of 'superior task decomposition and adaptive navigation performance' and 'high reliability' after 'real-world validation' are presented without any quantitative metrics, baselines, success rates, error bars, or ablation studies, rendering the central claims unevaluable.
[Abstract] Abstract (offline discovery phase): the claim that fragmented execution logs suffice to 'adaptively construct a topological knowledge base' that captures reusable transition topology for novel or non-stationary scenarios is stated without an algorithm, formal definition of the topology, measure of log fragmentation, or any experiment showing improvement over linear baselines on out-of-distribution workflows.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and valuable feedback. We agree that the abstract must be revised to include quantitative support for its claims so that the central assertions are immediately evaluable. The full manuscript already contains the supporting experimental details, algorithms, and results; we will ensure these are properly highlighted in the abstract and any necessary clarifications are added. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the assertions of 'superior task decomposition and adaptive navigation performance' and 'high reliability' after 'real-world validation' are presented without any quantitative metrics, baselines, success rates, error bars, or ablation studies, rendering the central claims unevaluable.

Authors: We acknowledge that the abstract as currently written does not contain the quantitative metrics, baselines, or ablation results needed to evaluate the claims. The manuscript reports these results in the experimental sections (including success rates on real-world workflows, comparisons against linear-episode baselines, and component ablations). In the revised version we will update the abstract to report the key quantitative findings (e.g., overall success rate, improvement margins, and statistical details) so the claims become directly evaluable from the abstract. revision: yes
Referee: [Abstract] Abstract (offline discovery phase): the claim that fragmented execution logs suffice to 'adaptively construct a topological knowledge base' that captures reusable transition topology for novel or non-stationary scenarios is stated without an algorithm, formal definition of the topology, measure of log fragmentation, or any experiment showing improvement over linear baselines on out-of-distribution workflows.

Authors: Section 3.1 of the manuscript presents the algorithm for adaptive topological knowledge-base construction, the formal graph definition (nodes as workflow states, edges as verified transitions), and the fragmentation metric (number of disconnected execution traces per workflow). Section 5 reports experiments demonstrating improved performance on out-of-distribution and non-stationary workflows relative to linear baselines. We will add a concise reference to the algorithm, formal definition, and experimental evidence in the revised abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: framework proposal lacks derivations or self-referential reductions

full rationale

The paper describes a two-phase multimodal agent framework (offline discovery of a topological knowledge base from logs, followed by Adaptive RAG + closed-loop verification) but contains no equations, fitted parameters, predictions, or first-principles derivations. The abstract and described pipeline present design choices and empirical validation claims without any step that defines a quantity in terms of itself or renames a fitted input as a prediction. No self-citation chains or uniqueness theorems are invoked as load-bearing elements. The derivation chain is therefore self-contained as an architectural proposal rather than a mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are specified in the abstract; the framework itself is described at the level of components rather than formal postulates.

pith-pipeline@v0.9.1-grok · 5705 in / 1139 out tokens · 33177 ms · 2026-06-29T11:44:28.003134+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 11 canonical work pages · 7 internal anchors

[1]

Conversing with business process-aware large language models: the BPLLM framework.J

Mario Luca Bernardi, Angelo Casciani, Marta Cimitile, and Andrea Marrella. Conversing with business process-aware large language models: the BPLLM framework.J. Intell. Inf. Syst., 62(6):1607–1629, 2024

2024
[2]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-Embedding: Multi-linguality, multi-functionality, multi- granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Pg-agent: An agent powered by page graph, 2025

Weizhi Chen, Ziwei Wang, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Jiajun Bu, Yong Li, and Wei Jiang. Pg-agent: An agent powered by page graph, 2025

2025
[4]

Seeclick: Harnessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 9313–9332, 2024

2024
[5]

Mind2Web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyu Zheng, Shijie Chen, Samuel Stevens, Xuehai Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, 2023

2023
[6]

A Survey on In-context Learning

Qingxiu Dong et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge et al. From local to global: A graph RAG approach to query-focused summarization. InarXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Guan et al

X. Guan et al. Topological perception in LLM-based agents: Beyond linear traces.Journal of Artificial Intelligence Research, 2024

2024
[10]

Cogagent: A visual language model for gui agents

Wenyi Hong et al. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08939, 2023

work page arXiv 2023
[11]

Leveraging large language models for bim-based automated compliance checking.Automation in Construction, 182:106707, 2026

Odin Iversen and Lizhen Huang. Leveraging large language models for bim-based automated compliance checking.Automation in Construction, 182:106707, 2026

2026
[12]

A survey on knowledge graphs: Representation, acquisition, and applications.IEEE Transactions on Neural Networks and Learning Systems, 33(2):494–514, 2021

Shaoxiong Ji et al. A survey on knowledge graphs: Representation, acquisition, and applications.IEEE Transactions on Neural Networks and Learning Systems, 33(2):494–514, 2021

2021
[13]

Visualwebarena: A multimodal benchmark for generalist visual agents on the web

Jing Yu Koh et al. Visualwebarena: A multimodal benchmark for generalist visual agents on the web. InProceedings of the ACL, 2024

2024
[14]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459–9474, 2020

2020
[15]

Llm-powered gui agents in phone automation: Surveying progress and prospects.arXiv, 2025

Guangyi Liu, Pengxiang Zhao, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, Wenhao Wang, Tianze Wu, Linghao Li, Hao Wang, Guanjing Xiong, Yong Liu, and Hongsheng Li. Llm-powered gui agents in phone automation: Surveying progress and prospects.arXiv, 2025

2025
[16]

Liu et al

R. Liu et al. Webllama: Bridging everyday language and web navigation with large language models.arXiv preprint arXiv:2402.05116, 2024

work page arXiv 2024
[17]

Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices, 2025

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices, 2025

2025
[18]

Mishra et al

A. Mishra et al. Multimodal large language models for gui agents: A survey.arXiv preprint arXiv:2402.00001, 2024

work page arXiv 2024
[19]

Unifying large language models and knowledge graphs: A roadmap.IEEE Transactions on Knowledge and Data Engineering, 2024

Shirui Pan et al. Unifying large language models and knowledge graphs: A roadmap.IEEE Transactions on Knowledge and Data Engineering, 2024

2024
[20]

Androidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Androidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

2023
[21]

Sokhansanj, James R

Mohammad Saleh Refahi, Gavin Hearne, Harrison Muller, Kieran Lynch, Bahrad A. Sokhansanj, James R. Brown, and Gail Rosen. Fast and scalable gene embedding search: A comparative study of FAISS and ScaNN.arXiv preprint arXiv:2507.16978, 2025

work page arXiv 2025
[22]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Peng Wang, An Yang, Jiamang Qui, et al. Qwen2-vl: To see real-world understanding as humans do.arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024
[24]

Mind2web: Towards a generalist agent for the web

Xiang Yang, Jiang Chen, et al. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[25]

Webshop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao et al. Webshop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[26]

QA-GNN: Reasoning with language models and knowledge graphs for question answering.Proceedings of the NAACL-HLT, 2022

Michihiro Yasunaga et al. QA-GNN: Reasoning with language models and knowledge graphs for question answering.Proceedings of the NAACL-HLT, 2022

2022
[27]

A Survey on Multimodal Large Language Models

Shukang Yin, Chaoyuan Fu, Sirui Zhao, Ke Xu, Kai Wang, Dianbo Sui, Yunhua Shen, Ning Li, Xing Sun, and Shan Lin. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Boyu Zheng et al. Gpt-4v(ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Conversing with business process-aware large language models: the BPLLM framework.J

Mario Luca Bernardi, Angelo Casciani, Marta Cimitile, and Andrea Marrella. Conversing with business process-aware large language models: the BPLLM framework.J. Intell. Inf. Syst., 62(6):1607–1629, 2024

2024

[2] [2]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-Embedding: Multi-linguality, multi-functionality, multi- granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Pg-agent: An agent powered by page graph, 2025

Weizhi Chen, Ziwei Wang, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Jiajun Bu, Yong Li, and Wei Jiang. Pg-agent: An agent powered by page graph, 2025

2025

[4] [4]

Seeclick: Harnessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 9313–9332, 2024

2024

[5] [5]

Mind2Web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyu Zheng, Shijie Chen, Samuel Stevens, Xuehai Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, 2023

2023

[6] [6]

A Survey on In-context Learning

Qingxiu Dong et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge et al. From local to global: A graph RAG approach to query-focused summarization. InarXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Guan et al

X. Guan et al. Topological perception in LLM-based agents: Beyond linear traces.Journal of Artificial Intelligence Research, 2024

2024

[10] [10]

Cogagent: A visual language model for gui agents

Wenyi Hong et al. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08939, 2023

work page arXiv 2023

[11] [11]

Leveraging large language models for bim-based automated compliance checking.Automation in Construction, 182:106707, 2026

Odin Iversen and Lizhen Huang. Leveraging large language models for bim-based automated compliance checking.Automation in Construction, 182:106707, 2026

2026

[12] [12]

A survey on knowledge graphs: Representation, acquisition, and applications.IEEE Transactions on Neural Networks and Learning Systems, 33(2):494–514, 2021

Shaoxiong Ji et al. A survey on knowledge graphs: Representation, acquisition, and applications.IEEE Transactions on Neural Networks and Learning Systems, 33(2):494–514, 2021

2021

[13] [13]

Visualwebarena: A multimodal benchmark for generalist visual agents on the web

Jing Yu Koh et al. Visualwebarena: A multimodal benchmark for generalist visual agents on the web. InProceedings of the ACL, 2024

2024

[14] [14]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 9459–9474, 2020

2020

[15] [15]

Llm-powered gui agents in phone automation: Surveying progress and prospects.arXiv, 2025

Guangyi Liu, Pengxiang Zhao, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, Wenhao Wang, Tianze Wu, Linghao Li, Hao Wang, Guanjing Xiong, Yong Liu, and Hongsheng Li. Llm-powered gui agents in phone automation: Surveying progress and prospects.arXiv, 2025

2025

[16] [16]

Liu et al

R. Liu et al. Webllama: Bridging everyday language and web navigation with large language models.arXiv preprint arXiv:2402.05116, 2024

work page arXiv 2024

[17] [17]

Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices, 2025

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices, 2025

2025

[18] [18]

Mishra et al

A. Mishra et al. Multimodal large language models for gui agents: A survey.arXiv preprint arXiv:2402.00001, 2024

work page arXiv 2024

[19] [19]

Unifying large language models and knowledge graphs: A roadmap.IEEE Transactions on Knowledge and Data Engineering, 2024

Shirui Pan et al. Unifying large language models and knowledge graphs: A roadmap.IEEE Transactions on Knowledge and Data Engineering, 2024

2024

[20] [20]

Androidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Androidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

2023

[21] [21]

Sokhansanj, James R

Mohammad Saleh Refahi, Gavin Hearne, Harrison Muller, Kieran Lynch, Bahrad A. Sokhansanj, James R. Brown, and Gail Rosen. Fast and scalable gene embedding search: A comparative study of FAISS and ScaNN.arXiv preprint arXiv:2507.16978, 2025

work page arXiv 2025

[22] [22]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Peng Wang, An Yang, Jiamang Qui, et al. Qwen2-vl: To see real-world understanding as humans do.arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024

[24] [24]

Mind2web: Towards a generalist agent for the web

Xiang Yang, Jiang Chen, et al. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[25] [25]

Webshop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao et al. Webshop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022

[26] [26]

QA-GNN: Reasoning with language models and knowledge graphs for question answering.Proceedings of the NAACL-HLT, 2022

Michihiro Yasunaga et al. QA-GNN: Reasoning with language models and knowledge graphs for question answering.Proceedings of the NAACL-HLT, 2022

2022

[27] [27]

A Survey on Multimodal Large Language Models

Shukang Yin, Chaoyuan Fu, Sirui Zhao, Ke Xu, Kai Wang, Dianbo Sui, Yunhua Shen, Ning Li, Xing Sun, and Shan Lin. A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Boyu Zheng et al. Gpt-4v(ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024