ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents

Binjie Zhang; Mike Zheng Shou

arxiv: 2606.31392 · v1 · pith:A3XCM5R7new · submitted 2026-06-30 · 💻 cs.AI

ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents

Binjie Zhang , Mike Zheng Shou This is my paper

Pith reviewed 2026-07-01 05:21 UTC · model grok-4.3

classification 💻 cs.AI

keywords ReGRPOReflection-of-Thoughtgroup relative policy optimizationtool-using agentsvision-language modelsfailure recoverymulti-step tasksnear-miss actions

0 comments

The pith

ReGRPO trains tool-using agents to reflect on near-miss failures via structured triplets and group-relative optimization to improve recovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReGRPO to address fragility in tool-augmented vision-language models during multi-step tasks. It builds a reflective data engine that runs near-miss actions to collect failure observations and formats them as Reflection-of-Thought triplets for initial supervised fine-tuning. The method then jointly optimizes reflection tokens and corrective actions inside local trajectories using group-relative advantages while adding a reflection-cost term to limit unnecessary reflection. Experiments on GTA and GAIA show consistent gains over strong open-source baselines under identical backbones and tool suites. A sympathetic reader would care because this supplies step-level recovery signals that standard success-only SFT and sparse trajectory rewards lack.

Core claim

ReGRPO learns reflection-guided correction in tool-using agents. It starts with a structured reflective data engine that executes near-miss actions to collect grounded failure observations, then builds Reflection-of-Thought triplets (ErrorType, Evidence, FixPlan) paired with corrected actions for warm-start SFT. It optimizes reflection tokens and corrective actions jointly within local trajectories using group-relative advantages and includes a reflection-cost term to reduce unnecessary reflection.

What carries the argument

Reflection-of-Thought triplets (ErrorType, Evidence, FixPlan) paired with corrected actions, optimized via group-relative policy optimization plus a reflection-cost term.

Load-bearing premise

Collecting grounded failure observations via near-miss actions and structuring them as Reflection-of-Thought triplets supplies a sufficiently rich training signal to enable reliable recovery, and adding a reflection-cost term will not degrade performance on successful trajectories.

What would settle it

Training ReGRPO on GTA or GAIA and measuring no gain or a drop versus baselines that omit the reflection components would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.31392 by Binjie Zhang, Mike Zheng Shou.

**Figure 1.** Figure 1: Pipeline of the Structured Reflective Data Engine. Given a task trajectory, we first induce a tool failure (for example, OCR on a face image returns empty text). A teacher model (GPT-4o by default) then generates a structured Reflection-of-Thought with explicit ErrorType, Evidence, and FixPlan, which explains the failure and proposes the next action (for example, switching to face detection). The agent exe… view at source ↗

**Figure 2.** Figure 2: Overview of ReGRPO. (1) Structured Reflective Data Engine: from multimodal inputs and a successful action, synthesize a near-miss failure (wrong crop/tool/argument), execute it to obtain grounded failure observations (e.g., empty OCR or tool error), then use a teacher VLM (e.g., GPT-4o) to generate a structured Reflection-of-Thought triplet (ErrorType, Evidence, FixPlan). Pair the reflection with the corre… view at source ↗

**Figure 3.** Figure 3: Comparison of PPO [19], DPO [17], GRPO [20], and ReGRPO (ours). PPO and DPO optimize actions or preferences without treating reflection as a decision variable; GRPO reduces variance via group-relative rewards; ReGRPO further includes reflection in the optimized trajectory to provide stronger recovery-oriented supervision for failed steps. Reflection-Driven Advantage and Optimization. Within the sampled gro… view at source ↗

**Figure 4.** Figure 4: Figure-level evidence for ReGRPO, instantiated on a verbatim synthesized RoT record (0FLZe2lb_rot_s0_GroundingDrift). (a) RoT augments the MAT SFT format with explicit reflective fields (a reflection triplet and a corrected_action). (b) The real (ErrorType, Evidence, FixPlan) reflection diagnoses a silent grounding failure—the tool answers about the cutting board rather than the smoothie—and prescribes a r… view at source ↗

**Figure 5.** Figure 5: Teacher-derived verifier subscores. At training time (sa, sg, sp) are computed deterministically from the teacher’s RoT metadata via tool/argument-signature matching, the replay success flag, and a grounded-reflection check (sg requires grounded evidence and sp > 0); no GPT-4o (or any LLM) is queried in the RL loop. The GPT-4o teacher is used only offline to synthesize the RoT reflections. A4 Tool Suite W… view at source ↗

**Figure 6.** Figure 6: Input image for Case A1 (image_318.jpg): the menu the agent grounds and reads before solving the constrained selection [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

**Figure 7.** Figure 7: Input image for Case A2 (image_417.jpg): the depicted product the agent recognizes and resolves to its manufacturer before retrieving the CEO [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

read the original abstract

Tool-augmented vision-language models (VLMs) can solve multimodal, multi-step tasks by calling external tools, yet they remain fragile in practice. Existing works have two common gaps. Supervised fine-tuning (SFT) is built mostly on successful trajectories and offers little signal for recovery after tool failures, while sparse trajectory-level RL rewards provide limited guidance on which step failed and how to repair it. We introduce ReGRPO (Reflection-augmented Group Relative Policy Optimization), a framework that learns reflection-guided correction in tool-using agents. ReGRPO starts with a structured reflective data engine: we execute near-miss actions to collect grounded failure observations, then build Reflection-of-Thought triplets (ErrorType, Evidence, FixPlan) paired with corrected actions for warm-start SFT. We then optimize reflection tokens and corrective actions jointly within local trajectories using group-relative advantages, and include a reflection-cost term to reduce unnecessary reflection. Experiments on GTA and GAIA show that, under the same backbone and tool suite, ReGRPO consistently outperforms strong open-source baselines and achieves the best results among the compared open-source controllers. Code and RoT data are available at https://github.com/showlab/ReGRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReGRPO turns near-miss failures into structured reflection triplets and folds them into GRPO with a cost penalty, producing measurable gains on GTA and GAIA under matched backbones.

read the letter

ReGRPO gives tool-using VLMs a way to learn recovery by collecting near-miss executions, packaging the failures as Reflection-of-Thought triplets, and then training reflections and actions together inside group-relative policy optimization with an added penalty for extra reflection steps.

The concrete new piece is the data engine that deliberately triggers near-miss actions to generate grounded failure observations, then converts them into (ErrorType, Evidence, FixPlan) triplets for warm-start SFT before the joint GRPO stage. That pipeline is not just restated prior work; it supplies a denser signal than trajectory-level rewards alone. Releasing the code and the RoT dataset is also useful for anyone who wants to inspect or extend the method.

The experiments report consistent outperformance over open-source baselines on GTA and GAIA with the same backbone and tool set. The stress-test found no internal contradiction in the argument, and the central claim rests on a coherent sequence of steps rather than hidden fitting.

Soft spots are limited. The reflection-cost weight is a free parameter that must be chosen, and the near-miss collection procedure could be sensitive to how the execution engine is implemented. Without the full ablation tables it is hard to quantify exactly how much each component drives the final numbers, but nothing in the described pipeline looks load-bearing or circular.

This paper is aimed at people building multimodal agents that must handle tool failures in practice. Readers who care about concrete training recipes for recovery will find the details and artifacts worth their time. It deserves a serious referee because the method is testable, the data are released, and the reported gains are framed under controlled comparisons.

Referee Report

0 major / 3 minor

Summary. The paper introduces ReGRPO, a framework for tool-augmented VLMs that addresses limited recovery signals in SFT and sparse RL by collecting near-miss failure data into Reflection-of-Thought triplets (ErrorType, Evidence, FixPlan) for warm-start SFT, followed by joint GRPO optimization of reflection tokens and corrective actions with an added reflection-cost term. Experiments claim consistent outperformance over open-source baselines on GTA and GAIA under matched backbones and tool suites, with code and data released.

Significance. If the empirical gains hold, the work supplies a concrete mechanism for generating grounded recovery signals from near-miss trajectories and integrating them into group-relative optimization, which could improve robustness of multi-step tool use. Explicit release of code and RoT data is a positive contribution to reproducibility.

minor comments (3)

[Abstract] The abstract states that ReGRPO 'consistently outperforms' baselines but provides no numerical deltas, success rates, or statistical significance; adding these values (or directing readers to the relevant table) would strengthen the claim.
The description of the reflection-cost term is introduced without an explicit equation or hyperparameter schedule; including the precise formulation and how its weight was chosen would improve clarity.
Figure or table captions should explicitly state the backbone model and tool suite used for each compared method to make the 'under the same backbone' claim immediately verifiable.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of ReGRPO, the recognition of its contribution to generating grounded recovery signals, and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents ReGRPO as a pipeline of near-miss data collection into Reflection-of-Thought triplets, warm-start SFT, and joint GRPO optimization with an added reflection-cost term. No equations, fitted parameters, or predictions are shown that reduce by construction to the method's own inputs or self-citations. The performance claims rest on external experimental benchmarks (GTA, GAIA) rather than any self-referential derivation or renaming of known results. The derivation chain is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only review yields limited visibility; the ledger records only elements explicitly named in the abstract.

free parameters (1)

reflection-cost term weight
Mentioned as included to reduce unnecessary reflection; no value or fitting procedure given.

axioms (1)

domain assumption Group-relative advantages within local trajectories supply finer-grained learning signal than sparse trajectory-level rewards
Stated as motivation for the optimization step.

invented entities (1)

Reflection-of-Thought triplets (ErrorType, Evidence, FixPlan) no independent evidence
purpose: Structured representation of failure observations for warm-start SFT
New data format introduced by the paper; no independent evidence outside this work.

pith-pipeline@v0.9.1-grok · 5740 in / 1323 out tokens · 26107 ms · 2026-07-01T05:21:05.420502+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 20 canonical work pages · 10 internal anchors

[1]

Chen, X., Lin, M., Schärli, N., Zhou, D.: Teaching large language models to self-debug (2023), https://arxiv.org/abs/2304.05128

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

In: ICML

Choi, W., Kim, W.K., Yoo, M., Woo, H.: Embodied cot distillation from llm to off-the-shelf agents. In: ICML. pp. 8702–8721 (2024)

2024
[3]

In: ECCV (2024)

Fan, Y ., Ma, X., Wu, R., Du, Y ., Li, J., Gao, Z., Li, Q.: Videoagent: A memory-augmented multimodal agent for video understanding. In: ECCV (2024)

2024
[4]

In: CVPR

Gao, Z., Du, Y ., Zhang, X., Ma, X., Han, W., Zhu, S.C., Li, Q.: Clova: A closed-loop visual assistant with tool usage and update. In: CVPR. pp. 13258–13268 (2024)

2024
[5]

arXiv preprint arXiv:2412.15606 (2024)

Gao, Z., Zhang, B., Li, P., Ma, X., Yuan, T., Fan, Y ., Wu, Y ., Jia, Y ., Zhu, S.C., Li, Q.: Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage. arXiv preprint arXiv:2412.15606 (2024)

work page arXiv 2024
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

In: ICLR (2022)

Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: ICLR (2022)

2022
[8]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Li, F., Zhang, R., Zhang, H., Zhang, Y ., Li, B., Li, W., Ma, Z., Li, C.: Llava-next- interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

Li, P., Gao, Z., Zhang, B., Mi, Y ., Ma, X., Shi, C., Yuan, T., Wu, Y ., Jia, Y ., Zhu, S.C., et al.: Iterative tool usage exploration for multimodal agents via step-wise preference tuning. arXiv preprint arXiv:2504.21561 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

arXiv preprint arXiv:2404.065103(2024)

Liao, Y .H., Mahmood, R., Fidler, S., Acuna, D.: Can feedback enhance semantic grounding in large vision-language models. arXiv preprint arXiv:2404.065103(2024)

work page arXiv 2024
[11]

Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., Lee, Y .J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024), https://llava-vl.github.io/blog/ 2024-01-30-llava-next/

2024
[12]

arXiv preprint arXiv:2311.05437 (2023)

Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., et al.: Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437 (2023)

work page arXiv 2023
[13]

arXiv preprint arXiv:2408.06327 (2024)

Liu, X., Zhang, T., Gu, Y ., Iong, I.L., Xu, Y ., Song, X., Zhang, S., Lai, H., Liu, X., Zhao, H., et al.: Visualagentbench: Towards large multimodal models as visual foundation agents. arXiv preprint arXiv:2408.06327 (2024)

work page arXiv 2024
[14]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., et al.: Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yazdanbakhsh, A., Clark, P.: Self-refine: Iterative refinement with self-feedback (2023), https://arxiv.org/abs/2303.17651

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

GAIA: a benchmark for General AI Assistants

Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y ., Scialom, T.: Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Advances in neural information processing systems36, 53728–53741 (2023)

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)

2023
[18]

arXiv preprint arXiv:2405.08037 (2024) ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents 17

Sasazawa, Y ., Sogawa, Y .: Layout generation agents with large language models. arXiv preprint arXiv:2405.08037 (2024) ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents 17

work page arXiv 2024
[19]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

arXiv preprint arXiv:2311.18760 (2023)

Shen, Y ., Song, K., Tan, X., Zhang, W., Ren, K., Yuan, S., Lu, W., Li, D., Zhuang, Y .: Taskbench: Benchmarking large language models for task automation. arXiv preprint arXiv:2311.18760 (2023)

work page arXiv 2023
[22]

Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: Lan- guage agents with verbal reinforcement learning (2023), https://arxiv.org/abs/ 2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

In: ICCV

Surís, D., Menon, S., V ondrick, C.: Vipergpt: Visual inference via python execution for reasoning. In: ICCV . pp. 11888–11898 (2023)

2023
[24]

arXiv preprint arXiv:2401.107274(2024)

Wang, C., Luo, W., Chen, Q., Mai, H., Guo, J., Dong, S., Xuan, X., Li, Z., Ma, L., Gao, S.: Mllm-tool: A multimodal large language model for tool agent learning. arXiv preprint arXiv:2401.107274(2024)

work page arXiv 2024
[25]

In: NeurIPS (2024)

Wang, J., Ma, Z., Li, Y ., Zhang, S., Chen, C., Chen, K., Le, X.: Gta: A benchmark for general tool agents. In: NeurIPS (2024)

2024
[26]

arXiv preprint arXiv:2407.05600 (2024)

Wang, Z., Li, A., Li, Z., Liu, X.: Genartist: Multimodal llm as an agent for unified image generation and editing. arXiv preprint arXiv:2407.05600 (2024)

work page arXiv 2024
[27]

arXiv preprint arXiv:2401.15688 (2024)

Wang, Z., Xie, E., Li, A., Wang, Z., Liu, X., Li, Z.: Divide and conquer: Language mod- els can plan and self-correct for compositional text-to-image generation. arXiv preprint arXiv:2401.15688 (2024)

work page arXiv 2024
[28]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Xiong, T., Wang, X., Guo, D., Ye, Q., Fan, H., Gu, Q., Huang, H., Li, C.: Llava-critic: Learning to evaluate multimodal models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13618–13628 (2025)

2025
[29]

In: ICML

Yin, D., Brahman, F., Ravichander, A., Chandu, K., Chang, K.W., Choi, Y ., Lin, B.Y .: Agent lumos: Unified and modular training for open-source language agents. In: ICML. pp. 12380– 12403 (2024)

2024
[30]

arXiv preprint arXiv:2402.15506 (2024)

Zhang, J., Lan, T., Murthy, R., Liu, Z., Yao, W., Tan, J., Hoang, T., Yang, L., Feng, Y ., Liu, Z., et al.: Agentohana: Design unified data and training pipeline for effective agent learning. arXiv preprint arXiv:2402.15506 (2024)

work page arXiv 2024
[31]

task": ...,

Zheng, B., Gou, B., Kil, J., Sun, H., Su, Y .: Gpt-4v (ision) is a generalist web agent, if grounded. In: ICML. pp. 61349–61385 (2024) ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents 1 Appendix A1 Algorithmic Details In this section we provide pseudo code and implementation details for Reflection- Augmented Group Relative Policy Opt...

2024

[1] [1]

Chen, X., Lin, M., Schärli, N., Zhou, D.: Teaching large language models to self-debug (2023), https://arxiv.org/abs/2304.05128

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

In: ICML

Choi, W., Kim, W.K., Yoo, M., Woo, H.: Embodied cot distillation from llm to off-the-shelf agents. In: ICML. pp. 8702–8721 (2024)

2024

[3] [3]

In: ECCV (2024)

Fan, Y ., Ma, X., Wu, R., Du, Y ., Li, J., Gao, Z., Li, Q.: Videoagent: A memory-augmented multimodal agent for video understanding. In: ECCV (2024)

2024

[4] [4]

In: CVPR

Gao, Z., Du, Y ., Zhang, X., Ma, X., Han, W., Zhu, S.C., Li, Q.: Clova: A closed-loop visual assistant with tool usage and update. In: CVPR. pp. 13258–13268 (2024)

2024

[5] [5]

arXiv preprint arXiv:2412.15606 (2024)

Gao, Z., Zhang, B., Li, P., Ma, X., Yuan, T., Fan, Y ., Wu, Y ., Jia, Y ., Zhu, S.C., Li, Q.: Multi-modal agent tuning: Building a vlm-driven agent for efficient tool usage. arXiv preprint arXiv:2412.15606 (2024)

work page arXiv 2024

[6] [6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

In: ICLR (2022)

Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: ICLR (2022)

2022

[8] [8]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Li, F., Zhang, R., Zhang, H., Zhang, Y ., Li, B., Li, W., Ma, Z., Li, C.: Llava-next- interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

Li, P., Gao, Z., Zhang, B., Mi, Y ., Ma, X., Shi, C., Yuan, T., Wu, Y ., Jia, Y ., Zhu, S.C., et al.: Iterative tool usage exploration for multimodal agents via step-wise preference tuning. arXiv preprint arXiv:2504.21561 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

arXiv preprint arXiv:2404.065103(2024)

Liao, Y .H., Mahmood, R., Fidler, S., Acuna, D.: Can feedback enhance semantic grounding in large vision-language models. arXiv preprint arXiv:2404.065103(2024)

work page arXiv 2024

[11] [11]

Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., Lee, Y .J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024), https://llava-vl.github.io/blog/ 2024-01-30-llava-next/

2024

[12] [12]

arXiv preprint arXiv:2311.05437 (2023)

Liu, S., Cheng, H., Liu, H., Zhang, H., Li, F., Ren, T., Zou, X., Yang, J., Su, H., Zhu, J., et al.: Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437 (2023)

work page arXiv 2023

[13] [13]

arXiv preprint arXiv:2408.06327 (2024)

Liu, X., Zhang, T., Gu, Y ., Iong, I.L., Xu, Y ., Song, X., Zhang, S., Lai, H., Liu, X., Zhao, H., et al.: Visualagentbench: Towards large multimodal models as visual foundation agents. arXiv preprint arXiv:2408.06327 (2024)

work page arXiv 2024

[14] [14]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., et al.: Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y ., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yazdanbakhsh, A., Clark, P.: Self-refine: Iterative refinement with self-feedback (2023), https://arxiv.org/abs/2303.17651

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

GAIA: a benchmark for General AI Assistants

Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y ., Scialom, T.: Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Advances in neural information processing systems36, 53728–53741 (2023)

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)

2023

[18] [18]

arXiv preprint arXiv:2405.08037 (2024) ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents 17

Sasazawa, Y ., Sogawa, Y .: Layout generation agents with large language models. arXiv preprint arXiv:2405.08037 (2024) ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents 17

work page arXiv 2024

[19] [19]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

arXiv preprint arXiv:2311.18760 (2023)

Shen, Y ., Song, K., Tan, X., Zhang, W., Ren, K., Yuan, S., Lu, W., Li, D., Zhuang, Y .: Taskbench: Benchmarking large language models for task automation. arXiv preprint arXiv:2311.18760 (2023)

work page arXiv 2023

[22] [22]

Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., Yao, S.: Reflexion: Lan- guage agents with verbal reinforcement learning (2023), https://arxiv.org/abs/ 2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

In: ICCV

Surís, D., Menon, S., V ondrick, C.: Vipergpt: Visual inference via python execution for reasoning. In: ICCV . pp. 11888–11898 (2023)

2023

[24] [24]

arXiv preprint arXiv:2401.107274(2024)

Wang, C., Luo, W., Chen, Q., Mai, H., Guo, J., Dong, S., Xuan, X., Li, Z., Ma, L., Gao, S.: Mllm-tool: A multimodal large language model for tool agent learning. arXiv preprint arXiv:2401.107274(2024)

work page arXiv 2024

[25] [25]

In: NeurIPS (2024)

Wang, J., Ma, Z., Li, Y ., Zhang, S., Chen, C., Chen, K., Le, X.: Gta: A benchmark for general tool agents. In: NeurIPS (2024)

2024

[26] [26]

arXiv preprint arXiv:2407.05600 (2024)

Wang, Z., Li, A., Li, Z., Liu, X.: Genartist: Multimodal llm as an agent for unified image generation and editing. arXiv preprint arXiv:2407.05600 (2024)

work page arXiv 2024

[27] [27]

arXiv preprint arXiv:2401.15688 (2024)

Wang, Z., Xie, E., Li, A., Wang, Z., Liu, X., Li, Z.: Divide and conquer: Language mod- els can plan and self-correct for compositional text-to-image generation. arXiv preprint arXiv:2401.15688 (2024)

work page arXiv 2024

[28] [28]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Xiong, T., Wang, X., Guo, D., Ye, Q., Fan, H., Gu, Q., Huang, H., Li, C.: Llava-critic: Learning to evaluate multimodal models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 13618–13628 (2025)

2025

[29] [29]

In: ICML

Yin, D., Brahman, F., Ravichander, A., Chandu, K., Chang, K.W., Choi, Y ., Lin, B.Y .: Agent lumos: Unified and modular training for open-source language agents. In: ICML. pp. 12380– 12403 (2024)

2024

[30] [30]

arXiv preprint arXiv:2402.15506 (2024)

Zhang, J., Lan, T., Murthy, R., Liu, Z., Yao, W., Tan, J., Hoang, T., Yang, L., Feng, Y ., Liu, Z., et al.: Agentohana: Design unified data and training pipeline for effective agent learning. arXiv preprint arXiv:2402.15506 (2024)

work page arXiv 2024

[31] [31]

task": ...,

Zheng, B., Gou, B., Kil, J., Sun, H., Su, Y .: Gpt-4v (ision) is a generalist web agent, if grounded. In: ICML. pp. 61349–61385 (2024) ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents 1 Appendix A1 Algorithmic Details In this section we provide pseudo code and implementation details for Reflection- Augmented Group Relative Policy Opt...

2024