Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

Chengxuan Qian; Dapeng Zhang; Jing Tang; Lei Sun; Rui Chen; Shuo Li; Xiangxiang Chu; Xiangyan Qu; Yiwei Wang; Yujun Cai

arxiv: 2510.08480 · v2 · submitted 2025-10-09 · 💻 cs.CV

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

Zhenlong Yuan , Xiangyan Qu , Chengxuan Qian , Rui Chen , Jing Tang , Lei Sun , Xiangxiang Chu , Dapeng Zhang

show 3 more authors

Yiwei Wang Yujun Cai Shuo Li

This is my paper

Pith reviewed 2026-05-18 08:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords open-vocabulary action recognitionvideo understandingsub-motion decompositiontool-augmented reinforcement learningmultimodal large language modelscross-modal hallucinationhierarchical reward

0 comments

The pith

Video-STAR decomposes video actions into sub-motions and augments MLLMs with tool-using reinforcement learning to improve open-vocabulary recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Video-STAR as a way to move beyond treating video actions as single units in multimodal models. It breaks each action into smaller discriminative sub-motions for more precise matching and lets the model call external tools on demand through reinforcement learning. A hierarchical reward balances how often tools are used, how relevant the sub-motions are, and how coherent the overall reasoning stays. This setup is meant to reduce text-driven hallucinations and shift the model toward evidence grounded in the actual video frames. Results on HMDB-51, UCF-101, Something-Something V2, Kinetics-400, and Kinetics-600 show gains over prior methods at distinguishing similar actions.

Core claim

Video-STAR harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition. Actions are decomposed into discriminative sub-motions that support fine-grained matching; domain-specific tools are invoked dynamically for cross-modal interleaving. A hierarchical reward that weighs tool-usage efficiency, sub-motion relevance, and structural coherence lets the model learn to prioritize useful patterns without explicit supervision, moving from text-centric priors to visually grounded inference.

What carries the argument

Contextual sub-motion decomposition combined with tool-augmented reinforcement learning guided by a hierarchical reward.

If this is right

Fine-grained action distinctions become feasible without category-specific training data.
Cross-modal hallucinations decrease because visual sub-motion evidence is interleaved with text reasoning.
The model acquires category-specific reasoning capacity through autonomous tool use rather than hand-crafted prompts.
Performance improves on both small and large benchmarks including Kinetics-400 and Kinetics-600.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition-plus-tool pattern could be tested on video question answering or temporal localization tasks.
If tool calls remain reliable, the approach might lower the amount of paired video-text data needed for new domains.
Hierarchical rewards that explicitly penalize unnecessary tool use could transfer to other agent-style vision-language systems.

Load-bearing premise

Actions can be broken into stable, discriminative sub-motions that improve matching and that external tools can be invoked without adding new errors.

What would settle it

On a held-out video dataset containing many near-identical actions, the full Video-STAR pipeline would need to show no accuracy gain or an increase in hallucination rate compared with a version that skips sub-motion decomposition and tool calls.

Figures

Figures reproduced from arXiv: 2510.08480 by Chengxuan Qian, Dapeng Zhang, Jing Tang, Lei Sun, Rui Chen, Shuo Li, Xiangxiang Chu, Xiangyan Qu, Yiwei Wang, Yujun Cai, Zhenlong Yuan.

**Figure 1.** Figure 1: Key insight of Video-STAR. (a) MLLMs + CoT is prone to hallucinations due to overreliance on text-centric reasoning while ignoring visual cues. (b) MLLMs + Tool-Augmented CoT mitigates hallucinations by integrating domain-specific tools to extract visual information. However, both (a) and (b) lack category-specific reasoning capabilities and struggle to distinguish semantically similar or complex actions.… view at source ↗

**Figure 2.** Figure 2: Pipeline of Video-STAR. (i) Introduce a three-stage sub-motion logic chain to construct tool-augmented reasoning data that decomposes actions into discriminative sub-motions. (ii) Pretrain the MLLMs on structured reasoning chains and fine-tune it for domain-specific adaptation. (iii) Adopt the GRPO algorithm for reinforcement learning, which optimizes a hierarchical reward function considering both tool-u… view at source ↗

**Figure 3.** Figure 3: Tool Libirary. Given the input video, Video-STAR respectively adopts the YOLO 11 for human detection & pose estimation, and the Qwen API for action explanation & video description. Stage 2: Result Integration & Prediction The second stage refines action recognition by integrating tool outputs R with original inputs. For the visual tools Tp and Td, extracted features F are concatenated with raw video frame… view at source ↗

**Figure 4.** Figure 4: Case Study between Qwen2.5-VL-3B and our Video-STAR-3B. Qwen2.5-VL-3B misclassifies action "turn" as "smile", while our Video-STAR-3B accurately identifies the correct action [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: More Case Study between Qwen2.5-VL-3B and our Video-STAR-3B. Qwen2.5-VL3B misclassifies "push" as "play", while our Video-STAR-3B accurately identifies the correct action. B RELATED WORK B.1 OPEN-VOCABULARY ACTION RECOGNITION. Building on the robust feature extraction capabilities of CLIP’s pre-training, researchers have devised numerous video recognition methods leveraging its cross-modal alignment stre… view at source ↗

**Figure 6.** Figure 6: More Case Study between Qwen2.5-VL-3B and our Video-STAR-3B. Our VideoSTAR-3B can accurately identify the correct action. B.2 MULTIMODAL LLMS REASONING. The advancement of deep learning algorithm has led to the development of numerous self-supervised learning methods (Yang et al., 2021; 2024), vision-language model (Song et al., 2025; Chen et al., 2023), multimodal balance (Qian et al., 2025b;a), image re… view at source ↗

read the original abstract

Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Video-STAR combines sub-motion decomposition with tool-augmented RL and a hierarchical reward to target hallucinations in open-vocabulary video action recognition, but the abstract gives no numbers or ablations to check the claims.

read the letter

Video-STAR is trying to fix how video MLLMs handle open-vocabulary actions by splitting them into sub-motions and using RL to decide when to call tools for extra info. The hierarchical reward is meant to keep the tool use efficient while staying relevant and coherent. The new angle is this mix of decomposition and autonomous tool calling without needing labeled supervision for the policy. It claims SOTA on the standard action datasets and better robustness against hallucinations. If the experiments show clear gains from the sub-motion part and the tool integration, that could be a useful practical tweak for fine-grained recognition tasks. The main issue is that the abstract doesn't include any actual numbers or ablation results, so it's hard to tell how much better it really is or whether the reward actually prevents bad tool calls from making things worse. The stress test raises a good point there, and without seeing the full methods or error analysis, that remains a potential weak spot. This paper is for folks working on video understanding and multimodal LLMs. Someone looking for ways to apply RL in this area might get ideas from it. I'd say send it for peer review so the experiments can be checked properly.

Referee Report

2 major / 1 minor

Summary. The paper introduces Video-STAR, a framework for open-vocabulary action recognition that decomposes video actions into discriminative sub-motions and employs tool-augmented reinforcement learning. A hierarchical reward balances tool-usage efficiency, sub-motion relevance, and structural coherence to enable autonomous tool invocation, reduce cross-modal hallucination, and shift from text-centric to visually grounded inference in MLLMs. The central claim is state-of-the-art performance with improved robustness on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600.

Significance. If the empirical results and ablations hold, the combination of sub-motion decomposition with unsupervised tool-augmented RL could offer a practical route to grounded reasoning in video MLLMs, particularly for fine-grained open-vocabulary tasks. The absence of explicit supervision on tool accuracy is a notable design choice that, if shown to work reliably, would strengthen the case for autonomous tool use in multimodal systems.

major comments (2)

[Abstract] Abstract: The SOTA and reduced-hallucination claims are stated without any accuracy numbers, error bars, ablation tables, or dataset-split details, which are load-bearing for evaluating whether the hierarchical reward actually delivers the reported gains on SSv2 and Kinetics-600.
[Framework description] Framework description (paragraph on hierarchical reward): No mechanism is specified that prevents the learned policy from invoking tools whose outputs are noisy or irrelevant; an incorrect tool result could introduce new cross-modal errors rather than reduce hallucination, directly affecting the robustness claim for fine-grained actions.

minor comments (1)

[Framework] The notation for sub-motion decomposition and tool interleaving should be formalized with explicit equations or pseudocode to clarify how the reward components are computed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and indicate planned revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The SOTA and reduced-hallucination claims are stated without any accuracy numbers, error bars, ablation tables, or dataset-split details, which are load-bearing for evaluating whether the hierarchical reward actually delivers the reported gains on SSv2 and Kinetics-600.

Authors: We agree that the abstract would benefit from greater specificity to support the claims. In the revised manuscript we will add key quantitative results (e.g., accuracy deltas on SSv2 and Kinetics-600), reference the ablation tables, and note the standard train/test splits used, thereby making the contribution of the hierarchical reward more transparent. revision: yes
Referee: [Framework description] Framework description (paragraph on hierarchical reward): No mechanism is specified that prevents the learned policy from invoking tools whose outputs are noisy or irrelevant; an incorrect tool result could introduce new cross-modal errors rather than reduce hallucination, directly affecting the robustness claim for fine-grained actions.

Authors: The hierarchical reward is constructed precisely to discourage noisy tool use: its tool-usage-efficiency and sub-motion-relevance terms penalize invocations that fail to improve reasoning coherence or final recognition accuracy. The RL objective therefore trains the policy to avoid low-reward (i.e., noisy or irrelevant) tool calls. We will revise the framework section to state this incentive structure more explicitly and to reference supporting ablation results that show reduced erroneous tool invocations. revision: partial

Circularity Check

0 steps flagged

No significant circularity in claimed derivation

full rationale

The paper describes an empirical ML framework (sub-motion decomposition + hierarchical-reward RL for tool invocation) whose central claims are SOTA numbers on external benchmarks (HMDB-51, UCF-101, SSv2, Kinetics-400/600). No equations, fitted parameters, or self-citations are presented in the supplied text that reduce any reported performance metric or robustness claim to quantities defined from the same data or prior author work by construction. The hierarchical reward is introduced as a design choice whose effectiveness is assessed experimentally rather than assumed tautologically. This matches the default expectation for an applied computer-vision paper whose results rest on held-out dataset evaluations rather than a closed logical loop.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

Review based on abstract only; the central claim rests on the unproven premise that sub-motion decomposition is feasible and beneficial for all target actions plus the assumption that external tools can be reliably selected and applied by the RL agent.

free parameters (1)

hierarchical reward weights
Balances tool-usage efficiency, sub-motion relevance, and structural coherence; values not specified in abstract.

axioms (2)

domain assumption Actions can be decomposed into discriminative sub-motions that support fine-grained cross-modal matching
Invoked as the core innovation in the framework description.
domain assumption Domain-specific tools exist that can be dynamically invoked to interleave modalities and reduce hallucination
Central to the tool-augmented RL component.

invented entities (1)

Video-STAR framework no independent evidence
purpose: Harmonizes sub-motion decomposition with tool-augmented reinforcement learning for OVAR
Newly proposed system; no independent evidence provided beyond the abstract claim.

pith-pipeline@v0.9.0 · 5774 in / 1540 out tokens · 40943 ms · 2026-05-18T08:38:44.960274+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sub-motion decomposition into discriminative primitives

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

ConeSep tackles noisy triplet correspondences in composed image retrieval by introducing geometric fidelity quantization to locate noise, negative boundary learning for semantic opposites, and targeted unlearning via ...
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.
Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

Air-Know decouples MLLM-based external arbitration from proxy learning via knowledge internalization and dual-stream training to overcome noisy triplet correspondence in composed image retrieval.
IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools
cs.CV 2026-05 unverdicted novelty 5.0

IndusAgent achieves state-of-the-art zero-shot performance on industrial anomaly benchmarks by using a custom Indus-CoT dataset, dynamic tool orchestration, and gated RL to optimize anomaly classification, localizatio...

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 4 Pith papers

[1]

Matching candidate actions: Push, Pullup, Punch

work page
[2]

push" as

Pattern comparison for each candidate: - Push: [Score: 9/10] - The arms' forceful movement and the legs' forward motion align with the definition of pushing. The torso's slight bend further supports this action, as it helps in exerting force on the object. - Pullup: [Score: 4/10] - While the arms' movement is similar, the legs' forward motion and the over...

work page 2021
[3]

can significantly enhance reasoning capabilities, as exemplified by OpenAI-o1 (Jaech et al.,

work page
[4]

handshake

and DeepSeek-R1 (Guo et al., 2025). Several works have extended these paradigms to multi- modal language models (MLLMs) for tasks such as mathematical and scientific image VQA (Peng et al., 2025; Huang et al., 2025; Lu et al., 2024), image segmentation and grounding (Liu et al., 2025; Bai et al., 2025b; Shen et al., 2025), video spatial or temporal ground...

work page 2025
[5]

**Pose Estimation Tool **: Adds skeleton keypoints and connections to show body pose and joint movements

work page
[6]

**Person Detection Tool **: Adds bounding boxes around detected persons to highlight human subjects

work page
[7]

**Noun Explanation Tool **: Provides detailed explanations of action types to help with classification

work page
[8]

**Video Description Tool **: Provides description of input video to help better understand video content Please analyze the video using step-by-step reasoning and decide for each tool independently: Analysis Requirements:

work page
[9]

**For Pose Estimation **: Assess whether body joint movements and pose details are crucial for identifying this action

work page
[10]

**For Person Detection **: Evaluate whether clearly identifying and localizing the person(s) would help with action recognition

work page
[11]

**For Noun Explanation **: Consider whether you need detailed explanations of action categories to make accurate classification

work page
[12]

**For Video Description **: Judge whether you require a detailed description concerning video to help identifying action Output Format: <think>step-by-step reasoning process:

work page
[13]

Video content analysis: [describe what you observe in the video]

work page
[14]

Pose estimation evaluation: [assess whether joint/skeleton info would help] 22

work page
[15]

Person detection evaluation: [assess whether person localization would help]

work page
[16]

Noun explanation evaluation: [assess whether action category details would help]

work page
[17]

"") Second Round Prompts: SECOND_TURN_TEMPLATE = (

Final tool selection reasoning: [explain your decisions for each tool ] </think> <action> <human>yes or no</human> <pose>yes or no</pose> <action>yes or no</action> <video>yes or no</video> </action> """) Second Round Prompts: SECOND_TURN_TEMPLATE = (""" I will provide you with a video and ask you to identify the human action shown. There is also an annot...

work page
[18]

Identify all body parts showing significant motion in the video (e.g., arms, legs, torso)

work page
[19]

For each identified body part: - Note its movement direction (up/down, rotational, etc.) - Record contact points with objects (if any)

work page
[20]

List them by importance from top to bottom **Stage 2: Action Candidate Selection **

work page
[21]

Extract the key movement patterns from its description in given action types

work page
[22]

pulling)

Compare with the video’s observed: - Temporal sequence of movements (which body part moves first) - Interaction patterns between body parts - Force application points (e.g., hand gripping vs. pulling)

work page
[23]

Generate 2-3 candidate actions that best match the observed body part movements **Stage 3: Matching Scoring **

work page
[24]

Match these observations to the action definitions in given action types

work page
[25]

Score each candidate based on: (Maximum Score: 10) - Body part involvement precision - Movement pattern similarity - Object interaction consistency Output Format: <think>step-by-step reasoning process:

work page
[26]

Observed body parts and movement characteristics: - [Body Part 1]: [Direction/Contact Description] - [Body Part 2]: [Direction/Contact Description] 23

work page
[27]

Matching candidate actions: - [Candidate 1]: Matches [Body Part A] [Movement Type] - [Candidate 2]: Matches [Body Part B] [Movement Type]

work page
[28]

E LLMCLARIFICATION We clarify the role of Large Language Models (LLMs) in the development of this manuscript

Pattern comparison for each candidate: - [Candidate 1]: [Score] - [Matching Details] - [Candidate 2]: [Score] - [Matching Details] </think> <answer>action-type</answer> """) This structured prompt ensures our dataset contains diverse and causally plausible reasoning pro- cesses, which are critical for cultivating the model’s foundational perception and pl...

work page

[1] [1]

Matching candidate actions: Push, Pullup, Punch

work page

[2] [2]

push" as

Pattern comparison for each candidate: - Push: [Score: 9/10] - The arms' forceful movement and the legs' forward motion align with the definition of pushing. The torso's slight bend further supports this action, as it helps in exerting force on the object. - Pullup: [Score: 4/10] - While the arms' movement is similar, the legs' forward motion and the over...

work page 2021

[3] [3]

can significantly enhance reasoning capabilities, as exemplified by OpenAI-o1 (Jaech et al.,

work page

[4] [4]

handshake

and DeepSeek-R1 (Guo et al., 2025). Several works have extended these paradigms to multi- modal language models (MLLMs) for tasks such as mathematical and scientific image VQA (Peng et al., 2025; Huang et al., 2025; Lu et al., 2024), image segmentation and grounding (Liu et al., 2025; Bai et al., 2025b; Shen et al., 2025), video spatial or temporal ground...

work page 2025

[5] [5]

**Pose Estimation Tool **: Adds skeleton keypoints and connections to show body pose and joint movements

work page

[6] [6]

**Person Detection Tool **: Adds bounding boxes around detected persons to highlight human subjects

work page

[7] [7]

**Noun Explanation Tool **: Provides detailed explanations of action types to help with classification

work page

[8] [8]

**Video Description Tool **: Provides description of input video to help better understand video content Please analyze the video using step-by-step reasoning and decide for each tool independently: Analysis Requirements:

work page

[9] [9]

**For Pose Estimation **: Assess whether body joint movements and pose details are crucial for identifying this action

work page

[10] [10]

**For Person Detection **: Evaluate whether clearly identifying and localizing the person(s) would help with action recognition

work page

[11] [11]

**For Noun Explanation **: Consider whether you need detailed explanations of action categories to make accurate classification

work page

[12] [12]

**For Video Description **: Judge whether you require a detailed description concerning video to help identifying action Output Format: <think>step-by-step reasoning process:

work page

[13] [13]

Video content analysis: [describe what you observe in the video]

work page

[14] [14]

Pose estimation evaluation: [assess whether joint/skeleton info would help] 22

work page

[15] [15]

Person detection evaluation: [assess whether person localization would help]

work page

[16] [16]

Noun explanation evaluation: [assess whether action category details would help]

work page

[17] [17]

"") Second Round Prompts: SECOND_TURN_TEMPLATE = (

Final tool selection reasoning: [explain your decisions for each tool ] </think> <action> <human>yes or no</human> <pose>yes or no</pose> <action>yes or no</action> <video>yes or no</video> </action> """) Second Round Prompts: SECOND_TURN_TEMPLATE = (""" I will provide you with a video and ask you to identify the human action shown. There is also an annot...

work page

[18] [18]

Identify all body parts showing significant motion in the video (e.g., arms, legs, torso)

work page

[19] [19]

For each identified body part: - Note its movement direction (up/down, rotational, etc.) - Record contact points with objects (if any)

work page

[20] [20]

List them by importance from top to bottom **Stage 2: Action Candidate Selection **

work page

[21] [21]

Extract the key movement patterns from its description in given action types

work page

[22] [22]

pulling)

Compare with the video’s observed: - Temporal sequence of movements (which body part moves first) - Interaction patterns between body parts - Force application points (e.g., hand gripping vs. pulling)

work page

[23] [23]

Generate 2-3 candidate actions that best match the observed body part movements **Stage 3: Matching Scoring **

work page

[24] [24]

Match these observations to the action definitions in given action types

work page

[25] [25]

Score each candidate based on: (Maximum Score: 10) - Body part involvement precision - Movement pattern similarity - Object interaction consistency Output Format: <think>step-by-step reasoning process:

work page

[26] [26]

Observed body parts and movement characteristics: - [Body Part 1]: [Direction/Contact Description] - [Body Part 2]: [Direction/Contact Description] 23

work page

[27] [27]

Matching candidate actions: - [Candidate 1]: Matches [Body Part A] [Movement Type] - [Candidate 2]: Matches [Body Part B] [Movement Type]

work page

[28] [28]

E LLMCLARIFICATION We clarify the role of Large Language Models (LLMs) in the development of this manuscript

Pattern comparison for each candidate: - [Candidate 1]: [Score] - [Matching Details] - [Candidate 2]: [Score] - [Matching Details] </think> <answer>action-type</answer> """) This structured prompt ensures our dataset contains diverse and causally plausible reasoning pro- cesses, which are critical for cultivating the model’s foundational perception and pl...

work page