pith. sign in

arxiv: 2510.08480 · v2 · submitted 2025-10-09 · 💻 cs.CV

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

Pith reviewed 2026-05-18 08:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords open-vocabulary action recognitionvideo understandingsub-motion decompositiontool-augmented reinforcement learningmultimodal large language modelscross-modal hallucinationhierarchical reward
0
0 comments X

The pith

Video-STAR decomposes video actions into sub-motions and augments MLLMs with tool-using reinforcement learning to improve open-vocabulary recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Video-STAR as a way to move beyond treating video actions as single units in multimodal models. It breaks each action into smaller discriminative sub-motions for more precise matching and lets the model call external tools on demand through reinforcement learning. A hierarchical reward balances how often tools are used, how relevant the sub-motions are, and how coherent the overall reasoning stays. This setup is meant to reduce text-driven hallucinations and shift the model toward evidence grounded in the actual video frames. Results on HMDB-51, UCF-101, Something-Something V2, Kinetics-400, and Kinetics-600 show gains over prior methods at distinguishing similar actions.

Core claim

Video-STAR harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition. Actions are decomposed into discriminative sub-motions that support fine-grained matching; domain-specific tools are invoked dynamically for cross-modal interleaving. A hierarchical reward that weighs tool-usage efficiency, sub-motion relevance, and structural coherence lets the model learn to prioritize useful patterns without explicit supervision, moving from text-centric priors to visually grounded inference.

What carries the argument

Contextual sub-motion decomposition combined with tool-augmented reinforcement learning guided by a hierarchical reward.

If this is right

  • Fine-grained action distinctions become feasible without category-specific training data.
  • Cross-modal hallucinations decrease because visual sub-motion evidence is interleaved with text reasoning.
  • The model acquires category-specific reasoning capacity through autonomous tool use rather than hand-crafted prompts.
  • Performance improves on both small and large benchmarks including Kinetics-400 and Kinetics-600.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition-plus-tool pattern could be tested on video question answering or temporal localization tasks.
  • If tool calls remain reliable, the approach might lower the amount of paired video-text data needed for new domains.
  • Hierarchical rewards that explicitly penalize unnecessary tool use could transfer to other agent-style vision-language systems.

Load-bearing premise

Actions can be broken into stable, discriminative sub-motions that improve matching and that external tools can be invoked without adding new errors.

What would settle it

On a held-out video dataset containing many near-identical actions, the full Video-STAR pipeline would need to show no accuracy gain or an increase in hallucination rate compared with a version that skips sub-motion decomposition and tool calls.

Figures

Figures reproduced from arXiv: 2510.08480 by Chengxuan Qian, Dapeng Zhang, Jing Tang, Lei Sun, Rui Chen, Shuo Li, Xiangxiang Chu, Xiangyan Qu, Yiwei Wang, Yujun Cai, Zhenlong Yuan.

Figure 1
Figure 1. Figure 1: Key insight of Video-STAR. (a) MLLMs + CoT is prone to hallucinations due to over￾reliance on text-centric reasoning while ignoring visual cues. (b) MLLMs + Tool-Augmented CoT mitigates hallucinations by integrating domain-specific tools to extract visual information. However, both (a) and (b) lack category-specific reasoning capabilities and struggle to distinguish semantically similar or complex actions.… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of Video-STAR. (i) Introduce a three-stage sub-motion logic chain to construct tool-augmented reasoning data that decomposes actions into discriminative sub-motions. (ii) Pre￾train the MLLMs on structured reasoning chains and fine-tune it for domain-specific adaptation. (iii) Adopt the GRPO algorithm for reinforcement learning, which optimizes a hierarchical reward function considering both tool-u… view at source ↗
Figure 3
Figure 3. Figure 3: Tool Libirary. Given the input video, Video-STAR respectively adopts the YOLO 11 for human detection & pose estimation, and the Qwen API for action explanation & video description. Stage 2: Result Integration & Prediction The second stage refines action recognition by integrat￾ing tool outputs R with original inputs. For the visual tools Tp and Td, extracted features F are concatenated with raw video frame… view at source ↗
Figure 4
Figure 4. Figure 4: Case Study between Qwen2.5-VL-3B and our Video-STAR-3B. Qwen2.5-VL-3B mis￾classifies action "turn" as "smile", while our Video-STAR-3B accurately identifies the correct action [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: More Case Study between Qwen2.5-VL-3B and our Video-STAR-3B. Qwen2.5-VL￾3B misclassifies "push" as "play", while our Video-STAR-3B accurately identifies the correct action. B RELATED WORK B.1 OPEN-VOCABULARY ACTION RECOGNITION. Building on the robust feature extraction capabilities of CLIP’s pre-training, researchers have de￾vised numerous video recognition methods leveraging its cross-modal alignment stre… view at source ↗
Figure 6
Figure 6. Figure 6: More Case Study between Qwen2.5-VL-3B and our Video-STAR-3B. Our Video￾STAR-3B can accurately identify the correct action. B.2 MULTIMODAL LLMS REASONING. The advancement of deep learning algorithm has led to the development of numerous self-supervised learning methods (Yang et al., 2021; 2024), vision-language model (Song et al., 2025; Chen et al., 2023), multimodal balance (Qian et al., 2025b;a), image re… view at source ↗
read the original abstract

Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Video-STAR, a framework for open-vocabulary action recognition that decomposes video actions into discriminative sub-motions and employs tool-augmented reinforcement learning. A hierarchical reward balances tool-usage efficiency, sub-motion relevance, and structural coherence to enable autonomous tool invocation, reduce cross-modal hallucination, and shift from text-centric to visually grounded inference in MLLMs. The central claim is state-of-the-art performance with improved robustness on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600.

Significance. If the empirical results and ablations hold, the combination of sub-motion decomposition with unsupervised tool-augmented RL could offer a practical route to grounded reasoning in video MLLMs, particularly for fine-grained open-vocabulary tasks. The absence of explicit supervision on tool accuracy is a notable design choice that, if shown to work reliably, would strengthen the case for autonomous tool use in multimodal systems.

major comments (2)
  1. [Abstract] Abstract: The SOTA and reduced-hallucination claims are stated without any accuracy numbers, error bars, ablation tables, or dataset-split details, which are load-bearing for evaluating whether the hierarchical reward actually delivers the reported gains on SSv2 and Kinetics-600.
  2. [Framework description] Framework description (paragraph on hierarchical reward): No mechanism is specified that prevents the learned policy from invoking tools whose outputs are noisy or irrelevant; an incorrect tool result could introduce new cross-modal errors rather than reduce hallucination, directly affecting the robustness claim for fine-grained actions.
minor comments (1)
  1. [Framework] The notation for sub-motion decomposition and tool interleaving should be formalized with explicit equations or pseudocode to clarify how the reward components are computed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and indicate planned revisions to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The SOTA and reduced-hallucination claims are stated without any accuracy numbers, error bars, ablation tables, or dataset-split details, which are load-bearing for evaluating whether the hierarchical reward actually delivers the reported gains on SSv2 and Kinetics-600.

    Authors: We agree that the abstract would benefit from greater specificity to support the claims. In the revised manuscript we will add key quantitative results (e.g., accuracy deltas on SSv2 and Kinetics-600), reference the ablation tables, and note the standard train/test splits used, thereby making the contribution of the hierarchical reward more transparent. revision: yes

  2. Referee: [Framework description] Framework description (paragraph on hierarchical reward): No mechanism is specified that prevents the learned policy from invoking tools whose outputs are noisy or irrelevant; an incorrect tool result could introduce new cross-modal errors rather than reduce hallucination, directly affecting the robustness claim for fine-grained actions.

    Authors: The hierarchical reward is constructed precisely to discourage noisy tool use: its tool-usage-efficiency and sub-motion-relevance terms penalize invocations that fail to improve reasoning coherence or final recognition accuracy. The RL objective therefore trains the policy to avoid low-reward (i.e., noisy or irrelevant) tool calls. We will revise the framework section to state this incentive structure more explicitly and to reference supporting ablation results that show reduced erroneous tool invocations. revision: partial

Circularity Check

0 steps flagged

No significant circularity in claimed derivation

full rationale

The paper describes an empirical ML framework (sub-motion decomposition + hierarchical-reward RL for tool invocation) whose central claims are SOTA numbers on external benchmarks (HMDB-51, UCF-101, SSv2, Kinetics-400/600). No equations, fitted parameters, or self-citations are presented in the supplied text that reduce any reported performance metric or robustness claim to quantities defined from the same data or prior author work by construction. The hierarchical reward is introduced as a design choice whose effectiveness is assessed experimentally rather than assumed tautologically. This matches the default expectation for an applied computer-vision paper whose results rest on held-out dataset evaluations rather than a closed logical loop.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

Review based on abstract only; the central claim rests on the unproven premise that sub-motion decomposition is feasible and beneficial for all target actions plus the assumption that external tools can be reliably selected and applied by the RL agent.

free parameters (1)
  • hierarchical reward weights
    Balances tool-usage efficiency, sub-motion relevance, and structural coherence; values not specified in abstract.
axioms (2)
  • domain assumption Actions can be decomposed into discriminative sub-motions that support fine-grained cross-modal matching
    Invoked as the core innovation in the framework description.
  • domain assumption Domain-specific tools exist that can be dynamically invoked to interleave modalities and reduce hallucination
    Central to the tool-augmented RL component.
invented entities (1)
  • Video-STAR framework no independent evidence
    purpose: Harmonizes sub-motion decomposition with tool-augmented reinforcement learning for OVAR
    Newly proposed system; no independent evidence provided beyond the abstract claim.

pith-pipeline@v0.9.0 · 5774 in / 1540 out tokens · 40943 ms · 2026-05-18T08:38:44.960274+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    ConeSep tackles noisy triplet correspondences in composed image retrieval by introducing geometric fidelity quantization to locate noise, negative boundary learning for semantic opposites, and targeted unlearning via ...

  2. Dual-Pathway Circuits of Object Hallucination in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.

  3. Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval

    cs.CV 2026-04 unverdicted novelty 6.0

    Air-Know decouples MLLM-based external arbitration from proxy learning via knowledge internalization and dual-stream training to overcome noisy triplet correspondence in composed image retrieval.

  4. IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

    cs.CV 2026-05 unverdicted novelty 5.0

    IndusAgent achieves state-of-the-art zero-shot performance on industrial anomaly benchmarks by using a custom Indus-CoT dataset, dynamic tool orchestration, and gated RL to optimize anomaly classification, localizatio...

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 4 Pith papers

  1. [1]

    Matching candidate actions: Push, Pullup, Punch

  2. [2]

    push" as

    Pattern comparison for each candidate: - Push: [Score: 9/10] - The arms' forceful movement and the legs' forward motion align with the definition of pushing. The torso's slight bend further supports this action, as it helps in exerting force on the object. - Pullup: [Score: 4/10] - While the arms' movement is similar, the legs' forward motion and the over...

  3. [3]

    can significantly enhance reasoning capabilities, as exemplified by OpenAI-o1 (Jaech et al.,

  4. [4]

    handshake

    and DeepSeek-R1 (Guo et al., 2025). Several works have extended these paradigms to multi- modal language models (MLLMs) for tasks such as mathematical and scientific image VQA (Peng et al., 2025; Huang et al., 2025; Lu et al., 2024), image segmentation and grounding (Liu et al., 2025; Bai et al., 2025b; Shen et al., 2025), video spatial or temporal ground...

  5. [5]

    **Pose Estimation Tool **: Adds skeleton keypoints and connections to show body pose and joint movements

  6. [6]

    **Person Detection Tool **: Adds bounding boxes around detected persons to highlight human subjects

  7. [7]

    **Noun Explanation Tool **: Provides detailed explanations of action types to help with classification

  8. [8]

    **Video Description Tool **: Provides description of input video to help better understand video content Please analyze the video using step-by-step reasoning and decide for each tool independently: Analysis Requirements:

  9. [9]

    **For Pose Estimation **: Assess whether body joint movements and pose details are crucial for identifying this action

  10. [10]

    **For Person Detection **: Evaluate whether clearly identifying and localizing the person(s) would help with action recognition

  11. [11]

    **For Noun Explanation **: Consider whether you need detailed explanations of action categories to make accurate classification

  12. [12]

    **For Video Description **: Judge whether you require a detailed description concerning video to help identifying action Output Format: <think>step-by-step reasoning process:

  13. [13]

    Video content analysis: [describe what you observe in the video]

  14. [14]

    Pose estimation evaluation: [assess whether joint/skeleton info would help] 22

  15. [15]

    Person detection evaluation: [assess whether person localization would help]

  16. [16]

    Noun explanation evaluation: [assess whether action category details would help]

  17. [17]

    "") Second Round Prompts: SECOND_TURN_TEMPLATE = (

    Final tool selection reasoning: [explain your decisions for each tool ] </think> <action> <human>yes or no</human> <pose>yes or no</pose> <action>yes or no</action> <video>yes or no</video> </action> """) Second Round Prompts: SECOND_TURN_TEMPLATE = (""" I will provide you with a video and ask you to identify the human action shown. There is also an annot...

  18. [18]

    Identify all body parts showing significant motion in the video (e.g., arms, legs, torso)

  19. [19]

    For each identified body part: - Note its movement direction (up/down, rotational, etc.) - Record contact points with objects (if any)

  20. [20]

    List them by importance from top to bottom **Stage 2: Action Candidate Selection **

  21. [21]

    Extract the key movement patterns from its description in given action types

  22. [22]

    pulling)

    Compare with the video’s observed: - Temporal sequence of movements (which body part moves first) - Interaction patterns between body parts - Force application points (e.g., hand gripping vs. pulling)

  23. [23]

    Generate 2-3 candidate actions that best match the observed body part movements **Stage 3: Matching Scoring **

  24. [24]

    Match these observations to the action definitions in given action types

  25. [25]

    Score each candidate based on: (Maximum Score: 10) - Body part involvement precision - Movement pattern similarity - Object interaction consistency Output Format: <think>step-by-step reasoning process:

  26. [26]

    Observed body parts and movement characteristics: - [Body Part 1]: [Direction/Contact Description] - [Body Part 2]: [Direction/Contact Description] 23

  27. [27]

    Matching candidate actions: - [Candidate 1]: Matches [Body Part A] [Movement Type] - [Candidate 2]: Matches [Body Part B] [Movement Type]

  28. [28]

    E LLMCLARIFICATION We clarify the role of Large Language Models (LLMs) in the development of this manuscript

    Pattern comparison for each candidate: - [Candidate 1]: [Score] - [Matching Details] - [Candidate 2]: [Score] - [Matching Details] </think> <answer>action-type</answer> """) This structured prompt ensures our dataset contains diverse and causally plausible reasoning pro- cesses, which are critical for cultivating the model’s foundational perception and pl...