ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

Bao Ngoc Le; Linh Chi Vo; Minh Anh Nguyen; Quang Huy Tran; Suiyang Guang; Tuan Kiet Pham

arxiv: 2605.05057 · v3 · pith:4PH2U7EJnew · submitted 2026-05-06 · 💻 cs.CV

ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

Minh Anh Nguyen , Quang Huy Tran , Bao Ngoc Le , SuiYang Guang , Tuan Kiet Pham , Linh Chi Vo This is my paper

Pith reviewed 2026-05-13 01:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords human-object interaction detectionopen-vocabulary learningstate transitionsscripted interactionspartial label learningvision-language modelsaffordance modeling

0 comments

The pith

Decomposing interaction phrases into state slots verifies multiple visual cues and improves rare and unseen human-object interaction detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current open-vocabulary detectors often predict actions from object presence and typical co-occurrences, such as guessing cut cake from a knife and cake without checking hand position or actual contact. ScriptHOI instead models each phrase as a soft scripted state transition split across six slots for body-role, contact, geometry, affordance, motion, and object state. A tokenizer turns visual human-object pairs into state tokens, while a matcher computes coverage and conflict scores to adjust logits and add training constraints. Interval partial-label learning handles missing annotations, and a contrast loss prevents object-only shortcuts. The approach yields gains on infrequent and novel interactions plus fewer false positives driven by affordance mismatches.

Core claim

ScriptHOI represents each interaction phrase as a soft scripted state transition decomposed into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict to calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations through interval partial-label learning and counterfactual script contrast loss.

What carries the argument

The soft scripted state transition, which breaks an interaction phrase into six slots to jointly check whether visual evidence supports the action rather than relying on object affordance alone.

If this is right

Higher accuracy on rare and unseen interaction classes in benchmarks like HICO-DET and V-COCO.
Fewer false positives from cases where object affordance suggests an action the visual states do not support.
Logit calibration that raises or lowers scores according to how completely a script is visually realized.
Training signals that bound probabilities for unannotated candidates instead of treating them as strict negatives.
Reduced reliance on object-only cues through losses that swap individual script slots.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The slot structure could transfer to video settings by adding temporal consistency checks across frames.
Real-world deployment might benefit from scripts that also encode scene context to handle cluttered environments.
Extending coverage to multi-person or tool-use sequences would test whether the same calibration logic scales.

Load-bearing premise

The visual state tokenizer can reliably parse human-object pairs into accurate tokens across all six slots, and the resulting coverage and conflict scores give valid calibration without new biases or overlooked cues.

What would settle it

A dataset of human-object pairs with expert-annotated states for each slot where the tokenizer matches labels at chance level and rare-class gains disappear on held-out splits.

Figures

Figures reproduced from arXiv: 2605.05057 by Bao Ngoc Le, Linh Chi Vo, Minh Anh Nguyen, Quang Huy Tran, Suiyang Guang, Tuan Kiet Pham.

**Figure 1.** Figure 1: Overall framework of ScriptHOI. The visual branch parses a detected human-object pair into state tokens, while the language view at source ↗

read the original abstract

Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbf{ScriptHOI}, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict. These two quantities calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations. To avoid suppressing valid but unannotated interactions, we further introduce interval partial-label learning, which constrains unannotated candidates with script-derived lower and upper probability bounds instead of assigning closed-world negatives. A counterfactual script contrast loss swaps individual script slots to discourage object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ScriptHOI decomposes HOI phrases into six state slots with a visual tokenizer and script coverage/conflict to reduce affordance shortcuts, but the gains likely trace more to the new losses than to the scripts themselves.

read the letter

ScriptHOI represents each interaction as a soft scripted state transition broken into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer turns detected human-object pairs into tokens, a slot-wise matcher computes coverage and conflict to adjust logits, and two new losses handle partial labels and discourage object-only shortcuts. The abstract and results claim clearer gains on rare and unseen classes plus fewer affordance false positives on HICO-DET, V-COCO, and open-vocab splits.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ScriptHOI, a framework for open-vocabulary HOI detection that represents each interaction as a soft scripted state transition decomposed into six slots (body-role, contact, geometry, affordance, motion, object-state). A visual state tokenizer parses detected human-object pairs into state tokens; a slot-wise matcher then computes script coverage and script conflict to calibrate logits, expose missing evidence, and supply training constraints. Interval partial-label learning replaces closed-world negatives with script-derived probability bounds, and a counterfactual script contrast loss discourages object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary splits report gains on rare/unseen classes and fewer affordance-conflict false positives.

Significance. If the structured components prove load-bearing, the work offers a concrete mechanism for injecting state-transition logic into vision-language HOI detectors, addressing the well-known problem of affordance and co-occurrence shortcuts. The interval partial-label learning and script-derived bounds directly target incomplete annotation, a persistent issue in HOI benchmarks. The paper ships no machine-checked proofs or parameter-free derivations, but the explicit decomposition into slots and the counterfactual contrast loss constitute reproducible design choices that could be tested on other structured-prediction tasks.

major comments (3)

[§4.2] §4.2 (visual state tokenizer): the manuscript provides no independent quantitative evaluation of tokenizer accuracy on the six slots (e.g., per-slot precision/recall against held-out state annotations). Without this, it is impossible to determine whether script coverage and conflict estimates are reliable or whether they simply add auxiliary supervision that any multi-task detector could exploit.
[§5.3] §5.3 (ablation on script coverage/conflict): the reported gains on rare/unseen splits are not isolated from the auxiliary losses; an ablation that removes only the coverage/conflict calibration while retaining the tokenizer and contrast loss is missing. This leaves open the possibility that improvements derive from regularization rather than enforced state-transition logic.
[§3.3] §3.3 (interval partial-label learning): the derivation of lower/upper probability bounds from script coverage is not shown to be unbiased with respect to the original annotation distribution. If scripts are manually authored, incomplete script coverage could systematically under-estimate valid but unscripted interactions, undermining the claim that the method avoids suppressing unannotated positives.

minor comments (2)

[§3.1] Notation for the six state slots is introduced in the abstract but the precise token vocabulary size and embedding dimension for each slot are not stated until the implementation details; moving this information to §3.1 would improve readability.
[Figure 2] Figure 2 (slot-wise matcher diagram) uses the same color for 'coverage' and 'conflict' arrows; distinct colors or hatching would reduce visual ambiguity.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and indicate planned revisions to the manuscript where appropriate.

read point-by-point responses

Referee: [§4.2] §4.2 (visual state tokenizer): the manuscript provides no independent quantitative evaluation of tokenizer accuracy on the six slots (e.g., per-slot precision/recall against held-out state annotations). Without this, it is impossible to determine whether script coverage and conflict estimates are reliable or whether they simply add auxiliary supervision that any multi-task detector could exploit.

Authors: We agree that an independent quantitative evaluation of the tokenizer on the six slots would strengthen the claims regarding the reliability of coverage and conflict estimates. The current manuscript does not include held-out state annotations for these slots, as generating them would require substantial new labeling effort outside the paper's scope. The tokenizer is trained end-to-end, and its utility is shown through overall gains on rare/unseen HOI classes plus qualitative reductions in affordance conflicts. We will add qualitative visualizations of tokenizer outputs on example pairs in the revision. revision: partial
Referee: [§5.3] §5.3 (ablation on script coverage/conflict): the reported gains on rare/unseen splits are not isolated from the auxiliary losses; an ablation that removes only the coverage/conflict calibration while retaining the tokenizer and contrast loss is missing. This leaves open the possibility that improvements derive from regularization rather than enforced state-transition logic.

Authors: The referee correctly notes the missing ablation. We will add an experiment that removes only the script coverage and conflict calibration while retaining the visual state tokenizer and counterfactual contrast loss. This will isolate whether gains arise from state-transition logic versus general regularization. revision: yes
Referee: [§3.3] §3.3 (interval partial-label learning): the derivation of lower/upper probability bounds from script coverage is not shown to be unbiased with respect to the original annotation distribution. If scripts are manually authored, incomplete script coverage could systematically under-estimate valid but unscripted interactions, undermining the claim that the method avoids suppressing unannotated positives.

Authors: The bounds are derived conservatively from script coverage to avoid hard negatives on unannotated candidates. We do not provide a formal proof of unbiasedness, but the design uses loose intervals to accommodate potential unscripted interactions, and experiments show gains on rare classes without suppressing annotated positives. We will expand §3.3 with this rationale and empirical support. revision: partial

standing simulated objections not resolved

Independent quantitative per-slot evaluation of the visual state tokenizer, due to absence of held-out state annotations in the current experimental setup.

Circularity Check

0 steps flagged

Low circularity: script coverage/conflict computed from visual tokenizer rather than target labels

full rationale

The framework decomposes phrases into slots, uses a visual state tokenizer on detected pairs to produce tokens, then computes coverage and conflict from those tokens to calibrate logits. These quantities are derived from visual inputs and the proposed tokenizer, not defined directly from HOI class labels by construction. Interval partial-label learning and counterfactual contrast loss add constraints without reducing the central claims to fitted inputs or self-citations. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the derivation chain. The paper is self-contained against external benchmarks with independent visual processing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on domain assumptions about interaction decomposability and introduces several new entities without independent evidence provided in the abstract.

axioms (2)

domain assumption Human-object interactions can be decomposed into the six slots of body-role, contact, geometry, affordance, motion, and object-state.
This decomposition is the foundation of the ScriptHOI representation.
domain assumption A visual state tokenizer can parse detected human-object pairs into corresponding state tokens.
Required for the slot-wise matcher to operate.

invented entities (3)

script coverage no independent evidence
purpose: Estimates the degree to which visual evidence supports the interaction script
New quantity used to calibrate HOI logits and expose missing evidence.
script conflict no independent evidence
purpose: Identifies inconsistencies or missing visual support for the script
Used alongside coverage to adjust predictions.
interval partial-label learning no independent evidence
purpose: Provides probability bounds for unannotated interaction candidates during training
New constraint to avoid treating valid but unlabeled interactions as negatives.

pith-pipeline@v0.9.0 · 5594 in / 1522 out tokens · 52545 ms · 2026-05-13T01:40:01.161646+00:00 · methodology

Review history (2 revisions) →

ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)