ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection
Pith reviewed 2026-05-13 01:40 UTC · model grok-4.3
The pith
Decomposing interaction phrases into state slots verifies multiple visual cues and improves rare and unseen human-object interaction detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ScriptHOI represents each interaction phrase as a soft scripted state transition decomposed into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict to calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations through interval partial-label learning and counterfactual script contrast loss.
What carries the argument
The soft scripted state transition, which breaks an interaction phrase into six slots to jointly check whether visual evidence supports the action rather than relying on object affordance alone.
If this is right
- Higher accuracy on rare and unseen interaction classes in benchmarks like HICO-DET and V-COCO.
- Fewer false positives from cases where object affordance suggests an action the visual states do not support.
- Logit calibration that raises or lowers scores according to how completely a script is visually realized.
- Training signals that bound probabilities for unannotated candidates instead of treating them as strict negatives.
- Reduced reliance on object-only cues through losses that swap individual script slots.
Where Pith is reading between the lines
- The slot structure could transfer to video settings by adding temporal consistency checks across frames.
- Real-world deployment might benefit from scripts that also encode scene context to handle cluttered environments.
- Extending coverage to multi-person or tool-use sequences would test whether the same calibration logic scales.
Load-bearing premise
The visual state tokenizer can reliably parse human-object pairs into accurate tokens across all six slots, and the resulting coverage and conflict scores give valid calibration without new biases or overlooked cues.
What would settle it
A dataset of human-object pairs with expert-annotated states for each slot where the tokenizer matches labels at chance level and rare-class gains disappear on held-out splits.
Figures
read the original abstract
Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbf{ScriptHOI}, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict. These two quantities calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations. To avoid suppressing valid but unannotated interactions, we further introduce interval partial-label learning, which constrains unannotated candidates with script-derived lower and upper probability bounds instead of assigning closed-world negatives. A counterfactual script contrast loss swaps individual script slots to discourage object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ScriptHOI, a framework for open-vocabulary HOI detection that represents each interaction as a soft scripted state transition decomposed into six slots (body-role, contact, geometry, affordance, motion, object-state). A visual state tokenizer parses detected human-object pairs into state tokens; a slot-wise matcher then computes script coverage and script conflict to calibrate logits, expose missing evidence, and supply training constraints. Interval partial-label learning replaces closed-world negatives with script-derived probability bounds, and a counterfactual script contrast loss discourages object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary splits report gains on rare/unseen classes and fewer affordance-conflict false positives.
Significance. If the structured components prove load-bearing, the work offers a concrete mechanism for injecting state-transition logic into vision-language HOI detectors, addressing the well-known problem of affordance and co-occurrence shortcuts. The interval partial-label learning and script-derived bounds directly target incomplete annotation, a persistent issue in HOI benchmarks. The paper ships no machine-checked proofs or parameter-free derivations, but the explicit decomposition into slots and the counterfactual contrast loss constitute reproducible design choices that could be tested on other structured-prediction tasks.
major comments (3)
- [§4.2] §4.2 (visual state tokenizer): the manuscript provides no independent quantitative evaluation of tokenizer accuracy on the six slots (e.g., per-slot precision/recall against held-out state annotations). Without this, it is impossible to determine whether script coverage and conflict estimates are reliable or whether they simply add auxiliary supervision that any multi-task detector could exploit.
- [§5.3] §5.3 (ablation on script coverage/conflict): the reported gains on rare/unseen splits are not isolated from the auxiliary losses; an ablation that removes only the coverage/conflict calibration while retaining the tokenizer and contrast loss is missing. This leaves open the possibility that improvements derive from regularization rather than enforced state-transition logic.
- [§3.3] §3.3 (interval partial-label learning): the derivation of lower/upper probability bounds from script coverage is not shown to be unbiased with respect to the original annotation distribution. If scripts are manually authored, incomplete script coverage could systematically under-estimate valid but unscripted interactions, undermining the claim that the method avoids suppressing unannotated positives.
minor comments (2)
- [§3.1] Notation for the six state slots is introduced in the abstract but the precise token vocabulary size and embedding dimension for each slot are not stated until the implementation details; moving this information to §3.1 would improve readability.
- [Figure 2] Figure 2 (slot-wise matcher diagram) uses the same color for 'coverage' and 'conflict' arrows; distinct colors or hatching would reduce visual ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and indicate planned revisions to the manuscript where appropriate.
read point-by-point responses
-
Referee: [§4.2] §4.2 (visual state tokenizer): the manuscript provides no independent quantitative evaluation of tokenizer accuracy on the six slots (e.g., per-slot precision/recall against held-out state annotations). Without this, it is impossible to determine whether script coverage and conflict estimates are reliable or whether they simply add auxiliary supervision that any multi-task detector could exploit.
Authors: We agree that an independent quantitative evaluation of the tokenizer on the six slots would strengthen the claims regarding the reliability of coverage and conflict estimates. The current manuscript does not include held-out state annotations for these slots, as generating them would require substantial new labeling effort outside the paper's scope. The tokenizer is trained end-to-end, and its utility is shown through overall gains on rare/unseen HOI classes plus qualitative reductions in affordance conflicts. We will add qualitative visualizations of tokenizer outputs on example pairs in the revision. revision: partial
-
Referee: [§5.3] §5.3 (ablation on script coverage/conflict): the reported gains on rare/unseen splits are not isolated from the auxiliary losses; an ablation that removes only the coverage/conflict calibration while retaining the tokenizer and contrast loss is missing. This leaves open the possibility that improvements derive from regularization rather than enforced state-transition logic.
Authors: The referee correctly notes the missing ablation. We will add an experiment that removes only the script coverage and conflict calibration while retaining the visual state tokenizer and counterfactual contrast loss. This will isolate whether gains arise from state-transition logic versus general regularization. revision: yes
-
Referee: [§3.3] §3.3 (interval partial-label learning): the derivation of lower/upper probability bounds from script coverage is not shown to be unbiased with respect to the original annotation distribution. If scripts are manually authored, incomplete script coverage could systematically under-estimate valid but unscripted interactions, undermining the claim that the method avoids suppressing unannotated positives.
Authors: The bounds are derived conservatively from script coverage to avoid hard negatives on unannotated candidates. We do not provide a formal proof of unbiasedness, but the design uses loose intervals to accommodate potential unscripted interactions, and experiments show gains on rare classes without suppressing annotated positives. We will expand §3.3 with this rationale and empirical support. revision: partial
- Independent quantitative per-slot evaluation of the visual state tokenizer, due to absence of held-out state annotations in the current experimental setup.
Circularity Check
Low circularity: script coverage/conflict computed from visual tokenizer rather than target labels
full rationale
The framework decomposes phrases into slots, uses a visual state tokenizer on detected pairs to produce tokens, then computes coverage and conflict from those tokens to calibrate logits. These quantities are derived from visual inputs and the proposed tokenizer, not defined directly from HOI class labels by construction. Interval partial-label learning and counterfactual contrast loss add constraints without reducing the central claims to fitted inputs or self-citations. No load-bearing self-citation, uniqueness theorem, or ansatz smuggling appears in the derivation chain. The paper is self-contained against external benchmarks with independent visual processing steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human-object interactions can be decomposed into the six slots of body-role, contact, geometry, affordance, motion, and object-state.
- domain assumption A visual state tokenizer can parse detected human-object pairs into corresponding state tokens.
invented entities (3)
-
script coverage
no independent evidence
-
script conflict
no independent evidence
-
interval partial-label learning
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.