{"paper":{"title":"ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"Decomposing interaction phrases into state slots verifies multiple visual cues and improves rare and unseen human-object interaction detection.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Bao Ngoc Le, Linh Chi Vo, Minh Anh Nguyen, Quang Huy Tran, Suiyang Guang, Tuan Kiet Pham","submitted_at":"2026-05-06T15:52:35Z","abstract_excerpt":"Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \\textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \\textbf{Scrip"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the visual state tokenizer can reliably parse human-object pairs into accurate state tokens for the six slots and that script coverage and conflict estimates provide valid calibration without introducing new biases or missing critical visual cues.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"ScriptHOI improves rare and unseen HOI recognition by decomposing phrases into state slots, using visual tokenization and slot-wise matching for script coverage and conflict to calibrate predictions and constrain training on incomplete labels.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Decomposing interaction phrases into state slots verifies multiple visual cues and improves rare and unseen human-object interaction detection.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"6dcdb51ff20351e7ad5a0cfdc9f67c1dcdc67178829585d942dcadf506f08850"},"source":{"id":"2605.05057","kind":"arxiv","version":3},"verdict":{"id":"6f667bd3-8145-4ad6-bfe3-f36dba8bfad8","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-13T01:40:01.161646Z","strongest_claim":"Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.","one_line_summary":"ScriptHOI improves rare and unseen HOI recognition by decomposing phrases into state slots, using visual tokenization and slot-wise matching for script coverage and conflict to calibrate predictions and constrain training on incomplete labels.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the visual state tokenizer can reliably parse human-object pairs into accurate state tokens for the six slots and that script coverage and conflict estimates provide valid calibration without introducing new biases or missing critical visual cues.","pith_extraction_headline":"Decomposing interaction phrases into state slots verifies multiple visual cues and improves rare and unseen human-object interaction detection."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.05057/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"ai_meta_artifact","ran_at":"2026-05-20T10:38:08.925251Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T21:31:19.711658Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T13:52:20.973232Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"2e9bee609381b8443b1946e3510c11be6ba4736b92645b2b2c03a9038f2a4e72"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}