{"paper":{"title":"Are Video Reasoning Models Ready to Go Outside?","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A consistency-based training method called ROVA improves video reasoning models' accuracy by at least 24 percent under real-world video disturbances.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Changgyu Boo, Jaehong Yoon, Yangfan He","submitted_at":"2026-03-11T11:10:52Z","abstract_excerpt":"In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based o"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the specific spatio-temporal corruptions injected into PVRBench faithfully represent the distribution of real-world disturbances and that the self-reflective difficulty estimation does not introduce systematic bias in sample selection.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"ROVA boosts video reasoning model accuracy by at least 24% and reasoning by over 9% under realistic perturbations via a new consistency reward and adaptive training, while PVRBench reveals up to 35% drops in existing models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A consistency-based training method called ROVA improves video reasoning models' accuracy by at least 24 percent under real-world video disturbances.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"30e4223c1a65938739e9c9287f0b8e2ec1a97ee72693e656e43c391e022aa12b"},"source":{"id":"2603.10652","kind":"arxiv","version":3},"verdict":{"id":"31dae356-4c65-4159-bcf6-2dd3460b07d6","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T13:43:25.056614Z","strongest_claim":"ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.","one_line_summary":"ROVA boosts video reasoning model accuracy by at least 24% and reasoning by over 9% under realistic perturbations via a new consistency reward and adaptive training, while PVRBench reveals up to 35% drops in existing models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the specific spatio-temporal corruptions injected into PVRBench faithfully represent the distribution of real-world disturbances and that the self-reflective difficulty estimation does not introduce systematic bias in sample selection.","pith_extraction_headline":"A consistency-based training method called ROVA improves video reasoning models' accuracy by at least 24 percent under real-world video disturbances."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2603.10652/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b888605d871c9db4a22d7da66f0109db732707ecd01cc3715810f61647d0477d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}