Current VLMs excel at individual manga panel interpretation but systematically fail at temporal causality and cross-panel cohesion in long-form narratives.
Vision-language models for vision tasks: A survey,
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Introduces an app-content instrumentation framework and benchmark showing that examined GUI agents suffer 42.0% and 36.1% average misleading rates from third-party content in dynamic and static tests respectively.
PACTS jointly model action trajectories and predicate belief trajectories in a single generative policy, enabling zero-shot skill composition via symbolic planning without retraining.
citing papers explorer
-
Re:Verse -- Can Your VLM Read a Manga?
Current VLMs excel at individual manga panel interpretation but systematically fail at temporal causality and cross-panel cohesion in long-form narratives.
-
Mobile GUI Agents under Real-world Threats: Are We There Yet?
Introduces an app-content instrumentation framework and benchmark showing that examined GUI agents suffer 42.0% and 36.1% average misleading rates from third-party content in dynamic and static tests respectively.
-
Jointly Learning Predicates and Actions Enables Zero-Shot Skill Composition
PACTS jointly model action trajectories and predicate belief trajectories in a single generative policy, enabling zero-shot skill composition via symbolic planning without retraining.