FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.
Visualwebarena: Evaluating multimodal agents on realistic visual web tasks
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3representative citing papers
State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating larger models.
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.
citing papers explorer
-
FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition
FIKA-Bench shows that the best large multimodal models and tool-using agents reach only 25.1% accuracy on fine-grained knowledge acquisition, with failures driven by wrong retrieval and poor visual judgment.
-
When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning
State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating larger models.
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment interventions.