EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
WebTestPilot symbolizes GUI elements to infer contextual oracles for end-to-end web testing from natural language specs, reporting 99% task completion and 96% precision/recall on a new bug-injected benchmark.
ANX introduces a protocol-first design with 3EX architecture that cuts token consumption by 47-66% and execution time by 58% versus prior methods in form-filling tests.
citing papers explorer
-
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents
EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-sensor workflows while fine-tuning lifts Pass@3 from 0.49 to 0.74.
-
WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI Elements
WebTestPilot symbolizes GUI elements to infer contextual oracles for end-to-end web testing from natural language specs, reporting 99% task completion and 96% precision/recall on a new bug-injected benchmark.
-
ANX: Protocol-First Design for AI Agent Interaction with a Supporting 3EX Decoupled Architecture
ANX introduces a protocol-first design with 3EX architecture that cuts token consumption by 47-66% and execution time by 58% versus prior methods in form-filling tests.