Ferret-ui 2: Mastering universal user interface understanding across platforms

Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, Zhe Gan · 2024 · arXiv 2410.18967

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

cs.CR · 2026-03-10 · unverdicted · novelty 6.0

FlexServe achieves up to 10x faster time-to-first-token for secure LLM inference on mobile devices by using flexible resource isolation in TrustZone compared to standard approaches.

citing papers explorer

Showing 3 of 3 citing papers.

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games? cs.AI · 2026-05-11 · unverdicted · none · ref 28
VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web cs.CV · 2026-04-09 · unverdicted · none · ref 55
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation cs.CR · 2026-03-10 · unverdicted · none · ref 43
FlexServe achieves up to 10x faster time-to-first-token for secure LLM inference on mobile devices by using flexible resource isolation in TrustZone compared to standard approaches.

Ferret-ui 2: Mastering universal user interface understanding across platforms

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer