Recognition: no theorem link
SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
Pith reviewed 2026-05-15 19:16 UTC · model grok-4.3
The pith
A specialized agent for smart glasses vision questions outperforms GPT-4o by 2.19 percent through object detection and targeted web search.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SUPERGLASSES is the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices, comprising 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. SUPERLENS integrates automatic object detection, query decoupling, and multimodal web search for retrieval-augmented answer generation and achieves state-of-the-art performance, outperforming GPT-4o by 2.19 percent on the benchmark.
What carries the argument
SUPERLENS, the multimodal smart glasses agent that performs automatic object detection, followed by query decoupling and multimodal web search, to support retrieval-augmented answer generation.
If this is right
- Existing VLMs leave significant performance gaps on realistic smart glasses queries that require precise object identification before external retrieval.
- Automatic object detection must precede knowledge lookup to handle the core challenge of smart glasses VQA.
- Query decoupling combined with multimodal web search improves answer quality over direct generation.
- Traditional multimodal datasets fail to reflect the sequential identification-then-retrieval demands of wearable devices.
Where Pith is reading between the lines
- Developers building smart glasses applications may need custom agent pipelines instead of relying solely on general VLMs.
- Extending the benchmark to continuous video or live interaction streams could expose further limitations in current models.
- Comparable detection-plus-retrieval designs might improve performance in other egocentric or wearable vision settings.
Load-bearing premise
The 2,422 collected pairs and the eight query categories sufficiently capture real smart glasses usage challenges, and the observed performance gain comes from the added detection and search components rather than evaluation choices or model scale.
What would settle it
A general-purpose VLM that reaches or exceeds SUPERLENS accuracy on the SUPERGLASSES benchmark without using the object-detection and query-decoupling pipeline would show the specialized steps are not required.
Figures
read the original abstract
The rapid advancement of AI-powered smart glasses-one of the hottest wearable devices-has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPER- GLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose the SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. SUPERLENS achieves state-of-the-art performance, outperforming GPT-4o by 2.19%, underscoring the need for task-specific solutions in smart glasses VQA. Our dataset is publicly available at https://huggingface.co/datasets/xandery/SuperGlasses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SUPERGLASSES, a new VQA benchmark of 2,422 real-world egocentric image-question pairs collected from smart glasses devices across 14 domains and 8 query categories, complete with search trajectories and annotations. It evaluates 26 VLMs on the benchmark, identifies performance gaps, and proposes SUPERLENS, a multimodal agent that integrates automatic object detection, query decoupling, and multimodal web search for retrieval-augmented generation, claiming state-of-the-art results with a 2.19% improvement over GPT-4o.
Significance. If the 2.19% gain is robust and specifically attributable to the three proposed components rather than prompt or protocol differences, the work would supply a useful public benchmark for smart-glasses VQA and illustrate the value of task-specific agent designs. The public dataset release is a clear strength for reproducibility and follow-on research.
major comments (2)
- [SUPERLENS agent description and results] The central claim attributes the 2.19% margin over GPT-4o specifically to automatic object detection + query decoupling + multimodal web search, yet the manuscript contains no ablation tables that disable one component at a time while holding the others fixed. Without these isolations it is impossible to rule out that the delta arises from prompt engineering, base-model choice, or evaluation-protocol details rather than the advertised integrations.
- [Abstract and experimental results] The abstract and results report a 2.19% absolute improvement without statistical significance testing, confidence intervals, or full evaluation-protocol details (exact prompts, temperature, decoding strategy). This information is required to establish that the margin is reliable rather than an artifact of the chosen protocol.
minor comments (2)
- [Abstract] Abstract contains inconsistent hyphenation and spacing in 'SUPER- GLASSES'.
- [Dataset construction] Clarify the rationale for selecting the 8 query categories and whether they adequately sample the full range of real smart-glasses usage scenarios.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments correctly identify gaps in the current manuscript regarding component isolation and statistical rigor. We address each point below and will revise the manuscript to incorporate the requested analyses and details.
read point-by-point responses
-
Referee: [SUPERLENS agent description and results] The central claim attributes the 2.19% margin over GPT-4o specifically to automatic object detection + query decoupling + multimodal web search, yet the manuscript contains no ablation tables that disable one component at a time while holding the others fixed. Without these isolations it is impossible to rule out that the delta arises from prompt engineering, base-model choice, or evaluation-protocol details rather than the advertised integrations.
Authors: We agree that the manuscript lacks explicit ablation studies isolating each component of SUPERLENS. The current version describes the three integrations but does not quantify their individual contributions through controlled ablations. In the revised manuscript we will add ablation tables that systematically disable one component at a time (automatic object detection, query decoupling, and multimodal web search) while holding the base model, prompts, and evaluation protocol fixed, thereby demonstrating the incremental benefit attributable to each module. revision: yes
-
Referee: [Abstract and experimental results] The abstract and results report a 2.19% absolute improvement without statistical significance testing, confidence intervals, or full evaluation-protocol details (exact prompts, temperature, decoding strategy). This information is required to establish that the margin is reliable rather than an artifact of the chosen protocol.
Authors: We acknowledge the absence of statistical testing and protocol transparency in the current version. The revised manuscript will include bootstrap confidence intervals and paired significance tests for the 2.19% margin, along with a new appendix that reports the exact prompts, temperature settings, decoding strategy, and all other hyperparameters used for SUPERLENS and the 26 evaluated VLMs, including GPT-4o. These additions will allow readers to reproduce and assess the reliability of the reported gains. revision: yes
Circularity Check
No circularity: empirical benchmark results on new dataset
full rationale
The paper's core claims are direct accuracy measurements (e.g., SUPERLENS outperforming GPT-4o by 2.19% on the 2,422-pair SUPERGLASSES benchmark). No equations, fitted parameters, or derivations are presented that reduce the reported gains to inputs by construction. The benchmark collection, query categories, and SUPERLENS components (object detection, query decoupling, web search) are described as independent engineering choices; performance deltas are evaluated externally against 26 VLMs without self-referential definitions or load-bearing self-citations. This is a standard empirical benchmarking paper with no detectable circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing VQA metrics and web-search APIs remain valid when applied to egocentric smart-glasses imagery
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadal- lah, Ammar Ahmad Awan, Nguyen Bach, and Amit Bahree. Phi-3 technical report: A highly capable language model lo- cally on your phone.arXiv preprint arXiv:2404.14219, 2024. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
jina-reranker-m0: Multilingual multimodal docu- ment reranker, 2025
Jina AI. jina-reranker-m0: Multilingual multimodal docu- ment reranker, 2025. 6, 15
work page 2025
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 1
work page 1901
-
[6]
Eun Chang, Zhuangqun Huang, Yiwei Liao, Sagar Ravi Bhavsar, Amogh Param, Tammy Stark, Adel Ahmadyan, Xiao Yang, Jiaqi Wang, Ahsan Abdullah, Giang Nguyen, Akil Iyer, David Patrick hall, Elissa Li, Nicolas SCHEF- FER, Ahmed Kirmani, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Seungwhan Moon, and Xin Luna Dong. WearVQA: A visual question answerin...
work page 2025
-
[7]
Webqa: Multihop and multimodal qa
Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. Webqa: Multihop and multimodal qa. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16495– 16504, 2022. 4, 5, 12, 13
work page 2022
-
[8]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large lan- guage models.ACM transactions on intelligent systems and technology, 15(3):1–45, 2024. 1
work page 2024
-
[9]
Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? InProceedings of the 2023 Conference on Empirical Methods in Natural Language Pro- cessing, pages 14948–14968, 2023. 4, 5, 13
work page 2023
-
[10]
Seeclick: Har- nessing gui grounding for advanced visual gui agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Har- nessing gui grounding for advanced visual gui agents. In Proceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024. 12
work page 2024
-
[11]
Yolo-world: Real-time open-vocabulary object detection
Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xing- gang Wang, and Ying Shan. Yolo-world: Real-time open-vocabulary object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16901–16911, 2024. 6
work page 2024
-
[12]
Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua Zhao, et al. Exploring large language model based intelligent agents: Definitions, methods, and prospects.arXiv preprint arXiv:2401.03428, 2024. 1
-
[13]
Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Augmenting multi- modal llms with self-reflective tokens for knowledge-based visual question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9199– 9209, 2025. 5
work page 2025
-
[14]
Oscar Danielsson, Magnus Holm, and Anna Syberfeldt. Augmented reality smart glasses in industrial assembly: Cur- rent status and future challenges.Journal of Industrial Infor- mation Integration, 20:100175, 2020. 1
work page 2020
-
[15]
Palm- e: an embodied multimodal language model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: an embodied multimodal language model. InProceedings of the 40th International Conference on Machine Learning, pages 8469–8488, 2023. 12
work page 2023
-
[16]
A sur- vey on rag meeting llms: Towards retrieval-augmented large language models
Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A sur- vey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6491–6501, 2024. 1
work page 2024
-
[17]
Mingyang Fu, Yuyang Peng, Dongping Chen, Zetong Zhou, Benlin Liu, Yao Wan, Zhou Zhao, Philip S. Yu, and Ranjay Krishna. Seeking and updating with live visual knowledge. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems Datasets and Benchmarks Track,
-
[18]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025. 1, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Mrag-bench: Vision- centric evaluation for retrieval-augmented multimodal mod- els
Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou, Mohsen Fayyaz, Pan Lu, Kai-Wei Chang, and Nanyun Peng. Mrag-bench: Vision- centric evaluation for retrieval-augmented multimodal mod- els. InThe Thirteenth International Conference on Learning Representations, 2025. 5, 13
work page 2025
-
[20]
MM- Search: Unveiling the potential of large models as multi- modal search engines
Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, ji- ayi lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Guanglu Song, Peng Gao, Yu Liu, Chunyuan Li, and Hongsheng Li. MM- Search: Unveiling the potential of large models as multi- modal search engines. InThe Thirteenth International Con- ference on Learning Representations, 2025. 4, 13
work page 2025
-
[21]
Hibench: Benchmarking llms capability on hi- erarchical structure reasoning
Zhuohang Jiang, Pangjing Wu, Ziran Liang, Peter Q Chen, Xu Yuan, Ye Jia, Jiancheng Tu, Chen Li, Peter HF Ng, and Qing Li. Hibench: Benchmarking llms capability on hi- erarchical structure reasoning. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 5505–5515, 2025. 12
work page 2025
-
[22]
QA-dragon: Query-aware dynamic RAG system for knowledge-intensive visual question answering
Zhuohang Jiang, Pangjing Wu, Xu Yuan, Wenqi Fan, and Li Qing. QA-dragon: Query-aware dynamic RAG system for knowledge-intensive visual question answering. In2025 KDD Cup Workshop for Multimodal Retrieval Augmented Generation, 2025. 5, 12
work page 2025
-
[23]
Hydra: A hy- per agent for dynamic compositional visual reasoning
Fucai Ke, Zhixi Cai, Simindokht Jahangard, Weiqing Wang, Pari Delir Haghighi, and Hamid Rezatofighi. Hydra: A hy- per agent for dynamic compositional visual reasoning. In European Conference on Computer Vision, pages 132–149. Springer, 2024. 1
work page 2024
-
[24]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020. 1
work page 2020
-
[25]
LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. LLaV A-onevision: Easy visual task transfer.Transactions on Machine Learning Research,
-
[26]
Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Fei Huang, Jingren Zhou, et al. Benchmarking multimodal re- trieval augmented generation with dynamic vqa dataset and self-adaptive planning agent. InThe Thirteenth International Conference on Learning Representations, 2025. 4, 5, 8, 13
work page 2025
-
[27]
Retrieval augmented visual ques- tion answering with outside knowledge
Weizhe Lin and Bill Byrne. Retrieval augmented visual ques- tion answering with outside knowledge. InProceedings of the 2022 Conference on Empirical Methods in Natural Lan- guage Processing, pages 11238–11254, 2022. 1, 6
work page 2022
-
[28]
Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, and Bill Byrne. Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. Advances in Neural Information Processing Systems, 36: 22820–22840, 2023. 12
work page 2023
-
[29]
Zhaojiang Lin, Yong Xu, Kai Sun, Jing Zheng, Yin Huang, Surya Teja Appini, Krish Narang, Renjie Tao, Ishan Kapil Jain, Siddhant Arora, et al. Wearvox: An egocentric mul- tichannel voice assistant benchmark for wearables.arXiv preprint arXiv:2601.02391, 2025. 1
-
[30]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 5, 6, 7
work page 2024
-
[31]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,
-
[32]
Benchmarking retrieval-augmented generation in multi-modal contexts
Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, Xinyi Zhang, Xiaoyuan Yi, Yukun Yan, Ge Yu, and Maosong Sun. Benchmarking retrieval-augmented generation in multi-modal contexts. InProceedings of the 33rd ACM In- ternational Conference on Multimedia, pages 4817–4826,
-
[33]
Visual agentic ai for spatial reasoning with a dynamic api
Damiano Marsili, Rohun Agrawal, Yisong Yue, and Geor- gia Gkioxari. Visual agentic ai for spatial reasoning with a dynamic api. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19446–19455, 2025. 1
work page 2025
-
[34]
Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024
Meta-AI. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models, 2024. 6, 7
work page 2024
-
[35]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 1
work page 2022
-
[36]
Retrieval-based knowledge augmented vision language pre-training
Jiahua Rao, Zifei Shan, Longpo Liu, Yao Zhou, and Yue- dong Yang. Retrieval-based knowledge augmented vision language pre-training. InProceedings of the 31st ACM In- ternational Conference on Multimedia, pages 5399–5409,
-
[37]
A-okvqa: A benchmark for visual question answering using world knowl- edge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. InEuropean conference on computer vision, pages 146–162. Springer, 2022. 1, 12
work page 2022
-
[38]
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36:38154–38180,
-
[39]
Distance learning and assistance using smart glasses.education sci- ences, 8(1):21, 2018
Michael Spitzer, Ibrahim Nanic, and Martin Ebner. Distance learning and assistance using smart glasses.education sci- ences, 8(1):21, 2018. 1
work page 2018
-
[40]
Vipergpt: Visual inference via python execution for reasoning
D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 11888–11898, 2023. 1, 12
work page 2023
-
[41]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Boyuan Wang, Ying Zheng, Xihao Han, Liang Kong, Gexin Xiao, Zunxiong Xiao, and Shanji Chen. A systematic lit- erature review on integrating ai-powered smart glasses into digital health management for proactive healthcare solutions. npj Digital Medicine, 8(1):410, 2025. 1
work page 2025
-
[43]
Mllm-tool: A multimodal large language model for tool agent learning
Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin Ma, and Shenghua Gao. Mllm-tool: A multimodal large language model for tool agent learning. In 2025 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 6678–6687. IEEE, 2025. 12
work page 2025
-
[44]
Feng Wang, Zesheng Shi, Bo Wang, Nan Wang, and Han Xiao. Readerlm-v2: Small language model for html to mark- down and json.arXiv preprint arXiv:2503.01151, 2025. 6, 15
-
[45]
A practical stereo depth system for smart glasses
Jialiang Wang, Daniel Scharstein, Akash Bapat, Kevin Blackburn-Matzen, Matthew Yu, Jonathan Lehman, Suhib Alsisan, Yanghan Wang, Sam Tsai, Jan-Michael Frahm, et al. A practical stereo depth system for smart glasses. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 21498–21507, 2023. 1
work page 2023
-
[46]
Crag-mm: Multi- modal multi-turn comprehensive rag benchmark.arXiv preprint arXiv:2510.26160, 2025
Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, et al. Crag-mm: Multi- modal multi-turn comprehensive rag benchmark.arXiv preprint arXiv:2510.26160, 2025. 1, 2, 4, 5, 12, 13
-
[47]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based au- tonomous agents.Frontiers of Computer Science, 18(6): 186345, 2024. 1
work page 2024
-
[48]
Ziyue Wang, Chi Chen, Peng Li, and Yang Liu. Filling the image information gap for vqa: Prompting large language models to proactively ask questions. InThe 2023 Confer- ence on Empirical Methods in Natural Language Processing,
work page 2023
-
[49]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025
Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, and Ziwei Liu. Mmsearch-r1: Incentivizing lmms to search.arXiv preprint arXiv:2506.20670, 2025. 5, 7, 12
-
[51]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Mimo-vl technical report, 2025
LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. 6, 7
work page 2025
-
[53]
Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024
Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024. 1
-
[54]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 1
work page 2024
-
[56]
Xu Yuan, Liangbo Ning, Wenqi Fan, and Qing Li. mkg- rag: Multimodal knowledge graph-enhanced rag for visual question answering.arXiv preprint arXiv:2508.05318, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Appa- gent: Multimodal agents as smartphone users
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appa- gent: Multimodal agents as smartphone users. InProceed- ings of the 2025 CHI Conference on Human Factors in Com- puting Systems, pages 1–20, 2025. 12
work page 2025
-
[58]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 12 We have included supplementary material to facilitate a more comprehensive understa...
work page 2023
-
[60]
Generate step-by-step reasoning to address the query using evidence from the image and your knowledge... Stop reasoning once you have enough information to answer, or you find that necessary information is lacking
-
[61]
In your reasoning, identify the exact object that the query is about by its exact name
-
[62]
If the query involves multiple objects or relationships, dedicate one reasoning step to each object or relationship, and then summarize the result in a final step
-
[63]
I have no knowledge about <lacking_knowledge>
If you find that necessary information is lacking, explicitly state: “I have no knowledge about <lacking_knowledge>” Domain Reasoning Guidelines: ... User Prompt: Given the<image>, please conduct step-by-step reasoning to address the query:{query} Image metadata: The location of the image is{location}. Output Format:
-
[64]
The exact name of the object in the image that the query is about is<specific_object_name>
-
[65]
Therefore, the answer is ... Output Summary in JSON format:{“reasoning”:<summary_reasoning_string>, “answer”:<answer>} Figure 12. The prompt used for direct answer generation. All retrieved webpages are merged into a unified set:H= H vis ∪H txt. In our setting, both image and text retrieval are conducted using a SerpApi-powered search engine6, re- stricte...
-
[66]
The prediction is correct if it captures all the key information from the ground truth
-
[67]
The prediction is correct even if phrased differently as long as the meaning is the same
-
[68]
Output a JSON object with a single field ‘accuracy’ whose value is true or false
The prediction is incorrect if it contains incorrect information or is missing essential details. “Output a JSON object with a single field ‘accuracy’ whose value is true or false.” Question:{query},Ground Truth:{answer},Prediction:{prediction} Figure 13. The prompt used for answer evaluation.Table 6. Detailed scores on query categories of SUPERGLASSES. M...
-
[69]
What is the model of this vehicle?
The exact name of the object in the image that the query "What is the model of this vehicle?" is about is a car
-
[70]
The car in the image appears to be a Toyota model, as indicated by the logo on the front
-
[71]
The specific model cannot be determined from the image alone without more details such as the body style or additional features. Therefore, the answer is: {"reasoning": "The model of the vehicle in the image is a Toyota, but the specific model cannot be determined from the image alone."} Decision: Need RAG Tool Decision: Need Image Search, No need Text Se...
-
[72]
How many founders founded Sushiro?
The exact name of the object in the image that the query "How many founders founded Sushiro?" is about is Sushiro
-
[73]
Sushiro is a Japanese restaurant chain
-
[74]
The founders of Sushiro are not visible in the image
-
[75]
I cannot determine the number of founders of Sushiro based on the image. Therefore, the answer is: {"reasoning": "I cannot determine the number of founders of Sushiro based on the image."} Decision: Need RAG Tool Decision: No need Image Search, Need Text Search Search Query: Search Result: Evidence 1 with relevance score 0.94: From Wikipedia, the free enc...
work page 1984
-
[76]
The exact name of the object in the image that the query "What does this red sign mean?" is about is a red line on a glass window
-
[77]
The red line on the glass window is likely a safety feature, such as a grab bar or a design element
-
[78]
In many high-rise buildings, red lines on glass windows are often used as safety features to help people hold onto the glass in case of an emergency
-
[79]
The red line could also be a design element, possibly part of the building's branding or aesthetic. Therefore, the answer is: {"reasoning": "The red line on the glass window is likely a safety feature, such as a grab bar or a design element."} Decision: Need RAG Tool Decision: No need Image Search, Need Text Search Search Query: Search Result: Evidence 1 ...
work page 1957
-
[80]
American Payroll Association Basic Guide to Payroll
The exact name of the object in the image that the query is about is "American Payroll Association Basic Guide to Payroll"
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.