See No Evil: Semantic Context-Aware Privacy Risk Detection for AR

Huining Li; Jialu Liu; Yao Li; Ying Chen; Zhuoheng Li

arxiv: 2604.22805 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI· cs.SY· eess.SY

See No Evil: Semantic Context-Aware Privacy Risk Detection for AR

Jialu Liu , Yao Li , Zhuoheng Li , Huining Li , Ying Chen This is my paper

Pith reviewed 2026-05-10 15:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.SYeess.SY

keywords augmented realityprivacy risk detectionvision-language modelssemantic contextchain-of-thought promptingcontext-aware privacyAR obfuscation

0 comments

The pith

PrivAR uses vision-language models and chain-of-thought reasoning to detect context-specific privacy risks in augmented reality scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PrivAR as a system that addresses a gap in AR privacy protection by adding semantic understanding to visual data capture. Standard approaches treat all text or objects the same way and miss when something like a password note is sensitive only in an office setting. PrivAR instead prompts vision-language models to reason step by step about the scene, infer what kinds of information might be private, and then selectively hide the risky text while leaving enough context for the models to keep working. Real-world tests show higher detection accuracy and lower leakage than prior methods, and the work also tests warning designs that tell users why something was hidden.

Core claim

PrivAR detects and obfuscates textual content in AR environments by using VLMs with chain-of-thought prompting to infer potential sensitive information types from visual scene cues, such as identifying password notes in office environments through contextual reasoning, while preserving cues needed for continued VLM inference.

What carries the argument

Vision-language models with chain-of-thought prompting that infer context-dependent sensitive information types from visual cues to guide targeted text obfuscation.

If this is right

AR systems can protect users from context-dependent leaks without blocking all text or breaking the visual experience.
Privacy leakage rates drop below 20 percent when obfuscation is guided by scene-level reasoning rather than fixed rules.
Context-aware warning interfaces give users clearer reasons for hiding content and improve awareness during AR use.
The same inference pipeline can be applied to other continuous visual capture devices beyond headsets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of consumer AR glasses could embed this style of filter to limit accidental sharing of personal documents or notes.
The approach might generalize to VR environments where users also move through spaces that contain private visual information.
Testing the system in outdoor or crowded public AR scenarios would reveal whether current VLM reasoning scales beyond indoor office-like settings.

Load-bearing premise

Vision-language models with chain-of-thought prompting can reliably infer context-dependent sensitive information types from visual cues across varied real-world AR environments and user scenarios.

What would settle it

A controlled test set of new AR scenes containing subtle sensitive items where the model repeatedly misses the privacy risk and the leakage rate rises above 30 percent.

Figures

Figures reproduced from arXiv: 2604.22805 by Huining Li, Jialu Liu, Yao Li, Ying Chen, Zhuoheng Li.

**Figure 2.** Figure 2: Different warning modes when a risk of privacy leakage is identified: (a) center-screen warning, (b) top-screen warning, and (c) region overlay warning. 3 warning modes, all flashing in a 2-second cycle (one second on, one second off) for a total of 6 seconds, as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

Augmented reality (AR) systems pose unique privacy risks due to their continuous capture of visual data. Existing AR privacy frameworks lack semantic understanding of visual content, limiting their effectiveness in detecting context-dependent privacy risks. We propose PrivAR, which leverages vision language models (VLMs) with chain-of-thought prompting for contextual privacy risk detection in AR environments. PrivAR uses visual scene cues to infer potential sensitive information types, such as identifying password notes in office environments through contextual reasoning. PrivAR detects and obfuscates textual content, preventing exposure of sensitive information while preserving contextual cues necessary for VLM inference. Additionally, we investigate contextually-informed warning interfaces to enhance user privacy awareness. Experiments on a real-world AR dataset show that PrivAR achieves superior accuracy (81.48%) and F1-score (84.62%) compared to baselines, while reducing privacy leakage rate to 17.58%. User studies evaluating contextually-informed warning interfaces provide insights into effective privacy-aware AR design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PrivAR shows a workable VLM-plus-CoT pipeline for spotting context-dependent privacy risks in AR, with reported gains on a real dataset and some user-study backing.

read the letter

The paper's main point is that standard AR privacy tools miss context, so the authors built PrivAR to let a vision-language model reason step-by-step about what counts as sensitive in a given scene. It then blurs or warns about risky text while keeping enough visual cues for the model to keep working. Experiments claim 81.48% accuracy and 84.62% F1 on a real-world AR dataset, with leakage down to 17.58%, plus user studies on the warning interfaces. That combination of detection, obfuscation, and interface testing is the concrete advance here. The authors supply implementation details, dataset description, and baseline comparisons, which makes the numbers easier to assess than the abstract alone suggested. No obvious internal contradictions or missing ablations turned up in the full text. The evaluation looks consistent with the task they defined. The soft spots are the usual ones for this kind of work: results sit on one dataset whose size and diversity aren't huge, so how well it holds in other AR environments or with different VLMs is still open. Failure modes get less attention than the headline numbers. The baselines are described but their exact construction could be probed more. This is aimed at people working on AR privacy, HCI, or applied vision-language systems. A reader who wants practical examples of context-aware privacy tools will find usable ideas and numbers to build on. It has enough experimental grounding and honest reporting to merit a serious referee rather than a desk reject.

Referee Report

0 major / 2 minor

Summary. The paper proposes PrivAR, a system leveraging vision-language models (VLMs) with chain-of-thought prompting for detecting context-dependent privacy risks in AR by inferring sensitive information types from visual cues. It includes mechanisms to detect and obfuscate textual content to reduce privacy leakage while preserving context, and explores contextually-informed warning interfaces. On a real-world AR dataset, it reports 81.48% accuracy, 84.62% F1-score outperforming baselines, and 17.58% privacy leakage rate, with supporting user studies.

Significance. If the results hold, PrivAR advances the field by addressing the lack of semantic understanding in existing AR privacy frameworks. The use of VLMs for contextual inference, combined with obfuscation and user interface studies, provides a comprehensive approach to mitigating privacy risks in continuous visual AR capture. This could inform future designs for privacy-aware AR systems.

minor comments (2)

[Abstract] Abstract: The performance metrics (accuracy, F1-score, leakage rate) are stated without error bars, number of runs, or statistical significance tests, which would strengthen the presentation of the superiority claims.
[Evaluation] Evaluation section: A summary table comparing baseline methods, their implementations, and exact metric values would improve readability and allow direct verification of the reported gains.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the accurate summary of PrivAR and for recognizing its significance in advancing semantic understanding for AR privacy protection. We are pleased with the recommendation for minor revision.

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on reported experiments

full rationale

The manuscript contains no equations, derivations, or first-principles predictions. Its central claims consist of empirical performance numbers (81.48% accuracy, 84.62% F1, 17.58% leakage reduction) obtained by running a VLM+CoT pipeline on a described real-world AR dataset and comparing against baselines. No fitted parameter is later renamed as a prediction, no self-citation supplies a uniqueness theorem that forces the method, and no ansatz is smuggled in. The evaluation pipeline is described with implementation details, dataset splits, and user-study results, making the reported metrics independent of any internal definitional loop. This is the expected non-finding for a purely experimental systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5484 in / 1118 out tokens · 72267 ms · 2026-05-10T15:15:00.589750+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 3 internal anchors

[1]

See No Evil: Semantic Context-Aware Privacy Risk Detection for AR

INTRODUCTION Augmented reality (AR) augments users’ perception by over- laying digital information onto the physical world around the users. AR relies on “always-on” environmental sens- ing, where cameras continuously capture users’ surround- ings [1]. This poses unprecedented privacy challenges by inadvertently recording sensitive information about user ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

PRIV ACY W ARNING!

SYSTEM DESIGN As shown in Fig. 1, PrivAR has a three-tier architecture com- prising: (1) an AR device providing user interface and privacy risk warnings (Section 2.1), (2) an edge server handling pri - vate information obfuscation (Section 2.2), and (3) a cloud server inferring contextual information and performing pr i- vacy risk assessment (Section 2.3)...

work page
[3]

strongly agree

EV ALUA TION We demonstrate the privacy risks of AR applications by eval- uating the accuracy of privacy risk detection. We also asses s the effectiveness of our privacy-preserving approach. 3.1. Experimental Setup We conduct a dataset-based evaluation and a user study with our end-to-end AR system. Following the experimental methodology in [17], in the d...

work page 2022
[4]

This approach achieves high privacy risk detection accuracy (81.48%) while reducing privacy leakage (17.58% PLR)

CONCLUSION PrivAR effectively addresses AR privacy challenges by ob- fuscating textual content at the edge server while preserv- ing contextual cues for cloud-based VLM inference. This approach achieves high privacy risk detection accuracy (81.48%) while reducing privacy leakage (17.58% PLR). Our results establish a practical path for privacy-preserv ing ...

work page
[5]

Erebus: Access control for augmented reality systems,

Y oonsang Kim, Sanket Goutam, Amir Rahmati, and Arie Kaufman, “Erebus: Access control for augmented reality systems,” in Proc. USENIX Security, 2023

work page 2023
[6]

User understanding of privacy permissions in mobile augmented reality: Per- ceptions and misconceptions,

Viktorija Paneva, V erena Winterhalter, Franziska Au- gustinowski, and Florian Alt, “User understanding of privacy permissions in mobile augmented reality: Per- ceptions and misconceptions,” Proceedings of the ACM on Human-Computer Interaction, vol. 9, no. 5, pp. 1–17, 2025

work page 2025
[7]

Privacy-enhancing technology and ev- eryday augmented reality: Understanding bystanders’ varying needs for awareness and consent,

Joseph O’Hagan, Pejman Saeghe, Jan Gugenheimer, Daniel Medeiros, Karola Marky, Mohamed Khamis, and Mark McGill, “Privacy-enhancing technology and ev- eryday augmented reality: Understanding bystanders’ varying needs for awareness and consent,” Proceedings of the ACM on Interactive, Mobile, W earable and Ubiq- uitous T echnologies, vol. 6, no. 4, pp. 1–35, 2023

work page 2023
[8]

World-driven access control for continuous sensing,

Franziska Roesner, David Molnar, Alexander Moshchuk, Tadayoshi Kohno, and Helen J Wang, “World-driven access control for continuous sensing,” in Proc. CCS, 2014

work page 2014
[9]

V eriﬁable access control for augmented reality localization and mapping,

Shaowei Zhu, Hyo Jin Kim, Maurizio Monge, G. Ed- ward Suh, Armin Alaghi, Brandon Reagen, and Vincent Lee, “V eriﬁable access control for augmented reality localization and mapping,” arXiv:2203.13308, 2022

work page arXiv 2022
[10]

BystandAR: Protecting by- stander visual data in augmented reality systems,

Matthew Corbett, Brendan David-John, Jiacheng Shang, Y . Charlie Hu, and Bo Ji, “BystandAR: Protecting by- stander visual data in augmented reality systems,” in Proc. MobiSys, 2023

work page 2023
[11]

Segue: Side- information guided generative unlearnable examples for facial privacy protection in real world,

Zhiling Zhang, Jie Zhang, Kui Zhang, Wenbo Zhou, Ting Xu, Daiheng Gao, Zixian Guo, Qinglang Guo, Weiming Zhang, and Nenghai Y u, “Segue: Side- information guided generative unlearnable examples for facial privacy protection in real world,” in Proc. IEEE ICASSP, 2025

work page 2025
[12]

Facial identity anonymization via intrinsic and extrinsic attention distraction,

Zhenzhong Kuang, Xiaochen Y ang, Yingjie Shen, Chao Hu, and Jun Y u, “Facial identity anonymization via intrinsic and extrinsic attention distraction,” in Proc. CVPR, 2024

work page 2024
[13]

Beyond blanket masking: Examin- ing granularity for privacy protection in images captured by blind and low vision users,

Jeffri Murrugarra-Llerena, Haoran Niu, K. Suzanne Barber, Hal Daum´ e III, Y ang Trista Cao, and Paola Cascante-Bonilla, “Beyond blanket masking: Examin- ing granularity for privacy protection in images captured by blind and low vision users,” in in Proc. COLM, 2025

work page 2025
[14]

ReVision: A dataset and baseline VLM for privacy-preserving task-oriented visual instruction rewriting,

Abhijit Mishra, Richard Noh, Hsiang Fu, Mingda Li, and Minji Kim, “ReVision: A dataset and baseline VLM for privacy-preserving task-oriented visual instruction rewriting,” in in Proc. IJCNLP-AACL, 2025

work page 2025
[15]

Vision language model helps private information de-identiﬁcation in vision data,

Tiejin Chen, Pingzhi Li, Kaixiong Zhou, Tianlong Chen, and Hua Wei, “Vision language model helps private information de-identiﬁcation in vision data,” in Proc. ACL, 2025

work page 2025
[16]

A design space for effective pri- vacy notices,

Florian Schaub, Rebecca Balebako, Adam L Durity, and Lorrie Faith Cranor, “A design space for effective pri- vacy notices,” in Proc. SOUPS, 2015

work page 2015
[17]

Latency-aware hybrid edge cloud framework for mo- bile augmented reality applications,

A yman Y ounis, Brian Qiu, and Dario Pompili, “Latency-aware hybrid edge cloud framework for mo- bile augmented reality applications,” in Proc. IEEE SECON, 2020

work page 2020
[18]

Integrated design of augmented reality spaces using virtual environments,

Tim Scargill, Ying Chen, Nathan Marzen, and Maria Gorlatova, “Integrated design of augmented reality spaces using virtual environments,” in Proc. IEEE IS- MAR, 2022

work page 2022
[19]

EAST: An efﬁcient and accurate scene text detector,

Xinyu Zhou, Cong Y ao, He Wen, Y uzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang, “EAST: An efﬁcient and accurate scene text detector,” in Proc. CVPR, 2017

work page 2017
[20]

Chain-of-thought prompting elicits rea- soning in large language models,

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou, “Chain-of-thought prompting elicits rea- soning in large language models,” in Proc. NeurIPS , 2022

work page 2022
[21]

ViD- DAR: Vision language model-based task-detrimental content detection for augmented reality,

Y anming Xiu, Tim Scargill, and Maria Gorlatova, “ViD- DAR: Vision language model-based task-detrimental content detection for augmented reality,” IEEE Trans- actions on Visualization and Computer Graphics , vol. 31, no. 5, pp. 3194–3203, 2025

work page 2025
[22]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, et al., “GPT-4 technical report,” arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, M. Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur´ elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample, “LLaMA: Open and efﬁcient foundation language models,” arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

An overview of the Tesseract OCR engine,

Ray Smith, “An overview of the Tesseract OCR engine,” in Proc. ICDAR, 2007

work page 2007
[25]

Ultralytics, “YOLOv8,” 2023, https://github.com/ultralytics/ultralytics

work page 2023
[26]

Col-OLHTR: A novel framework for multimodal online handwritten text recognition,

Chenyu Liu, Jinshui Hu, Baocai Yin, Jia Pan, Bing Yin, Jun Du, and Qingfeng Liu, “Col-OLHTR: A novel framework for multimodal online handwritten text recognition,” in Proc. IEEE ICASSP, 2025

work page 2025

[1] [1]

See No Evil: Semantic Context-Aware Privacy Risk Detection for AR

INTRODUCTION Augmented reality (AR) augments users’ perception by over- laying digital information onto the physical world around the users. AR relies on “always-on” environmental sens- ing, where cameras continuously capture users’ surround- ings [1]. This poses unprecedented privacy challenges by inadvertently recording sensitive information about user ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

PRIV ACY W ARNING!

SYSTEM DESIGN As shown in Fig. 1, PrivAR has a three-tier architecture com- prising: (1) an AR device providing user interface and privacy risk warnings (Section 2.1), (2) an edge server handling pri - vate information obfuscation (Section 2.2), and (3) a cloud server inferring contextual information and performing pr i- vacy risk assessment (Section 2.3)...

work page

[3] [3]

strongly agree

EV ALUA TION We demonstrate the privacy risks of AR applications by eval- uating the accuracy of privacy risk detection. We also asses s the effectiveness of our privacy-preserving approach. 3.1. Experimental Setup We conduct a dataset-based evaluation and a user study with our end-to-end AR system. Following the experimental methodology in [17], in the d...

work page 2022

[4] [4]

This approach achieves high privacy risk detection accuracy (81.48%) while reducing privacy leakage (17.58% PLR)

CONCLUSION PrivAR effectively addresses AR privacy challenges by ob- fuscating textual content at the edge server while preserv- ing contextual cues for cloud-based VLM inference. This approach achieves high privacy risk detection accuracy (81.48%) while reducing privacy leakage (17.58% PLR). Our results establish a practical path for privacy-preserv ing ...

work page

[5] [5]

Erebus: Access control for augmented reality systems,

Y oonsang Kim, Sanket Goutam, Amir Rahmati, and Arie Kaufman, “Erebus: Access control for augmented reality systems,” in Proc. USENIX Security, 2023

work page 2023

[6] [6]

User understanding of privacy permissions in mobile augmented reality: Per- ceptions and misconceptions,

Viktorija Paneva, V erena Winterhalter, Franziska Au- gustinowski, and Florian Alt, “User understanding of privacy permissions in mobile augmented reality: Per- ceptions and misconceptions,” Proceedings of the ACM on Human-Computer Interaction, vol. 9, no. 5, pp. 1–17, 2025

work page 2025

[7] [7]

Privacy-enhancing technology and ev- eryday augmented reality: Understanding bystanders’ varying needs for awareness and consent,

Joseph O’Hagan, Pejman Saeghe, Jan Gugenheimer, Daniel Medeiros, Karola Marky, Mohamed Khamis, and Mark McGill, “Privacy-enhancing technology and ev- eryday augmented reality: Understanding bystanders’ varying needs for awareness and consent,” Proceedings of the ACM on Interactive, Mobile, W earable and Ubiq- uitous T echnologies, vol. 6, no. 4, pp. 1–35, 2023

work page 2023

[8] [8]

World-driven access control for continuous sensing,

Franziska Roesner, David Molnar, Alexander Moshchuk, Tadayoshi Kohno, and Helen J Wang, “World-driven access control for continuous sensing,” in Proc. CCS, 2014

work page 2014

[9] [9]

V eriﬁable access control for augmented reality localization and mapping,

Shaowei Zhu, Hyo Jin Kim, Maurizio Monge, G. Ed- ward Suh, Armin Alaghi, Brandon Reagen, and Vincent Lee, “V eriﬁable access control for augmented reality localization and mapping,” arXiv:2203.13308, 2022

work page arXiv 2022

[10] [10]

BystandAR: Protecting by- stander visual data in augmented reality systems,

Matthew Corbett, Brendan David-John, Jiacheng Shang, Y . Charlie Hu, and Bo Ji, “BystandAR: Protecting by- stander visual data in augmented reality systems,” in Proc. MobiSys, 2023

work page 2023

[11] [11]

Segue: Side- information guided generative unlearnable examples for facial privacy protection in real world,

Zhiling Zhang, Jie Zhang, Kui Zhang, Wenbo Zhou, Ting Xu, Daiheng Gao, Zixian Guo, Qinglang Guo, Weiming Zhang, and Nenghai Y u, “Segue: Side- information guided generative unlearnable examples for facial privacy protection in real world,” in Proc. IEEE ICASSP, 2025

work page 2025

[12] [12]

Facial identity anonymization via intrinsic and extrinsic attention distraction,

Zhenzhong Kuang, Xiaochen Y ang, Yingjie Shen, Chao Hu, and Jun Y u, “Facial identity anonymization via intrinsic and extrinsic attention distraction,” in Proc. CVPR, 2024

work page 2024

[13] [13]

Beyond blanket masking: Examin- ing granularity for privacy protection in images captured by blind and low vision users,

Jeffri Murrugarra-Llerena, Haoran Niu, K. Suzanne Barber, Hal Daum´ e III, Y ang Trista Cao, and Paola Cascante-Bonilla, “Beyond blanket masking: Examin- ing granularity for privacy protection in images captured by blind and low vision users,” in in Proc. COLM, 2025

work page 2025

[14] [14]

ReVision: A dataset and baseline VLM for privacy-preserving task-oriented visual instruction rewriting,

Abhijit Mishra, Richard Noh, Hsiang Fu, Mingda Li, and Minji Kim, “ReVision: A dataset and baseline VLM for privacy-preserving task-oriented visual instruction rewriting,” in in Proc. IJCNLP-AACL, 2025

work page 2025

[15] [15]

Vision language model helps private information de-identiﬁcation in vision data,

Tiejin Chen, Pingzhi Li, Kaixiong Zhou, Tianlong Chen, and Hua Wei, “Vision language model helps private information de-identiﬁcation in vision data,” in Proc. ACL, 2025

work page 2025

[16] [16]

A design space for effective pri- vacy notices,

Florian Schaub, Rebecca Balebako, Adam L Durity, and Lorrie Faith Cranor, “A design space for effective pri- vacy notices,” in Proc. SOUPS, 2015

work page 2015

[17] [17]

Latency-aware hybrid edge cloud framework for mo- bile augmented reality applications,

A yman Y ounis, Brian Qiu, and Dario Pompili, “Latency-aware hybrid edge cloud framework for mo- bile augmented reality applications,” in Proc. IEEE SECON, 2020

work page 2020

[18] [18]

Integrated design of augmented reality spaces using virtual environments,

Tim Scargill, Ying Chen, Nathan Marzen, and Maria Gorlatova, “Integrated design of augmented reality spaces using virtual environments,” in Proc. IEEE IS- MAR, 2022

work page 2022

[19] [19]

EAST: An efﬁcient and accurate scene text detector,

Xinyu Zhou, Cong Y ao, He Wen, Y uzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang, “EAST: An efﬁcient and accurate scene text detector,” in Proc. CVPR, 2017

work page 2017

[20] [20]

Chain-of-thought prompting elicits rea- soning in large language models,

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou, “Chain-of-thought prompting elicits rea- soning in large language models,” in Proc. NeurIPS , 2022

work page 2022

[21] [21]

ViD- DAR: Vision language model-based task-detrimental content detection for augmented reality,

Y anming Xiu, Tim Scargill, and Maria Gorlatova, “ViD- DAR: Vision language model-based task-detrimental content detection for augmented reality,” IEEE Trans- actions on Visualization and Computer Graphics , vol. 31, no. 5, pp. 3194–3203, 2025

work page 2025

[22] [22]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, et al., “GPT-4 technical report,” arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, M. Lachaux, Timoth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur´ elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample, “LLaMA: Open and efﬁcient foundation language models,” arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

An overview of the Tesseract OCR engine,

Ray Smith, “An overview of the Tesseract OCR engine,” in Proc. ICDAR, 2007

work page 2007

[25] [25]

Ultralytics, “YOLOv8,” 2023, https://github.com/ultralytics/ultralytics

work page 2023

[26] [26]

Col-OLHTR: A novel framework for multimodal online handwritten text recognition,

Chenyu Liu, Jinshui Hu, Baocai Yin, Jia Pan, Bing Yin, Jun Du, and Qingfeng Liu, “Col-OLHTR: A novel framework for multimodal online handwritten text recognition,” in Proc. IEEE ICASSP, 2025

work page 2025