pith. machine review for the scientific record. sign in

arxiv: 2604.05591 · v1 · submitted 2026-04-07 · 💻 cs.CE · cs.AI· cs.CL· cs.CY· cs.ET

Recognition: 1 theorem link

· Lean Theorem

AI-Driven Modular Services for Accessible Multilingual Education in Immersive Extended Reality Settings: Integrating Speech Processing, Translation, and Sign Language Rendering

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:48 UTC · model grok-4.3

classification 💻 cs.CE cs.AIcs.CLcs.CYcs.ET
keywords modular AI platformXR educationmultilingual translationsign language renderingaccessible educationspeech processingvirtual realityAI services
0
0 comments X

The pith

A modular AI platform combines six services to deliver real-time accessible multilingual education with sign language in immersive XR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a system that links automatic speech recognition, translation, speech synthesis, emotion detection, dialogue summarization, and International Sign rendering into one XR application. Recorded sign gestures are turned into hand landmarks and then into 3D avatar animations. Separate technical tests on each service show acceptable latency and translation scores, which the authors take as evidence that the full setup can run live and support inclusive language learning. Readers might care because the design uses existing AI pieces in a way that can scale to different languages and user needs without rebuilding everything.

Core claim

These findings establish the viability of orchestrating cross-modal AI services within XR settings for accessible, multilingual language instruction. The modular design permits independent scaling and adaptation to varied educational contexts, providing a foundation for equitable learning solutions aligned with European Union digital accessibility goals.

What carries the argument

The modular platform that runs and connects six AI components—speech recognition, translation, synthesis, emotion classification, summarization, and mapping International Sign gestures to VR avatars—while validating each through isolated benchmarks.

If this is right

  • Components can be updated or swapped independently for new languages or user groups.
  • The system supports real-time XR use based on the reported latency and BLEU results.
  • It creates a path for inclusive education that reaches both multilingual speakers and sign language users.
  • The approach aligns with existing digital accessibility requirements in education.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Full classroom trials would be needed to check whether the technical numbers translate into actual language learning gains.
  • The same modular combination of existing models could apply to accessibility in training simulations or remote collaboration.
  • More sign language data could improve the gesture-to-avatar step for greater naturalness.

Load-bearing premise

That separate benchmarks of each AI component on latency and accuracy are enough to confirm the integrated system will perform well in real-time XR without full end-to-end user testing.

What would settle it

Running the complete platform with actual learners in XR and measuring combined response times, sign rendering accuracy, translation quality in conversation, and learning gains; results below practical thresholds would show the viability claim does not hold.

Figures

Figures reproduced from arXiv: 2604.05591 by A.J. McCracken, E. Papatheou, I. Karachalios, N.D. Tantaroudas.

Figure 1
Figure 1. Figure 1: High-level overview of the proposed system architecture. AI services are hosted on [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Speech-to-text transcription utilising a Whisper AI wrapper [ [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Text-to-text translation workflow employing Meta’s NLLB model [ [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Emoticon-based sentiment feedback in the XR environment. The avatar delivers [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: API request and response format for the RoBERTa-based sentiment analysis module. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: API request and response format for the meeting summarisation module based on the [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual sequence of real-time avatar animation driven by extracted gesture landmarks [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Immersive VR-based language learning system. A 3D avatar stands in a virtual class [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: MVP demonstration showing multilingual avatar interaction in VR. The avatar delivers [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
read the original abstract

This work introduces a modular platform that brings together six AI services, automatic speech recognition via OpenAI Whisper, multilingual translation through Meta NLLB, speech synthesis using AWS Polly, emotion classification with RoBERTa, dialogue summarisation via flan t5 base samsum, and International Sign (IS) rendering through Google MediaPipe. A corpus of IS gesture recordings was processed to derive hand landmark coordinates, which were subsequently mapped onto three dimensional avatar animations inside a virtual reality (VR) environment. Validation comprised technical benchmarking of each AI component, including comparative assessments of speech synthesis providers and multilingual translation models (NLLB 200 and EuroLLM 1.7B variants). Technical evaluations confirmed the suitability of the platform for real time XR deployment. Speech synthesis benchmarking established that AWS Polly delivers the lowest latency at a competitive price point. The EuroLLM 1.7B Instruct variant attained a higher BLEU score, surpassing NLLB. These findings establish the viability of orchestrating cross modal AI services within XR settings for accessible, multilingual language instruction. The modular design permits independent scaling and adaptation to varied educational contexts, providing a foundation for equitable learning solutions aligned with European Union digital accessibility goals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a modular platform integrating six AI services—automatic speech recognition via OpenAI Whisper, multilingual translation via Meta NLLB (and EuroLLM variants), speech synthesis via AWS Polly, emotion classification via RoBERTa, dialogue summarization via flan-t5-base-samsum, and International Sign rendering via Google MediaPipe—for accessible multilingual education in immersive XR/VR settings. It describes processing a corpus of IS gesture recordings into hand landmark coordinates mapped to 3D avatar animations and validates the approach through separate technical benchmarks of each component, including latency comparisons for synthesis providers and BLEU score comparisons for translation models, concluding that the platform is suitable for real-time XR deployment and supports equitable learning aligned with EU digital accessibility goals.

Significance. The modular orchestration of cross-modal AI services for XR-based language instruction represents a practical engineering contribution toward accessible education tools. If end-to-end integration and real-time performance were demonstrated, the work could serve as a foundation for scalable, adaptable systems in educational contexts. The explicit component benchmarking and focus on International Sign rendering are positive elements, but the absence of integrated system validation substantially reduces the strength of the viability claims.

major comments (2)
  1. [Abstract] Abstract: The claim that 'Technical evaluations confirmed the suitability of the platform for real time XR deployment' rests solely on isolated component benchmarks (AWS Polly latency, EuroLLM BLEU scores, MediaPipe landmark mapping). No end-to-end pipeline latency from ASR input through translation, synthesis, summarization, emotion detection, and IS avatar rendering; no synchronization error between audio and 3D hand landmarks; and no orchestration overhead are reported, leaving the leap to real-time XR viability unsubstantiated.
  2. [Abstract] Abstract (and validation description): The manuscript provides no user studies, educational effectiveness metrics, or specified real-time performance thresholds (e.g., maximum acceptable end-to-end latency for immersive XR) to support the stronger claims of enabling 'accessible, multilingual language instruction' and 'equitable learning solutions'. Component-level metrics alone do not establish overall platform suitability for the intended educational use case.
minor comments (2)
  1. [Abstract] Abstract: Inconsistent terminology—'real time' appears without hyphenation while 'real-time' is conventional; standardize throughout.
  2. [Abstract] Abstract: The IS gesture corpus is referenced but no details on its size, recording conditions, or exact mapping procedure to 3D avatars are supplied, hindering reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, acknowledging limitations where they exist and outlining targeted revisions to strengthen the presentation of our technical contribution.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'Technical evaluations confirmed the suitability of the platform for real time XR deployment' rests solely on isolated component benchmarks (AWS Polly latency, EuroLLM BLEU scores, MediaPipe landmark mapping). No end-to-end pipeline latency from ASR input through translation, synthesis, summarization, emotion detection, and IS avatar rendering; no synchronization error between audio and 3D hand landmarks; and no orchestration overhead are reported, leaving the leap to real-time XR viability unsubstantiated.

    Authors: We agree that the abstract overstates the evidence by claiming confirmation of real-time XR suitability based solely on component benchmarks. The manuscript does not report integrated end-to-end latency, synchronization metrics, or orchestration overhead. We will revise the abstract to state that the benchmarks demonstrate the potential of individual components for low-latency operation in XR contexts. We will add an explicit limitations paragraph noting the absence of full-pipeline measurements and provide an aggregated latency estimate derived from the reported component times, along with a discussion of synchronization strategies for future integrated testing. revision: yes

  2. Referee: [Abstract] Abstract (and validation description): The manuscript provides no user studies, educational effectiveness metrics, or specified real-time performance thresholds (e.g., maximum acceptable end-to-end latency for immersive XR) to support the stronger claims of enabling 'accessible, multilingual language instruction' and 'equitable learning solutions'. Component-level metrics alone do not establish overall platform suitability for the intended educational use case.

    Authors: The manuscript is a technical engineering contribution centered on modular AI service integration and component benchmarking; it does not include user studies or educational outcome metrics. We will revise the abstract and conclusion to moderate the language, framing the work as providing a technical foundation for accessible multilingual XR education rather than claiming validated educational effectiveness or equitable learning solutions. We will also reference established XR latency thresholds from the literature (e.g., sub-100 ms end-to-end for immersion) and map our component results against them. Conducting user studies lies outside the current scope and would require separate ethical and resource considerations, which we will note as a future direction. revision: partial

Circularity Check

0 steps flagged

No circularity; viability claims rest on external service benchmarks

full rationale

The paper describes integration of six pre-existing AI services (OpenAI Whisper, Meta NLLB, AWS Polly, RoBERTa, flan-t5, MediaPipe) and reports separate technical benchmarks drawn from those external providers. No equations, derivations, fitted parameters, or internal predictions appear anywhere in the manuscript. The assertion that 'technical evaluations confirmed the suitability of the platform for real time XR deployment' is grounded in cited third-party metrics (latency, BLEU scores) rather than any self-referential construction or renaming of results. The work therefore contains no load-bearing steps that reduce to their own inputs by definition or self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work assembles existing commercial and open-source AI services plus a processed gesture corpus without postulating new quantities or assumptions beyond standard model usage.

pith-pipeline@v0.9.0 · 5553 in / 1193 out tokens · 62553 ms · 2026-05-10T18:48:18.518774+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The proposed platform unifies modular AI-driven services... six AI services: automatic speech recognition via OpenAI Whisper, multilingual translation through Meta NLLB, speech synthesis using AWS Polly, emotion classification with RoBERTa, dialogue summarisation via flan-t5-base-samsum, and International Sign (IS) rendering through Google MediaPipe... Technical evaluations confirmed the platform’s suitability for real-time XR deployment.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 39 canonical work pages · 3 internal anchors

  1. [1]

    Shliakhtina, T

    O. Shliakhtina, T. Kyselova, S. Mudra, Y. Talalay, and A. Oleksiienko. The effectiveness of the grammar translation method for learning English in higher education institutions. Eduweb, 17 0 (3), 2023. doi:10.46502/issn.1856-7576/2023.17.03.12

  2. [2]

    H. Wu, H. Su, M. Yan, and Q. Zhuang. Perceptions of grammar-translation method and communicative language teaching method used in English classrooms. Journal of English Language Teaching and Applied Linguistics, 5 0 (2), 2023 a . doi:10.32996/jeltal.2023.5.2.12

  3. [3]

    Divekar et al

    R. Divekar et al. Foreign language acquisition via artificial intelligence and extended reality: Design and evaluation. Computer Assisted Language Learning, 35 0 (9): 0 2332--2360, 2021. doi:10.1080/09588221.2021.1879162

  4. [4]

    Tegoan, S

    N. Tegoan, S. Wibowo, and S. Grandhi. Application of the extended reality technology for teaching new languages: A systematic review. Applied Sciences, 11 0 (23): 0 11360, 2021. doi:10.3390/app112311360

  5. [5]

    Panagiotidis

    P. Panagiotidis. Virtual reality applications and language learning. International Journal for Cross-Disciplinary Subjects in Education, 12: 0 4447--4454, 2021. doi:10.20533/ijcdse.2042.6364.2021.0543

  6. [6]

    Godwin-Jones

    R. Godwin-Jones. Presence and agency in real and virtual spaces: The promise of extended reality for language learning. Language Learning & Technology, 27 0 (3): 0 6--26, 2023. https://hdl.handle.net/10125/73529

  7. [7]

    Zhi and L

    Y. Zhi and L. Wu. Extended reality in language learning: A cognitive affective model of immersive learning perspective. Frontiers in Psychology, 14, 2023. doi:10.3389/fpsyg.2023.1109025

  8. [8]

    https://immerseme.co/

    ImmerseMe VR . https://immerseme.co/. Accessed: 2025

  9. [9]

    https://www.mondly.com/

    MondlyAR . https://www.mondly.com/. Accessed: 2025

  10. [10]

    Garcia, A

    C. Garcia, A. Guzman, and D. S\' a nchez Ruano. Binding AI and XR in design education: Challenges and opportunities with emerging technologies. In Proceedings of the 26th International Conference on Engineering and Product Design Education (EPDE), pages 247--251, 2024. doi:10.35199/EPDE.2024.42

  11. [11]

    Hartholt, E

    A. Hartholt, E. Fast, A. Reilly, W. Whitcup, M. Liewer, and S. Mozgai. Ubiquitous virtual humans: A multi-platform framework for embodied AI agents in XR . In 2019 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR), pages 308--3084, 2019. doi:10.1109/AIVR46125.2019.00072

  12. [12]

    2024), 6204–6224

    R. Zhang, D. Zou, and G. Cheng. Concepts, affordances, and theoretical frameworks of mixed reality enhanced language learning. Interactive Learning Environments, 32 0 (7): 0 3624--3637, 2023. doi:10.1080/10494820.2023.2187421

  13. [13]

    C. L. Taborda, H. Nguyen, and P. Bourdot. Engagement and attention in XR for learning: Literature review. In Virtual Reality and Mixed Reality. EuroXR 2024. Lecture Notes in Computer Science, volume 15445. Springer, Cham, 2025. doi:10.1007/978-3-031-78593-1_13

  14. [14]

    Taborri, P

    J. Taborri, P. Fornai, E. Yeguas-Bolivar, M. D. Redel-Macias, M. Hilzensauer, A. Pecher, M. Leisenberg, A. Melis, and S. Rossi. The use of artificial intelligence for sign language recognition in education: From a literature overview to the ISENSE project. In 2023 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and...

  15. [15]

    Strobel, T

    G. Strobel, T. Schoormann, L. Banh, and F. M\" o ller. Artificial intelligence for sign language translation -- a design science research study. Communications of the Association for Information Systems, 53, 2023. doi:10.17705/1cais.05303

  16. [16]

    Chaudhary, T

    L. Chaudhary, T. Ananthanarayana, E. Hoq, and I. Nwogu. SignNet II : A transformer-based two-way sign language translation model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45: 0 12896--12907, 2022. doi:10.1109/TPAMI.2022.3232389

  17. [17]

    P. A. Rodr\' i guez-Correa, A. Valencia-Arias, O. N. Pati\ n o Toro, Y. Oblitas D\' i az, and R. Teodori de la Puente. Benefits and development of assistive technologies for deaf people's communication: A systematic review. Frontiers in Education, 8, 2023. doi:10.3389/feduc.2023.1121597

  18. [18]

    EUD position paper: International sign language

    European Union of the Deaf . EUD position paper: International sign language. https://eud.eu/eud/position-papers/international-signs/, 2018

  19. [19]

    A. Yin, T. Zhong, L. H. Tang, W. Jin, T. Jin, and Z. Zhao. Gloss attention for gloss-free sign language translation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2551--2562, 2023. doi:10.1109/CVPR52729.2023.00251

  20. [20]

    Sylaiou, E

    S. Sylaiou, E. Gkagka, C. Fidas, E. Vlachou, G. Lampropoulos, A. Plytas, and V. Nomikou. Use of XR technologies for fostering visitors' experience and inclusion at industrial museums. In Proceedings of the 2nd International Conference of the ACM Greek SIGCHI Chapter (CHI-GREECE '23), pages 1--5, 2023. doi:10.1145/3609987.3610008

  21. [21]

    Hirzle, F

    T. Hirzle, F. M\" u ller, F. Draxler, M. Schmitz, P. Knierim, and K. Hornb k. When XR and AI meet -- a scoping review on extended reality and artificial intelligence. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23), pages 1--45, 2023. doi:10.1145/3544548.3581072

  22. [22]

    N. D. Tantaroudas, A. J. McCracken, I. Karachalios, and E. Papatheou. AI -based services to support language-learning for deaf and hearing individuals in immersive XR settings. In Extended Reality. XR Salento 2025. Lecture Notes in Computer Science, volume 15743. Springer, Cham, 2026 a . doi:10.1007/978-3-031-97781-7_17

  23. [23]

    N. D. Tantaroudas, A. J. McCracken, I. Karachalios, and E. Papatheou. Enhancing accessibility and inclusivity in business meetings through AI -driven extended reality solutions. In Extended Reality. XR Salento 2025. Lecture Notes in Computer Science, volume 15743. Springer, Cham, 2026 b . doi:10.1007/978-3-031-97781-7_6

  24. [24]

    N. D. Tantaroudas, A. J. McCracken, I. Karachalios, V. Pastrikakis, and E. Papatheou. Transforming career development through immersive and data-driven solutions. In Extended Reality. XR Salento 2025. Lecture Notes in Computer Science, volume 15742. Springer, Cham, 2026 c . doi:10.1007/978-3-031-97778-7_7

  25. [25]

    N. D. Tantaroudas, A. J. McCracken, I. Karachalios, and E. Papatheou. INTERACT : AI -powered extended reality platform for inclusive communication with real-time sign language translation and sentiment analysis. Open Research Europe, 6: 0 71, 2026 d . doi:10.12688/openreseurope.23201.1. version 1; peer review: awaiting peer review

  26. [26]

    N. D. Tantaroudas, A. J. McCracken, I. Karachalios, and E. Papatheou. AI -based services for inclusive language learning in immersive XR environments: Speech translation, and sign language integration. Open Research Europe, 6: 0 72, 2026 e . doi:10.12688/openreseurope.23214.1. version 1; peer review: awaiting peer review

  27. [27]

    Robust Speech Recognition via Large-Scale Weak Supervision

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), PMLR 202, pages 28492--28518, 2023. doi:10.48550/arXiv.2212.04356

  28. [28]

    No Language Left Behind: Scaling Human-Centered Machine Translation

    NLLB Team et al. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022. doi:10.48550/arXiv.2207.04672

  29. [29]

    Y. Liu, J. Zhu, J. Zhang, and C. Zong. Bridging the modality gap for speech-to-text translation. arXiv preprint arXiv:2010.14920, 2020. doi:10.48550/arXiv.2010.14920

  30. [30]

    M. S. Anwar, B. Shi, V. Goswami, W. Hsu, J. M. Pino, and C. Wang. MuAViC : A multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation. arXiv preprint arXiv:2303.00628, 2023. doi:10.48550/arXiv.2303.00628

  31. [31]

    N. C. Camg\" o z, O. Koller, S. Hadfield, and R. Bowden. Sign language transformers: Joint end-to-end sign language recognition and translation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10020--10030, 2020. doi:10.1109/CVPR42600.2020.01004

  32. [32]

    B. Zhou, Z. Chen, A. Clap\' e s, J. Wan, Y. Liang, and S. Escalera. Gloss-free sign language translation: Improving from visual-language pretraining. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 20814--20824, 2023. doi:10.1109/ICCV51070.2023.01908

  33. [33]

    Assran, Q

    J. Zheng, Y. Wang, C. Tan, S. Li, G. Wang, and J. Xia. CVT-SLR : Contrastive visual-textual transformation for sign language recognition with variational alignment. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23141--23150, 2023. doi:10.1109/CVPR52729.2023.02216

  34. [34]

    X. Wu, X. Luo, Z. Song, Y. Bai, B. Zhang, and G. Zhang. Ultra-robust and sensitive flexible strain sensor for real-time and wearable sign language translation. Advanced Functional Materials, 33, 2023 b . doi:10.1002/adfm.202303504

  35. [35]

    MediaPipe: A Framework for Building Perception Pipelines

    C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.-L. Chang, M.-G. Yong, J. Lee, W.-T. Chang, W. Hua, M. Georg, and M. Grundmann. MediaPipe : A framework for building perception pipelines. arXiv preprint arXiv:1906.08172, 2019. doi:10.48550/arXiv.1906.08172

  36. [36]

    Subramanian, B

    B. Subramanian, B. Olimov, S. M. Naik, et al. An integrated MediaPipe -optimized GRU model for Indian sign language recognition. Scientific Reports, 12: 0 11964, 2022. doi:10.1038/s41598-022-15998-7

  37. [37]

    https://github.com/bishal7679/ASL-Transformer

    ASL Transformer . https://github.com/bishal7679/ASL-Transformer. Accessed: 2025

  38. [38]

    https://www.spreadthesign.com/en.gb/search/

    SpreadTheSign . https://www.spreadthesign.com/en.gb/search/. Accessed: 2025

  39. [39]

    Srivastava, S

    S. Srivastava, S. Singh, Pooja, et al. Continuous sign language recognition system using deep learning with MediaPipe holistic. Wireless Personal Communications, 137: 0 1455--1468, 2024. doi:10.1007/s11277-024-11356-0

  40. [40]

    https://web.archive.org/web/20150711105152/http://www.handspeak.com/world/isl/index.php?id=151

    HandSpeak -- International Sign Language . https://web.archive.org/web/20150711105152/http://www.handspeak.com/world/isl/index.php?id=151. Accessed: 2025

  41. [41]

    Available: https://doi.org/10.1145/3551349.3559555

    T. Ahmed and P. Devanbu. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022. doi:10.1145/3551349.3559555

  42. [42]

    Boros and M

    K. Boros and M. Oyamada. Towards large language model organization: A case study on abstractive summarization. In 2023 IEEE International Conference on Big Data (BigData), pages 6109--6112, Sorrento, Italy, 2023. doi:10.1109/BigData59044.2023.10386199

  43. [43]

    Bozkir, S

    E. Bozkir, S. \" O zdel, K. H. C. Lau, M. Wang, H. Gao, and E. Kasneci. Embedding large language models into extended reality: Opportunities and challenges for inclusion, engagement, and privacy. In Proceedings of the 6th ACM Conference on Conversational User Interfaces (CUI '24), pages 1--7, 2024. doi:10.1145/3640794.3665563

  44. [44]

    Ramprasad, E

    S. Ramprasad, E. Ferracane, and Z. Lipton. Analyzing LLM behavior in dialogue summarization: Unveiling circumstantial hallucination trends. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), pages 12549--12561, 2024. doi:10.18653/v1/2024.acl-long.677

  45. [45]

    Barbieri, J

    F. Barbieri, J. Camacho-Collados, L. Espinosa-Anke, and L. Neves. TweetEval : Unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644--1650, 2020. doi:10.18653/v1/2020.findings-emnlp.148

  46. [46]

    https://github.com/coqui-ai/TTS

    Coqui TTS : High-quality text-to-speech synthesis for researchers and developers. https://github.com/coqui-ai/TTS. Accessed: 2025

  47. [47]

    Piper: A fast, local neural text to speech system

    Rhasspy contributors . Piper: A fast, local neural text to speech system. https://github.com/rhasspy/piper, 2023

  48. [48]

    TTS latency benchmark

    Picovoice . TTS latency benchmark. https://picovoice.ai/docs/benchmark/tts-latency/, 2024

  49. [49]

    P. Schmid. flan-t5-base-samsum. Hugging Face model repository. https://huggingface.co/philschmid/flan-t5-base-samsum, 2022

  50. [50]

    Bradski and A

    G. Bradski and A. Kaehler. Learning OpenCV : Computer Vision with the OpenCV Library . O'Reilly Media, Sebastopol, CA, 2008