pith. machine review for the scientific record. sign in

arxiv: 2604.12081 · v1 · submitted 2026-04-13 · 💻 cs.AI

Recognition: unknown

Human-Inspired Context-Selective Multimodal Memory for Social Robots

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:34 UTC · model grok-4.3

classification 💻 cs.AI
keywords social robotsmultimodal memorycontext-selective storageepisodic memoryemotional saliencehuman-robot interactionmemory retrievalscene novelty
0
0 comments X

The pith

A context-selective multimodal memory architecture enables social robots to store and retrieve personalized episodic experiences based on emotional salience and scene novelty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a memory system for social robots that selectively captures and recalls both text and visual information from past interactions, prioritizing moments with high emotional impact or novelty. This draws from how humans remember meaningful events to adapt behavior in social contexts. If effective, it would allow robots to generate more natural, grounded, and user-specific dialogue rather than relying on generic or text-only memory. The system associates memories with individual users to support long-term personalized interactions. Evaluations on a dataset of social scenarios show it outperforms baselines in selective storage and retrieval accuracy while running in real time.

Core claim

The context-selective multimodal memory architecture captures textual and visual episodic traces, prioritizing high emotional salience or scene novelty, and associates them with users to enable socially personalized recall and natural dialogue, achieving a Spearman correlation of 0.506 in selective storage that surpasses human consistency and improving multimodal retrieval Recall@1 by up to 13%.

What carries the argument

The context-selective multimodal memory architecture that prioritizes memories by emotional salience and scene novelty for multimodal (text and image) episodic traces associated with users.

If this is right

  • Social robots can produce richer and more relevant responses in conversations by recalling contextually important past events.
  • Personalized memory association with users supports long-term human-robot interaction without losing relevance over time.
  • Real-time performance is maintained, allowing deployment in ongoing interactive settings without delays.
  • Selective storage reduces the load of irrelevant memories compared to storing every interaction equally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the approach to include audio or other sensor data could further strengthen recall by capturing additional dimensions of social context.
  • Testing across different cultures or age groups might reveal whether the emotional and novelty cues generalize or require adaptation.
  • Long-term deployment could show whether repeated selective recall improves user trust and engagement in repeated encounters.

Load-bearing premise

That emotional salience and scene novelty can be reliably computed from data to prioritize memories in a way that matches human judgment without bias.

What would settle it

A new dataset of real human-robot interactions where the system's selective storage decisions show no higher agreement with human judgments than non-selective baselines would falsify the core advantage.

Figures

Figures reproduced from arXiv: 2604.12081 by Hangyeol Kang, Nadia Magnenat Thalmann, Slava Voloshynovskiy.

Figure 1
Figure 1. Figure 1: Overview of the SUMMER architecture for selective multimodal memory in social robots. The perception layer analyzes [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the user identification process [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of response generated by the baseline model (left) and the SUMMER-augmented system (right) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Memory is fundamental to social interaction, enabling humans to recall meaningful past experiences and adapt their behavior accordingly based on the context. However, most current social robots and embodied agents rely on non-selective, text-based memory, limiting their ability to support personalized, context-aware interactions. Drawing inspiration from cognitive neuroscience, we propose a context-selective, multimodal memory architecture for social robots that captures and retrieves both textual and visual episodic traces, prioritizing moments characterized by high emotional salience or scene novelty. By associating these memories with individual users, our system enables socially personalized recall and more natural, grounded dialogue. We evaluate the selective storage mechanism using a curated dataset of social scenarios, achieving a Spearman correlation of 0.506, surpassing human consistency ($\rho=0.415$) and outperforming existing image memorability models. In multimodal retrieval experiments, our fusion approach improves Recall@1 by up to 13\% over unimodal text or image retrieval. Runtime evaluations confirm that the system maintains real-time performance. Qualitative analyses further demonstrate that the proposed framework produces richer and more socially relevant responses than baseline models. This work advances memory design for social robots by bridging human-inspired selectivity and multimodal retrieval to enhance long-term, personalized human-robot interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a context-selective multimodal memory architecture for social robots, inspired by cognitive neuroscience. It selectively stores and retrieves textual and visual episodic traces, prioritizing high emotional salience or scene novelty, and associates memories with individual users to enable personalized recall and grounded dialogue. On a curated dataset of social scenarios, the selective storage achieves a Spearman correlation of 0.506 (surpassing human consistency ρ=0.415) and outperforms image memorability models; multimodal fusion improves Recall@1 by up to 13% over unimodal baselines while maintaining real-time performance, with qualitative gains in socially relevant responses.

Significance. If the selectivity and fusion results hold under broader conditions, the work could meaningfully advance memory design in social robotics by integrating human-inspired prioritization with multimodal retrieval, supporting more adaptive long-term HRI. The explicit comparison to human consistency and the reported Recall@1 gains provide a concrete empirical anchor, though the absence of open code, full methodological details, or error bars limits immediate reproducibility and extension.

major comments (2)
  1. [Evaluation] Evaluation section (selective storage experiments): The central claim that the architecture enables 'socially personalized recall' for social robots rests on performance (ρ=0.506) measured exclusively on a curated dataset of social scenarios. No analysis or experiments demonstrate stability of emotional salience and scene novelty scores under distribution shift to live robot camera feeds, variable lighting, spontaneous dialogue, or individual user idiosyncrasies, which is load-bearing for the application to real HRI.
  2. [Multimodal retrieval experiments] Multimodal retrieval experiments: The reported up to 13% Recall@1 gain via fusion is presented without error bars, statistical significance tests, or complete specification of the fusion mechanism, baseline implementations, and dataset splits. This makes it impossible to assess whether the improvement is robust or sensitive to post-hoc modeling choices, directly affecting the strength of the multimodal advantage claim.
minor comments (2)
  1. [Abstract and Evaluation] The abstract and evaluation sections would benefit from explicit statements of the exact models or features used to compute emotional salience and scene novelty, including any hyperparameters.
  2. [Runtime evaluations] Runtime performance claims would be strengthened by reporting hardware specifications and latency distributions rather than a qualitative 'real-time' assertion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed review and constructive suggestions for improving our manuscript on the context-selective multimodal memory architecture. Below, we provide point-by-point responses to the major comments. We have revised the manuscript to address concerns about experimental details and have added discussions on limitations where appropriate.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (selective storage experiments): The central claim that the architecture enables 'socially personalized recall' for social robots rests on performance (ρ=0.506) measured exclusively on a curated dataset of social scenarios. No analysis or experiments demonstrate stability of emotional salience and scene novelty scores under distribution shift to live robot camera feeds, variable lighting, spontaneous dialogue, or individual user idiosyncrasies, which is load-bearing for the application to real HRI.

    Authors: We thank the referee for highlighting this important aspect. Our evaluation indeed focuses on a curated dataset of social scenarios to enable direct comparison with human consistency ratings and existing memorability models. This controlled setting allows us to isolate and validate the selectivity mechanism without confounding factors from real-world variability. We recognize that demonstrating robustness under distribution shifts to live robot environments is crucial for practical HRI applications. In the revised manuscript, we have added a dedicated paragraph in the Limitations and Future Work section acknowledging this gap and outlining planned experiments involving live camera feeds, variable conditions, and user studies to assess stability of salience and novelty scores. We believe this provides an honest assessment while maintaining the contributions of the current work. revision: partial

  2. Referee: [Multimodal retrieval experiments] Multimodal retrieval experiments: The reported up to 13% Recall@1 gain via fusion is presented without error bars, statistical significance tests, or complete specification of the fusion mechanism, baseline implementations, and dataset splits. This makes it impossible to assess whether the improvement is robust or sensitive to post-hoc modeling choices, directly affecting the strength of the multimodal advantage claim.

    Authors: We agree that these details are essential for evaluating the reliability of the multimodal fusion results. Upon review, we have expanded the relevant section in the revised manuscript to include error bars representing standard deviation over 5 independent runs, results from statistical significance testing (paired t-tests with p-values reported), a complete description of the fusion mechanism as a weighted late fusion of normalized text and image similarity scores, full specifications of the baseline models (including the exact pre-trained models and hyperparameters used), and the precise train/validation/test split ratios (70%/15%/15%) along with how the curated dataset was partitioned. These additions should enable readers to better assess the robustness of the up to 13% Recall@1 improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on curated data

full rationale

The paper proposes a context-selective multimodal memory architecture inspired by cognitive neuroscience and reports direct empirical results: Spearman correlation of 0.506 on selective storage (vs. human ρ=0.415) and up to 13% Recall@1 gain from fusion, all measured on a curated dataset of social scenarios. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The metrics are presented as experimental outcomes against external baselines and human consistency, with no reduction of claims to inputs by construction. The work is self-contained as an empirical demonstration.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract introduces the memory architecture as the core contribution without listing explicit free parameters or background axioms; the selectivity criteria (emotional salience, scene novelty) function as domain assumptions.

invented entities (1)
  • context-selective multimodal memory architecture no independent evidence
    purpose: Captures and retrieves textual and visual episodic traces prioritized by emotional salience or scene novelty and associated with individual users for personalized recall.
    Presented as the novel system proposed in the paper; no independent evidence or falsifiable prediction outside the reported experiments is given.

pith-pipeline@v0.9.0 · 5519 in / 1216 out tokens · 96337 ms · 2026-05-10T15:34:00.475802+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 28 canonical work pages · 3 internal anchors

  1. [1]

    Shivendra Agrawal, Suresh Nayak, Ashutosh Naik, and Bradley Hayes. 2024. ShelfHelp: Empowering humans to perform vision-independent manipulation tasks with a socially assistive robotic cane.arXiv preprint arXiv:2405.20501(2024)

  2. [2]

    Xiang An, Xuhan Zhu, Yuan Gao, Yang Xiao, Yongle Zhao, Ziyong Feng, Lan Wu, Bin Qin, Ming Zhang, Debing Zhang, et al . 2021. Partial fc: Training 10 million identities on a single machine. InProceedings of the IEEE/CVF International Conference on Computer Vision. 1445–1449

  3. [3]

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. 2024. Video generation models as world simulators.OpenAI Blog1, 8 (2024), 1

  4. [4]

    Zoya Bylinskii, Lore Goetschalckx, Anelise Newman, and Aude Oliva. 2021. Memorability: An Image-Computable Measure of Information Utility. https: //doi.org/10.48550/arXiv.2104.00805 arXiv:2104.00805 [cs]

  5. [5]

    Zoya Bylinskii, Phillip Isola, Constance Bainbridge, Antonio Torralba, and Aude Oliva. 2015. Intrinsic and Extrinsic Effects on Image Memorability.Vision Research 116 (Nov. 2015), 165–178. https://doi.org/10.1016/j.visres.2015.03.005

  6. [6]

    Duncan, Farshid Alambeigi, and Mitchell W

    John A. Duncan, Farshid Alambeigi, and Mitchell W. Pryor. 2024. A Survey of Multimodal Perception Methods for Human–Robot Interaction in Social Envi- ronments.ACM Transactions on Human-Robot Interaction13, 4 (Dec. 2024), 1–50. https://doi.org/10.1145/3657030

  7. [7]

    Khaled El Emam, Lucy Mosquera, and Jason Bass. 2020. Evaluating identity disclosure risk in fully synthetic health data: model development and validation. Journal of medical Internet research22, 11 (2020), e23139

  8. [8]

    Jiri Fajtl, Vasileios Argyriou, Dorothy Monekosso, and Paolo Remagnino. 2018. AMNet: Memorability Estimation with Attention. https://doi.org/10.48550/arXiv. 1804.03115 arXiv:1804.03115 [cs]

  9. [9]

    Tinglei Feng, Yingjie Zhai, Jufeng Yang, Jie Liang, Deng-Ping Fan, Jing Zhang, Ling Shao, and Dacheng Tao. 2022. Ic9600: A benchmark dataset for automatic image complexity assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence45, 7 (2022), 8577–8593

  10. [10]

    Ronald Aylmer Fisher. 1970. Statistical methods for research workers. InBreak- throughs in statistics: Methodology and distribution. Springer, 66–70

  11. [11]

    Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, et al. 2025. jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval.arXiv preprint arXiv:2506.18902(2025)

  12. [12]

    Thomas Hagen and Thomas Espeseth. 2023. Image Memorability Predic- tion with Vision Transformers. https://doi.org/10.48550/arXiv.2301.08647 arXiv:2301.08647 [cs]

  13. [13]

    Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image de- scription as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Research47 (2013), 853–899

  14. [14]

    My Agent Understands Me Better

    Yuki Hou, Haruki Tamoto, and Homei Miyashita. 2024. "My Agent Understands Me Better": Integrating Dynamic Human-like Memory Recall and Consolidation in LLM-Based Agents. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems. ACM, Honolulu HI USA, 1–7. https://doi.org/10. 1145/3613905.3650839

  15. [15]

    Chenxu Hu, Jie Fu, Chenzhuang Du, Simian Luo, Junbo Zhao, and Hang Zhao

  16. [16]

    Chatdb: Augmenting llms with databases as their symbolic memory.arXiv preprint arXiv:2306.03901(2023)

  17. [17]

    Jiewen Hu, Leena Mathur, Paul Pu Liang, and Louis-Philippe Morency. 2025. OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behav- ior Analysis.arXiv preprint arXiv:2506.02891(2025)

  18. [18]

    Yifei Huang, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Lijin Yang, Xinyuan Chen, Yaohui Wang, Zheng Nie, Jinyao Liu, et al . 2024. Vinci: A real-time embodied smart assistant based on egocentric vision-language model.arXiv preprint arXiv:2412.21080(2024)

  19. [19]

    Ziheng Huang, Sebastian Gutierrez, Hemanth Kamana, and Stephen Mac- Neil. 2023. Memory Sandbox: Transparent and Interactive Memory Manage- ment for Conversational Agents. https://doi.org/10.48550/arXiv.2308.01542 arXiv:2308.01542 [cs]

  20. [20]

    Phillip Isola, Devi Parikh, Antonio Torralba, and Aude Oliva. 2011. Understanding the Intrinsic Memorability of Images. (2011)

  21. [21]

    Ruben Janssens and Tony Belpaeme. 2025. Towards Multimodal Social Conver- sations with Robots: Using Vision-Language Models. https://doi.org/10.48550/ arXiv.2507.19196 arXiv:2507.19196 [cs]

  22. [22]

    Hangyeol Kang, Maher Ben Moussa, and Nadia Magnenat-Thalmann. 2024. Nadine: An LLM-driven Intelligent Social Robot with Affective Capabili- ties and Human-like Memory. https://doi.org/10.48550/arXiv.2405.20189 arXiv:2405.20189 [cs]

  23. [23]

    Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition. 3128–3137

  24. [24]

    Raju, Antonio Torralba, and Aude Oliva

    Aditya Khosla, Akhil S. Raju, Antonio Torralba, and Aude Oliva. 2015. Un- derstanding and Predicting Image Memorability at a Large Scale. In2015 IEEE International Conference on Computer Vision (ICCV). IEEE, Santiago, Chile. https: //doi.org/10.1109/iccv.2015.275

  25. [25]

    Byeong Su Kim, Jieun Kim, Deokwoo Lee, and Beakcheol Jang. 2025. Visual question answering: A survey of methods, datasets, evaluation, and challenges. Comput. Surveys57, 10 (2025), 1–35

  26. [26]

    Taewoon Kim, Michael Cochez, Vincent Francois-Lavet, Mark Neerincx, and Piek Vossen. 2024. A Machine With Human-Like Memory Systems. https: //doi.org/10.48550/arXiv.2204.01611 arXiv:2204.01611 [cs]

  27. [27]

    Max A Kramer, Martin N Hebart, Chris I Baker, and Wilma A Bainbridge. 2023. The features underlying the memorability of objects.Science advances9, 17 (2023), eadd2981

  28. [28]

    Taeyoon Kwon, Dongwook Choi, Sunghwan Kim, Hyojun Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, and Jinyoung Yeo. 2025. Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance. arXiv preprint arXiv:2505.16348(2025)

  29. [29]

    Cameron Kyle-Davidson, Oscar Solis, Stephen Robinson, Ryan Tze Wang Tan, and Karla K Evans. 2025. Scene complexity and the detail trace of human long-term visual memory.Vision Research227 (2025), 108525

  30. [30]

    Ryan T LaLumiere, James L McGaugh, and Christa K McIntyre. 2017. Emotional modulation of learning and memory: pharmacological implications.Pharmaco- logical reviews69, 3 (2017), 236–255

  31. [31]

    VI Lcvenshtcin. 1966. Binary coors capable or ‘correcting deletions, insertions, and reversals. InSoviet physics-doklady, Vol. 10

  32. [32]

    Jie Li, Junpei Zhong, and Ning Wang. 2023. A Multimodal Human-Robot Sign Lan- guage Interaction Framework Applied in Social Robots.Frontiers in Neuroscience 17 (April 2023), 1168888. https://doi.org/10.3389/fnins.2023.1168888

  33. [33]

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. InEuropean conference on computer vision. Springer, 740–755

  34. [34]

    Weihao Liu, Fangyu Lei, Tongxu Luo, Jiahe Lei, Shizhu He, Jun Zhao, and Kang Liu. 2023. MMHQA-ICL: Multimodal in-context learning for hybrid question answering over text, tables and images.arXiv preprint arXiv:2309.04790(2023)

  35. [35]

    Yang Liu, Xinshuai Song, Kaixuan Jiang, Weixing Chen, Jingzhou Luo, Guanbin Li, and Liang Lin. 2024. Meia: Multimodal embodied perception and interaction in unknown environments.arXiv preprint arXiv:2402.00290(2024)

  36. [36]

    Jinjie Mai, Jun Chen, Guocheng Qian, Mohamed Elhoseiny, Bernard Ghanem, et al. 2023. Llm as a robotic brain: Unifying egocentric memory and control. (2023)

  37. [37]

    Artur Marchewka, Marek Wypych, Abnoos Moslehi, Monika Riegel, Jarosław M Michałowski, and Katarzyna Jednoróg. 2016. Arousal rather than basic emotions influence long-term recognition memory in humans.Frontiers in behavioral neuroscience10 (2016), 198

  38. [38]

    Mara Mather. 2007. Emotional Arousal and Memory Binding: An Object-Based Framework.Perspectives on Psychological Science2, 1 (March 2007), 33–52. https: //doi.org/10.1111/j.1745-6916.2007.00028.x

  39. [39]

    Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, et al. 2025. A survey of context engineering for large language models.arXiv preprint arXiv:2507.13334(2025)

  40. [40]

    Clinton Merck, Jeremy K Yamashiro, and William Hirst. 2020. Remembering the big game: Social identity and memory for media events.Memory28, 6 (2020), 795–814

  41. [41]

    MistralAI. 2025. Mistral-Small-3.2-24B-Instruct-2506. https://huggingface.co/ mistralai/Mistral-Small-3.2-24B-Instruct-2506

  42. [42]

    MoondreamLabs. 2025. Moondream2-2B. https://https://huggingface.co/ vikhyatk/moondream2

  43. [43]

    Needell and Wilma A

    Coen D. Needell and Wilma A. Bainbridge. 2022. Embracing New Techniques in Deep Learning for Estimating Image Memorability. https://doi.org/10.48550/ arXiv.2105.10598 arXiv:2105.10598 [cs]

  44. [44]

    Fabian Peller-Konrad, Rainer Kartmann, Christian RG Dreher, Andre Meixner, Fabian Reister, Markus Grotz, and Tamim Asfour. 2023. A memory system of a robot cognitive architecture and its implementation in ArmarX.Robotics and Autonomous Systems164 (2023), 104415

  45. [45]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  46. [46]

    Lizeth Tapia Tarifa, Einar Broch Johnsen, and Carlos Hernández Corbato

    Gustavo Rezende Silva, Juliane Päßler, S. Lizeth Tapia Tarifa, Einar Broch Johnsen, and Carlos Hernández Corbato. 2025. ROSA: A Knowledge-Based Solution for Robot Self-Adaptation.Frontiers in Robotics and AI12 (May 2025), 1531743. https://doi.org/10.3389/frobt.2025.1531743

  47. [47]

    Toshiyuki Shiwa, Takayuki Kanda, Michita Imai, Hiroshi Ishiguro, and Norihiro Hagita. 2008. How quickly should communication robots respond?. InProceedings of the 3rd ACM/IEEE international conference on Human robot interaction. 153–160

  48. [48]

    Zhihang Song, Zimin He, Xingyu Li, Qiming Ma, Ruibo Ming, Zhiqi Mao, Huaxin Pei, Lihui Peng, Jianming Hu, Danya Yao, et al . 2023. Synthetic datasets for autonomous driving: A survey.IEEE Transactions on Intelligent Vehicles9, 1 (2023), 1847–1864

  49. [49]

    Charles Spearman. 1904. The proof and measurement of association between two things.The American Journal of Psychology15, 1 (1904), 72–101

  50. [50]

    Micol Spitale, Minja Axelsson, and Hatice Gunes. 2025. VITA: A Multi-Modal LLM-Based System for Longitudinal, Autonomous and Adaptive Robotic Mental Well-Being Coaching.ACM Transactions on Human-Robot Interaction14, 2 (2025), 1–28

  51. [51]

    R Nathan Spreng. 2013. Examining the role of memory in social cognition. , 437 pages

  52. [52]

    CI Stewardson, MC Hunsche, V Wardell, DJ Palombo, and CM Kerns. 2022. Episodic memory through a social and emotional lens. Emotion. Advance online publication

  53. [53]

    Sydney Thompson, Kate Candon, and Marynel Vázquez. 2025. The Social Con- text of Human-Robot Interactions. https://doi.org/10.48550/arXiv.2508.13982 arXiv:2508.13982 [cs]

  54. [54]

    Freek Van Ede and Anna C Nobre. 2023. Turning attention inside out: How working memory serves behavior.Annual review of psychology74, 1 (2023), 137–165

  55. [55]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672(2024)

  56. [56]

    Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichten- hofer. 2023. Demystifying clip data.arXiv preprint arXiv:2309.16671(2023)

  57. [57]

    Qianli Xu, Fen Fang, Ana Molino, Vigneshwaran Subbaraju, and Joo-Hwee Lim

  58. [58]

    Predicting event memorability from contextual visual semantics.Advances in Neural Information Processing Systems34 (2021), 22431–22442

  59. [59]

    Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the association for computational linguistics2 (2014), 67–78

  60. [60]

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sig- moid loss for language image pre-training. InProceedings of the IEEE/CVF inter- national conference on computer vision. 11975–11986

  61. [61]

    Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, et al. 2024. mgte: Generalized long-context text representation and reranking models for multilingual text retrieval.arXiv preprint arXiv:2407.19669(2024)

  62. [62]

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2024. Memo- rybank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19724–19731