Hierarchical Long-Term Semantic Memory for LinkedIn's Hiring Agent
Pith reviewed 2026-05-07 13:28 UTC · model grok-4.3
The pith
A schema-aligned hierarchical memory tree lets LLM agents store and retrieve long-term semantic knowledge with over 10% gains in correctness and retrieval quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that representing long-term semantic memory as a schema-aligned tree that holds knowledge at multiple granularities, combined with an adaptation mechanism, solves the joint problems of scalable ingestion, privacy-aware storage, low-latency retrieval, and observable provenance, producing more than 10% higher answer correctness and retrieval F1 while moving the query-versus-indexing latency frontier outward.
What carries the argument
The schema-aligned memory tree that stores semantic knowledge at multiple levels of granularity and incorporates an adaptation mechanism for cross-domain use.
If this is right
- Ingestion of noisy longitudinal behavioral data becomes scalable because the tree grows incrementally along schema paths.
- Storage can remain privacy-aware since only the structured nodes, not raw logs, need to be kept.
- Retrieval latency drops because queries can target the appropriate granularity level instead of scanning everything.
- Provenance stays transparent because every retrieved fact traces back to its originating node and schema.
- The adaptation mechanism allows the same tree structure to be reused in new domains with limited additional tuning.
Where Pith is reading between the lines
- The tree structure might naturally support selective forgetting or data minimization, which could help satisfy stricter privacy rules without extra engineering.
- Similar hierarchical organization could be tested in other agent settings that accumulate user history, such as personal scheduling or customer-support assistants.
- The latency gains suggest that indexing cost might stay manageable even as the number of users grows, provided the schema remains stable.
- If the adaptation step can be made fully automatic, the framework could reduce the need for per-domain engineering effort.
Load-bearing premise
The schema-aligned memory tree and adaptation mechanism will work across many different applications and the reported gains on internal data will hold when baselines and data splits are chosen independently.
What would settle it
Applying the same tree construction and retrieval procedure to an independent, publicly available long-term memory benchmark and measuring no improvement in correctness or in the latency trade-off would falsify the central performance claim.
Figures
read the original abstract
Large Language Model (LLM) agents are increasingly used in real-world products, where personalized and context-aware user interactions are essential. A central enabler of such capabilities is the agent's long-term semantic memory system, which extracts implicit and explicit signals from noisy longitudinal behavioral data, stores them in a structured form, and supports low-latency retrieval. Building industrial-grade long-term memory for LLM agents raises five challenges: scalability, low-latency retrieval, privacy constraints, cross-domain generalizability, and observability. We introduce the Hierarchical Long-Term Semantic Memory (HLTM) framework, which organizes textual data into a schema-aligned memory tree that captures semantic knowledge at multiple levels of granularity, enabling scalable ingestion, privacy-aware storage, low-latency retrieval, and transparent provenance; HLTM further incorporates an adaptation mechanism to generalize across diverse use cases. Extensive evaluations on LinkedIn's Hiring Assistant show that HLTM improves answer correctness and retrieval F1 significantly by more than 10%, while significantly advancing the Pareto frontier between query and indexing latency. HLTM has been deployed in LinkedIn's Hiring Assistant to power core personalization features in production hiring workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Hierarchical Long-Term Semantic Memory (HLTM) framework for LLM agents, which structures longitudinal behavioral data into a schema-aligned memory tree supporting multi-granularity semantic knowledge. This addresses industrial challenges including scalability, low-latency retrieval, privacy, cross-domain generalizability, and observability, with an adaptation mechanism for diverse use cases. Evaluations on LinkedIn's Hiring Assistant data report >10% gains in answer correctness and retrieval F1, plus Pareto improvements in query/indexing latency; the system is deployed in production for personalization in hiring workflows.
Significance. If the empirical results hold under scrutiny, the work is significant for industrial information retrieval and LLM agent systems. It offers a deployable solution to long-term memory challenges with explicit attention to privacy and latency trade-offs, backed by real-world production use at LinkedIn. This provides a concrete reference point for similar personalization tasks in hiring and recommendation domains.
major comments (2)
- [Abstract / Evaluation] Abstract and Evaluation section: The central claims of >10% improvements in answer correctness and retrieval F1 (plus Pareto frontier advance) are stated without any reported details on test-set size, query distribution, baseline definitions (e.g., standard RAG or prior memory systems), statistical tests, ablation results, or data characteristics. This omission is load-bearing because the headline performance gains cannot be verified as robust rather than artifacts of data selection or weak baselines.
- [HLTM Framework] Framework description (likely §3): The adaptation mechanism is asserted to enable generalization across use cases, yet no cross-domain, hold-out, or external validation experiments are described to support this claim, leaving the generalizability assertion unsupported by evidence.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for strengthening the presentation of our empirical results and the generalizability discussion. We have revised the manuscript to provide additional context and clarifications while respecting the proprietary constraints of the LinkedIn production data.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and Evaluation section: The central claims of >10% improvements in answer correctness and retrieval F1 (plus Pareto frontier advance) are stated without any reported details on test-set size, query distribution, baseline definitions (e.g., standard RAG or prior memory systems), statistical tests, ablation results, or data characteristics. This omission is load-bearing because the headline performance gains cannot be verified as robust rather than artifacts of data selection or weak baselines.
Authors: We agree that greater transparency on the experimental setup is warranted. In the revised manuscript, we have expanded the Evaluation section (and updated the abstract for consistency) to report test-set size, high-level query characteristics, explicit baseline definitions (including standard RAG and prior memory systems), statistical significance testing, and ablation studies isolating the contribution of the hierarchical structure and adaptation mechanism. Due to privacy and proprietary constraints, we report aggregated statistics rather than raw query distributions or individual examples. revision: yes
-
Referee: [HLTM Framework] Framework description (likely §3): The adaptation mechanism is asserted to enable generalization across use cases, yet no cross-domain, hold-out, or external validation experiments are described to support this claim, leaving the generalizability assertion unsupported by evidence.
Authors: We acknowledge that the generalizability claim would be strengthened by additional empirical validation. The current work evaluates HLTM on LinkedIn's Hiring Assistant, a complex production setting. The adaptation mechanism is presented in Section 3 as a modular, schema-driven component intended to support diverse domains. In the revision, we have added a new subsection in the Discussion that explicitly addresses design choices supporting generalization, outlines how the mechanism can be applied to other use cases, and states the limitations of validating only within the hiring domain. revision: partial
- Full raw query distributions and per-user data characteristics, which cannot be disclosed due to LinkedIn's privacy policies and data protection regulations.
Circularity Check
No circularity: claims rest on empirical system evaluation without self-referential derivations
full rationale
The paper describes the HLTM framework as a hierarchical memory tree with schema alignment and an adaptation mechanism for LLM agents in hiring workflows. Central performance claims (>10% gains in correctness and F1, plus Pareto latency improvements) are presented as results of extensive evaluations on LinkedIn's internal Hiring Assistant data and production deployment. No equations, fitted parameters, predictions, or uniqueness theorems appear in the abstract or described structure that could reduce to inputs by construction. No self-citations are invoked as load-bearing justification for core premises. The derivation chain is therefore self-contained as an engineering contribution validated externally to any internal definitions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review arXiv 2023
-
[2]
Anthropic. 2025. Home — Anthropic. https://www.anthropic.com/. Accessed: 2025-12-12
2025
-
[3]
Sizhe Cheng, Jiaping Li, Huanchen Wang, and Yuxin Ma. 2025. Ragtrace: Un- derstanding and refining retrieval-generation dynamics in retrieval-augmented generation. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–20
2025
-
[4]
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Ya- dav. 2025. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413(2025)
work page internal anchor Pith review arXiv 2025
-
[5]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)
work page internal anchor Pith review arXiv 2025
-
[6]
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130(2024)
work page internal anchor Pith review arXiv 2024
-
[7]
GDPR.eu. [n. d.]. Complete guide to GDPR compliance. https://gdpr.eu/. Ac- cessed: 2026-01-30
2026
-
[8]
2025.Building the agentic future of recruiting: how we engineered LinkedIn’s Hiring Assistant
Xiaoyang Gu, Xie Lu, and Daniel Hewlett. 2025.Building the agentic future of recruiting: how we engineered LinkedIn’s Hiring Assistant. LinkedIn Engi- neering. https://www.linkedin.com/blog/engineering/ai/how-we-engineered- linkedins-hiring-assistant
2025
- [9]
-
[10]
Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. 2024. Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in Neural Information Processing Systems37 (2024), 59532–59569
2024
-
[11]
Ehsan Kamalloo, Nouha Dziri, Charles Clarke, and Davood Rafiei. 2023. Eval- uating open-domain question answering in the era of large language models. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). 5591–5606
2023
- [12]
-
[13]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474
2020
- [14]
- [15]
-
[16]
Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems. (2023)
2023
- [17]
-
[18]
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. 2024. Raptor: Recursive abstractive processing for tree-organized retrieval. InThe Twelfth International Conference on Learning Representations
2024
-
[19]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools.Advances in Neural Information Processing Systems36 (2023), 68539–68551
2023
-
[20]
Wenyu Tao, Xiaofen Xing, Yirong Chen, Linyi Huang, and Xiangmin Xu. 2025. Treerag: Unleashing the power of hierarchical storage for enhanced knowledge retrieval in long documents. InFindings of the Association for Computational Linguistics: ACL 2025. 356–371
2025
-
[21]
Bing Wang, Xinnian Liang, Jian Yang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, and Zhoujun Li. 2023. Scm: Enhancing large language model with self-controlled memory framework.arXiv e-prints(2023), arXiv–2304
2023
-
[22]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345
2024
-
[23]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models.Advances in neural information processing systems 35 (2022), 24824–24837
2022
-
[24]
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang
-
[25]
A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110 (2025)
work page internal anchor Pith review arXiv 2025
-
[26]
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems36 (2023), 11809–11822
2023
-
[27]
Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023. Large language models meet nl2code: A survey. InProceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). 7443–7464
2023
-
[28]
Shenglai Zeng, Jiankun Zhang, Pengfei He, Yiding Liu, Yue Xing, Han Xu, Jie Ren, Yi Chang, Shuaiqiang Wang, Dawei Yin, et al. 2024. The good and the bad: Exploring privacy issues in retrieval-augmented generation (rag). InFindings of the Association for Computational Linguistics: ACL 2024. 4505–4524
2024
- [29]
-
[30]
Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. 2025. A survey on the memory mechanism of large language model-based agents.ACM Transactions on Information Systems 43, 6 (2025), 1–47
2025
- [31]
-
[32]
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2024. Memo- rybank: Enhancing large language models with long-term memory. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19724–19731
2024
-
[33]
facets\" and value as a flattened dictionary, in which the keys is facet name and the value is the corresponding extracted facet values in string format; no nested information. {{
Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. 2025. Large language models for information retrieval: A survey.ACM Transactions on Information Systems44, 1 (2025), 1–54. AHLTMIndexing-time Prompts A.1 Facet Extraction Facet Extraction Prompt <system message> You are a ...
2025
-
[34]
Judge only using the Golden Reference Answer (no outside knowledge)
-
[35]
Apply the exclusions penalty if applicable
-
[36]
</user prompt> D Detailed Experiment Results D.1 Performance Results Table 4: Answer quality across query types
Output ONLY a JSON object with: - is_correct: true if score >= 0.7 else false - score: float in [0.0, 1.0] - rationale: brief justification naming the main mismatches ( no step-by-step reasoning). </user prompt> D Detailed Experiment Results D.1 Performance Results Table 4: Answer quality across query types. Left: summary-style queries; right: retrieval-s...
2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.