HYVE: Hybrid Views for LLM Context Engineering over Machine Data

Boris Sobolev; Dev Khanolkar; Fan Bu; Jason Mackay; Jian Tan; Lei Jin; Li Zhang; Yuqing Gao

arxiv: 2604.05400 · v1 · submitted 2026-04-07 · 💻 cs.AI

HYVE: Hybrid Views for LLM Context Engineering over Machine Data

Jian Tan , Fan Bu , Yuqing Gao , Dev Khanolkar , Jason Mackay , Boris Sobolev , Lei Jin , Li Zhang This is my paper

Pith reviewed 2026-05-10 19:11 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM context engineeringmachine datahybrid viewstoken reductionobservabilitypreprocessingpostprocessingstructured payloads

0 comments

The pith

HYVE preprocesses machine data into selective hybrid views stored in a request-scoped datastore to reduce LLM token consumption by 50-90 percent while keeping or raising output quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HYVE as a preprocessing and postprocessing layer around LLM calls that handles large volumes of logs, metrics, traces, and configuration data. It identifies repetitive patterns in the raw input, materializes them with schema details in a temporary datastore, and supplies the model with compact hybrid columnar and row views instead of full payloads. Postprocessing then queries the datastore or runs a limited follow-up call to restore or synthesize any details left out. This setup targets the brittleness LLMs show with long, nested, repetitive machine data that drives up costs and errors in observability and diagnosis tasks. The reported benchmarks show consistent token savings alongside gains in accuracy on structured outputs like charts.

Core claim

HYVE surrounds each model invocation with coordinated preprocessing that detects repetitive structure, stores it in a request-scoped datastore augmented with schema information, and transforms it into hybrid columnar and row-oriented views that expose only the most relevant representation to the LLM; postprocessing either returns the output directly, queries the datastore to recover omitted information, or performs a bounded additional LLM call for SQL-augmented semantic synthesis, yielding 50-90 percent token reduction and maintained or improved quality across knowledge QA, chart generation, anomaly detection, and network troubleshooting workloads.

What carries the argument

The hybrid view mechanism, which converts raw machine-data payloads into coordinated columnar and row-oriented representations held in a request-scoped datastore with schema metadata so that only selected subsets reach the LLM.

If this is right

Token counts drop 50-90 percent on real-world observability and diagnosis workloads while output quality stays the same or rises.
Chart-generation accuracy increases by as much as 132 percent and latency falls by as much as 83 percent on structured generation tasks.
The approach approximates an effectively unbounded context window when prompts are dominated by large machine-data payloads.
The same pipeline applies to knowledge QA, anomaly detection, and multi-step network troubleshooting without task-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the datastore schema capture is extended to streaming updates, HYVE-style layers could support continuous monitoring rather than one-shot queries.
The selective exposure pattern could transfer to other repetitive structured inputs such as large codebases or scientific measurement arrays.
Failure modes would appear most clearly on inputs whose repetitive patterns are irregular or cross multiple schema boundaries that current detection misses.

Load-bearing premise

Preprocessing reliably detects repetitive structure and creates hybrid views that omit nothing the downstream LLM task needs, while postprocessing can accurately recover or synthesize the missing details.

What would settle it

Run the same machine-data input through HYVE and a direct full-context baseline on a task where a subtle repetitive pattern carries a critical diagnostic clue; if the HYVE output misses that clue or requires far more tokens than claimed, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2604.05400 by Boris Sobolev, Dev Khanolkar, Fan Bu, Jason Mackay, Jian Tan, Lei Jin, Li Zhang, Yuqing Gao.

**Figure 3.** Figure 3: Line chart over three years of USD exchange-rate data (778 points per [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: , the resulting prompt Rep(s) is assembled as the following hybrid view: t1 ⊕ Col(J1) ⊕ t2 ⊕ Col(J2) ⊕ · · · ⊕ tm+1 ⊕ Row(J), where ⊕ denotes string concatenation [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Per-sample distributions on the Line chart dataset ( [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Per-sample distributions on the Anom dataset (n=797). HYVE [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Machine data is central to observability and diagnosis in modern computing systems, appearing in logs, metrics, telemetry traces, and configuration snapshots. When provided to large language models (LLMs), this data typically arrives as a mixture of natural language and structured payloads such as JSON or Python/AST literals. Yet LLMs remain brittle on such inputs, particularly when they are long, deeply nested, and dominated by repetitive structure. We present HYVE (HYbrid ViEw), a framework for LLM context engineering for inputs containing large machine-data payloads, inspired by database management principles. HYVE surrounds model invocation with coordinated preprocessing and postprocessing, centered on a request-scoped datastore augmented with schema information. During preprocessing, HYVE detects repetitive structure in raw inputs, materializes it in the datastore, transforms it into hybrid columnar and row-oriented views, and selectively exposes only the most relevant representation to the LLM. During postprocessing, HYVE either returns the model output directly, queries the datastore to recover omitted information, or performs a bounded additional LLM call for SQL-augmented semantic synthesis. We evaluate HYVE on diverse real-world workloads spanning knowledge QA, chart generation, anomaly detection, and multi-step network troubleshooting. Across these benchmarks, HYVE reduces token usage by 50-90% while maintaining or improving output quality. On structured generation tasks, it improves chart-generation accuracy by up to 132% and reduces latency by up to 83%. Overall, HYVE offers a practical approximation to an effectively unbounded context window for prompts dominated by large machine-data payloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HYVE gives a practical preprocessing loop for repetitive machine data in LLM prompts, with reported token cuts of 50-90 percent, but the evaluation details are too thin to judge how reliably it avoids losing critical information.

read the letter

HYVE wraps an LLM call with preprocessing that detects repetitive structure in logs, traces, and similar payloads, materializes it in a request-scoped datastore, and feeds the model only hybrid columnar or row views. Postprocessing either returns the output, queries the store, or runs a bounded follow-up call to recover details. The framework is new in its coordinated use of database-style view materialization specifically for this class of inputs rather than generic compression or summarization tricks. It targets a real pain point: long, nested machine data that makes context windows expensive and LLMs brittle on observability tasks. The reported 50-90 percent token reduction, up to 83 percent latency drop, and accuracy gains on chart generation are the kind of numbers that would matter to people shipping diagnostic tools if they hold up. The paper does a clean job laying out the preprocessing-postprocessing loop and showing results across knowledge QA, anomaly detection, and troubleshooting workloads. The main soft spot is that the central claim rests on the preprocessing step never dropping task-critical details that the detection heuristics miss. The abstract gives no information on how the relevance scoring works, what the exact baselines were, dataset sizes, or statistical tests, so the 132 percent accuracy lift is hard to interpret. The stress-test concern about unique error codes buried in mostly repetitive JSON is fair; without edge-case analysis or ablation on the detection logic, it is difficult to know whether the gains generalize or are benchmark-specific. The work is aimed at engineers who already use LLMs on system data and want a concrete way to shrink context without losing too much. A reader building monitoring or diagnosis applications would find the framework description useful even if they end up re-implementing parts. It deserves peer review because the problem is timely and the method is reproducible enough to test, though any referee would likely ask for fuller experimental protocols and failure-mode analysis.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces HYVE, a framework for LLM context engineering over large machine-data payloads (logs, JSON, traces). It surrounds model calls with preprocessing that detects repetitive structure, materializes it in a request-scoped datastore with schema, exposes hybrid columnar/row views selectively to the LLM, and postprocessing that either returns output directly, queries the datastore, or performs bounded SQL-augmented synthesis. On benchmarks spanning knowledge QA, chart generation, anomaly detection, and network troubleshooting, the authors report 50-90% token reduction while maintaining or improving output quality, with up to 132% accuracy gains and 83% latency reduction on structured generation tasks.

Significance. If the empirical claims are substantiated with proper controls, HYVE would offer a practical, database-inspired technique for handling repetitive machine data in LLM prompts, effectively approximating unbounded context without full materialization. The hybrid-view and request-scoped datastore design is a clear strength and could influence context-engineering practices in observability and diagnostics applications.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: quantitative claims of 50-90% token reduction, 132% accuracy improvement, and 83% latency reduction are stated without any description of baselines, statistical tests, error bars, dataset sizes, or exact evaluation protocols, rendering it impossible to assess whether the numbers support the central performance claims.
[Preprocessing] Preprocessing description: the detection of repetitive structure and relevance scoring are characterized as heuristic; no analysis, formal guarantee, or edge-case evaluation is supplied for payloads containing deeply nested or partially repetitive content where task-critical non-repetitive details (e.g., a unique error code inside a mostly-repetitive JSON trace) could be omitted from the datastore and therefore lost to postprocessing recovery.

minor comments (1)

[Framework Overview] Notation for the hybrid views and datastore schema is introduced without a compact formal definition or diagram that would clarify the columnar versus row-oriented transformations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, agreeing where revisions are needed to improve clarity and robustness while defending the core contributions based on the manuscript's existing evaluation.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: quantitative claims of 50-90% token reduction, 132% accuracy improvement, and 83% latency reduction are stated without any description of baselines, statistical tests, error bars, dataset sizes, or exact evaluation protocols, rendering it impossible to assess whether the numbers support the central performance claims.

Authors: We agree that the abstract would benefit from a brief mention of the evaluation setup to contextualize the claims. In the full manuscript, Section 4 (Evaluation) provides detailed descriptions of the benchmarks, baselines (including direct LLM prompting, token compression methods like LLMLingua, and other context engineering approaches), dataset sizes (e.g., specific numbers of traces and logs used), evaluation protocols, and statistical significance where applicable. However, to address this directly, we will revise the abstract to include a short summary of the experimental setup and add error bars and p-values to the key result tables in the revised version. The reported improvements are relative to standard prompting baselines on the same tasks. revision: yes
Referee: [Preprocessing] Preprocessing description: the detection of repetitive structure and relevance scoring are characterized as heuristic; no analysis, formal guarantee, or edge-case evaluation is supplied for payloads containing deeply nested or partially repetitive content where task-critical non-repetitive details (e.g., a unique error code inside a mostly-repetitive JSON trace) could be omitted from the datastore and therefore lost to postprocessing recovery.

Authors: The preprocessing steps are indeed heuristic, as we note in the manuscript, to balance efficiency and coverage. We have evaluated HYVE on real-world machine data that includes nested structures and mixed repetitive/non-repetitive elements, such as in the network troubleshooting and anomaly detection benchmarks. The hybrid views and request-scoped datastore are designed to allow postprocessing to query for omitted details when needed, and our results show maintained or improved quality, suggesting critical information is preserved. That said, we acknowledge the value of explicit edge-case analysis. In the revision, we will add a new subsection in Section 3 (Preprocessing) discussing potential failure modes for deeply nested content and include additional experiments or examples demonstrating recovery via postprocessing. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims with no derivations or self-referential reductions

full rationale

The paper describes the HYVE framework for preprocessing machine data into hybrid views and postprocessing LLM outputs, with all central claims consisting of measured benchmark results (50-90% token reduction, up to 132% accuracy gain, 83% latency reduction) on specific workloads. No equations, first-principles derivations, fitted parameters, or predictions appear anywhere in the text. The preprocessing detection logic and relevance scoring are presented as heuristics without any claim that they are derived from or equivalent to the reported outcomes. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The evaluation is therefore self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach rests on the assumption that machine data contains detectable repetitive structure that can be losslessly materialized and selectively re-exposed; no free parameters are explicitly fitted in the abstract, but the framework itself is an invented engineering artifact.

axioms (1)

domain assumption LLMs perform better or at least as well when given selectively chosen hybrid views of structured data rather than raw nested payloads
This is the core premise that justifies the preprocessing step and is invoked throughout the described workflow.

invented entities (2)

HYVE framework no independent evidence
purpose: Coordinated preprocessing and postprocessing layer for LLM context engineering on machine data
Newly proposed end-to-end system not previously described.
request-scoped datastore no independent evidence
purpose: Temporary storage for materialized structure and schema to enable later recovery or synthesis
Invented component central to the hybrid-view approach.

pith-pipeline@v0.9.0 · 5598 in / 1427 out tokens · 35643 ms · 2026-05-10T19:11:22.881513+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 2 internal anchors

[1]

Analytics Context Engineering for LLM,

“Analytics Context Engineering for LLM,” https://blogs.cisco.com/ai/ analytics-context-engineering-for-llm, 2024, february 3, 2026

work page 2024
[2]

Openclaw,

OpenClaw, “Openclaw,” https://openclaw.ai/, 2026, official website. Ac- cessed: 2026-03-30

work page 2026
[3]

Claude code overview,

Anthropic, “Claude code overview,” https://docs.anthropic.com/en/docs/ claude-code/overview, 2026, official documentation. Accessed: 2026- 03-30

work page 2026
[4]

OpenAI Codex CLI – Getting Started,

OpenAI, “OpenAI Codex CLI – Getting Started,” https://help.openai. com/en/articles/11096431, 2026, official help documentation. Accessed: 2026-03-30

work page arXiv 2026
[5]

Gemini CLI,

Google, “Gemini CLI,” https://github.com/google-gemini/gemini-cli, 2026, official repository. Accessed: 2026-03-30

work page 2026
[6]

OpenCode,

OpenCode, “OpenCode,” https://opencode.ai/, 2026, official website. Accessed: 2026-03-30

work page 2026
[7]

Accessed: 2026-03-30

Pi, “pi.dev,” https://buildwithpi.com/, 2026, official website for the Pi coding agent. Accessed: 2026-03-30

work page 2026
[8]

The Claude 3 model family: Opus, Sonnet, Haiku,

Anthropic, “The Claude 3 model family: Opus, Sonnet, Haiku,” https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model Card Claude 3.pdf, 2024

work page 2024
[9]

Building filesystem agents,

Vercel, “Building filesystem agents,” https://vercel.com/academy/ filesystem-agents, 2025

work page 2025
[10]

JSONPath: Query Expressions for JSON,

“JSONPath: Query Expressions for JSON,” https://www.rfc-editor.org/ rfc/rfc9535

work page
[11]

OpenAI API Reference: Responses,

“OpenAI API Reference: Responses,” https://platform.openai.com/docs/ api-reference/responses, 2026, accessed: 2026-03-25

work page 2026
[12]

OpenAI API Reference: Chat Completions,

“OpenAI API Reference: Chat Completions,” https://platform.openai. com/docs/api-reference/chat/create-chat-completion, 2026, accessed: 2026-03-25

work page 2026
[13]

Anthropic api: Messages examples,

Anthropic, “Anthropic api: Messages examples,” https://docs.anthropic. com/en/api/messages-examples, 2026, accessed: 2026-03-25

work page 2026
[14]

Openai sdk compatibility,

——, “Openai sdk compatibility,” https://docs.anthropic.com/en/api/ openai-sdk, 2026, accessed: 2026-03-25

work page 2026
[15]

Duckdb: An embeddable analytical database,

H. M ¨uhleisen and M. Raasveldt, “Duckdb: An embeddable analytical database,” inProceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). ACM, 2019

work page 2019
[16]

Robertson and H

S. Robertson and H. Zaragoza, “The probabilistic relevance framework: Bm25 and beyond,”Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009. [Online]. Available: https: //doi.org/10.1561/1500000019

work page doi:10.1561/1500000019 2009
[17]

Anomaly detection: A survey,

V . Chandola, A. Banerjee, and V . Kumar, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, pp. 1–58, 2009

work page 2009
[18]

TOON: Token-oriented object notation,

TOON Format Contributors, “TOON: Token-oriented object notation,” https://github.com/toon-format/toon, 2025, includes the TOON Retrieval Accuracy Benchmark. Accessed: 2026

work page 2025
[19]

LangSmith,

LangChain, Inc., “LangSmith,” https://www.langchain.com/langsmith, 2023, accessed: 2026

work page 2023
[20]

Cisco Deep Network Model: Purpose built intelligence for networking,

“Cisco Deep Network Model: Purpose built intelligence for networking,” https://blogs.cisco.com/ai/cisco-deep-network-model-overview, 2026, february 5, 2026

work page 2026
[21]

TaPas: Weakly supervised table parsing via pre-training,

J. Herzig, P. K. Nowak, T. M ¨uller, F. Piccinno, and J. Eisenschlos, “TaPas: Weakly supervised table parsing via pre-training,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. Online: Association for Computational Linguistics, Jul. 2020, pp. 4320–4333. [...

work page 2020
[22]

TaBERT: Pretraining for joint understanding of textual and tabular data,

P. Yin, G. Neubig, W.-t. Yih, and S. Riedel, “TaBERT: Pretraining for joint understanding of textual and tabular data,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8413–8426

work page 2020
[23]

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,

T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev, “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, and ...

work page 2018
[24]

Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls,

J. Li, B. Hui, G. Qu, J. Yang, B. Li, B. Li, B. Wang, B. Qin, R. Geng, N. Huo, X. Zhou, C. Ma, G. Li, K. C. Chang, F. Huang, R. Cheng, and Y . Li, “Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls,” inProceedings of the 37th International Conference on Neural Information Processing Systems, ser. NIP...

work page 2023
[25]

Enhancing text-to-SQL capabilities of large language models through tailored promptings,

Z. Tan, X. Liu, Q. Shu, X. Li, C. Wan, D. Liu, Q. Wan, and G. Liao, “Enhancing text-to-SQL capabilities of large language models through tailored promptings,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y . Kan, V . Hoste, A. Lenci, S. Sakti, ...

work page 2024
[26]

Large language models are versatile decomposers: Decomposing evidence and questions for table-based rea- soning

Y . Ye, B. Hui, M. Yang, B. Li, F. Huang, and Y . Li, “Large language models are versatile decomposers: Decomposing evidence and questions for table-based reasoning,” inProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’23. New York, NY , USA: Association for Computing Machinery, 20...

work page doi:10.1145/3539618.3591708 2023
[27]

Table meets LLM: Can large language models understand structured table data? a benchmark and empirical study,

Y . Sui, M. Zhou, M. Zhou, S. Han, and D. Zhang, “Table meets LLM: Can large language models understand structured table data? a benchmark and empirical study,” inProceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, pp. 645–654

work page 2024
[28]

LLMLingua: Com- pressing prompts for accelerated inference of large language models,

H. Jiang, Q. Wu, C.-Y . Lin, Y . Yang, and L. Qiu, “LLMLingua: Com- pressing prompts for accelerated inference of large language models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 13 358–13 376

work page 2023
[29]

LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression,

H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu, “LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association f...

work page 2024
[30]

Compressing context to enhance inference efficiency of large language models,

Y . Li, B. Dong, F. Guerin, and C. Lin, “Compressing context to enhance inference efficiency of large language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 6342–6353. [Online]. Available: https:/...

work page 2023
[31]

PICARD: Parsing in- crementally for constrained auto-regressive decoding from language models,

T. Scholak, N. Schucher, and D. Bahdanau, “PICARD: Parsing in- crementally for constrained auto-regressive decoding from language models,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 9895–9901

work page 2021
[32]

Prompting is program- ming: A query language for large language models,

L. Beurer-Kellner, M. Fischer, and M. Vechev, “Prompting is program- ming: A query language for large language models,”Proceedings of the ACM on Programming Languages, vol. 7, no. PLDI, pp. 1946–1969, 2023

work page 1946
[33]

Grammar-constrained decoding for structured NLP tasks without finetuning,

S. Geng, M. Josifoski, M. Peyrard, and R. West, “Grammar-constrained decoding for structured NLP tasks without finetuning,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 10 932– 10 952. [Online]. Available...

work page 2023
[34]

Synchromesh: Reliable code generation from pre-trained language models,

G. Poesia, O. Polozov, V . Le, A. Tiwari, G. Soares, C. Meek, and S. Gulwani, “Synchromesh: Reliable code generation from pre-trained language models,” inInternational Conference on Learning Represen- tations, 2022

work page 2022
[35]

C-store: a column-oriented dbms,

M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Fer- reira, E. Lau, A. Lin, S. Madden, E. O’Neil, P. O’Neil, A. Rasin, N. Tran, and S. Zdonik, “C-store: a column-oriented dbms,” inProceedings of the 31st International Conference on Very Large Data Bases, ser. VLDB ’05. VLDB Endowment, 2005, pp. 553–564

work page 2005
[36]

Htap databases: A survey,

C. Zhang, G. Li, J. Zhang, X. Zhang, and J. Feng, “Htap databases: A survey,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 11, pp. 6410–6429, 2024

work page 2024
[37]

Tidb: a raft-based htap database,

D. Huang, Q. Liu, Q. Cui, Z. Fang, X. Ma, F. Xu, L. Shen, L. Tang, Y . Zhou, M. Huanget al., “Tidb: a raft-based htap database,”Proceed- ings of the VLDB Endowment, vol. 13, no. 12, pp. 3072–3084, 2020

work page 2020
[38]

F1 lightning: Htap as a service,

J. Yang, I. Rae, J. Xu, J. Shute, Z. Yuan, K. Lau, Q. Zeng, X. Zhao, J. Ma, Z. Chenet al., “F1 lightning: Htap as a service,”Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 3313–3325, 2020

work page 2020
[39]

Augmenting language models with long-term memory,

W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, and F. Wei, “Augmenting language models with long-term memory,”arXiv preprint arXiv:2306.07174, 2023

work page arXiv 2023
[40]

MemGPT: Towards LLMs as Operating Systems

C. Packer, V . Fang, S. G. Patil, K. Lin, S. Wooders, and J. E. Gonzalez, “Memgpt: Towards llms as operating systems,”arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

P. Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav, “Mem0: Building production-ready ai agents with scalable long-term memory,” arXiv preprint arXiv:2504.19413, 2025. This appendix is organized into three parts. We first present representative benchmark examples to ground the tasks sum- marized in the main body, then list the evaluation prompts used ...

work page internal anchor Pith review arXiv 2025
[42]

CCNA-Level Example (Entry):Question:What com- mand is used on a Windows PC to display IP-to-MAC address mappings? Answer:arp -a

work page
[43]

It was originally created to provide transport for non-routable legacy protocols (like IPX) across an IP network

CCNP-Level Example (Advanced):Question:What is Generic Routing Encapsulation (GRE), and what was its original purpose? Answer:GRE is a tunneling protocol that encapsulates packets over an IP-based network. It was originally created to provide transport for non-routable legacy protocols (like IPX) across an IP network

work page
[44]

Which profile must be configured to provide these services? Answer:Service profile

CCIE-Level Example (Expert, Open-Ended):Question: Cisco Jabber clients need to be able to reach several different applications to provide access to services such as voicemail, meetings, directories, and other functions. Which profile must be configured to provide these services? Answer:Service profile

work page
[45]

The service is deployed in the Cisco cloud and other deployment options are not possible

CCIE-Level Example (Expert, Multiple-Choice):Ques- tion:What is the deployment model of the Cisco Secure Network Analytics Cognitive Analytics system? (a) as a plug-in in the Cisco Secure Network Analytics Management Console (b) in the public cloud (SaaS) (c) on-premise as a dedicated appliance (d) on-premise as a virtual machine Answer:(b) in the public ...

work page
[46]

Expert-Tiered Example (Advanced Topics):Question: In BGP implementations, what attribute is used to influence inbound traffic from neighbouring ASes? Answer:AS-PATH prepending. B. Runbook Dataset This appendix provides a representative example from the Runbook dataset

work page
[47]

StackWise Upgrade Troubleshooting:Problem Descrip- tion:

work page
[48]

Install mode in stackwise standard upgrade procedure

work page
[49]

How to correctly upgrade this device?

Install mode in stackwise, if one machine (non-active switch) prompts v-mismatch. How to correctly upgrade this device?

work page
[50]

The focus is on addressing potential issues encountered during the upgrade process, such as ver- sion mismatches and ensuring successful upgrades across all stack members

If using a 4-switch stack, but these 4 switches have different versions, can all switches be upgraded directly to the specified version through the active switch? Ground-Truth Runbook (excerpt): StackWise Upgrade Troubleshooting Summary:This playbook outlines the troubleshoot- ing steps for upgrading a Catalyst 9300 series switch stack. The focus is on ad...

work page
[51]

Use Telnet or SSH to access the CLI

Initial Assessment: •Access the Active Switch CLI: Establish con- nectivity to the active switch within the stack. Use Telnet or SSH to access the CLI. •Verify Stack Status: Executeshow stack statusto confirm the operational state of the stack. •Check Software Versions: Runshow versionon the active switch to identify the current software version

work page
[52]

•The active switch will propagate the image to mismatched members automatically

Version Mismatch Resolution: •If a non-active switch shows v-mismatch, useinstall add file <image> activate commitfrom the active switch. •The active switch will propagate the image to mismatched members automatically

work page
[53]

•Executeinstall add file flash:<image> activate commit to upgrade all stack members simultaneously

Multi-Version Stack Upgrade: •Yes, all switches can be upgraded from the active switch using install mode. •Executeinstall add file flash:<image> activate commit to upgrade all stack members simultaneously. C. Line Chart Dataset This appendix provides a representative example from the Line chart dataset

work page
[54]

endTs":

Line Chart Example:Input Data (excerpt): [ {"endTs": "2025-06-14T22:32:00Z", "jitter": 0.23, "goodput": 100000, "startTs": "2025-06-14T22:31:00Z", "latencyMs": 9.07, "lossPercent": 0}, {"endTs": "2025-06-14T22:34:00Z", "jitter": 0.11, "goodput": 100000, "startTs": "2025-06-14T22:33:00Z", "latencyMs": 9.02, "lossPercent": 0}, ... ] Expected Output (excerpt...

work page 2025
[55]

peer": 59,

Bar Chart Example:Input Data (excerpt): [ {"peer": 59, "expansionism": "sediment"}, {"peer": 98, "expansionism": "nonsense"}, {"peer": 81, "expansionism": "cloud"}, ... ] Expected Output (excerpt): {"data": {"props": {"data": [ {"stacks": [{"key": "peer", "value": 59}], "category": "sediment"}, {"stacks": [{"key": "peer", "value": 98}], "category": "nonse...

work page
[56]

test": {

Network Path Latency Analysis:Task Instruction: Use the Network path data to identify IF there are any specific high latency nodes. Provide a summary to the user of impacted nodes. Input Data (excerpt): {"test": {"testId": "281208", "testName": "Synthetic Network Test", "type": "network"}, "pathVis": [{ "agent": {"agentName": "agent-89", "countryId": "JP"...

work page
[57]

board": {

Board Report Generation:Task Instruction: Given a JSON object that contains the current con- text of the ‘board’, generate a holistic report that derives key data points, insights, timelines, incident details, resolutions, or root cause analysis. Organize your response with clearly defined sections and a table of contents. Input Data (excerpt): {"board": ...

work page 2025
[58]

[Introduction](#introduction)

work page
[59]

[Board Overview](#board-overview)

work page
[60]

[Canvas Details](#canvas-details)

work page
[61]

[Cards Analysis](#cards-analysis)

work page
[62]

[Conversations Historical Context]

work page
[63]

AIC 10 - Sep 27, 2025 5:42 PM,

[Conclusion](#conclusion) ## Introduction This report presents an analysis of the board titled "AIC 10 - Sep 27, 2025 5:42 PM," providing insights into its components... ## Board Overview - **Name**: AIC 10 - Sep 27, 2025 5:42 PM - **Description**: The board is identified by its timestamp, suggesting it may be part of a series or larger project... G. Canv...

work page 2025
[64]

context": {

Runbook Step: Network Identification:Task Context: You are an expert in networking with a CCIE certification. You are helping with running a net- work troubleshooting run-book. Steps involve data gathering, analyzing the data and setting variables, running commands against the network, and making decisions based on the data. Input Data (excerpt): {"contex...

work page 2025
[65]

status":

ThousandEyes Analysis Summary:Task Context: You are an expert in networking with a CCIE certification. You are helping with running a network troubleshooting run-book. Please summarize the fol- lowing response that was obtained calling an API and output it in a markdown format. API Response (excerpt): {"status": "success", "message": "The analysis cannot ...

work page
[66]

Location

Variable Existence Check:Task Context: You are an expert in networking with a CCIE certification. We are executing a flow chart that corresponds to a run-book used for network trou- bleshooting. We need to figure out the truth value of the expression in the reasoning instruction to execute the flow chart. Reasoning Instruction: If "Location" is set, go to...

work page 2025
[67]

orders": [ {

Order Filtering and Counting:Input Data (excerpt): {"orders": [ {"orderId": "ORD-0001", "customer": {"id": 1, "name": "Valerie Braun", "email": "name.jones@gmail.com"}, "items": [ {"sku": "SKU-OOH73G", "name": "Widget A", "quantity": 2, "price": 29.99}, {"sku": "SKU-PLM12X", "name": "Gadget B", "quantity": 1, "price": 49.99}, ... ], "status": "processing"...

work page
[68]

Expected Answer:63

Employee Aggregation:Question: How many active employees have more than 5 years of experience? Provide only the direct answer, without any additional explanation or formatting. Expected Answer:63

work page
[69]

Expected Answer:8357.79 K

Time-Series Lookup:Question: What was the revenue on 2025-01-04? Provide only the direct answer, without any additional explanation or formatting. Expected Answer:8357.79 K. Hard Reasoning Dataset This appendix provides a representative example from the Hard multi-hop reasoning dataset

work page 2025
[70]

tests": [ {

Multi-Hop Test Discovery:Query: Find a ThousandEyes DNS test (type: dns-server) that is related to the Application (SharePoint) and is run from an agent in the Location (San Francisco). Input Data (excerpt): {"tests": [ {"testId": 264343, "testName": "A - SharePoint - DNS - Internal", "type": "dns-server", "target": "cisco.sharepoint.com", "agents": [ {"a...

work page
[71]

**Read Carefully ** Review the Question, Ground Truth Answer, and Generated Answer thoroughly

work page
[72]

Matches the Ground Truth in factual content and intent with no significant errors or omissions

**Assign a Score (1-5) ** Evaluate the Generated Answer against the Ground Truth Answer using the following rubric: * ** 5 - Excellent **: Fully correct, complete, and clearly articulated. Matches the Ground Truth in factual content and intent with no significant errors or omissions. * ** 4 - Good **: Mostly correct and covers most key points. Minor inacc...

work page
[73]

score": an integer from 1 to 5 *

**Output Final JSON ** Return a valid JSON object with exactly two keys: * "score": an integer from 1 to 5 * "justification": a brief explanation for the score Here are the inputs for you to conduct your evaluation: Question: [BEGIN QUESTION] {question} [END QUESTION] Ground Truth Answer: [BEGIN GROUND TRUTH ANSWER] {ground_truth} [END GROUND TRUTH ANSWER...

work page
[74]

* If parsing fails, assign a score of 1

**Parse and Validate JSON ** * Extract JSON from the SUT Output (strip code fences if present). * If parsing fails, assign a score of 1. * If parsing succeeds, validate against the Schema

work page
[75]

* For each Ground Truth field, check: * Field exists in the SUT

**Compare to Ground Truth ** * Ignore extra fields not in the Ground Truth. * For each Ground Truth field, check: * Field exists in the SUT. * Type matches. * Value matches, using these rules: - Strings: exact match after trimming whitespace. - Numbers/booleans: exact equality. - Arrays (structured data): same length, element- wise equality, same order. -...

work page
[76]

* ** 4 - Good **: Valid JSON; schema-valid; most fields correct with minor omissions; no contradictions

**Assign a Score (1-5) ** * ** 5 - Excellent **: Valid JSON; schema-valid; all fields match (or only negligible paraphrasing). * ** 4 - Good **: Valid JSON; schema-valid; most fields correct with minor omissions; no contradictions. * ** 3 - Fair **: Valid JSON; schema-valid; some fields correct, but notable errors or omissions. * ** 2 - Poor **: Valid JSO...

work page
[77]

score": integer 1-5. *

**Output Final JSON ** * "score": integer 1-5. * "justification": brief explanation citing specific issues. Here are the inputs for you to evaluate: Ground Truth JSON: [BEGIN GROUND TRUTH] {ground_truth} [END GROUND TRUTH] SUT Output (to be parsed and validated): [BEGIN SUT OUTPUT] {sut_output} [END SUT OUTPUT] JSON Schema to validate the SUT Output again...

work page

[1] [1]

Analytics Context Engineering for LLM,

“Analytics Context Engineering for LLM,” https://blogs.cisco.com/ai/ analytics-context-engineering-for-llm, 2024, february 3, 2026

work page 2024

[2] [2]

Openclaw,

OpenClaw, “Openclaw,” https://openclaw.ai/, 2026, official website. Ac- cessed: 2026-03-30

work page 2026

[3] [3]

Claude code overview,

Anthropic, “Claude code overview,” https://docs.anthropic.com/en/docs/ claude-code/overview, 2026, official documentation. Accessed: 2026- 03-30

work page 2026

[4] [4]

OpenAI Codex CLI – Getting Started,

OpenAI, “OpenAI Codex CLI – Getting Started,” https://help.openai. com/en/articles/11096431, 2026, official help documentation. Accessed: 2026-03-30

work page arXiv 2026

[5] [5]

Gemini CLI,

Google, “Gemini CLI,” https://github.com/google-gemini/gemini-cli, 2026, official repository. Accessed: 2026-03-30

work page 2026

[6] [6]

OpenCode,

OpenCode, “OpenCode,” https://opencode.ai/, 2026, official website. Accessed: 2026-03-30

work page 2026

[7] [7]

Accessed: 2026-03-30

Pi, “pi.dev,” https://buildwithpi.com/, 2026, official website for the Pi coding agent. Accessed: 2026-03-30

work page 2026

[8] [8]

The Claude 3 model family: Opus, Sonnet, Haiku,

Anthropic, “The Claude 3 model family: Opus, Sonnet, Haiku,” https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model Card Claude 3.pdf, 2024

work page 2024

[9] [9]

Building filesystem agents,

Vercel, “Building filesystem agents,” https://vercel.com/academy/ filesystem-agents, 2025

work page 2025

[10] [10]

JSONPath: Query Expressions for JSON,

“JSONPath: Query Expressions for JSON,” https://www.rfc-editor.org/ rfc/rfc9535

work page

[11] [11]

OpenAI API Reference: Responses,

“OpenAI API Reference: Responses,” https://platform.openai.com/docs/ api-reference/responses, 2026, accessed: 2026-03-25

work page 2026

[12] [12]

OpenAI API Reference: Chat Completions,

“OpenAI API Reference: Chat Completions,” https://platform.openai. com/docs/api-reference/chat/create-chat-completion, 2026, accessed: 2026-03-25

work page 2026

[13] [13]

Anthropic api: Messages examples,

Anthropic, “Anthropic api: Messages examples,” https://docs.anthropic. com/en/api/messages-examples, 2026, accessed: 2026-03-25

work page 2026

[14] [14]

Openai sdk compatibility,

——, “Openai sdk compatibility,” https://docs.anthropic.com/en/api/ openai-sdk, 2026, accessed: 2026-03-25

work page 2026

[15] [15]

Duckdb: An embeddable analytical database,

H. M ¨uhleisen and M. Raasveldt, “Duckdb: An embeddable analytical database,” inProceedings of the 2019 International Conference on Management of Data (SIGMOD ’19). ACM, 2019

work page 2019

[16] [16]

Robertson and H

S. Robertson and H. Zaragoza, “The probabilistic relevance framework: Bm25 and beyond,”Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009. [Online]. Available: https: //doi.org/10.1561/1500000019

work page doi:10.1561/1500000019 2009

[17] [17]

Anomaly detection: A survey,

V . Chandola, A. Banerjee, and V . Kumar, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, pp. 1–58, 2009

work page 2009

[18] [18]

TOON: Token-oriented object notation,

TOON Format Contributors, “TOON: Token-oriented object notation,” https://github.com/toon-format/toon, 2025, includes the TOON Retrieval Accuracy Benchmark. Accessed: 2026

work page 2025

[19] [19]

LangSmith,

LangChain, Inc., “LangSmith,” https://www.langchain.com/langsmith, 2023, accessed: 2026

work page 2023

[20] [20]

Cisco Deep Network Model: Purpose built intelligence for networking,

“Cisco Deep Network Model: Purpose built intelligence for networking,” https://blogs.cisco.com/ai/cisco-deep-network-model-overview, 2026, february 5, 2026

work page 2026

[21] [21]

TaPas: Weakly supervised table parsing via pre-training,

J. Herzig, P. K. Nowak, T. M ¨uller, F. Piccinno, and J. Eisenschlos, “TaPas: Weakly supervised table parsing via pre-training,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. Online: Association for Computational Linguistics, Jul. 2020, pp. 4320–4333. [...

work page 2020

[22] [22]

TaBERT: Pretraining for joint understanding of textual and tabular data,

P. Yin, G. Neubig, W.-t. Yih, and S. Riedel, “TaBERT: Pretraining for joint understanding of textual and tabular data,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8413–8426

work page 2020

[23] [23]

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,

T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. Radev, “Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task,” inProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, E. Riloff, D. Chiang, J. Hockenmaier, and ...

work page 2018

[24] [24]

Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls,

J. Li, B. Hui, G. Qu, J. Yang, B. Li, B. Li, B. Wang, B. Qin, R. Geng, N. Huo, X. Zhou, C. Ma, G. Li, K. C. Chang, F. Huang, R. Cheng, and Y . Li, “Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls,” inProceedings of the 37th International Conference on Neural Information Processing Systems, ser. NIP...

work page 2023

[25] [25]

Enhancing text-to-SQL capabilities of large language models through tailored promptings,

Z. Tan, X. Liu, Q. Shu, X. Li, C. Wan, D. Liu, Q. Wan, and G. Liao, “Enhancing text-to-SQL capabilities of large language models through tailored promptings,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y . Kan, V . Hoste, A. Lenci, S. Sakti, ...

work page 2024

[26] [26]

Large language models are versatile decomposers: Decomposing evidence and questions for table-based rea- soning

Y . Ye, B. Hui, M. Yang, B. Li, F. Huang, and Y . Li, “Large language models are versatile decomposers: Decomposing evidence and questions for table-based reasoning,” inProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’23. New York, NY , USA: Association for Computing Machinery, 20...

work page doi:10.1145/3539618.3591708 2023

[27] [27]

Table meets LLM: Can large language models understand structured table data? a benchmark and empirical study,

Y . Sui, M. Zhou, M. Zhou, S. Han, and D. Zhang, “Table meets LLM: Can large language models understand structured table data? a benchmark and empirical study,” inProceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, pp. 645–654

work page 2024

[28] [28]

LLMLingua: Com- pressing prompts for accelerated inference of large language models,

H. Jiang, Q. Wu, C.-Y . Lin, Y . Yang, and L. Qiu, “LLMLingua: Com- pressing prompts for accelerated inference of large language models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 13 358–13 376

work page 2023

[29] [29]

LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression,

H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y . Lin, Y . Yang, and L. Qiu, “LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association f...

work page 2024

[30] [30]

Compressing context to enhance inference efficiency of large language models,

Y . Li, B. Dong, F. Guerin, and C. Lin, “Compressing context to enhance inference efficiency of large language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 6342–6353. [Online]. Available: https:/...

work page 2023

[31] [31]

PICARD: Parsing in- crementally for constrained auto-regressive decoding from language models,

T. Scholak, N. Schucher, and D. Bahdanau, “PICARD: Parsing in- crementally for constrained auto-regressive decoding from language models,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 9895–9901

work page 2021

[32] [32]

Prompting is program- ming: A query language for large language models,

L. Beurer-Kellner, M. Fischer, and M. Vechev, “Prompting is program- ming: A query language for large language models,”Proceedings of the ACM on Programming Languages, vol. 7, no. PLDI, pp. 1946–1969, 2023

work page 1946

[33] [33]

Grammar-constrained decoding for structured NLP tasks without finetuning,

S. Geng, M. Josifoski, M. Peyrard, and R. West, “Grammar-constrained decoding for structured NLP tasks without finetuning,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 10 932– 10 952. [Online]. Available...

work page 2023

[34] [34]

Synchromesh: Reliable code generation from pre-trained language models,

G. Poesia, O. Polozov, V . Le, A. Tiwari, G. Soares, C. Meek, and S. Gulwani, “Synchromesh: Reliable code generation from pre-trained language models,” inInternational Conference on Learning Represen- tations, 2022

work page 2022

[35] [35]

C-store: a column-oriented dbms,

M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Fer- reira, E. Lau, A. Lin, S. Madden, E. O’Neil, P. O’Neil, A. Rasin, N. Tran, and S. Zdonik, “C-store: a column-oriented dbms,” inProceedings of the 31st International Conference on Very Large Data Bases, ser. VLDB ’05. VLDB Endowment, 2005, pp. 553–564

work page 2005

[36] [36]

Htap databases: A survey,

C. Zhang, G. Li, J. Zhang, X. Zhang, and J. Feng, “Htap databases: A survey,”IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 11, pp. 6410–6429, 2024

work page 2024

[37] [37]

Tidb: a raft-based htap database,

D. Huang, Q. Liu, Q. Cui, Z. Fang, X. Ma, F. Xu, L. Shen, L. Tang, Y . Zhou, M. Huanget al., “Tidb: a raft-based htap database,”Proceed- ings of the VLDB Endowment, vol. 13, no. 12, pp. 3072–3084, 2020

work page 2020

[38] [38]

F1 lightning: Htap as a service,

J. Yang, I. Rae, J. Xu, J. Shute, Z. Yuan, K. Lau, Q. Zeng, X. Zhao, J. Ma, Z. Chenet al., “F1 lightning: Htap as a service,”Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 3313–3325, 2020

work page 2020

[39] [39]

Augmenting language models with long-term memory,

W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, and F. Wei, “Augmenting language models with long-term memory,”arXiv preprint arXiv:2306.07174, 2023

work page arXiv 2023

[40] [40]

MemGPT: Towards LLMs as Operating Systems

C. Packer, V . Fang, S. G. Patil, K. Lin, S. Wooders, and J. E. Gonzalez, “Memgpt: Towards llms as operating systems,”arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

P. Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav, “Mem0: Building production-ready ai agents with scalable long-term memory,” arXiv preprint arXiv:2504.19413, 2025. This appendix is organized into three parts. We first present representative benchmark examples to ground the tasks sum- marized in the main body, then list the evaluation prompts used ...

work page internal anchor Pith review arXiv 2025

[42] [42]

CCNA-Level Example (Entry):Question:What com- mand is used on a Windows PC to display IP-to-MAC address mappings? Answer:arp -a

work page

[43] [43]

It was originally created to provide transport for non-routable legacy protocols (like IPX) across an IP network

CCNP-Level Example (Advanced):Question:What is Generic Routing Encapsulation (GRE), and what was its original purpose? Answer:GRE is a tunneling protocol that encapsulates packets over an IP-based network. It was originally created to provide transport for non-routable legacy protocols (like IPX) across an IP network

work page

[44] [44]

Which profile must be configured to provide these services? Answer:Service profile

CCIE-Level Example (Expert, Open-Ended):Question: Cisco Jabber clients need to be able to reach several different applications to provide access to services such as voicemail, meetings, directories, and other functions. Which profile must be configured to provide these services? Answer:Service profile

work page

[45] [45]

The service is deployed in the Cisco cloud and other deployment options are not possible

CCIE-Level Example (Expert, Multiple-Choice):Ques- tion:What is the deployment model of the Cisco Secure Network Analytics Cognitive Analytics system? (a) as a plug-in in the Cisco Secure Network Analytics Management Console (b) in the public cloud (SaaS) (c) on-premise as a dedicated appliance (d) on-premise as a virtual machine Answer:(b) in the public ...

work page

[46] [46]

Expert-Tiered Example (Advanced Topics):Question: In BGP implementations, what attribute is used to influence inbound traffic from neighbouring ASes? Answer:AS-PATH prepending. B. Runbook Dataset This appendix provides a representative example from the Runbook dataset

work page

[47] [47]

StackWise Upgrade Troubleshooting:Problem Descrip- tion:

work page

[48] [48]

Install mode in stackwise standard upgrade procedure

work page

[49] [49]

How to correctly upgrade this device?

Install mode in stackwise, if one machine (non-active switch) prompts v-mismatch. How to correctly upgrade this device?

work page

[50] [50]

The focus is on addressing potential issues encountered during the upgrade process, such as ver- sion mismatches and ensuring successful upgrades across all stack members

If using a 4-switch stack, but these 4 switches have different versions, can all switches be upgraded directly to the specified version through the active switch? Ground-Truth Runbook (excerpt): StackWise Upgrade Troubleshooting Summary:This playbook outlines the troubleshoot- ing steps for upgrading a Catalyst 9300 series switch stack. The focus is on ad...

work page

[51] [51]

Use Telnet or SSH to access the CLI

Initial Assessment: •Access the Active Switch CLI: Establish con- nectivity to the active switch within the stack. Use Telnet or SSH to access the CLI. •Verify Stack Status: Executeshow stack statusto confirm the operational state of the stack. •Check Software Versions: Runshow versionon the active switch to identify the current software version

work page

[52] [52]

•The active switch will propagate the image to mismatched members automatically

Version Mismatch Resolution: •If a non-active switch shows v-mismatch, useinstall add file <image> activate commitfrom the active switch. •The active switch will propagate the image to mismatched members automatically

work page

[53] [53]

•Executeinstall add file flash:<image> activate commit to upgrade all stack members simultaneously

Multi-Version Stack Upgrade: •Yes, all switches can be upgraded from the active switch using install mode. •Executeinstall add file flash:<image> activate commit to upgrade all stack members simultaneously. C. Line Chart Dataset This appendix provides a representative example from the Line chart dataset

work page

[54] [54]

endTs":

Line Chart Example:Input Data (excerpt): [ {"endTs": "2025-06-14T22:32:00Z", "jitter": 0.23, "goodput": 100000, "startTs": "2025-06-14T22:31:00Z", "latencyMs": 9.07, "lossPercent": 0}, {"endTs": "2025-06-14T22:34:00Z", "jitter": 0.11, "goodput": 100000, "startTs": "2025-06-14T22:33:00Z", "latencyMs": 9.02, "lossPercent": 0}, ... ] Expected Output (excerpt...

work page 2025

[55] [55]

peer": 59,

Bar Chart Example:Input Data (excerpt): [ {"peer": 59, "expansionism": "sediment"}, {"peer": 98, "expansionism": "nonsense"}, {"peer": 81, "expansionism": "cloud"}, ... ] Expected Output (excerpt): {"data": {"props": {"data": [ {"stacks": [{"key": "peer", "value": 59}], "category": "sediment"}, {"stacks": [{"key": "peer", "value": 98}], "category": "nonse...

work page

[56] [56]

test": {

Network Path Latency Analysis:Task Instruction: Use the Network path data to identify IF there are any specific high latency nodes. Provide a summary to the user of impacted nodes. Input Data (excerpt): {"test": {"testId": "281208", "testName": "Synthetic Network Test", "type": "network"}, "pathVis": [{ "agent": {"agentName": "agent-89", "countryId": "JP"...

work page

[57] [57]

board": {

Board Report Generation:Task Instruction: Given a JSON object that contains the current con- text of the ‘board’, generate a holistic report that derives key data points, insights, timelines, incident details, resolutions, or root cause analysis. Organize your response with clearly defined sections and a table of contents. Input Data (excerpt): {"board": ...

work page 2025

[58] [58]

[Introduction](#introduction)

work page

[59] [59]

[Board Overview](#board-overview)

work page

[60] [60]

[Canvas Details](#canvas-details)

work page

[61] [61]

[Cards Analysis](#cards-analysis)

work page

[62] [62]

[Conversations Historical Context]

work page

[63] [63]

AIC 10 - Sep 27, 2025 5:42 PM,

[Conclusion](#conclusion) ## Introduction This report presents an analysis of the board titled "AIC 10 - Sep 27, 2025 5:42 PM," providing insights into its components... ## Board Overview - **Name**: AIC 10 - Sep 27, 2025 5:42 PM - **Description**: The board is identified by its timestamp, suggesting it may be part of a series or larger project... G. Canv...

work page 2025

[64] [64]

context": {

Runbook Step: Network Identification:Task Context: You are an expert in networking with a CCIE certification. You are helping with running a net- work troubleshooting run-book. Steps involve data gathering, analyzing the data and setting variables, running commands against the network, and making decisions based on the data. Input Data (excerpt): {"contex...

work page 2025

[65] [65]

status":

ThousandEyes Analysis Summary:Task Context: You are an expert in networking with a CCIE certification. You are helping with running a network troubleshooting run-book. Please summarize the fol- lowing response that was obtained calling an API and output it in a markdown format. API Response (excerpt): {"status": "success", "message": "The analysis cannot ...

work page

[66] [66]

Location

Variable Existence Check:Task Context: You are an expert in networking with a CCIE certification. We are executing a flow chart that corresponds to a run-book used for network trou- bleshooting. We need to figure out the truth value of the expression in the reasoning instruction to execute the flow chart. Reasoning Instruction: If "Location" is set, go to...

work page 2025

[67] [67]

orders": [ {

Order Filtering and Counting:Input Data (excerpt): {"orders": [ {"orderId": "ORD-0001", "customer": {"id": 1, "name": "Valerie Braun", "email": "name.jones@gmail.com"}, "items": [ {"sku": "SKU-OOH73G", "name": "Widget A", "quantity": 2, "price": 29.99}, {"sku": "SKU-PLM12X", "name": "Gadget B", "quantity": 1, "price": 49.99}, ... ], "status": "processing"...

work page

[68] [68]

Expected Answer:63

Employee Aggregation:Question: How many active employees have more than 5 years of experience? Provide only the direct answer, without any additional explanation or formatting. Expected Answer:63

work page

[69] [69]

Expected Answer:8357.79 K

Time-Series Lookup:Question: What was the revenue on 2025-01-04? Provide only the direct answer, without any additional explanation or formatting. Expected Answer:8357.79 K. Hard Reasoning Dataset This appendix provides a representative example from the Hard multi-hop reasoning dataset

work page 2025

[70] [70]

tests": [ {

Multi-Hop Test Discovery:Query: Find a ThousandEyes DNS test (type: dns-server) that is related to the Application (SharePoint) and is run from an agent in the Location (San Francisco). Input Data (excerpt): {"tests": [ {"testId": 264343, "testName": "A - SharePoint - DNS - Internal", "type": "dns-server", "target": "cisco.sharepoint.com", "agents": [ {"a...

work page

[71] [71]

**Read Carefully ** Review the Question, Ground Truth Answer, and Generated Answer thoroughly

work page

[72] [72]

Matches the Ground Truth in factual content and intent with no significant errors or omissions

**Assign a Score (1-5) ** Evaluate the Generated Answer against the Ground Truth Answer using the following rubric: * ** 5 - Excellent **: Fully correct, complete, and clearly articulated. Matches the Ground Truth in factual content and intent with no significant errors or omissions. * ** 4 - Good **: Mostly correct and covers most key points. Minor inacc...

work page

[73] [73]

score": an integer from 1 to 5 *

**Output Final JSON ** Return a valid JSON object with exactly two keys: * "score": an integer from 1 to 5 * "justification": a brief explanation for the score Here are the inputs for you to conduct your evaluation: Question: [BEGIN QUESTION] {question} [END QUESTION] Ground Truth Answer: [BEGIN GROUND TRUTH ANSWER] {ground_truth} [END GROUND TRUTH ANSWER...

work page

[74] [74]

* If parsing fails, assign a score of 1

**Parse and Validate JSON ** * Extract JSON from the SUT Output (strip code fences if present). * If parsing fails, assign a score of 1. * If parsing succeeds, validate against the Schema

work page

[75] [75]

* For each Ground Truth field, check: * Field exists in the SUT

**Compare to Ground Truth ** * Ignore extra fields not in the Ground Truth. * For each Ground Truth field, check: * Field exists in the SUT. * Type matches. * Value matches, using these rules: - Strings: exact match after trimming whitespace. - Numbers/booleans: exact equality. - Arrays (structured data): same length, element- wise equality, same order. -...

work page

[76] [76]

* ** 4 - Good **: Valid JSON; schema-valid; most fields correct with minor omissions; no contradictions

**Assign a Score (1-5) ** * ** 5 - Excellent **: Valid JSON; schema-valid; all fields match (or only negligible paraphrasing). * ** 4 - Good **: Valid JSON; schema-valid; most fields correct with minor omissions; no contradictions. * ** 3 - Fair **: Valid JSON; schema-valid; some fields correct, but notable errors or omissions. * ** 2 - Poor **: Valid JSO...

work page

[77] [77]

score": integer 1-5. *

**Output Final JSON ** * "score": integer 1-5. * "justification": brief explanation citing specific issues. Here are the inputs for you to evaluate: Ground Truth JSON: [BEGIN GROUND TRUTH] {ground_truth} [END GROUND TRUTH] SUT Output (to be parsed and validated): [BEGIN SUT OUTPUT] {sut_output} [END SUT OUTPUT] JSON Schema to validate the SUT Output again...

work page