pith. machine review for the scientific record. sign in

arxiv: 2605.11093 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI· cs.PF· cs.SE· cs.SY· eess.SY

Recognition: 2 theorem links

· Lean Theorem

Enabling Performant and Flexible Model-Internal Observability for LLM Inference

Nengneng Yu, Sixian Xiong, Wei Wang, Yibo Zhao, Zaoxing Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.PFcs.SEcs.SYeess.SY
keywords LLM inferencemodel-internal observabilityDMI-Libasynchronous tensor stagingGPU-CPU memory abstractioninference overheaddeep model inspectorRing^2
0
0 comments X

The pith

DMI-Lib decouples internal LLM state access from the inference hot path to keep overhead under 7 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DMI-Lib as a system that makes access to a model's internal states during LLM inference a built-in capability rather than an add-on. It does so by creating an asynchronous layer using Ring^2 to move tensors from GPU to CPU and a policy-controlled backend to handle export without touching the main computation. This design supports observation points on many different signals and backends while staying inside GPU memory limits and keeping existing optimizations. A reader would care because inference workloads now routinely need timely internal data for monitoring and debugging, yet most current methods add noticeable slowdowns. Experiments back the approach by reporting 0.4 to 6.8 percent overhead in batch mode and 6 percent on average in online serving, with 2x to 15x lower latency cost than prior tools offering comparable features.

Core claim

DMI-Lib treats internal observability as a first-class systems primitive, decoupling it from the inference hot path via an asynchronous observability substrate built from Ring^2, a GPU-CPU memory abstraction for capturing and staging tensors, and a policy-controlled host backend that exports them. This enables the placement of observation points across a rich space of internal signals and diverse inference backends while preserving serving optimizations and adhering to tight GPU memory budgets.

What carries the argument

Ring^2, the GPU-CPU memory abstraction for asynchronous tensor capture and staging, together with the policy-controlled host backend for export.

If this is right

  • Observation points can be placed flexibly across many internal signals without disrupting the main inference computation.
  • GPU memory budgets and existing serving optimizations remain intact during observability operations.
  • Offline batch inference runs with 0.4 to 6.8 percent added overhead.
  • Moderate online serving runs with roughly 6 percent average overhead.
  • Latency overhead drops by a factor of 2 to 15 compared with prior systems that provide similar observability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decoupling pattern could support continuous internal-state streaming for production safety checks or model calibration without pausing inference.
  • Ring^2-style abstractions may transfer to other latency-sensitive systems that need auxiliary data movement without contention on the critical path.
  • Implementation on newer accelerator types would test whether the GPU-CPU staging logic generalizes beyond current hardware.
  • Connection to distributed tracing tools could turn per-request internal signals into cluster-wide diagnostics.

Load-bearing premise

The design assumes that asynchronous tensor staging via Ring^2 and the policy-controlled backend can be fully decoupled from the inference hot path without introducing data races, correctness errors, or hidden synchronization costs under all realistic serving loads and backends.

What would settle it

A high-load online serving test on a new backend that shows either overhead above 7 percent or any data races, tensor corruption, or extra synchronization delays would disprove the low-overhead and correctness claims.

Figures

Figures reproduced from arXiv: 2605.11093 by Nengneng Yu, Sixian Xiong, Wei Wang, Yibo Zhao, Zaoxing Liu.

Figure 1
Figure 1. Figure 1: Overview of DMI-Lib vs. other methods: When data speed stays below PCIe bandwidth, DMI-Lib is close to the original inference speed. Once overload, the effective speed converges toward synchronous offloading. Profiler [31], and NCCL RAS [28] cover execution behavior at the levels of kernels, streams, memory, and communication. System- and hardware-level observability is widely deployed for performance or r… view at source ↗
Figure 2
Figure 2. Figure 2: DMI-Lib Overview. capture and replay, and competing with KV cache and batching capacity. Observability must be decoupled from the inference fast path to preserve execution and memory guarantees. Key idea. We introduce a separate observability data path for tensor capture and transport. By decoupling the main computation and observability paths, we preserve backend execution optimiza￾tions (e.g., compiler- … view at source ↗
Figure 3
Figure 3. Figure 3: HookPoint Execution Stack [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ring2 Workflow. 4.2 Ring2 : GPU–CPU Memory Abstraction After HookPoints are installed, DMI-Lib requires a staging mechanism that can collect captured tensors on the GPU and expose them to the host without perturbing the inference fast path. Ring2 provides this abstraction. It is a bounded, GPU–CPU split staging layer that transforms each captured tensor into a host-consumable record while remaining fully d… view at source ↗
Figure 5
Figure 5. Figure 5: Example of best-effort runtime policy under the drop-recent strategy. GPU-to-host DMA transfers, pinned-to-paged memory copies, tensor reconstruction/slicing, and final persistence. Each stage operates asynchronously to maximize throughput and overlap host-side processing with GPU execution. This design is critical because the efficiency of the data exporter directly determines the sustainability of the ob… view at source ↗
Figure 6
Figure 6. Figure 6: Offline Performance with Limited Hooks: 1 hidden-state hook per layer + 2 global hooks (final_ln and logits), for a total of 38/34/42 hooks on Qwen3-4B, Llama3.1-8B, and Qwen3-14B, respectively. internal states, so we configure all baselines to extract internal tensors to host memory and deploy DMI-Lib’s ClickHouse [34] database on a tmpfs in-memory file system for consistency and fairness. 6.2 Offline Bat… view at source ↗
Figure 7
Figure 7. Figure 7: Offline Performance with Custom Hooks: 7 hooks per layer, for a total of 252/224/280 hooks on Qwen3-4B, Llama3.1-8B, and Qwen3-14B, respectively [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Offline performance: (a) Tensor-parallel performance. (b) Storage ablation study. 34.0–59.8% overhead, with an average of 46.9%, while NNsight incurs 54.1–71.2% overhead, with an average of 62.3%. NNsight also runs out of memory for Qwen3-14B at batch size 64. Multi-GPU performance. We evaluate DMI-Lib under multi-GPU tensor parallelism (TP) to demonstrate that it remains compatible with distributed infere… view at source ↗
Figure 9
Figure 9. Figure 9: Online Serving Performance: TTFT. 0.5 1 2 4 8 16 32 64 Request Rate (req/s) (a) ShareGPT - Qwen3-4B 0.00 0.05 0.10 Median TPOT (s) 0.5 1 2 4 8 16 32 64 Request Rate (req/s) (b) ShareGPT - Llama3.1-8B 0.00 0.05 0.10 0.5 1 2 4 8 16 32 Request Rate (req/s) (c) ShareGPT - Qwen3-14B 0.00 0.05 0.10 0.5 1 2 4 8 16 32 64 Request Rate (req/s) (d) WildChat - Qwen3-4B 0.00 0.05 0.10 Median TPOT (s) 0.5 1 2 4 8 16 32 … view at source ↗
Figure 10
Figure 10. Figure 10: Online Serving Performance: TPOT. Figures 9 and 10 show the results. vLLM-Hook is unusable even at minimal load: at 1 req/s its median TTFT is already 10–15× the baseline (152–252 ms vs. 12–34 ms) and its TPOT is 5–8× higher, because each hook performs a synchronous D2H copy that serializes with inference and prevents CUDA graph replay. TRT-LLM (Debug API) incurs a persistent TPOT overhead at low request … view at source ↗
Figure 11
Figure 11. Figure 11: DMI-Lib behavior when generation speed of internal tensors exceeds export band￾width. Each DMI-Lib (x, y) denotes x hook points and a y-sized ring buffer [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Observability tradeoffs at hook and request granularity. (a) Hook filtering trades observability for overhead by varying the enabled hook set. (b) Request-granular dropping trades request coverage for overhead by varying how many requests remain in the observation path under overload. of GPU-side extraction and asynchronous offloading. In this experiment, we use Qwen3-1.7B with Hugging Face as the backend… view at source ↗
Figure 13
Figure 13. Figure 13: Per-step overhead breakdown and ablation study [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Two attention-head patterns observed in Qwen3-0.6B from a single DMI-Lib capture (17-token IOI prompt). (a) A head at layer 2, head 2 exhibiting the Duplicate Token pattern from [42]: each query at a repeated token (John, Mary) attends back to that token’s first occurrence (corresponding column). (b) A head at layer 19, head 4 exhibiting the attention-sink pattern from [45]: every query routes its mass to… view at source ↗
read the original abstract

Today's inference-time workloads increasingly depend on timely access to a model's internal states. We present DMI-Lib, a high-speed deep model inspector that treats internal observability as a first-class systems primitive, decoupling it from the inference hot path via an asynchronous observability substrate built from Ring^2, a GPU-CPU memory abstraction for capturing and staging tensors, and a policy-controlled host backend that exports them. DMI-Lib enables the placement of observation points across a rich space of internal signals and diverse inference backends while preserving serving optimizations and adhering to tight GPU memory budgets. Our experiments demonstrate that DMI-Lib incurs only 0.4%--6.8% overhead in offline batch inference and an average of 6% in moderate online serving, reducing latency overhead by 2x-15x compared to existing baselines with similar observability features. DMI-Lib is open-sourced at https://github.com/ProjectDMX/DMI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DMI-Lib, a library for model-internal observability during LLM inference. It decouples observation from the hot path via an asynchronous substrate built on Ring^2 (a GPU-CPU tensor staging abstraction) and a policy-controlled host backend, enabling flexible placement of observation points across signals and backends while respecting GPU memory budgets. Experiments report 0.4%--6.8% overhead for offline batch inference, ~6% average overhead for moderate online serving, and 2x--15x latency reduction relative to baselines with comparable observability features.

Significance. If the performance numbers hold under realistic loads and backends, the work would be significant for production LLM serving by making internal-state access a low-cost primitive, directly supporting debugging, interpretability, and safety monitoring without forcing trade-offs against serving optimizations. The open-source release aids reproducibility.

major comments (2)
  1. [§3.1] §3.1 (Ring^2 design): the claim of full decoupling from the inference hot path lacks any latency bounds on GPU-CPU transfers, analysis of ring-buffer contention, or proof of absence of data races under high-frequency observation or diverse tensor sizes; this assumption is load-bearing for all reported overhead figures.
  2. [§4] §4 (experimental evaluation): the headline numbers (0.4%--6.8% offline, 6% online, 2x--15x latency reduction) are presented without error bars, explicit request-rate or tensor-size ranges for the 'moderate' online case, or stress-test results across backends, so the central performance claim cannot be verified from the given data.
minor comments (2)
  1. [Abstract] Abstract: 'moderate online serving' is undefined; add concrete QPS or batch-size ranges.
  2. [§3] Notation: Ring^2 is introduced without a formal definition or pseudocode in the main text; move or expand the description from any appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the analysis and experimental presentation.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (Ring^2 design): the claim of full decoupling from the inference hot path lacks any latency bounds on GPU-CPU transfers, analysis of ring-buffer contention, or proof of absence of data races under high-frequency observation or diverse tensor sizes; this assumption is load-bearing for all reported overhead figures.

    Authors: We agree that §3.1 would benefit from additional quantitative support. The Ring^2 abstraction relies on asynchronous CUDA streams and a lock-free ring buffer with atomic flags for synchronization. In the revision we will add (i) measured PCIe transfer latency bounds across representative tensor sizes, (ii) microbenchmark results quantifying ring-buffer contention at varying observation frequencies, and (iii) an explicit description of the synchronization primitives together with empirical checks for data races. These additions will be supported by new figures and will directly address the load-bearing nature of the decoupling claim. revision: yes

  2. Referee: [§4] §4 (experimental evaluation): the headline numbers (0.4%--6.8% offline, 6% online, 2x--15x latency reduction) are presented without error bars, explicit request-rate or tensor-size ranges for the 'moderate' online case, or stress-test results across backends, so the central performance claim cannot be verified from the given data.

    Authors: We accept that the current experimental section lacks sufficient detail for independent verification. We will revise §4 to report standard error bars from repeated runs, explicitly document the request-rate range (e.g., 10–100 QPS) and tensor-size distributions used for the moderate online case, and include stress-test results on at least two additional backends. These changes will make the reported overhead and latency-reduction figures reproducible and will strengthen the central performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims are direct empirical measurements

full rationale

The paper presents a systems design (Ring^2 GPU-CPU abstraction plus policy-controlled backend) and reports overhead numbers (0.4%–6.8% offline, ~6% online, 2×–15× latency reduction) from benchmark experiments. No derivation chain, fitted parameters, or equations exist that reduce to self-inputs. No self-citations are load-bearing for the central claims. The design assumptions about decoupling are stated but not proven mathematically; the reported numbers stand or fall on the measurements themselves, which are external to any internal definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the feasibility of non-blocking GPU-CPU tensor transfer and policy enforcement without correctness impact; no free parameters are explicitly fitted in the abstract, and no new physical entities are postulated.

axioms (1)
  • domain assumption Asynchronous staging of internal tensors via Ring^2 does not introduce data races or correctness violations under typical inference workloads
    Invoked in the description of the observability substrate decoupling from the hot path.
invented entities (1)
  • Ring^2 no independent evidence
    purpose: GPU-CPU memory abstraction for capturing and staging tensors asynchronously
    New abstraction introduced to enable the low-overhead observability substrate.

pith-pipeline@v0.9.0 · 5489 in / 1362 out tokens · 78551 ms · 2026-05-13T07:12:12.546044+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 8 internal anchors

  1. [1]

    Taming{Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gula- vani, Alexey Tumanov, and Ramachandran Ramjee. Taming{Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 117–134, 2024

  2. [2]

    Make every draft count: Hidden state based speculative decoding.arXiv preprint arXiv:2602.21224, 2026

    Yuetao Chen, Xuliang Wang, Xinzhou Zheng, Ming Li, Peng Wang, and Hong Xu. Make every draft count: Hidden state based speculative decoding.arXiv preprint arXiv:2602.21224, 2026. 18

  3. [3]

    Flexmodel: A frame- work for interpretability of distributed large language models.arXiv preprint arXiv:2312.03140, 2023

    Matthew Choi, Muhammad Adil Asif, John Willes, and David Emerson. Flexmodel: A frame- work for interpretability of distributed large language models.arXiv preprint arXiv:2312.03140, 2023

  4. [4]

    Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of BERT’s attention. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, 2019

  5. [5]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

  6. [6]

    Minder: Faulty machine detection for large-scale distributed model training

    Yangtao Deng, Xiang Shi, Zhuo Jiang, Xingjian Zhang, Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, Gaohong Liu, et al. Minder: Faulty machine detection for large-scale distributed model training. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 505–521, 2025

  7. [7]

    Mycroft: Tracing dependencies in collective com- munication towards reliable llm training

    Yangtao Deng, Lei Zhang, Qinlong Wang, Xiaoyun Zhi, Xinlei Zhang, Zhuo Jiang, Haohan Xu, Lei Wang, Zuquan Song, Gaohong Liu, et al. Mycroft: Tracing dependencies in collective com- munication towards reliable llm training. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 254–269, 2025

  8. [8]

    Advancing llm safe alignment with safety representation ranking.arXiv preprint arXiv:2505.15710, 2025

    Tianqi Du, Zeming Wei, Quan Chen, Chenheng Zhang, and Yisen Wang. Advancing llm safe alignment with safety representation ranking.arXiv preprint arXiv:2505.15710, 2025

  9. [9]

    A primer on the inner workings of transformer-based language models.arXiv preprint arXiv:2405.00208, 2024

    Javier Ferrando, Gabriele Sarti, Arianna Bisazza, and Marta R Costa-Jussà. A primer on the inner workings of transformer-based language models.arXiv preprint arXiv:2405.00208, 2024

  10. [10]

    Nnsight and ndif: Democratizing access to foundation model internals

    Jaden Fiotto-Kaufman, Alexander R Loftus, Eric Todd, Jannik Brinkmann, Caden Juang, Koyena Pal, Can Rager, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Michael Ripa, Adam Belfki, Nikhil Prakash, Sumeet Multani, Carla Brodley, Arjun Guha, Jonathan Bell, Byron Wallace, and David Bau. Nnsight and ndif: Democratizing access to foundatio...

  11. [11]

    llama.cpp: Llm inference in c/c++.https://github.com/ggml-org/ llama.cpp, 2023

    Georgi Gerganov et al. llama.cpp: Llm inference in c/c++.https://github.com/ggml-org/ llama.cpp, 2023

  12. [12]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  13. [13]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling, 2024

  14. [14]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hy- oukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

  15. [15]

    Text generation inference: A toolkit for routing and serving large language models.https://github.com/huggingface/text-generation-inference, 2023

    Hugging Face. Text generation inference: A toolkit for routing and serving large language models.https://github.com/huggingface/text-generation-inference, 2023

  16. [16]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  17. [17]

    vllm hook v0: A plug-in for programming model internals on vllm.arXiv preprint arXiv:2603.06588, 2026

    Ching-Yun Ko and Pin-Yu Chen. vllm hook v0: A plug-in for programming model internals on vllm.arXiv preprint arXiv:2603.06588, 2026

  18. [18]

    arXiv , author =:2009.07896 , primaryclass =

    Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, et al. Captum: A unified and generic model interpretability library for pytorch.arXiv preprint arXiv:2009.07896, 2020. 19

  19. [19]

    Building production-ready probes for Gemini

    János Kramár, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, and Arthur Conmy. Building production-ready probes for gemini.arXiv preprint arXiv:2601.11516, 2026

  20. [20]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  21. [21]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

  22. [22]

    arXiv preprint arXiv:2503.01840 (2025) 5 16 Z

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

  23. [23]

    Dart: Diffusion-inspired speculative decoding for fast llm inference.arXiv preprint arXiv:2601.19278, 2026

    Fuliang Liu, Xue Li, Ketai Zhao, Yinxi Gao, Ziyan Zhou, Zhonghui Zhang, Zhibin Wang, Wanchun Dou, Sheng Zhong, and Chen Tian. Dart: Diffusion-inspired speculative decoding for fast llm inference.arXiv preprint arXiv:2601.19278, 2026

  24. [24]

    Detecting high-stakes interactions with activation probes

    Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, and Dmitrii Krasheninnikov. Detecting high-stakes interactions with activation probes. arXiv preprint arXiv:2506.10805, 2025

  25. [25]

    Transformerlens

    Neel Nanda and Joseph Bloom. Transformerlens. https://github.com/ TransformerLensOrg/TransformerLens, 2022

  26. [26]

    Pipedream: Generalized pipeline parallelism for dnn training

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. InProceedings of the 27th ACM symposium on operating systems principles, 2019

  27. [27]

    TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM, 2024

    NVIDIA. TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM, 2024

  28. [28]

    Nvidia collective communications library (nccl)

    NVIDIA. Nvidia collective communications library (nccl). https://github.com/NVIDIA/ nccl, 2026

  29. [29]

    NVIDIA Nsight Systems

    NVIDIA Corporation. NVIDIA Nsight Systems. https://developer.nvidia.com/ nsight-systems. Accessed: 2026-04-01

  30. [30]

    Triton inference server: An optimized cloud and edge inferencing solution.https://github.com/triton-inference-server/server, 2023

    NVIDIA Corporation. Triton inference server: An optimized cloud and edge inferencing solution.https://github.com/triton-inference-server/server, 2023

  31. [31]

    Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  32. [32]

    Prometheus, 2026

    Prometheus Authors. Prometheus, 2026. URL https://prometheus.io/. Open-source monitoring system and time series database. Accessed 2026-04-01

  33. [33]

    A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646,

    Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. A practical review of mech- anistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646, 2024

  34. [34]

    Clickhouse-lightning fast analytics for everyone.Proceedings of the VLDB Endowment, 17(12): 3731–3744, 2024

    Robert Schulze, Tom Schreiber, Ilya Yatsishin, Ryadh Dahimene, and Alexey Milovidov. Clickhouse-lightning fast analytics for everyone.Proceedings of the VLDB Endowment, 17(12): 3731–3744, 2024

  35. [35]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017. 20

  36. [36]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  37. [37]

    Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024

    Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving instruction-following in language models through activation steering.arXiv preprint arXiv:2410.12877, 2024

  38. [38]

    Extracting and visualizing hidden activations and computational graphs of pytorch models with torchlens.Scientific Reports, 13(1):14375, 2023

    JohnMark Taylor and Nikolaus Kriegeskorte. Extracting and visualizing hidden activations and computational graphs of pytorch models with torchlens.Scientific Reports, 13(1):14375, 2023

  39. [39]

    Sharegpt.https://sharegpt.com/, 2023

    ShareGPT Team. Sharegpt.https://sharegpt.com/, 2023

  40. [40]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, 2024.URL https://arxiv. org/abs/2308.10248, 2308, 2024

  41. [41]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  42. [42]

    Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

    Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InThe Eleventh International Conference on Learning Representations, 2023

  43. [43]

    Reliable and resilient collective communication library for llm training and serving.arXiv preprint arXiv:2512.25059, 2025

    Wei Wang, Nengneng Yu, Sixian Xiong, and Zaoxing Liu. Reliable and resilient collective communication library for llm training and serving.arXiv preprint arXiv:2512.25059, 2025

  44. [44]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the- ar...

  45. [45]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024

  46. [46]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  47. [47]

    Orca: A distributed serving system for {Transformer-Based} generative models

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In16th USENIX symposium on operating systems design and implementation (OSDI 22), pages 521–538, 2022

  48. [48]

    Precise attribute intensity control in large language models via targeted representation editing.arXiv preprint arXiv:2510.12121, 2025

    Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu, Lingkai Kong, Sudheer Chava, and Chao Zhang. Precise attribute intensity control in large language models via targeted representation editing.arXiv preprint arXiv:2510.12121, 2025

  49. [49]

    Icr probe: Tracking hidden state dynamics for reliable hallucination detection in llms

    Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, and Xiaojun Wan. Icr probe: Tracking hidden state dynamics for reliable hallucination detection in llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17986–18002, 2025

  50. [50]

    Wild- Chat: 1M ChatGPT interaction logs in the wild.arXiv preprint arXiv:2405.01470,

    Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatgpt interaction logs in the wild.arXiv preprint arXiv:2405.01470, 2024

  51. [51]

    Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024. 21

  52. [52]

    {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024

  53. [53]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023. 22