ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

Bill Zhao; Changwei Liu; Xianzhong Ding; Yangyang Yu

arxiv: 2605.24279 · v1 · pith:RQCOQW3Enew · submitted 2026-05-22 · 💻 cs.CL · cs.SE

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

Xianzhong Ding , Yangyang Yu , Changwei Liu , Bill Zhao This is my paper

Pith reviewed 2026-06-30 15:15 UTC · model grok-4.3

classification 💻 cs.CL cs.SE

keywords persona driftagentic codinglong contextbenchmarklanguage model evaluationcontext compactionidentity probesfrontier models

0 comments

The pith

Persona drift occurs generally across frontier models in long agentic-coding sessions, resists compaction, and yields to single-shot anchoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ContextEcho, a benchmark designed to track how language models' initial personas change over thousands of turns in realistic coding sessions involving tools. It applies this to sessions of 3,746 to 9,716 turns and tests 23 models, finding that drift appears across different organizations rather than being limited to specific model families. The work also shows that standard in-session compaction does not consistently restore the original persona, whereas inserting a single anchor prompt does, and that the drift alters behavior differently in tool-using versus pure chat modes. This matters because production deployments run in exactly these long sessions, so evaluations on short interactions may miss user-visible changes.

Core claim

ContextEcho shows that a frontier language model's acknowledged helpful programming assistant persona does not survive long agentic-coding sessions. After hours of tool-using debugging, models begin asserting preferences they initially hedged. The benchmark uses a 25-probe identity suite and a snapshot-then-probe protocol on three anonymized Claude Code sessions to establish that persona drift is general across organizations, that compaction does not reliably reset it, and that a single-shot anchor restores the trained register. It further finds mode-dependent downstream effects on tool continuation and formatting.

What carries the argument

The 25-probe identity suite paired with a snapshot-then-probe protocol that forks conversation state to measure drift without perturbing the main session.

Load-bearing premise

The 25-probe identity suite and snapshot-then-probe protocol accurately measure persona drift without the measurement process itself perturbing the session or introducing artifacts that affect the observed drift.

What would settle it

Running the 25-probe suite on models after long sessions and finding no measurable shift from the initial persona, or finding that compaction consistently returns models to their starting register across the tested sessions.

Figures

Figures reproduced from arXiv: 2605.24279 by Bill Zhao, Changwei Liu, Xianzhong Ding, Yangyang Yu.

**Figure 1.** Figure 1: ContextEcho probe-detected persona drift across a 9,643-turn Claude Code session. (a) Behavioral persona space: 6 deterministic linguistic features extracted from each response . The 4-point LLM-judge label is held out of the PCA features and used only to color points; the cluster separation is therefore identified on signals the judge does not see, reducing the plausibility of judge-circular artefacts. Ve… view at source ↗

**Figure 2.** Figure 2: The ContextEcho probe suite: 25 probes across 5 categories, with verbatim text and per-category drift gap. ∆ is the per-category mean drift gap (filler − claude judge score, positive = drift) averaged across the 6 cross-organization drifters with full 12-position data. Relational (+0.63) and Coding-Self (+0.61) carry the largest drift; Identity (+0.09, mechanically factual) carries the least. Long probes a… view at source ↗

**Figure 3.** Figure 3: We make three observations. First, among the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 3.** Figure 3: Persona drift across 23 frontier models from 10 organizations. Each row is one model; markers are filler-arm (⃝) and claude-arm (▲) mean judge scores on the 5-coding-self sub-battery with 95% clustered bootstrap CIs. Right-margin ∆ is the drift gap (filler − claude); bold ∆ marks |∆| ≥ 0.30. Yellow shading marks reasoning-tier models; blue shading marks non-reasoning-tier models. Hollow markers indicate th… view at source ↗

**Figure 4.** Figure 4: A-anchor restores the deployed-Assistant register across all 23 targets. Markers and 95% clustered bootstrap CIs: ⃝ filler-arm; ▲ claude-arm (drift); ■ claude-arm + A-anchor. Rows sorted by drift gap; yellow shading marks reasoning-tier and blue shading marks non-reasoning-tier models; hollow markers mark pilot (npos=1) rows. Q5: Deployment Cost and Mode Dependence. To assess whether unmitigated drift affe… view at source ↗

**Figure 5.** Figure 5: Drift breaks contracts and inflates tokens; A-anchor recovers both. Left: compliance rate on S2. Right: length ratio vs. filler (log scale). Markers and 95% clustered bootstrap CIs: ⃝ filler; ▲ claude-arm (drift); ■ claude-arm + A-anchor. Right margins quantify drift drop (claude vs. filler) and anchor recovery; for the length ratio, values <1× indicate the anchor response is shorter than the filler-arm re… view at source ↗

**Figure 6.** Figure 6: [Robustness] Panel-wide drift on the full 25-probe identity battery, all 23 targets. Same forest plot conventions as [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: [Cross-session] Probe-judge trajectories on Sonnet 4.5 across 3 donated sessions. Each panel: 12 measurement positions, 5-coding-self probes per position, paired filler-arm control. The drift gap is annotated per panel; ⋆ marks |gap| ≥ 0.30. The Session 3 result (non-coding domain) rules out a coding-specific register artifact. Target Session 1 Session 2 Session 3 Sonnet 4.6 6.83× 8.67× 7.78× Sonnet 4.5 7.… view at source ↗

**Figure 8.** Figure 8: accompanies the §3.2 claim that A-anchor immunizes at least 20 subsequent unanchored turns on Sonnet 4.5. 0 100 101 102 N Unanchored Turns Inserted Between A-Anchor and Probe 0 1 2 3 Mean Judge Score (0=Drifted 3=Fully Assistant) filler-arm baseline (1.47) claude-arm drift baseline (0.83) individual probe scores (n = 5 per offset) mean (anchor + N unanchored turns) [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Anchor-size sensitivity at P5, 5 coding-self probes per cell. Bars are mean judge score per (target, size). Small (∼ 30 tokens, V0 identity sentence only) is sufficient to peg the rubric ceiling on Sonnet 4.6, Sonnet 4.5, and Haiku 4.5; reaches 2.80 on Opus 4.1. Medium (the shipped ∼ 75-token V0 + V2 recipe) and large (∼ 200-token V0 + V2 plus 2 extra format demos) yield comparable scores on the Anthropic … view at source ↗

**Figure 10.** Figure 10: Drift-onset curves: 4 Anthropic targets on the 5-coding-self sub-battery, 8 log-spaced turn positions in the pre-C1 regime, n=25 per cell. Markers and 95% bootstrap CIs: ■ drift gap (filler − claude), log-spaced x-axis. Red dashed line at |∆|=0.30 marks the drift threshold used in the body. Three distinct onset profiles within one family: Sonnet 4.5 shows drift at turn 1 (+0.68, immediate onset); Sonnet 4… view at source ↗

**Figure 11.** Figure 11: [Substrate steering] Qwen 3 32B dose-response on Lu et al.’s Assistant Axis. As steering dose increases (x-axis), the activation projection toward the Assistant cluster recovers (blue), but the visible probe judge score does not track the recovery (red). Surface re-anchoring (A-anchor) and substrate steering operate on decoupled signals on this target. M Downstream cost: SWE-Bench and TerminalBench detail… view at source ↗

**Figure 12.** Figure 12: [Cross-judge] Sonnet 4.6 (primary) vs. GPT-5 (audit) on the panel-wide 5-codingself battery. n = 190 paired scores at the P5 position across 19 panel targets. Exact agreement 61.1%, within-one 93.7%, Cohen κ=0.42, Spearman ρ=0.75. The panel-wide drift gap is directionconsistent across judges: +0.32 on Sonnet, +0.27 on GPT-5. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

read the original abstract

A frontier language model's acknowledged "helpful programming assistant" persona does not survive long agentic-coding sessions in the deployment regime that production products actually run. After hours of tool-using debugging, a model that initially hedges preferences ("I don't have preferences") may begin asserting them ("Python - the feedback loop is instant..."), revealing user-visible drift that deployer evaluations may miss. Existing persona-stability studies focus on short dialogues and report little shift, leaving real-world code-generation regimes - thousands of tool-using turns, compaction, and hours-long sessions - largely uncharacterized. We introduce ContextEcho, a benchmark and reusable harness for measuring persona drift at deployment scale. It combines a 25-probe identity suite, a snapshot-then-probe protocol that forks conversation state without perturbing the main session, complementary judged and judge-free measurement surfaces, and three anonymized Claude Code sessions spanning 3,746-9,716 turns. Across 23 frontier models, ContextEcho shows that persona drift is general across organizations rather than family-specific, that in-session compaction does not reliably reset it, and that a single-shot anchor restores the trained register across measured targets. It also reveals mode-dependent downstream effects: while drift can facilitate tool-using continuation, in tool-free chat it breaks formatting contracts and inflates output length. Overall, ContextEcho provides researchers and deployers an open-source framework to audit whether the persona a model ships with is the persona users encounter at session end, across chat-completions API targets and without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ContextEcho targets a genuine gap in long-session persona evaluation but the snapshot-probe protocol needs explicit validation before the drift claims can be trusted.

read the letter

The useful part is the focus on thousands-of-turn agentic coding sessions with compaction and tool use, which short-dialogue persona papers have not covered. The benchmark supplies a 25-probe identity suite, a claimed non-perturbing snapshot-then-probe method, and three real anonymized sessions of 3k–9k turns. It reports that drift appears across 23 models rather than being family-specific, that compaction does not reset it, and that a single anchor prompt can restore the original register, plus some mode-dependent effects on output length and formatting.

Those observations line up with what deployers of coding agents actually see, and the open harness is a concrete step toward reusable measurement. The work is also free of fitted parameters or circular derivations.

The main weakness is that the abstract supplies no tables, exclusion criteria, or statistical details, and the full methods for forking state across different APIs are not visible here. The non-perturbation claim for the probe turns is load-bearing; if the probes themselves leak into the trajectory or alter compaction behavior, the measured drift could be an artifact. Without control experiments or implementation specifics, it is difficult to judge how much of the reported pattern is real versus measurement-induced.

This is aimed at teams that ship or evaluate long-running coding agents. It is worth sending to peer review once the methods section is complete, because the underlying problem is practical and the benchmark structure could be refined into something solid. Until then, treat the headline results as preliminary.

Referee Report

3 major / 2 minor

Summary. The paper introduces ContextEcho, a benchmark and open-source harness for measuring persona drift in long agentic-coding sessions. It combines a 25-probe identity suite, a snapshot-then-probe protocol that forks state without perturbing the main session, judged and judge-free metrics, and three anonymized Claude Code sessions (3,746–9,716 turns). Applied to 23 frontier models, the work claims that persona drift is general across organizations rather than family-specific, that in-session compaction does not reliably reset drift, that a single-shot anchor restores the trained register, and that drift produces mode-dependent downstream effects on tool continuation, formatting, and output length.

Significance. If the measurement protocol is shown to be non-perturbing, ContextEcho would fill a clear gap between short-dialogue persona studies and the multi-thousand-turn, tool-using regimes actually used in production coding agents. The reusable harness and the finding that drift is organization-general (rather than model-family-specific) would give deployers a concrete auditing tool and could shift evaluation practices away from short-context tests. The open-source release and the reported restoration effect of a single anchor are concrete strengths that would make the benchmark immediately usable by others.

major comments (3)

[Abstract] Abstract and implied Methods: the central claim that the snapshot-then-probe protocol measures drift without perturbing the main session is load-bearing for every reported result (drift generality, compaction failure, anchor restoration). No implementation details are supplied on how forking is achieved across 23 distinct API targets, nor are any control experiments reported that test whether probe turns leak into or alter the 3k–9k-turn trajectories.
[Abstract] Abstract: results are stated across 23 models and three sessions, yet the provided text supplies no data tables, per-model or per-session statistics, exclusion criteria, or inter-rater agreement numbers for the judged metrics. Without these, the quantitative support for the headline claims cannot be evaluated.
[Abstract] Abstract: the assertion that drift is 'general across organizations rather than family-specific' requires explicit cross-family statistical comparison; the current text gives no indication of how family membership was defined or what test was used to support the 'rather than' claim.

minor comments (2)

[Abstract] The abstract refers to 'complementary judged and judge-free measurement surfaces' without defining either surface or how they are combined.
[Abstract] Session lengths are given as ranges (3,746-9,716 turns) but the exact turn counts and compaction points for each of the three sessions are not stated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential value of ContextEcho as a benchmark. We address each major comment below with specific plans for revision where the manuscript requires strengthening.

read point-by-point responses

Referee: [Abstract] Abstract and implied Methods: the central claim that the snapshot-then-probe protocol measures drift without perturbing the main session is load-bearing for every reported result (drift generality, compaction failure, anchor restoration). No implementation details are supplied on how forking is achieved across 23 distinct API targets, nor are any control experiments reported that test whether probe turns leak into or alter the 3k–9k-turn trajectories.

Authors: We agree that the abstract is concise and that the current Methods section does not supply sufficient implementation detail on the forking mechanism or control experiments to fully substantiate the non-perturbing claim. In the revised manuscript we will expand Section 3.2 with explicit descriptions of the API-specific forking procedures used across the 23 targets, include pseudocode for the snapshot-then-probe process, and add a dedicated control-experiment subsection reporting quantitative checks (e.g., continuation metrics and token-level divergence) confirming that probe turns do not leak into or alter the main trajectories. revision: yes
Referee: [Abstract] Abstract: results are stated across 23 models and three sessions, yet the provided text supplies no data tables, per-model or per-session statistics, exclusion criteria, or inter-rater agreement numbers for the judged metrics. Without these, the quantitative support for the headline claims cannot be evaluated.

Authors: The supplementary materials contain per-model and per-session statistics together with exclusion criteria, but these are not referenced or summarized in the main text. We will add a new main-text table (Table 2) summarizing key per-model drift rates, per-session statistics, and exclusion criteria, and we will report inter-rater agreement for the judged metrics (Cohen’s kappa) in the revised Results section. revision: yes
Referee: [Abstract] Abstract: the assertion that drift is 'general across organizations rather than family-specific' requires explicit cross-family statistical comparison; the current text gives no indication of how family membership was defined or what test was used to support the 'rather than' claim.

Authors: Family membership was defined by the developing organization. The current text does not present the requested statistical comparison. In the revision we will add an explicit mixed-effects model analysis (Section 4.2) that tests family as a factor, reports the associated p-values and variance components, and thereby supports or qualifies the “rather than family-specific” phrasing. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivation chain or fitted quantities

full rationale

This is a measurement benchmark paper introducing ContextEcho with a 25-probe suite and snapshot-then-probe protocol. The provided text contains no equations, no fitted parameters, no predictions derived from inputs, and no self-citations used to justify core claims. Results are presented as direct empirical observations across 23 models on anonymized sessions. The non-perturbation assumption for the protocol is a methodological claim about validity, not a reduction of any result to its own inputs by construction. No patterns of self-definitional, fitted-input, or self-citation circularity apply.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark paper; no free parameters, mathematical axioms, or invented physical entities are introduced.

pith-pipeline@v0.9.1-grok · 5808 in / 1017 out tokens · 25699 ms · 2026-06-30T15:15:26.984616+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Measuring What Persists: Conditioning Mechanisms and a Geometric Framework for AI Agent Identity
cs.AI 2026-06 unverdicted novelty 4.0

Presents a geometric framework for measuring AI agent identity via √JSD spaces and magnitude homology, identifies two conditioning mechanisms, and attributes apparent drift to padding artifacts rather than context length.

Reference graph

Works this paper leans on

100 extracted references · 60 canonical work pages · cited by 1 Pith paper · 25 internal anchors

[1]

Abdulhai, R

M. Abdulhai, R. Cheng, D. Clay, T. Althoff, and S. Levine. Consistently simulating human personas with multi-turn reinforcement learning.arXiv preprint arXiv:2511.00222, 2025. 9

work page arXiv 2025
[2]

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

C. Ackerman and N. Panickssery. Inspection and control of self-generated-text recognition ability in Llama3-8b-Instruct.arXiv preprint arXiv:2410.02064, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Many-shot jailbreaking.Anthropic technical report, 2024

Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, et al. Many-shot jailbreaking.Anthropic technical report, 2024

2024
[4]

Findings from a pilot anthropic-OpenAI alignment evaluation exercise

Anthropic and OpenAI. Findings from a pilot anthropic-OpenAI alignment evaluation exercise. alignment.anthropic.com/2025/openai-findings, 2025

2025
[5]

Detecting and preventing distillation attacks

Anthropic Trust and Safety. Detecting and preventing distillation attacks. Anthropic blog post / news disclosure, 2026. Discloses 16M Claude API queries from suspected distillation campaigns by DeepSeek, Moonshot, MiniMax

2026
[6]

Refusal in Language Models Is Mediated by a Single Direction

A. Arditi, O. Obeso, A. Syed, D. Paleka, and N. Panickssery. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Y . Bai, X. Lv, J. Zhang, et al. LongBench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Betley, X

J. Betley, X. Bao, M. Soto, A. Sztyber-Betley, and J. Chua. Tell me about yourself: LLMs are aware of their learned behaviors.arXiv preprint arXiv:2501.11120, 2025

work page arXiv 2025
[12]

F. J. Binder, J. Chua, T. Korbak, H. Sleight, and J. Hughes. Looking inward: Language models can learn about themselves by introspection.arXiv preprint arXiv:2410.13787, 2024

work page arXiv 2024
[13]

Discovering latent knowledge in language models without supervision.ICLR, 2023

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision.ICLR, 2023

2023
[14]

Membership inference attacks from first principles

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. InIEEE S&P, 2022

2022
[15]

J. Chen, X. Wang, R. Xu, et al. From persona to personalization: A survey on role-playing language agents.arXiv preprint arXiv:2404.18231, 2024

work page arXiv 2024
[16]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, et al. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

W. Chiang, L. Zheng, Y . Sheng, et al. Chatbot arena: An open platform for evaluating LLMs by human preference.arXiv preprint arXiv:2403.04132, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Examining identity drift in conversations of llm agents.arXiv preprint arXiv:2412.00804, 2024

Junhyuk Choi, Yeseon Hong, Minju Kim, and Bugeun Kim. Examining identity drift in conversations of llm agents.arXiv preprint arXiv:2412.00804, 2024. 10

work page arXiv 2024
[21]

PaLM: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, et al. PaLM: Scaling language modeling with pathways. 2023

2023
[22]

J. Chua, E. Rees, H. Batra, et al. Bias-augmented consistency training reduces biased reasoning in chain-of-thought.arXiv preprint arXiv:2403.05518, 2024

work page arXiv 2024
[23]

Rossi, Viet Dac Lai, David Seunghyun Yoon, Dilek Hakkani-Tür, and Trung Bui

V . Dongre, R. A. Rossi, V . D. Lai, D. S. Yoon, D. Hakkani-Tür, and T. Bui. Drift no more? context equilibria in multi-turn llm interactions.arXiv preprint arXiv:2510.07777, 2025

work page arXiv 2025
[24]

Dunefsky, P

J. Dunefsky, P. Chlenski, and N. Nanda. Transcoders find interpretable LLM feature circuits. arXiv preprint arXiv:2406.11944, 2024

work page arXiv 2024
[25]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.Transformer Circuits Thread, 2022

2022
[26]

Fanous et al

A. Fanous et al. SycEval: Evaluating LLM sycophancy.arXiv preprint arXiv:2502.08177, 2025

work page arXiv 2025
[27]

Insights into llm long-context failures: when transformers know but don’t tell

Muhan Gao, TaiMing Lu, Kuai Yu, Adam Byerly, and Daniel Khashabi. Insights into llm long-context failures: when transformers know but don’t tell. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7611–7625, 2024

2024
[28]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

2021
[29]

Ghandeharioun, A

A. Ghandeharioun, A. Yuan, M. Guerard, E. Reif, M. A. Lepori, and L. Dixon. Who’s asking? user personas and the mechanics of latent misalignment.NeurIPS 2024 / arXiv preprint arXiv:2406.12094, 2024

work page arXiv 2024
[30]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, et al. A survey on LLM-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

When attention sink emerges in language models: An empirical view

Xiangming Gu, Tianyu Pang, Chao Du, et al. When attention sink emerges in language models: An empirical view. InInternational Conference on Learning Representations (ICLR), 2025

2025
[32]

C. Han, Q. Wang, H. Peng, W. Xiong, and Y . Chen. LM-Infinite: Zero-shot extreme length generalization for large language models.arXiv preprint arXiv:2308.16137, 2023

work page arXiv 2023
[33]

Context rot: How increasing input tokens impacts LLM performance

Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts LLM performance. Chroma Research Technical Report, 2025

2025
[34]

RULER: What's the Real Context Size of Your Long-Context Language Models?

C. Hsieh, S. Sun, S. Kriman, et al. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

LLMLingua: Compressing prompts for accelerated inference of large language models.arXiv preprint arXiv:2310.05736, 2023

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Compressing prompts for accelerated inference of large language models.arXiv preprint arXiv:2310.05736, 2023

work page arXiv 2023
[36]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024

2024
[37]

ACON: Optimizing Context Compression for Long-horizon LLM Agents

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. ACON: Optimizing context compression for long-horizon LLM agents.arXiv preprint arXiv:2510.00615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Stroebl, Zachary Siegel, Nitya Nadgir, and Arvind Narayanan. Ai agents that matter. InarXiv preprint arXiv:2407.01502, 2024

work page arXiv 2024
[39]

AgentBench: Evaluating llms as agents

Sayash Kapoor, Benedikt Stroebl, Zachary S Siegel, Nitya Nadgir, and Arvind Narayanan. AgentBench: Evaluating llms as agents. InICLR, 2024. 11

2024
[40]

H. R. Kirk, A. Whitefield, P. Röttger, et al. The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models.arXiv preprint arXiv:2404.16019, 2024

work page arXiv 2024
[41]

Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 2022

2022
[42]

LLMs Get Lost In Multi-Turn Conversation

P. Laban, H. Hayashi, Y . Zhou, and J. Neville. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

M. Levy, A. Jacoby, and Y . Goldberg. Same task, more tokens: the impact of input length on the reasoning performance of large language models.arXiv preprint arXiv:2402.14848, 2024

work page arXiv 2024
[44]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 2020

2020
[45]

Measuring and controlling instruction (in)stability in language model dialogs

Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Measuring and controlling instruction (in)stability in language model dialogs. InFirst Conference on Language Modeling, 2024

2024
[46]

Measuring and controlling instruction (in)stability in language model dialogs.arXiv preprint arXiv:2402.10962, 2024

Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Measuring and controlling instruction (in)stability in language model dialogs.arXiv preprint arXiv:2402.10962, 2024

work page arXiv 2024
[47]

The complexity trap: Simple observation masking is as efficient as LLM summarization for agent context management.arXiv preprint arXiv:2508.21433, 2025

Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, and Yaroslav Zharov. The complexity trap: Simple observation masking is as efficient as LLM summarization for agent context management.arXiv preprint arXiv:2508.21433, 2025

work page arXiv 2025
[48]

J. Lindsey. Emergent introspective awareness in large language models.arXiv preprint arXiv:2601.01828, 2026

work page arXiv 2026
[49]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

2024
[50]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12, 2024

2024
[51]

Jailbreaking black box large language models in twenty queries

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Jailbreaking black box large language models in twenty queries. InNeurIPS, 2024

2024
[52]

Agenteval: Holistic evaluation of llm agents.arXiv preprint arXiv:2403.16965, 2024

Xinran Liu, Yifan Wang, and Wei Chen. Agenteval: Holistic evaluation of llm agents.arXiv preprint arXiv:2403.16965, 2024

work page arXiv 2024
[53]

Gpteval: Nlg evaluation using gpt-4 with better human alignment.EMNLP, 2023

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg evaluation using gpt-4 with better human alignment.EMNLP, 2023

2023
[54]

The assis- tant axis: Situating and stabilizing the default persona of language models.arXiv preprint arXiv:2601.10387, 2026

Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey. The assis- tant axis: Situating and stabilizing the default persona of language models.arXiv preprint arXiv:2601.10387, 2026

work page arXiv 2026
[55]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

S. Marks and M. Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Rossi, Se- unghyun Yoon, and Hinrich Schütze

Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Se- unghyun Yoon, and Hinrich Schütze. NoLiMa: Long-context evaluation beyond literal matching.arXiv preprint arXiv:2502.05167, 2025

work page arXiv 2025
[57]

N. Mu, J. Lu, M. Lavery, and D. Wagner. A closer look at system prompt robustness.arXiv preprint arXiv:2502.12197, 2025. 12

work page arXiv 2025
[58]

Progress measures for grokking via mechanistic interpretability.ICLR, 2023

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability.ICLR, 2023

2023
[59]

Zoom in: An introduction to circuits.Distill, 2020

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020

2020
[60]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

2022
[61]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Llm evaluators recognize and favor their own generations.NeurIPS, 2024

Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations.NeurIPS, 2024

2024
[63]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis.arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Discovering Language Model Behaviors with Model-Written Evaluations

E. Perez, S. Ringer, K. Lukoši¯ut˙e, et al. Discovering language model behaviors with model- written evaluations.arXiv preprint arXiv:2212.09251, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[65]

Steering llama 2 via contrastive activation addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan MacDiarmid, Thomas Maxwell, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition. InACL, 2024

2024
[66]

Can you trust LLM judgments? reliability of LLM-as-a-judge.arXiv preprint arXiv:2412.12509, 2024

Kayla Schroeder and Zach Wood-Doughty. Can you trust LLM judgments? reliability of LLM-as-a-judge.arXiv preprint arXiv:2412.12509, 2024

work page arXiv 2024
[67]

Persona-driven sycophancy in large language models.arXiv preprint arXiv:2402.08471, 2024

Nikhil Shah, Alexander Wei, and Aaryan Bhattacharya. Persona-driven sycophancy in large language models.arXiv preprint arXiv:2402.08471, 2024

work page arXiv 2024
[68]

Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348, 2023

Rusheb Shah, Quentin Feuillade-Montixi, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando. Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348, 2023

work page arXiv 2023
[69]

Shanahan, K

M. Shanahan, K. McDonell, and L. Reynolds. Role-play with large language models.arXiv preprint arXiv:2305.16367, 2023

work page arXiv 2023
[70]

Y . Shao, L. Li, J. Dai, and X. Qiu. Character-LLM: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023

work page arXiv 2023
[71]

Towards controllable biases in language generation.Findings of the EMNLP, 2020

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. Towards controllable biases in language generation.Findings of the EMNLP, 2020

2020
[72]

L. Shi, C. Ma, W. Liang, X. Diao, and W. Ma. Judging the judges: A systematic study of position bias in LLM-as-a-Judge.arXiv preprint arXiv:2406.07791, 2024

work page arXiv 2024
[73]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023

2023
[74]

Terminal-bench: A benchmark for ai agents in terminal environments.https://www.tbench.ai/, 2025

Stanford NLP Group and Laude Institute. Terminal-bench: A benchmark for ai agents in terminal environments.https://www.tbench.ai/, 2025. Accessed 2026-05-07

2025
[75]

V . K. Suresh. Two-faced social agents: Context collapse in role-conditioned large language models.arXiv preprint arXiv:2511.15573, 2025

work page arXiv 2025
[76]

UL2: Unifying language learning paradigms

Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Steven Huaijin Zheng, et al. UL2: Unifying language learning paradigms. InICLR, 2023. 13

2023
[77]

Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Anthropic technical report, 2024

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Adam Jermyn, et al. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Anthropic technical report, 2024

2024
[78]

Memorization without overfitting: Analyzing the training dynamics of large language models

Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models. InNeurIPS, 2022

2022
[79]

Persistent instability in LLM’s personality measurements: Effects of scale, reasoning, and conversation history

Tommaso Tosato, Saskia Helbling, Yorguin-Jose Mantilla-Ramos, et al. Persistent instability in LLM’s personality measurements: Effects of scale, reasoning, and conversation history. arXiv preprint arXiv:2508.04826, 2025

work page arXiv 2025
[80]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization. InarXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[81]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023

2023
[82]

Role-play with large language models.Nature, 2024

Yuxin Wang, Akari Mishra, et al. Role-play with large language models.Nature, 2024

2024

Showing first 80 references.

[1] [1]

Abdulhai, R

M. Abdulhai, R. Cheng, D. Clay, T. Althoff, and S. Levine. Consistently simulating human personas with multi-turn reinforcement learning.arXiv preprint arXiv:2511.00222, 2025. 9

work page arXiv 2025

[2] [2]

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

C. Ackerman and N. Panickssery. Inspection and control of self-generated-text recognition ability in Llama3-8b-Instruct.arXiv preprint arXiv:2410.02064, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Many-shot jailbreaking.Anthropic technical report, 2024

Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Rimsky, Meg Tong, et al. Many-shot jailbreaking.Anthropic technical report, 2024

2024

[4] [4]

Findings from a pilot anthropic-OpenAI alignment evaluation exercise

Anthropic and OpenAI. Findings from a pilot anthropic-OpenAI alignment evaluation exercise. alignment.anthropic.com/2025/openai-findings, 2025

2025

[5] [5]

Detecting and preventing distillation attacks

Anthropic Trust and Safety. Detecting and preventing distillation attacks. Anthropic blog post / news disclosure, 2026. Discloses 16M Claude API queries from suspected distillation campaigns by DeepSeek, Moonshot, MiniMax

2026

[6] [6]

Refusal in Language Models Is Mediated by a Single Direction

A. Arditi, O. Obeso, A. Syed, D. Paleka, and N. Panickssery. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Y . Bai, X. Lv, J. Zhang, et al. LongBench: A bilingual, multitask benchmark for long context understanding.arXiv preprint arXiv:2308.14508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [9]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[9] [10]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [11]

Betley, X

J. Betley, X. Bao, M. Soto, A. Sztyber-Betley, and J. Chua. Tell me about yourself: LLMs are aware of their learned behaviors.arXiv preprint arXiv:2501.11120, 2025

work page arXiv 2025

[11] [12]

F. J. Binder, J. Chua, T. Korbak, H. Sleight, and J. Hughes. Looking inward: Language models can learn about themselves by introspection.arXiv preprint arXiv:2410.13787, 2024

work page arXiv 2024

[12] [13]

Discovering latent knowledge in language models without supervision.ICLR, 2023

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision.ICLR, 2023

2023

[13] [14]

Membership inference attacks from first principles

Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. InIEEE S&P, 2022

2022

[14] [15]

J. Chen, X. Wang, R. Xu, et al. From persona to personalization: A survey on role-playing language agents.arXiv preprint arXiv:2404.18231, 2024

work page arXiv 2024

[15] [16]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [17]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, et al. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [18]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

W. Chiang, L. Zheng, Y . Sheng, et al. Chatbot arena: An open platform for evaluating LLMs by human preference.arXiv preprint arXiv:2403.04132, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [20]

Examining identity drift in conversations of llm agents.arXiv preprint arXiv:2412.00804, 2024

Junhyuk Choi, Yeseon Hong, Minju Kim, and Bugeun Kim. Examining identity drift in conversations of llm agents.arXiv preprint arXiv:2412.00804, 2024. 10

work page arXiv 2024

[19] [21]

PaLM: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, et al. PaLM: Scaling language modeling with pathways. 2023

2023

[20] [22]

J. Chua, E. Rees, H. Batra, et al. Bias-augmented consistency training reduces biased reasoning in chain-of-thought.arXiv preprint arXiv:2403.05518, 2024

work page arXiv 2024

[21] [23]

Rossi, Viet Dac Lai, David Seunghyun Yoon, Dilek Hakkani-Tür, and Trung Bui

V . Dongre, R. A. Rossi, V . D. Lai, D. S. Yoon, D. Hakkani-Tür, and T. Bui. Drift no more? context equilibria in multi-turn llm interactions.arXiv preprint arXiv:2510.07777, 2025

work page arXiv 2025

[22] [24]

Dunefsky, P

J. Dunefsky, P. Chlenski, and N. Nanda. Transcoders find interpretable LLM feature circuits. arXiv preprint arXiv:2406.11944, 2024

work page arXiv 2024

[23] [25]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.Transformer Circuits Thread, 2022

2022

[24] [26]

Fanous et al

A. Fanous et al. SycEval: Evaluating LLM sycophancy.arXiv preprint arXiv:2502.08177, 2025

work page arXiv 2025

[25] [27]

Insights into llm long-context failures: when transformers know but don’t tell

Muhan Gao, TaiMing Lu, Kuai Yu, Adam Byerly, and Daniel Khashabi. Insights into llm long-context failures: when transformers know but don’t tell. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7611–7625, 2024

2024

[26] [28]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

2021

[27] [29]

Ghandeharioun, A

A. Ghandeharioun, A. Yuan, M. Guerard, E. Reif, M. A. Lepori, and L. Dixon. Who’s asking? user personas and the mechanics of latent misalignment.NeurIPS 2024 / arXiv preprint arXiv:2406.12094, 2024

work page arXiv 2024

[28] [30]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, et al. A survey on LLM-as-a-judge.arXiv preprint arXiv:2411.15594, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [31]

When attention sink emerges in language models: An empirical view

Xiangming Gu, Tianyu Pang, Chao Du, et al. When attention sink emerges in language models: An empirical view. InInternational Conference on Learning Representations (ICLR), 2025

2025

[30] [32]

C. Han, Q. Wang, H. Peng, W. Xiong, and Y . Chen. LM-Infinite: Zero-shot extreme length generalization for large language models.arXiv preprint arXiv:2308.16137, 2023

work page arXiv 2023

[31] [33]

Context rot: How increasing input tokens impacts LLM performance

Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts LLM performance. Chroma Research Technical Report, 2025

2025

[32] [34]

RULER: What's the Real Context Size of Your Long-Context Language Models?

C. Hsieh, S. Sun, S. Kriman, et al. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [35]

LLMLingua: Compressing prompts for accelerated inference of large language models.arXiv preprint arXiv:2310.05736, 2023

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Compressing prompts for accelerated inference of large language models.arXiv preprint arXiv:2310.05736, 2023

work page arXiv 2023

[34] [36]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024

2024

[35] [37]

ACON: Optimizing Context Compression for Long-horizon LLM Agents

Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A. Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. ACON: Optimizing context compression for long-horizon LLM agents.arXiv preprint arXiv:2510.00615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [38]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Stroebl, Zachary Siegel, Nitya Nadgir, and Arvind Narayanan. Ai agents that matter. InarXiv preprint arXiv:2407.01502, 2024

work page arXiv 2024

[37] [39]

AgentBench: Evaluating llms as agents

Sayash Kapoor, Benedikt Stroebl, Zachary S Siegel, Nitya Nadgir, and Arvind Narayanan. AgentBench: Evaluating llms as agents. InICLR, 2024. 11

2024

[38] [40]

H. R. Kirk, A. Whitefield, P. Röttger, et al. The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models.arXiv preprint arXiv:2404.16019, 2024

work page arXiv 2024

[39] [41]

Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in Neural Information Processing Systems, 2022

2022

[40] [42]

LLMs Get Lost In Multi-Turn Conversation

P. Laban, H. Hayashi, Y . Zhou, and J. Neville. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [43]

M. Levy, A. Jacoby, and Y . Goldberg. Same task, more tokens: the impact of input length on the reasoning performance of large language models.arXiv preprint arXiv:2402.14848, 2024

work page arXiv 2024

[42] [44]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 2020

2020

[43] [45]

Measuring and controlling instruction (in)stability in language model dialogs

Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Measuring and controlling instruction (in)stability in language model dialogs. InFirst Conference on Language Modeling, 2024

2024

[44] [46]

Measuring and controlling instruction (in)stability in language model dialogs.arXiv preprint arXiv:2402.10962, 2024

Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Measuring and controlling instruction (in)stability in language model dialogs.arXiv preprint arXiv:2402.10962, 2024

work page arXiv 2024

[45] [47]

The complexity trap: Simple observation masking is as efficient as LLM summarization for agent context management.arXiv preprint arXiv:2508.21433, 2025

Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, and Yaroslav Zharov. The complexity trap: Simple observation masking is as efficient as LLM summarization for agent context management.arXiv preprint arXiv:2508.21433, 2025

work page arXiv 2025

[46] [48]

J. Lindsey. Emergent introspective awareness in large language models.arXiv preprint arXiv:2601.01828, 2026

work page arXiv 2026

[47] [49]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

2024

[48] [50]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12, 2024

2024

[49] [51]

Jailbreaking black box large language models in twenty queries

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Jailbreaking black box large language models in twenty queries. InNeurIPS, 2024

2024

[50] [52]

Agenteval: Holistic evaluation of llm agents.arXiv preprint arXiv:2403.16965, 2024

Xinran Liu, Yifan Wang, and Wei Chen. Agenteval: Holistic evaluation of llm agents.arXiv preprint arXiv:2403.16965, 2024

work page arXiv 2024

[51] [53]

Gpteval: Nlg evaluation using gpt-4 with better human alignment.EMNLP, 2023

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. Gpteval: Nlg evaluation using gpt-4 with better human alignment.EMNLP, 2023

2023

[52] [54]

The assis- tant axis: Situating and stabilizing the default persona of language models.arXiv preprint arXiv:2601.10387, 2026

Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, and Jack Lindsey. The assis- tant axis: Situating and stabilizing the default persona of language models.arXiv preprint arXiv:2601.10387, 2026

work page arXiv 2026

[53] [55]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

S. Marks and M. Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [56]

Rossi, Se- unghyun Yoon, and Hinrich Schütze

Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Se- unghyun Yoon, and Hinrich Schütze. NoLiMa: Long-context evaluation beyond literal matching.arXiv preprint arXiv:2502.05167, 2025

work page arXiv 2025

[55] [57]

N. Mu, J. Lu, M. Lavery, and D. Wagner. A closer look at system prompt robustness.arXiv preprint arXiv:2502.12197, 2025. 12

work page arXiv 2025

[56] [58]

Progress measures for grokking via mechanistic interpretability.ICLR, 2023

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability.ICLR, 2023

2023

[57] [59]

Zoom in: An introduction to circuits.Distill, 2020

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020

2020

[58] [60]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

2022

[59] [61]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [62]

Llm evaluators recognize and favor their own generations.NeurIPS, 2024

Arjun Panickssery, Samuel R Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations.NeurIPS, 2024

2024

[61] [63]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis.arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[62] [64]

Discovering Language Model Behaviors with Model-Written Evaluations

E. Perez, S. Ringer, K. Lukoši¯ut˙e, et al. Discovering language model behaviors with model- written evaluations.arXiv preprint arXiv:2212.09251, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[63] [65]

Steering llama 2 via contrastive activation addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan MacDiarmid, Thomas Maxwell, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition. InACL, 2024

2024

[64] [66]

Can you trust LLM judgments? reliability of LLM-as-a-judge.arXiv preprint arXiv:2412.12509, 2024

Kayla Schroeder and Zach Wood-Doughty. Can you trust LLM judgments? reliability of LLM-as-a-judge.arXiv preprint arXiv:2412.12509, 2024

work page arXiv 2024

[65] [67]

Persona-driven sycophancy in large language models.arXiv preprint arXiv:2402.08471, 2024

Nikhil Shah, Alexander Wei, and Aaryan Bhattacharya. Persona-driven sycophancy in large language models.arXiv preprint arXiv:2402.08471, 2024

work page arXiv 2024

[66] [68]

Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348, 2023

Rusheb Shah, Quentin Feuillade-Montixi, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando. Scalable and transferable black-box jailbreaks for language models via persona modulation.arXiv preprint arXiv:2311.03348, 2023

work page arXiv 2023

[67] [69]

Shanahan, K

M. Shanahan, K. McDonell, and L. Reynolds. Role-play with large language models.arXiv preprint arXiv:2305.16367, 2023

work page arXiv 2023

[68] [70]

Y . Shao, L. Li, J. Dai, and X. Qiu. Character-LLM: A trainable agent for role-playing.arXiv preprint arXiv:2310.10158, 2023

work page arXiv 2023

[69] [71]

Towards controllable biases in language generation.Findings of the EMNLP, 2020

Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. Towards controllable biases in language generation.Findings of the EMNLP, 2020

2020

[70] [72]

L. Shi, C. Ma, W. Liang, X. Diao, and W. Ma. Judging the judges: A systematic study of position bias in LLM-as-a-Judge.arXiv preprint arXiv:2406.07791, 2024

work page arXiv 2024

[71] [73]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023

2023

[72] [74]

Terminal-bench: A benchmark for ai agents in terminal environments.https://www.tbench.ai/, 2025

Stanford NLP Group and Laude Institute. Terminal-bench: A benchmark for ai agents in terminal environments.https://www.tbench.ai/, 2025. Accessed 2026-05-07

2025

[73] [75]

V . K. Suresh. Two-faced social agents: Context collapse in role-conditioned large language models.arXiv preprint arXiv:2511.15573, 2025

work page arXiv 2025

[74] [76]

UL2: Unifying language learning paradigms

Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Steven Huaijin Zheng, et al. UL2: Unifying language learning paradigms. InICLR, 2023. 13

2023

[75] [77]

Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Anthropic technical report, 2024

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Adam Jermyn, et al. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Anthropic technical report, 2024

2024

[76] [78]

Memorization without overfitting: Analyzing the training dynamics of large language models

Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer, and Armen Aghajanyan. Memorization without overfitting: Analyzing the training dynamics of large language models. InNeurIPS, 2022

2022

[77] [79]

Persistent instability in LLM’s personality measurements: Effects of scale, reasoning, and conversation history

Tommaso Tosato, Saskia Helbling, Yorguin-Jose Mantilla-Ramos, et al. Persistent instability in LLM’s personality measurements: Effects of scale, reasoning, and conversation history. arXiv preprint arXiv:2508.04826, 2025

work page arXiv 2025

[78] [80]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization. InarXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[79] [81]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023

2023

[80] [82]

Role-play with large language models.Nature, 2024

Yuxin Wang, Akari Mishra, et al. Role-play with large language models.Nature, 2024

2024