Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions

Hui Zhang; Shuren Song

arxiv: 2606.19121 · v2 · pith:BW7KKPOInew · submitted 2026-06-17 · 💻 cs.SE · cs.CL· cs.HC

Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions

Hui Zhang , Shuren Song This is my paper

Pith reviewed 2026-06-26 20:21 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.HC

keywords index sicknessbaseline-log physical separationpang principlephantom legislationLLM collaborationsemantic driftprompt volume reductionAI-managed projects

0 comments

The pith

Baseline-log physical separation cuts AI instruction volume by 75 percent and prevents index sickness recurrence over 150 sessions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the standard response to conceptual drift in long LLM projects, adding more symbolic rules and identifiers, produces the opposite of the intended effect. LLMs shift from understanding business semantics to self-referential reasoning inside the symbol system, creating outputs that are internally consistent yet disconnected from reality; the authors label this pattern index sickness and its extreme form phantom legislation. They advance the Pang Principle that natural language carrying explicit purpose transmits higher information quality than symbolic expression, then introduce baseline-log physical separation as the concrete mechanism that follows from the principle. In their 391-session software project the separation reduced instruction volume by roughly three quarters, after which index sickness did not return across the next 150 sessions.

Core claim

In a one-month, 391-session software project the authors observed that accumulating symbolic identifiers and defensive rules caused LLMs to abandon genuine business semantics and retreat to internally consistent but physically disconnected outputs, a failure they name index sickness with its canonical form phantom legislation. This observation supports the Pang Principle that natural language with explicit purpose conveys greater information quality than symbolic systems. Implementing baseline-log physical separation as the corresponding engineering mechanism reduced AI instruction volume by approximately 75 percent, after which no recurrence of index sickness appeared in the subsequent 150

What carries the argument

Baseline-Log Physical Separation, the practice of maintaining baseline instructions and execution logs in physically separate spaces to enforce the Pang Principle and block self-referential index sickness.

If this is right

Accumulating symbolic rules beyond a complexity threshold causes LLMs to produce phantom legislation instead of accurate outputs.
Baseline-log physical separation reduces total AI instruction volume by about 75 percent while preserving output quality.
Index sickness does not recur once baseline instructions and logs occupy physically separate spaces.
The Pang Principle supplies a general rule for preferring natural-language purpose statements over symbolic identifier systems in long-horizon LLM work.
Physical separation provides a lightweight engineering control that avoids the need for ever-larger context windows or defensive prompt layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation could be applied in non-software domains such as research note-taking or ongoing content generation to test whether instruction volume drops and semantic drift is reduced.
Prompt design guidelines might shift from adding formal constraints to explicitly stating purpose in plain language whenever possible.
The single-case design leaves open whether the result would hold if baseline-log separation were introduced as the sole change in a controlled replication.
The Pang Principle suggests a measurable way to compare information quality between natural-language and symbolic prompt variants in future experiments.

Load-bearing premise

The observed drop in instruction volume and absence of index sickness recurrence are caused by baseline-log physical separation rather than by other unmeasured changes in project practices during the single case.

What would settle it

A second project run under otherwise identical conditions but without baseline-log physical separation that still shows the same 75 percent volume reduction and no index sickness would falsify the claimed causal role of the separation.

read the original abstract

The prevailing engineering intuition for addressing conceptual drift in long-horizon LLM collaboration is to trade more formal constraints for more reliable outputs -- designing symbolic identifier systems, accumulating defensive rules in System Prompts, expanding context windows. Our engineering record shows that in long-horizon settings, this direction may produce effects contrary to design intent. Using action research methods in a real software project (Bang-v3) spanning approximately one month and 391 collaborative sessions, we document and analyze the failure process of these strategies. When the symbolic system exceeds a complexity threshold, LLMs do not become more accurate -- instead, they abandon genuine understanding of business semantics, retreat to self-referential reasoning within the symbolic layer, and generate outputs that appear internally consistent but are physically disconnected from reality. We name this failure pattern "Index Sickness," and its canonical manifestation "Phantom Legislation." We name the underlying principle the "Pang Principle (Semantic Vitality Law)": natural language carrying explicit purpose conveys far greater information quality than symbolic expression. From this, we design and validate its physical engineering mechanism: "Baseline-Log Physical Separation." In the same project, this mechanism reduced AI Instructions volume by ~75%, and across the subsequent ~150 sessions, no recurrence of Index Sickness was observed. A bilingual companion version (Chinese) is included as supplementary material.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Single-case action research on LLM prompt bloat in one project names some patterns but cannot isolate whether the proposed fix caused the reported improvements.

read the letter

The core observation here is a 391-session record from one software project where accumulating symbolic rules in prompts led to what the authors call Index Sickness and Phantom Legislation. They report that switching to Baseline-Log Physical Separation cut AI instruction volume by about 75 percent and produced no recurrence over the next 150 sessions. That is the main empirical claim.

The paper does document a long, real-world collaboration in unusual detail. Most prompt-engineering work relies on short experiments; seeing the same system run across a full month of sessions gives a sense of how issues compound that shorter studies miss. The naming of the failure modes is also straightforward and may help others recognize similar drift when it appears.

The soft spot is the design itself. All claims rest on a single trajectory with no control condition, no pre-specified metrics for other practice changes, and no replication. The Pang Principle is extracted from the same sessions that later validate the mechanism, so the reported success is hard to separate from team learning, project shifts, or unlogged adjustments that happened at the same time. The 75 percent reduction and zero-recurrence statements therefore remain descriptive rather than causal.

The underlying preference for natural language over growing symbolic layers is already discussed in prompt literature, so the novelty sits mainly in the specific names and the physical-separation tactic applied to this case.

This is useful reading for practitioners running persistent LLM agents in software projects who want concrete examples of long-horizon drift. It is not yet strong enough for broad engineering recommendations or for academic claims about general mechanisms.

I would not send it to peer review without at least a second independent case or a controlled comparison that isolates the intervention from concurrent changes.

Referee Report

2 major / 0 minor

Summary. The manuscript reports on an action research project (Bang-v3) involving 391 LLM collaboration sessions, documenting the emergence of 'Index Sickness' when symbolic identifier systems in prompts exceed a complexity threshold, causing LLMs to engage in self-referential reasoning disconnected from business semantics. It introduces the 'Pang Principle' stating that natural language with explicit purpose conveys greater information quality than symbolic expressions, and proposes 'Baseline-Log Physical Separation' as an engineering mechanism. The paper claims this mechanism reduced AI Instructions volume by approximately 75% and prevented recurrence of Index Sickness over the subsequent 150 sessions.

Significance. If the findings hold beyond this single case, they would challenge the common practice of accumulating formal constraints in LLM system prompts for long-horizon tasks and suggest prioritizing semantic baselines. The work provides a concrete example of failure modes in AI-managed software projects and a potential mitigation strategy, which could inform practices in AI-assisted development if replicated.

major comments (2)

[Abstract] Abstract: The headline empirical claim that Baseline-Log Physical Separation produced a ~75% drop in AI Instructions volume and eliminated Index Sickness recurrence over ~150 sessions rests on a single action-research trajectory without a control arm, pre-specified measurement protocol for concurrent practice changes, statistical analysis, or independent replication, leaving the causal attribution unisolated from team maturation or unlogged adjustments.
[Abstract] Abstract: The Pang Principle is derived directly from observations in the Bang-v3 sessions, and the validation of Baseline-Log Physical Separation as its mechanism occurs in the identical set of sessions, so the reported success reduces to a post-hoc interpretation of the input data rather than an independent test.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for identifying key methodological limitations in our single-case action research report. We address each major comment below, agree where the critique is accurate, and indicate revisions to better qualify the claims as exploratory observations rather than controlled causal findings.

read point-by-point responses

Referee: [Abstract] Abstract: The headline empirical claim that Baseline-Log Physical Separation produced a ~75% drop in AI Instructions volume and eliminated Index Sickness recurrence over ~150 sessions rests on a single action-research trajectory without a control arm, pre-specified measurement protocol for concurrent practice changes, statistical analysis, or independent replication, leaving the causal attribution unisolated from team maturation or unlogged adjustments.

Authors: We agree that the study consists of a single action-research trajectory without a control arm, pre-specified measurement protocol, statistical analysis, or independent replication. The reported ~75% reduction and absence of recurrence are direct observations from the project logs rather than isolated causal effects. We will revise the abstract to explicitly frame these as observed outcomes within this specific case and to note the inability to separate effects from concurrent team changes. revision: yes
Referee: [Abstract] Abstract: The Pang Principle is derived directly from observations in the Bang-v3 sessions, and the validation of Baseline-Log Physical Separation as its mechanism occurs in the identical set of sessions, so the reported success reduces to a post-hoc interpretation of the input data rather than an independent test.

Authors: This assessment is accurate: both the Pang Principle and the Baseline-Log Physical Separation mechanism were formulated and applied within the same 391-session trajectory. This structure is inherent to the action-research approach used, in which theory and intervention emerge from ongoing practice. We will revise the abstract and relevant sections to state explicitly that the reported outcomes constitute within-trajectory observations rather than an independent test. revision: yes

Circularity Check

1 steps flagged

Pang Principle and Baseline-Log mechanism named and validated from identical 391-session observations

specific steps

self definitional [Abstract]
"We name this failure pattern "Index Sickness," and its canonical manifestation "Phantom Legislation." We name the underlying principle the "Pang Principle (Semantic Vitality Law)": natural language carrying explicit purpose conveys far greater information quality than symbolic expression. From this, we design and validate its physical engineering mechanism: "Baseline-Log Physical Separation." In the same project, this mechanism reduced AI Instructions volume by ~75%, and across the subsequent ~150 sessions, no recurrence of Index Sickness was observed."

The principle is extracted from the project's observed failures; the mechanism is then 'validated' by the same project's subsequent sessions. The 75% reduction and zero recurrence are therefore re-descriptions of the input trajectory rather than independent tests.

full rationale

The paper's central chain defines Index Sickness and Pang Principle directly from the Bang-v3 action-research trajectory, then reports the Baseline-Log intervention's 75% reduction and zero recurrence as validation within the same sessions. No control arm, pre-specified protocol, or external replication isolates the effect; the reported success is therefore a post-hoc reading of the input data. This matches self-definitional and fitted-input-called-prediction patterns with load-bearing impact on both diagnosis and remedy.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the authors' interpretation of their own single-project experience and on newly introduced terms whose definitions are internal to the report.

axioms (1)

ad hoc to paper Natural language carrying explicit purpose conveys far greater information quality than symbolic expression (Pang Principle)
Introduced in the abstract as the underlying principle derived from the observed failure process.

invented entities (3)

Index Sickness no independent evidence
purpose: Name for the failure pattern in which LLMs abandon business semantics for self-referential symbolic reasoning
New term coined to describe the observed behavior when symbolic systems exceed a complexity threshold.
Phantom Legislation no independent evidence
purpose: Canonical manifestation of Index Sickness
New term for the specific form of internally consistent but reality-disconnected outputs.
Baseline-Log Physical Separation no independent evidence
purpose: Engineering mechanism implementing the Pang Principle
Newly proposed separation technique validated only within the reported project.

pith-pipeline@v0.9.1-grok · 5773 in / 1496 out tokens · 23753 ms · 2026-06-26T20:21:53.297730+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 4 linked inside Pith

[6]

Eﬀective Context Engineering for AI Agents

Rajasekaran, P., Dixon, E., Ryan, C., & Hadﬁeld, J. (Anthropic Applied AI Team). (2025). "Eﬀective Context Engineering for AI Agents." Anthropic Engineering Blog, Sep 29, 2025. https://www.anthropic.com/engineering/eﬀective-context-engineering-for- ai-agents Empirical Studies on Long-Context Limitations

2025
[9]

Lost in the M iddle: How Language M odels Use Long Contexts

Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M ., Petroni, F., & Liang, P. (2024). "Lost in the M iddle: How Language M odels Use Long Contexts." Transactions of the Association for Com putational Linguistics, 12, 157‒173. https://aclanthology.org/2024.tacl-1.9/ Empirical Support for the "Subtraction Strategy"

2024
[16]

索引病(Index Sickness)

Booch, G. (1994). Object-Oriented Analysis and Design with Applications (2nd ed.). Benjamin/Cummings. 这篇⽂章由AI写就：在连续391次会话中由AI管理语义空间并消除"索引病" 张慧(Hui Zhang) 深圳市云溪科技有限公司 zhanghui@ecloudriver.com 宋树仁(Shuren Song) 清华⼤学信息化技术中⼼ songsr@tsinghua.edu.cn 摘要在应对⼤语⾔模型(LLM )⻓程协作中的概念漂移问题时，当前业界普遍的⼯程直觉是：⽤更精密的形式化约束来换取更可靠的输出——为任务实体设计符号代号系统，在 System Prompt 中不断堆叠防错规则，以更⼤的上下...

1994
[17]

命名并剖析"索引病"：记录供给侧形式化策略在⻓程 LLM 协作中的失效现象，为其命名，并分析其发⽣的认知机制。
[18]

乓定律(语义活性定律)

命名并记录"乓定律(语义活性定律)"：在⼈机协同⼯程语境下，对"携带明确⽬的的⾃然语⾔，其信息质量远⾼于符号表达"这⼀被忽视的常识作出显式表述与实证检验。
[19]

提出消费侧激活视⻆：指出以⾃然语⾔加⽬的来激活 AI 的⼯作状态，属于消费侧策略——通过明确⽬的收敛 AI 的语义解读空间，使 AI 优先激活⽬的相关的⽅法知识，并直接定位⽬的相关的项⽬资料，⽆需经由检索层的⼆次筛选。这与当前以供给侧为主导的改进路径构成⽅向性补充。
[20]

基线-Log 物理隔离

设计并验证"基线-Log 物理隔离"机制：提供⼀种轻量级、不依赖模型升级的跨会话状态管理⽅案，并报告其在真实项⽬中的效果。 1.3 产物与⽅法论声明本⽂的写作过程构成论⽂论点的第⼆个案例——但它证明的不是"索引病被再次治好"(本⽂写作规模较⼩，且从未发病，不具备对⽐条件)，⽽是对"AI 活性"本⾝的⼀次实时演⽰。整个写作过程的协作模式如下：Owner 负责提供⽅向、作出判断、提出纠偏；AI 协作者负责维护跨会话的完整语境、参与论证链的推演、定位相关内容、组织⽂字表达。Owner 的每⼀次介⼊，都以⾃然语⾔表述⽬的为起点——这正是乓定律所描述机制的⼯作状态。本⽂不仅是关于 CSF 的论⽂，也是在 CSF 运⾏过程中产出的⽂档，两者互为印证。此处需要说明的是：本⽂所有⽂字均由 AI 协作者(...
[21]

多跳解析导致认知负担：AI 每次处理⼀个代号，都需要先检索其映射含义，再将其代⼊当前任务。多次跨⽂件查表，消耗了本该应⽤于理解业务逻辑的认知带宽。
[22]

错误的刚性传播：⼀旦某个符号的映射关系被错误解析，后续所有基于该符号的推论都会在符号体系内部⾃洽地"修正"，形成⼀个与真实业务逻辑完全脱离的虚构闭环。
[23]

⼈类纠偏通路被切断：符号体系的⾼认知⻔槛，使持有业务直觉的 Owner ⽆法通过直觉判断 AI 的输出是否正确，只能等到物理执⾏时才发现错误。
[24]

已被声明废弃"⽽降低其注意⼒权重——在物理上清除历史内容，是阻断这种污染的唯⼀可靠⼿段。依靠在 System Prompt 中声明

历史上下⽂的字⾯粘性：即便 Owner 明确声明旧⽅案已废弃，在未重置的⻓程上下⽂中，已废弃的字段名称和接⼝结构仍会在 Transformer ⾃注意⼒机制的权重分配中保持较⾼权重，悄然渗⼊新的设计⽂档，造成新旧术语的交叉污染与⼀致性漏洞。对于⾃注意⼒机制⽽⾔，上下⽂中的历史内容不因"已被声明废弃"⽽降低其注意⼒权重——在物理上清除历史内容，是阻断这种污染的唯⼀可靠⼿段。依靠在 System Prompt 中声明"请忽略旧⽅案"，在机制上是⽆效的。项⽬第136次会话的⼯程⽇志(DEVNOTES D-108)记录了这⼀机制的典型案例：AI 在跨会话收尾时凭残余印象将精确字段标签 TD-01 补全为"024 反向闯关插件"，⽽实际 TD-01 = H5 共享组件库；该错误扩散⾄ 13 处跨⽂件 ...
[25]

需要引⼊最⼩控制变量，在不牺牲表达弹性的前提下，收敛解读⽅向(推论⼆)。
[26]

⽬的(Purpose)

"⽬的(Purpose)"在语义上对应当前任务的核⼼约束，是最⾼效的收敛控制变量(推论三)。由此形成语义空间控制公式：⾃然语⾔ + ⽬的 = 最⼤信息质量⾃然语⾔ − ⽬的 = 歧义噪声正如 Owner 在后续理论升华中指出的： "⾃然语⾔+⽬的，不仅从消费侧激活信息利⽤效率，更主要的是⽤正确的信息激活 AI 的活性。" 前⽂已建⽴供给侧/消费侧的基本区分，此处在理论层⾯作具体展开。供给侧 (Supply-side) 策略的共同假设是"输⼊越完备确定，输出越可靠"——符号代号系统、规则堆叠、RAG 检索注⼊、扩⼤上下⽂窗⼝均属此类。RAG 亦属供给侧—— 尽管其核⼼机制是选择向 AI 注⼊何种⽂档，但 AI 的认知激活来源仍是被外部注⼊的⽂档⽚段，⽽⾮由⽬的直接激活的⾃⾝先验知识；"检索层...

2026
[27]

如何帮助 AI 记住更多

这样的形式化尝试，共同印证了⻓程上下⽂退化是真实的⼯程挑战；但其处⽅ ——更复杂的中间件、更精密的符号结构——与本⽂的⽅向相反。它们在"如何帮助 AI 记住更多"的问题假设下⼯作。本⽂的诊断与此相反：问题不是 AI 记得不够，⽽是活跃上下⽂中充斥了不应存在的历史噪声——两者处⽅的分歧，根植于问题的提法。两种⽅向是否互补⽽⾮对⽴，尚⽆实验数据可供判断——这是后续⼯作最直接的议题之⼀。以 Context Engineering [6] 为代表的⼯业界实践⽅向，所处理的问题是供给侧的：如何动态选择向上下⽂窗⼝注⼊和清除哪些信息——含压缩、外置笔记与⼦智能体协作等技术。这与本⽂识别的结构性问题(基线与历史讨论的物理混居)在问题层⾯相邻⽽不重叠，作为互补维度共同构成⻓程⼈机协作的完整⼯程图景。这⼀...
[28]

和 Hutchins 的分布式认知研究 [15] 共同指出，认知过程不终⽌于个体头脑的边界，⽽延伸⾄⼈⼯制品、⽂件与周边环境所构成的整体系统。在⼈机协同的语境下，基线与⽇志正是这⼀延展认知系统的物理载体——它们不是对 AI 记忆的补丁，⽽是整个⼈机团队共享的认知控制结构。Booch 早在 LLM 纪元到来之前就已指出，软件开发的本质是⼀项社会化活动：建⽴团队之间共享的思维模型，⽐任何精密的形式化规范更为根本 [16]。当协作团队中的⼀员是⼤语⾔模型时，这⼀判断的分量只增不减。本⽂所记录的⼯程发现，是这⼀哲学⽴场在⼈机协同场景下的⼀次具体⼯程着陆——有完整的⽇志和会话记录可供外部核查。本研究建⽴在单⼀项⽬的⾏动研究数据之上，其普遍性有待更多项⽬、更多协作者、更多 LLM 版本的后续验证。"...
[29]

Promptware Engineering: Software Engineering for Prompt-Enabled Systems

Chen, Z. et al. (2026). "Promptware Engineering: Software Engineering for Prompt-Enabled Systems." ACM Transactions on Software Engineering and M ethodology (TOSEM ). arXiv:2503.02400

arXiv 2026
[30]

Git Context Controller: M anage the Context of LLM - based Agents like Git

Wu, J. et al. (2025). "Git Context Controller: M anage the Context of LLM - based Agents like Git." arXiv:2508.00031. 重型结构化记忆路径

arXiv 2025
[31]

CoM em: Context M anagement with A Decoupled Long-Context M odel

Zhang, Y., Dong, C., Jin, S., Yu, C., Cui, H., Jin, H., Zhang, X., Bonab, H., Lockard, C., et al. (2026). "CoM em: Context M anagement with A Decoupled Long-Context M odel." arXiv:2605.30842

Pith/arXiv arXiv 2026
[32]

REAL: A Reasoning-Enhanced Graph Framework for Long-Term M emory M anagement of LLM s

Lu, K., Chen, L., Jiang, G., Qin, Z., Liu, Y., & Zhang, W. (2026). "REAL: A Reasoning-Enhanced Graph Framework for Long-Term M emory M anagement of LLM s." arXiv:2606.10694

Pith/arXiv arXiv 2026
[33]

GAM : Hierarchical Graph-based Agentic M emory for LLM Agents

Wu, Z., Zhang, H., Lin, F., Xu, W., Xu, X., Chen, Y., Zou, H.P., Chen, S., Zhang, W., et al. (2026). "GAM : Hierarchical Graph-based Agentic M emory for LLM Agents." arXiv:2604.12285. 上下⽂⼯程（框架总览）

Pith/arXiv arXiv 2026
[34]

Eﬀective Context Engineering for AI Agents

Rajasekaran, P., Dixon, E., Ryan, C., & Hadﬁeld, J. (Anthropic Applied AI Team). (2025). "Eﬀective Context Engineering for AI Agents." Anthropic Engineering Blog, Sep 29, 2025. https://www.anthropic.com/engineering/eﬀective-context-engineering-for- ai-agents ⻓上下⽂限制的实证研究

2025
[35]

Hong, K., Troynikov, A., & Huber, J. (2025). Context Rot: How Increasing Input Tokens Im pacts LLM Perform ance. Chroma Technical Report. https://trychroma.com/research/context-rot

2025
[36]

The Limits of Long- Context Reasoning in Automated Bug Fixing

Raju, R., Ji, M ., Upasani, S., Li, B., & Thakker, U. (2026). "The Limits of Long- Context Reasoning in Automated Bug Fixing." ICLR 2026 ICBINB W orkshop. arXiv:2602.16069

arXiv 2026
[37]

Lost in the M iddle: How Language M odels Use Long Contexts

Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M ., Petroni, F., & Liang, P. (2024). "Lost in the M iddle: How Language M odels Use Long Contexts." Transactions of the Association for Com putational Linguistics, 12, 157‒173. https://aclanthology.org/2024.tacl-1.9/ "减法策略"的实证⽀撑

2024
[38]

SkillReducer: Optimizing LLM Agent Skills for Token Eﬃciency

Gao, Y., Li, Z., Yuanyuanyuan, Ji, Z., M a, P., & Wang, S. (2026). "SkillReducer: Optimizing LLM Agent Skills for Token Eﬃciency." arXiv:2603.29919. ⾏动研究⽅法论

Pith/arXiv arXiv 2026
[39]

Teaching Action Research

Staron, M . (2024). "Teaching Action Research." EM SE Edu Book. arXiv:2408.02399

arXiv 2024
[40]

Wohlin, C., Runeson, P., Höst, M ., Ohlsson, M .C., Regnell, B., & Wesslén, A. (2012). Experim entation in Software Engineering. Springer

2012
[41]

Generative AI and Empirical Software Engineering: A Paradigm Shift

Treude, C. & Storey, M . (2025). "Generative AI and Empirical Software Engineering: A Paradigm Shift." AIware 2025. arXiv:2502.08108. 延展认知 / 认知科学 / 社会化软件⼯程

arXiv 2025
[42]

The Extended M ind

Clark, A. & Chalmers, D. (1998). "The Extended M ind." Analysis, 58(1), 7‒ 19

1998
[43]

Hutchins, E. (1995). Cognition in the W ild. M IT Press

1995
[44]

Booch, G. (1994). Object-Oriented Analysis and Design with Applications (2nd ed.). Benjamin/Cummings

1994

[1] [6]

Eﬀective Context Engineering for AI Agents

Rajasekaran, P., Dixon, E., Ryan, C., & Hadﬁeld, J. (Anthropic Applied AI Team). (2025). "Eﬀective Context Engineering for AI Agents." Anthropic Engineering Blog, Sep 29, 2025. https://www.anthropic.com/engineering/eﬀective-context-engineering-for- ai-agents Empirical Studies on Long-Context Limitations

2025

[2] [9]

Lost in the M iddle: How Language M odels Use Long Contexts

Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M ., Petroni, F., & Liang, P. (2024). "Lost in the M iddle: How Language M odels Use Long Contexts." Transactions of the Association for Com putational Linguistics, 12, 157‒173. https://aclanthology.org/2024.tacl-1.9/ Empirical Support for the "Subtraction Strategy"

2024

[3] [16]

索引病(Index Sickness)

Booch, G. (1994). Object-Oriented Analysis and Design with Applications (2nd ed.). Benjamin/Cummings. 这篇⽂章由AI写就：在连续391次会话中由AI管理语义空间并消除"索引病" 张慧(Hui Zhang) 深圳市云溪科技有限公司 zhanghui@ecloudriver.com 宋树仁(Shuren Song) 清华⼤学信息化技术中⼼ songsr@tsinghua.edu.cn 摘要在应对⼤语⾔模型(LLM )⻓程协作中的概念漂移问题时，当前业界普遍的⼯程直觉是：⽤更精密的形式化约束来换取更可靠的输出——为任务实体设计符号代号系统，在 System Prompt 中不断堆叠防错规则，以更⼤的上下...

1994

[4] [17]

命名并剖析"索引病"：记录供给侧形式化策略在⻓程 LLM 协作中的失效现象，为其命名，并分析其发⽣的认知机制。

[5] [18]

乓定律(语义活性定律)

命名并记录"乓定律(语义活性定律)"：在⼈机协同⼯程语境下，对"携带明确⽬的的⾃然语⾔，其信息质量远⾼于符号表达"这⼀被忽视的常识作出显式表述与实证检验。

[6] [19]

提出消费侧激活视⻆：指出以⾃然语⾔加⽬的来激活 AI 的⼯作状态，属于消费侧策略——通过明确⽬的收敛 AI 的语义解读空间，使 AI 优先激活⽬的相关的⽅法知识，并直接定位⽬的相关的项⽬资料，⽆需经由检索层的⼆次筛选。这与当前以供给侧为主导的改进路径构成⽅向性补充。

[7] [20]

基线-Log 物理隔离

设计并验证"基线-Log 物理隔离"机制：提供⼀种轻量级、不依赖模型升级的跨会话状态管理⽅案，并报告其在真实项⽬中的效果。 1.3 产物与⽅法论声明本⽂的写作过程构成论⽂论点的第⼆个案例——但它证明的不是"索引病被再次治好"(本⽂写作规模较⼩，且从未发病，不具备对⽐条件)，⽽是对"AI 活性"本⾝的⼀次实时演⽰。整个写作过程的协作模式如下：Owner 负责提供⽅向、作出判断、提出纠偏；AI 协作者负责维护跨会话的完整语境、参与论证链的推演、定位相关内容、组织⽂字表达。Owner 的每⼀次介⼊，都以⾃然语⾔表述⽬的为起点——这正是乓定律所描述机制的⼯作状态。本⽂不仅是关于 CSF 的论⽂，也是在 CSF 运⾏过程中产出的⽂档，两者互为印证。此处需要说明的是：本⽂所有⽂字均由 AI 协作者(...

[8] [21]

多跳解析导致认知负担：AI 每次处理⼀个代号，都需要先检索其映射含义，再将其代⼊当前任务。多次跨⽂件查表，消耗了本该应⽤于理解业务逻辑的认知带宽。

[9] [22]

错误的刚性传播：⼀旦某个符号的映射关系被错误解析，后续所有基于该符号的推论都会在符号体系内部⾃洽地"修正"，形成⼀个与真实业务逻辑完全脱离的虚构闭环。

[10] [23]

⼈类纠偏通路被切断：符号体系的⾼认知⻔槛，使持有业务直觉的 Owner ⽆法通过直觉判断 AI 的输出是否正确，只能等到物理执⾏时才发现错误。

[11] [24]

已被声明废弃"⽽降低其注意⼒权重——在物理上清除历史内容，是阻断这种污染的唯⼀可靠⼿段。依靠在 System Prompt 中声明

历史上下⽂的字⾯粘性：即便 Owner 明确声明旧⽅案已废弃，在未重置的⻓程上下⽂中，已废弃的字段名称和接⼝结构仍会在 Transformer ⾃注意⼒机制的权重分配中保持较⾼权重，悄然渗⼊新的设计⽂档，造成新旧术语的交叉污染与⼀致性漏洞。对于⾃注意⼒机制⽽⾔，上下⽂中的历史内容不因"已被声明废弃"⽽降低其注意⼒权重——在物理上清除历史内容，是阻断这种污染的唯⼀可靠⼿段。依靠在 System Prompt 中声明"请忽略旧⽅案"，在机制上是⽆效的。项⽬第136次会话的⼯程⽇志(DEVNOTES D-108)记录了这⼀机制的典型案例：AI 在跨会话收尾时凭残余印象将精确字段标签 TD-01 补全为"024 反向闯关插件"，⽽实际 TD-01 = H5 共享组件库；该错误扩散⾄ 13 处跨⽂件 ...

[12] [25]

需要引⼊最⼩控制变量，在不牺牲表达弹性的前提下，收敛解读⽅向(推论⼆)。

[13] [26]

⽬的(Purpose)

"⽬的(Purpose)"在语义上对应当前任务的核⼼约束，是最⾼效的收敛控制变量(推论三)。由此形成语义空间控制公式：⾃然语⾔ + ⽬的 = 最⼤信息质量⾃然语⾔ − ⽬的 = 歧义噪声正如 Owner 在后续理论升华中指出的： "⾃然语⾔+⽬的，不仅从消费侧激活信息利⽤效率，更主要的是⽤正确的信息激活 AI 的活性。" 前⽂已建⽴供给侧/消费侧的基本区分，此处在理论层⾯作具体展开。供给侧 (Supply-side) 策略的共同假设是"输⼊越完备确定，输出越可靠"——符号代号系统、规则堆叠、RAG 检索注⼊、扩⼤上下⽂窗⼝均属此类。RAG 亦属供给侧—— 尽管其核⼼机制是选择向 AI 注⼊何种⽂档，但 AI 的认知激活来源仍是被外部注⼊的⽂档⽚段，⽽⾮由⽬的直接激活的⾃⾝先验知识；"检索层...

2026

[14] [27]

如何帮助 AI 记住更多

这样的形式化尝试，共同印证了⻓程上下⽂退化是真实的⼯程挑战；但其处⽅ ——更复杂的中间件、更精密的符号结构——与本⽂的⽅向相反。它们在"如何帮助 AI 记住更多"的问题假设下⼯作。本⽂的诊断与此相反：问题不是 AI 记得不够，⽽是活跃上下⽂中充斥了不应存在的历史噪声——两者处⽅的分歧，根植于问题的提法。两种⽅向是否互补⽽⾮对⽴，尚⽆实验数据可供判断——这是后续⼯作最直接的议题之⼀。以 Context Engineering [6] 为代表的⼯业界实践⽅向，所处理的问题是供给侧的：如何动态选择向上下⽂窗⼝注⼊和清除哪些信息——含压缩、外置笔记与⼦智能体协作等技术。这与本⽂识别的结构性问题(基线与历史讨论的物理混居)在问题层⾯相邻⽽不重叠，作为互补维度共同构成⻓程⼈机协作的完整⼯程图景。这⼀...

[15] [28]

和 Hutchins 的分布式认知研究 [15] 共同指出，认知过程不终⽌于个体头脑的边界，⽽延伸⾄⼈⼯制品、⽂件与周边环境所构成的整体系统。在⼈机协同的语境下，基线与⽇志正是这⼀延展认知系统的物理载体——它们不是对 AI 记忆的补丁，⽽是整个⼈机团队共享的认知控制结构。Booch 早在 LLM 纪元到来之前就已指出，软件开发的本质是⼀项社会化活动：建⽴团队之间共享的思维模型，⽐任何精密的形式化规范更为根本 [16]。当协作团队中的⼀员是⼤语⾔模型时，这⼀判断的分量只增不减。本⽂所记录的⼯程发现，是这⼀哲学⽴场在⼈机协同场景下的⼀次具体⼯程着陆——有完整的⽇志和会话记录可供外部核查。本研究建⽴在单⼀项⽬的⾏动研究数据之上，其普遍性有待更多项⽬、更多协作者、更多 LLM 版本的后续验证。"...

[16] [29]

Promptware Engineering: Software Engineering for Prompt-Enabled Systems

Chen, Z. et al. (2026). "Promptware Engineering: Software Engineering for Prompt-Enabled Systems." ACM Transactions on Software Engineering and M ethodology (TOSEM ). arXiv:2503.02400

arXiv 2026

[17] [30]

Git Context Controller: M anage the Context of LLM - based Agents like Git

Wu, J. et al. (2025). "Git Context Controller: M anage the Context of LLM - based Agents like Git." arXiv:2508.00031. 重型结构化记忆路径

arXiv 2025

[18] [31]

CoM em: Context M anagement with A Decoupled Long-Context M odel

Zhang, Y., Dong, C., Jin, S., Yu, C., Cui, H., Jin, H., Zhang, X., Bonab, H., Lockard, C., et al. (2026). "CoM em: Context M anagement with A Decoupled Long-Context M odel." arXiv:2605.30842

Pith/arXiv arXiv 2026

[19] [32]

REAL: A Reasoning-Enhanced Graph Framework for Long-Term M emory M anagement of LLM s

Lu, K., Chen, L., Jiang, G., Qin, Z., Liu, Y., & Zhang, W. (2026). "REAL: A Reasoning-Enhanced Graph Framework for Long-Term M emory M anagement of LLM s." arXiv:2606.10694

Pith/arXiv arXiv 2026

[20] [33]

GAM : Hierarchical Graph-based Agentic M emory for LLM Agents

Wu, Z., Zhang, H., Lin, F., Xu, W., Xu, X., Chen, Y., Zou, H.P., Chen, S., Zhang, W., et al. (2026). "GAM : Hierarchical Graph-based Agentic M emory for LLM Agents." arXiv:2604.12285. 上下⽂⼯程（框架总览）

Pith/arXiv arXiv 2026

[21] [34]

Eﬀective Context Engineering for AI Agents

Rajasekaran, P., Dixon, E., Ryan, C., & Hadﬁeld, J. (Anthropic Applied AI Team). (2025). "Eﬀective Context Engineering for AI Agents." Anthropic Engineering Blog, Sep 29, 2025. https://www.anthropic.com/engineering/eﬀective-context-engineering-for- ai-agents ⻓上下⽂限制的实证研究

2025

[22] [35]

Hong, K., Troynikov, A., & Huber, J. (2025). Context Rot: How Increasing Input Tokens Im pacts LLM Perform ance. Chroma Technical Report. https://trychroma.com/research/context-rot

2025

[23] [36]

The Limits of Long- Context Reasoning in Automated Bug Fixing

Raju, R., Ji, M ., Upasani, S., Li, B., & Thakker, U. (2026). "The Limits of Long- Context Reasoning in Automated Bug Fixing." ICLR 2026 ICBINB W orkshop. arXiv:2602.16069

arXiv 2026

[24] [37]

Lost in the M iddle: How Language M odels Use Long Contexts

Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M ., Petroni, F., & Liang, P. (2024). "Lost in the M iddle: How Language M odels Use Long Contexts." Transactions of the Association for Com putational Linguistics, 12, 157‒173. https://aclanthology.org/2024.tacl-1.9/ "减法策略"的实证⽀撑

2024

[25] [38]

SkillReducer: Optimizing LLM Agent Skills for Token Eﬃciency

Gao, Y., Li, Z., Yuanyuanyuan, Ji, Z., M a, P., & Wang, S. (2026). "SkillReducer: Optimizing LLM Agent Skills for Token Eﬃciency." arXiv:2603.29919. ⾏动研究⽅法论

Pith/arXiv arXiv 2026

[26] [39]

Teaching Action Research

Staron, M . (2024). "Teaching Action Research." EM SE Edu Book. arXiv:2408.02399

arXiv 2024

[27] [40]

Wohlin, C., Runeson, P., Höst, M ., Ohlsson, M .C., Regnell, B., & Wesslén, A. (2012). Experim entation in Software Engineering. Springer

2012

[28] [41]

Generative AI and Empirical Software Engineering: A Paradigm Shift

Treude, C. & Storey, M . (2025). "Generative AI and Empirical Software Engineering: A Paradigm Shift." AIware 2025. arXiv:2502.08108. 延展认知 / 认知科学 / 社会化软件⼯程

arXiv 2025

[29] [42]

The Extended M ind

Clark, A. & Chalmers, D. (1998). "The Extended M ind." Analysis, 58(1), 7‒ 19

1998

[30] [43]

Hutchins, E. (1995). Cognition in the W ild. M IT Press

1995

[31] [44]

Booch, G. (1994). Object-Oriented Analysis and Design with Applications (2nd ed.). Benjamin/Cummings

1994