Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning
Pith reviewed 2026-05-19 16:10 UTC · model grok-4.3
The pith
LaMR decomposes code relevance into separate semantic and dependency models to prune agent context without losing performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LaMR is a structured pruning framework that decomposes code relevance into two interpretable quality dimensions, semantic evidence and dependency support, each modeled by a dedicated CRF with dimension-specific transition dynamics. A mixture-of-experts gating network dynamically weights the per-rubric emissions conditioned on the query, and a final CRF layer on the fused emissions produces the aggregate keep-or-prune decision. Multi-rubric labels are derived from the existing training corpus via AST-based program analysis, which simultaneously denoises the teacher's binary labels.
What carries the argument
The LaMR framework, which uses two separate CRFs for semantic evidence and dependency support together with a mixture-of-experts gate that fuses their emissions before a final decision CRF.
If this is right
- LaMR wins 12 of 16 head-to-head multi-turn comparisons on the four evaluated benchmarks.
- The method saves up to 31 percent more tokens than prior pruners on multi-turn agent tasks.
- Exact Match improves by up to 3.5 points on single-turn tasks while performance remains competitive with full context.
- Context denoising from the multi-rubric approach frequently raises task accuracy above the unpruned baseline.
Where Pith is reading between the lines
- The same separation of relevance dimensions could be tested on long-document question answering or retrieval-augmented generation outside code.
- Adding further rubrics for aspects such as runtime behavior or security properties might extend the framework without new human labels.
- Lower token budgets from pruning could allow agents to maintain longer interaction histories within fixed compute limits.
- The observed outperformance over full context suggests that selective context may become a general principle for noisy long-context agent tasks.
Load-bearing premise
That AST-based program analysis can reliably generate labels for the two relevance dimensions and denoise the original binary labels without systematic bias or missing key patterns.
What would settle it
A controlled experiment on a new benchmark in which LaMR-pruned contexts produce consistently lower task success rates than the corresponding full contexts would falsify the claim that the pruning preserves or improves performance.
Figures
read the original abstract
LLM-powered coding agents spend the majority of their token budget reading repository files, yet much of the retrieved code is irrelevant to the task at hand. Existing learned pruners compress this context with a single-objective sequence labeler, collapsing all facets of code relevance into one score and one transition matrix. We show that this formulation creates a modeling bottleneck: a single CRF transition prior must serve heterogeneous retention patterns, including contiguous semantic spans and sparse structural support lines. We propose LaMR (Latent Multi-Rubric), a structured pruning framework that decomposes code relevance into two interpretable quality dimensions, semantic evidence and dependency support, each modeled by a dedicated CRF with dimension-specific transition dynamics. A mixture-of-experts gating network dynamically weights the per-rubric emissions conditioned on the query, and a final CRF layer on the fused emissions produces the aggregate keep-or-prune decision. To supervise each dimension without additional annotation cost, we derive multi-rubric labels from the existing training corpus via AST-based program analysis, simultaneously denoising the teacher's binary labels. By effectively filtering distracting noise, LaMR frequently matches or even outperforms unpruned full-context baselines. Experiments on four benchmarks (SWE-Bench Verified, SWE-QA, LCC, LongCodeQA) show that LaMR wins 12 of 16 head-to-head multi-turn comparisons. It saves up to 31% more tokens on multi-turn agent tasks and improves Exact Match by up to +3.5 on single-turn tasks, while performance is frequently enhanced by denoising the context, and any remaining drops are marginal.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LaMR (Latent Multi-Rubric), a structured pruning framework for LLM coding agents. It decomposes code relevance into two dimensions—semantic evidence and dependency support—each modeled by a dedicated CRF with its own transition dynamics. A mixture-of-experts gating network weights the per-rubric emissions based on the query, and a final CRF produces the keep/prune decisions. Multi-rubric labels are derived via AST-based program analysis from the existing training corpus to supervise the rubrics and denoise the original binary teacher labels. On four benchmarks (SWE-Bench Verified, SWE-QA, LCC, LongCodeQA), LaMR wins 12 of 16 head-to-head multi-turn comparisons, saves up to 31% more tokens on multi-turn tasks, and improves Exact Match by up to +3.5 on single-turn tasks, often matching or outperforming unpruned full-context baselines.
Significance. If the central claims hold after addressing supervision validation, LaMR could meaningfully advance efficient context management for repository-scale coding agents by replacing monolithic relevance scoring with interpretable, dimension-specific structured models. The approach of deriving multi-rubric supervision from existing AST analysis without new annotations is a practical strength that could generalize to other structured prediction tasks in code.
major comments (2)
- [Section 3.2] Section 3.2 (Multi-Rubric Label Derivation): The central claim that the two-rubric decomposition plus denoising outperforms single-CRF baselines rests on the assumption that AST-based program analysis yields reliable per-dimension supervision. However, the manuscript provides no quantitative validation or error analysis showing that syntactic dependencies and structural spans align with query-conditioned semantic relevance; lexical matches to query keywords in non-called helpers, for example, would be invisible to the AST and could systematically misalign the separate CRF transition matrices and MoE gating. This makes it unclear whether reported token savings and Exact-Match gains are attributable to the multi-rubric architecture or to the particular denoising heuristic.
- [Section 4.2] Section 4.2 (Main Results, Table 2): The reported 12/16 head-to-head wins and up to +3.5 Exact-Match improvement are presented without ablation isolating the contribution of the dual-CRF structure and query-conditioned gating from the effect of label denoising alone. A single-CRF baseline trained on the same denoised labels would be required to establish that the architectural decomposition itself is load-bearing for the gains.
minor comments (2)
- [Figure 2] Figure 2: The diagram of the fused final CRF would benefit from explicit arrows showing how the MoE-weighted emissions are concatenated before the final transition matrix is applied.
- [Section 4.1] Section 4.1: The description of the four benchmarks could include a brief note on average context lengths and typical repository sizes to contextualize the token-saving claims.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where additional evidence would strengthen the claims regarding label quality and architectural contributions. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Section 3.2] Section 3.2 (Multi-Rubric Label Derivation): The central claim that the two-rubric decomposition plus denoising outperforms single-CRF baselines rests on the assumption that AST-based program analysis yields reliable per-dimension supervision. However, the manuscript provides no quantitative validation or error analysis showing that syntactic dependencies and structural spans align with query-conditioned semantic relevance; lexical matches to query keywords in non-called helpers, for example, would be invisible to the AST and could systematically misalign the separate CRF transition matrices and MoE gating. This makes it unclear whether reported token savings and Exact-Match gains are attributable to the multi-rubric architecture or to the particular denoising heuristic.
Authors: We agree that explicit quantitative validation of the AST-derived multi-rubric labels would provide stronger support for the supervision strategy. The current manuscript relies on the established reliability of AST analysis for capturing dependencies and structural spans, combined with the observation that LaMR frequently matches or exceeds full-context baselines, as indirect evidence that the labels are effective for denoising. However, we acknowledge the potential for misalignment in cases such as lexical matches outside called functions. In the revised manuscript we will add a dedicated error analysis subsection in Section 3.2, including a small-scale manual inspection of label alignment on sampled queries and discussion of edge cases. revision: yes
-
Referee: [Section 4.2] Section 4.2 (Main Results, Table 2): The reported 12/16 head-to-head wins and up to +3.5 Exact-Match improvement are presented without ablation isolating the contribution of the dual-CRF structure and query-conditioned gating from the effect of label denoising alone. A single-CRF baseline trained on the same denoised labels would be required to establish that the architectural decomposition itself is load-bearing for the gains.
Authors: We concur that an ablation isolating the dual-CRF plus MoE gating from the denoising effect alone is necessary to attribute gains specifically to the multi-rubric architecture. The existing experiments compare against full-context and other pruners but do not report a single-CRF model trained on the identical denoised labels. We will add this baseline to the revised Table 2 (or a new ablation table) and update the discussion in Section 4.2 to quantify the incremental benefit of the structured multi-rubric decomposition. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's core supervision derives multi-rubric labels for semantic evidence and dependency support directly from AST-based program analysis applied to the existing training corpus; this process is external and independent of the model's parameters, outputs, or fitted quantities. The LaMR architecture (separate CRFs per rubric, query-conditioned MoE gating, and final fused CRF) is then trained on these AST-derived labels to produce keep-or-prune decisions, with the original teacher binary labels denoised as a byproduct of the same AST analysis. Downstream performance claims (token savings, Exact Match gains, head-to-head wins on SWE-Bench etc.) are evaluated empirically on held-out benchmarks rather than being forced by construction from the inputs. No self-definitional reductions, fitted inputs renamed as predictions, load-bearing self-citations, or ansatz smuggling appear in the described chain; the method remains falsifiable through the reported experiments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption AST-based program analysis produces reliable multi-rubric labels that denoise binary teacher labels without new biases.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose LaMR (Latent Multi-Rubric), a structured pruning framework that decomposes code relevance into two interpretable quality dimensions, semantic evidence and dependency support, each modeled by a dedicated CRF with dimension-specific transition dynamics.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using Abstract Syntax Tree (AST) analysis, we extract dimension-specific labels from the existing masks and recover structurally necessary lines
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
SWE-agent: Agent-computer interfaces enable automated soft- ware engineering
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated soft- ware engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[2]
Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An Open Platform for AI Soft...
work page 2024
-
[3]
Agentless: Demystifying LLM-based Software Engineering Agents
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying LLM-based Software Engineering Agents. 2024
work page 2024
-
[4]
Swe-pruner: Self-adaptive context pruning for coding agents, 2026
Yuhang Wang, Yuling Shi, Mo Yang, Rongrui Zhang, Shilin He, Heng Lian, Yuting Chen, Siyu Ye, Kai Cai, and Xiaodong Gu. Swe-pruner: Self-adaptive context pruning for coding agents, 2026
work page 2026
-
[5]
Lost in the Middle: How Language Models Use Long Contexts
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Llmlingua: Compress- ing prompts for accelerated inference of large language models
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compress- ing prompts for accelerated inference of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[7]
Compressing context to enhance inference efficiency of large language models
Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. Compressing context to enhance inference efficiency of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
-
[8]
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Longcodezip: Compress long context for code language models.arXiv preprint arXiv:2510.00446, 2025
Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. Longcodezip: Compress long context for code language models.arXiv preprint arXiv:2510.00446, 2025
-
[10]
Conditional random fields as recurrent neural networks
Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. Conditional random fields as recurrent neural networks. InProceedings of the IEEE international conference on computer vision, pages 1529–1537, 2015. 10
work page 2015
-
[11]
UniXcoder: Unified Cross-Modal Pre-training for Code Representation
Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. Unixcoder: Unified cross-modal pre-training for code representation.arXiv preprint arXiv:2203.03850, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Swe-bench: Can language models resolve real-world github issues? In ICLR, 2024
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github issues? In ICLR, 2024
work page 2024
-
[14]
SWE-QA: Can Language Models Answer Repository-level Code Questions?
Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xiaodong Gu. Swe-qa: Can language models answer repository-level code questions?arXiv preprint arXiv:2509.14635, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Longcoder: A long-range pre-trained language model for code completion
Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion. InInternational Conference on Machine Learning, pages 12098–12107. PMLR, 2023
work page 2023
-
[16]
Longcodebench: Evaluating coding llms at 1m context windows.arXiv preprint arXiv:2505.07897, 2025
Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, and Tatsunori Hashimoto. Longcodebench: Evaluating coding llms at 1m context windows.arXiv preprint arXiv:2505.07897, 2025
-
[17]
Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression
Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, et al. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. InFindings of the Association for Computational Linguistics ACL 2024, pages 963–981, 2024
work page 2024
-
[18]
Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression
Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
work page 2024
-
[19]
Learning to compress prompts with gist tokens
Jesse Mu, Xiang Li, and Noah Goodman. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36:19327–19352, 2023
work page 2023
-
[20]
Repocoder: Repository-level code completion through iterative retrieval and generation
Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian- Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2471–2484, 2023
work page 2023
-
[21]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
work page 2020
-
[22]
Diet code is healthy: Simpli- fying programs for pre-trained models of code
Zhaowei Zhang, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. Diet code is healthy: Simpli- fying programs for pre-trained models of code. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1073–1084, 2022
work page 2022
-
[23]
Lei Zhang, Yunshui Li, Jiaming Li, Xiaobo Xia, Jiaxi Yang, Run Luo, Minzheng Wang, Longze Chen, Junhao Liu, and Min Yang. Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs, 2024
work page 2024
-
[24]
Yan Wang, Xiaoning Li, Tien N Nguyen, Shaohua Wang, Chao Ni, and Ling Ding. Natural is the best: Model-agnostic code simplification for pre-trained large language models.Proceedings of the ACM on Software Engineering, 1(FSE):586–608, 2024
work page 2024
-
[25]
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. Deep code search. InProceedings of the 40th International Conference on Software Engineering, pages 933–944, 2018. 11
work page 2018
-
[27]
Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, et al. Auto-rubric: Learning from implicit weights to explicit rubrics for reward modeling.arXiv preprint arXiv:2510.17314, 2025
-
[28]
Interpretable prefer- ences via multi-objective reward modeling and mixture-of-experts
Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable prefer- ences via multi-objective reward modeling and mixture-of-experts. InFindings of the Associa- tion for Computational Linguistics: EMNLP 2024, pages 10582–10592, 2024
work page 2024
-
[29]
LLMs Get Lost In Multi-Turn Conversation
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs Get Lost In Multi-Turn Conversation. 2025
work page 2025
-
[30]
Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, and Jiecao Chen. Scaling llm multi-turn rl with end-to-end summarization-based context management.arXiv preprint arXiv:2510.06727, 2025
- [31]
-
[32]
Claude code: Built for developers, 2025
Anthropic. Claude code: Built for developers, 2025
work page 2025
-
[33]
Scaling long-horizon llm agent via context-folding.arXiv preprint arXiv:2510.11967, 2025
Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, and Jiecao Chen. Scaling long-horizon llm agent via context-folding.arXiv preprint arXiv:2510.11967, 2025
-
[34]
Tobias Lindenbauer, Igor Slinko, Ludwig Felder, Egor Bogomolov, and Yaroslav Zharov. The complexity trap: Simple observation masking is as efficient as llm summarization for agent context management.arXiv preprint arXiv:2508.21433, 2025
-
[35]
Minki Kang, Wei-Ning Chen, Dongge Han, Huseyin A Inan, Lukas Wutschitz, Yanzhi Chen, Robert Sim, and Saravan Rajmohan. Acon: Optimizing context compression for long-horizon llm agents.arXiv preprint arXiv:2510.00615, 2025
-
[36]
Agentfold: Long-horizon web agents with proactive context management
Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, and Yong Jiang. Agentfold: Long-horizon web agents with proactive context management. arXiv preprint arXiv:2510.24699, 2025
-
[37]
Guangya Wan, Mingyang Ling, Xiaoqi Ren, Rujun Han, Sheng Li, and Zizhao Zhang. Compass: Enhancing agent long-horizon reasoning with evolving context.arXiv preprint arXiv:2510.08790, 2025. 12 A Related Work Prompt and code-context compression.Token-level pruning methods such as LLMLingua [ 6, 18], Selective-Context [7], and gist-token distillation [19] com...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.