FeatX: Editing Software by Editing Features for Repository-Level Code Evolution

Bing Xie; Lu Zhang; Xianlin Zhao; Xutian Li; Yanzhen Zou; Yifeng Zhu

arxiv: 2606.31206 · v1 · pith:XWQUXFEQnew · submitted 2026-06-30 · 💻 cs.SE

FeatX: Editing Software by Editing Features for Repository-Level Code Evolution

Xutian Li , Yifeng Zhu , Xianlin Zhao , Yanzhen Zou , Lu Zhang , Bing Xie This is my paper

Pith reviewed 2026-07-01 04:54 UTC · model grok-4.3

classification 💻 cs.SE

keywords feature-oriented editingrepository-level evolutionLLM agent workflowhierarchical feature mappingcode patch generationuser study evaluationsoftware maintenance tool

0 comments

The pith

FeatX lets users edit repository features instead of code, using extracted hierarchies to guide an LLM agent into patches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

FeatX extracts a hierarchical epic-feature structure from an existing repository to create explicit mappings from features to code locations. A three-stage Evolution Agent then converts user changes at the feature level into concrete code patches without requiring manual prompt iteration or context selection. The approach is evaluated through a controlled user study and replay experiments on 38 real-world feature-editing commits, where it reduces cognitive load and improves usability over vanilla ChatGPT while delivering a 42.6 percent relative gain in function-level modification localization F1 at a total cost of $0.07. The central shift is from code-centric interaction to feature-centric interaction for repository-scale evolution tasks.

Core claim

By building a hierarchical epic-feature structure with explicit feature-to-code mappings and routing edits through a three-stage Evolution Agent, FeatX converts high-level feature modifications directly into repository code patches, achieving better localization accuracy and lower user effort than direct LLM prompting.

What carries the argument

Hierarchical epic-feature structure with explicit feature-to-code mappings, which supplies the three-stage Evolution Agent with the context needed to translate feature edits into patches.

If this is right

Developers perform repository-level changes without manually selecting or maintaining prompt context.
Function-level modification localization reaches 42.6 percent higher F1 than strong LLM baselines.
Total interaction cost stays at $0.07 even for multi-file feature changes.
Cognitive load and usability metrics improve measurably over vanilla ChatGPT in controlled studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structure-extraction plus staged-agent pattern could apply to non-code artifacts such as documentation or configuration hierarchies.
If feature mappings prove unstable across frequent refactors, the workflow would require periodic re-extraction or human verification steps.
Combining the extracted structure with version-control history could support feature-level diffing and rollback without additional tooling.

Load-bearing premise

The hierarchical epic-feature structure extracted from the repository provides sufficiently accurate and stable feature-to-code mappings for the three-stage Evolution Agent to produce correct patches without substantial manual correction.

What would settle it

Running the same 38 real commits through FeatX after replacing the feature extraction step with a method that produces systematically inaccurate mappings, then measuring whether the F1 improvement and usability gains disappear.

Figures

Figures reproduced from arXiv: 2606.31206 by Bing Xie, Lu Zhang, Xianlin Zhao, Xutian Li, Yanzhen Zou, Yifeng Zhu.

**Figure 1.** Figure 1: The overall design of FeatX. We evaluate FeatX along two dimensions: cognitive load and function-level modification localization accuracy. Across a user study and quantitative experiments on real-world feature-editing tasks, FeatX shows that: (1) compared with vanilla ChatGPT, it significantly reduces NASA-TLX cognitive load (12.5 → 7.4, 41% reduction) and improves usability (73 → 84, 15% increase); and (2… view at source ↗

**Figure 2.** Figure 2: Usage workflow of FeatX. delta with the expanded code context. Based on this joint signal, the agent localizes the change intent to a small set of concrete code regions. It then generates a fine-grained modification plan for each localized region, including required edits, dependencies, and execution order. The reasoning process and planning results are displayed in the Agent Panel. 2.2.3 Concrete Code Mod… view at source ↗

**Figure 3.** Figure 3: NASA-TLX cognitive load ratings [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Individual SUS scores comparison. 4.1.2 Usability Improvement. SUS consists of 10 items assessing overall system usability [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used for software evolution, yet most interaction paradigms remain code-centric and require manual context management and prompt iteration. We present FeatX, a feature-oriented tool for editing software by editing features. Given an existing repository, FeatX extracts a hierarchical epic-feature structure with explicit feature-to-code mappings, then invokes a three-stage Evolution Agent to translate feature edits into code patches. The workflow is exposed through four coordinated panels. Across a controlled user study and replay experiments on 38 real-world feature-editing commits, FeatX significantly reduces cognitive load and improves usability compared with vanilla ChatGPT. It also achieves a 42.6\% relative improvement in function-level modification localization F1 over strong LLM baselines, at substantially lower cost (\$0.07 in total). The tool and collected dataset are available at https://github.com/a496263365/FeatX/tree/demo, with a demonstration video at https://youtu.be/OZqKZ4Ii-yM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FeatX introduces a feature-hierarchy workflow for LLM repo edits with reported localization and usability gains, but lacks direct measurement of mapping accuracy.

read the letter

The core idea is shifting from code-centric LLM prompts to editing at the level of extracted features. FeatX pulls a hierarchical epic-feature structure from a repo, builds explicit mappings to functions, and runs a three-stage agent to convert feature changes into patches, all surfaced in four UI panels.

What stands out is the concrete release: the tool, a dataset of 38 real feature-editing commits, and a user study showing lower cognitive load and better usability than plain ChatGPT, plus a 42.6% relative F1 lift on function-level localization at $0.07 total cost. Releasing the code and data is useful for follow-up work in this corner of SE tooling.

The main gap is the missing check on the mappings themselves. The central claim depends on those feature-to-code links being accurate enough for the agent to succeed, yet the paper gives no precision or recall numbers for how well the extracted mappings match the actual changed locations in the 38 commits. Without that, it's hard to tell whether the gains come from the feature abstraction or just from the panels and prompting. Methods details on the extraction and agent stages also look thin in the abstract.

This is for researchers building LLM assistants for repository evolution. It has enough of a new technique and some empirical signal to warrant peer review, though the mapping validation would need to be added or clarified.

Referee Report

1 major / 1 minor

Summary. The manuscript presents FeatX, a system for repository-level code evolution that extracts a hierarchical epic-feature structure with explicit feature-to-code mappings from a given repository. It then employs a three-stage Evolution Agent to convert feature-level edits into code patches, presented via four coordinated UI panels. The approach is evaluated in a controlled user study and through replay experiments on 38 real-world feature-editing commits, reporting significant reductions in cognitive load and usability improvements over vanilla ChatGPT, a 42.6% relative improvement in function-level modification localization F1 score over strong LLM baselines, and a total cost of $0.07. The tool and dataset are made available.

Significance. If the empirical claims hold, FeatX represents a meaningful shift from code-centric to feature-centric interaction paradigms for LLM-assisted software maintenance, potentially reducing developer cognitive load in large repositories. The open-sourcing of the tool, dataset, and demonstration video is a clear strength supporting reproducibility.

major comments (1)

[Replay experiments on 38 commits] The central claim that the extracted hierarchical epic-feature structure enables the three-stage Evolution Agent to produce correct patches without substantial manual correction depends on the accuracy of the feature-to-code mappings. The manuscript reports results on 38 real commits and a user study but supplies no precision, recall, or other quantitative metrics measuring how well the automatically extracted mappings align with the locations actually modified in the ground-truth commits (see abstract and replay-experiments description).

minor comments (1)

[Abstract] The abstract asserts quantitative gains and a user study without any accompanying detail on methods, statistical tests, baseline construction, or error analysis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of FeatX's significance and reproducibility. We address the single major comment point-by-point below.

read point-by-point responses

Referee: [Replay experiments on 38 commits] The central claim that the extracted hierarchical epic-feature structure enables the three-stage Evolution Agent to produce correct patches without substantial manual correction depends on the accuracy of the feature-to-code mappings. The manuscript reports results on 38 real commits and a user study but supplies no precision, recall, or other quantitative metrics measuring how well the automatically extracted mappings align with the locations actually modified in the ground-truth commits (see abstract and replay-experiments description).

Authors: We agree that direct quantitative metrics on mapping accuracy would strengthen validation of the central claim. The replay experiments measure end-to-end patch correctness (via the reported F1 gains and commit reproduction), which depends on the mappings, but we did not include explicit precision/recall comparing extracted mappings to ground-truth modified locations in the 38 commits. In the revised manuscript we will add this analysis. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external user study and commit replays

full rationale

The paper presents an empirical tool evaluation with no equations, fitted parameters, or derivation chain. Results derive from a controlled user study and replay of 38 real-world commits, with no self-definitional mappings, predictions that reduce to inputs by construction, or load-bearing self-citations. The feature-to-code extraction step is an implementation detail whose accuracy is externally testable against the ground-truth commits; it does not create a closed loop within the reported claims. This is the normal case of a self-contained empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems paper; no mathematical axioms, free parameters, or invented physical entities are invoked.

pith-pipeline@v0.9.1-grok · 5716 in / 1089 out tokens · 28646 ms · 2026-07-01T04:54:27.515979+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 8 canonical work pages · 1 internal anchor

[1]

John Brooke et al. 1996. SUS-A quick and dirty usability scale.Usability evaluation in industry189, 194 (1996), 4–7

1996
[2]

2004.User Stories Applied: For Agile Software Development

Mike Cohn. 2004.User Stories Applied: For Agile Software Development. Addison Wesley Longman Publishing Co., Inc., USA

2004
[3]

R. L. Glass. 2001. Frequently forgotten fundamental facts about software engi- neering.IEEE Software18, 3 (2001), 112–111. doi:10.1109/MS.2001.922739

work page doi:10.1109/ms.2001.922739 2001
[4]

Hart and Lowell E

Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. InHuman Mental Workload, Peter A. Hancock and Najmedin Meshkati (Eds.). Advances in Psychology, Vol. 52. North-Holland, 139–183. doi:10.1016/S0166-4115(08)62386-9

work page doi:10.1016/s0166-4115(08)62386-9 1988
[5]

JavaParser Team. 2023. JavaParser. https://javaparser.org/ Accessed: 2025

2023
[6]

Junwei Liu, Chen Xu, Chong Wang, et al. 2026. EvoDev: An Iterative Feature- Driven Framework for End-to-End Software Development with LLM-based Agents. arXiv:2511.02399 [cs.SE]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

George A. Miller. 1956. The magical number seven, plus or minus two: Some limits on our capacity for processing information.Psychological Review63, 2 (1956), 81–97

1956
[8]

Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. 2025. RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/ forum?id=dw9VUsSHGB

2025
[9]

Kensen Shi, Deniz Altınbüken, Saswat Anand, Mihai Christodorescu, Katja Grün- wedel, Alexa Koenings, Sai Naidu, Anurag Pathak, Marc Rasi, Fredde Ribeiro, Brandon Ruffin, Siddhant Sanyam, Maxim Tabachnyk, Sara Toth, Roy Tu, Tobias Welp, Pengcheng Yin, Manzil Zaheer, Satish Chandra, and Charles Sutton. 2025. Natural Language Outlines for Code: Literate Prog...

work page doi:10.1145/3696630.3728541 2025
[10]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE (June 2025). doi:10.1145/3715754

work page doi:10.1145/3715754 2025
[11]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2025. SWE-agent: agent-computer interfaces enable automated software engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems (NIPS ’24). Curran Associates Inc., Red Hook, NY, USA

2025
[12]

Ryan Yen, Jian Zhao, and Daniel Vogel. 2025. Code Shaping: Iterative Code Editing with Free-form AI-Interpreted Sketching. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3706598.3713822

work page doi:10.1145/3706598.3713822 2025
[13]

J. D. Zamfirescu-Pereira, Eunice Jun, Michael Terry, Qian Yang, and Bjoern Hartmann. 2025. Beyond Code Generation: LLM-supported Exploration of the Program Design Space. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3706598.3714154

work page doi:10.1145/3706598.3714154 2025
[14]

Yifeng Zhu, Xianlin Zhao, Xutian Li, Yanzhen Zou, Haizhuo Yuan, Yue Wang, and Bing Xie. 2025. RepoSummary: Feature-Oriented Summarization and Documentation Generation for Code Repositories. arXiv:2510.11039 [cs.SE] https://arxiv.org/abs/2510.11039

work page arXiv 2025

[1] [1]

John Brooke et al. 1996. SUS-A quick and dirty usability scale.Usability evaluation in industry189, 194 (1996), 4–7

1996

[2] [2]

2004.User Stories Applied: For Agile Software Development

Mike Cohn. 2004.User Stories Applied: For Agile Software Development. Addison Wesley Longman Publishing Co., Inc., USA

2004

[3] [3]

R. L. Glass. 2001. Frequently forgotten fundamental facts about software engi- neering.IEEE Software18, 3 (2001), 112–111. doi:10.1109/MS.2001.922739

work page doi:10.1109/ms.2001.922739 2001

[4] [4]

Hart and Lowell E

Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. InHuman Mental Workload, Peter A. Hancock and Najmedin Meshkati (Eds.). Advances in Psychology, Vol. 52. North-Holland, 139–183. doi:10.1016/S0166-4115(08)62386-9

work page doi:10.1016/s0166-4115(08)62386-9 1988

[5] [5]

JavaParser Team. 2023. JavaParser. https://javaparser.org/ Accessed: 2025

2023

[6] [6]

Junwei Liu, Chen Xu, Chong Wang, et al. 2026. EvoDev: An Iterative Feature- Driven Framework for End-to-End Software Development with LLM-based Agents. arXiv:2511.02399 [cs.SE]

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

George A. Miller. 1956. The magical number seven, plus or minus two: Some limits on our capacity for processing information.Psychological Review63, 2 (1956), 81–97

1956

[8] [8]

Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. 2025. RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/ forum?id=dw9VUsSHGB

2025

[9] [9]

Kensen Shi, Deniz Altınbüken, Saswat Anand, Mihai Christodorescu, Katja Grün- wedel, Alexa Koenings, Sai Naidu, Anurag Pathak, Marc Rasi, Fredde Ribeiro, Brandon Ruffin, Siddhant Sanyam, Maxim Tabachnyk, Sara Toth, Roy Tu, Tobias Welp, Pengcheng Yin, Manzil Zaheer, Satish Chandra, and Charles Sutton. 2025. Natural Language Outlines for Code: Literate Prog...

work page doi:10.1145/3696630.3728541 2025

[10] [10]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE (June 2025). doi:10.1145/3715754

work page doi:10.1145/3715754 2025

[11] [11]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2025. SWE-agent: agent-computer interfaces enable automated software engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems (NIPS ’24). Curran Associates Inc., Red Hook, NY, USA

2025

[12] [12]

Ryan Yen, Jian Zhao, and Daniel Vogel. 2025. Code Shaping: Iterative Code Editing with Free-form AI-Interpreted Sketching. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3706598.3713822

work page doi:10.1145/3706598.3713822 2025

[13] [13]

J. D. Zamfirescu-Pereira, Eunice Jun, Michael Terry, Qian Yang, and Bjoern Hartmann. 2025. Beyond Code Generation: LLM-supported Exploration of the Program Design Space. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3706598.3714154

work page doi:10.1145/3706598.3714154 2025

[14] [14]

Yifeng Zhu, Xianlin Zhao, Xutian Li, Yanzhen Zou, Haizhuo Yuan, Yue Wang, and Bing Xie. 2025. RepoSummary: Feature-Oriented Summarization and Documentation Generation for Code Repositories. arXiv:2510.11039 [cs.SE] https://arxiv.org/abs/2510.11039

work page arXiv 2025