Trust-Calibrated Code Review: A Participatory Design Study of Review Workflows for LLM-Generated Multi-File Changes
Pith reviewed 2026-06-28 13:44 UTC · model grok-4.3
The pith
Reviewing LLM-generated multi-file code changes is a trust-calibration problem rather than a diffing problem.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that reviewing LLM-generated multi-file changes centers on trust calibration. It proposes a three-level review workflow consisting of overview, file-analysis, and code snippet review, underpinned by seven design constructs: chunk, risk-per-line, risk-per-file, judge, walk-through, zooming in/out, and security cage. These provide a framework for tools that surface risk and confidence signals at the granularity at which developers allocate attention. Survey responses indicated that the workflow levels received above-neutral scores and that many participants anticipated reduced overall review effort and trust-assessment effort.
What carries the argument
The three-level review workflow (overview, file-analysis, code snippet review) supported by seven design constructs that surface risk and confidence signals at the granularity at which developers allocate attention.
If this is right
- Tool designers gain a conceptual framework for building AI-ready code review tools.
- The workflow levels can surface risk and confidence signals matched to how developers direct their attention.
- A majority of surveyed practitioners expected the approach to reduce overall review effort.
- Over half expected a reduction in the effort needed to assess trust in the generated changes.
Where Pith is reading between the lines
- The same emphasis on trust calibration could extend to reviewing LLM output in domains other than code, such as documentation or test cases.
- Integrating the constructs into existing IDEs might change the default review process from line-by-line diff inspection to risk-focused navigation.
- Future evaluations could test whether the constructs improve detection of specific failure modes like security issues or incomplete multi-file consistency.
- The constructs might serve as a basis for new evaluation metrics that score LLM code generators by how much review effort they demand.
Load-bearing premise
The workflow and design constructs developed from sessions with a small group of practitioners will produce lower review effort and better trust calibration when built into tools used by a wider population of developers.
What would settle it
A controlled study that measures actual review time, error detection rates, and trust accuracy when developers use tools built on the three-level workflow versus standard diff tools would falsify the claim if the new tools show no improvement or increased effort.
read the original abstract
Background: Developers increasingly review multi-file code changes generated by LLM-based agents, yet no validated end-to-end workflow or IDE tooling design exists for this scenario. Aims: We investigate (RQ1) the challenges developers face when reviewing LLM-generated multi-file changes and (RQ2) how developers envision effective workflows for this task. Method: In collaboration with JetBrains, we conducted a participatory design study structured using the double-diamond design process with Discover, Define, Develop, and Deliver phases. Industry practitioners participated in the Discover phase (N=17); seven of these returned for the Develop phase. The Define phase was an author-led synthesis. The Deliver phase produced a conceptual design and a high-fidelity semi-interactive prototype evaluated through a follow-up survey with N=43 practitioners. Results: Participants identified trust-calibration as the central challenge. The study yielded a three-level review workflow (overview, file-analysis, code snippet review) supported by seven design constructs (chunk, risk-per-line, risk-per-file, judge, walk-through, zooming in/out, and security cage). In the validation survey, all three workflow levels scored above the neutral midpoint (means 3.50--3.91 on a five-point scale). Of the respondents, 63% expected reduced overall review effort, and 52% reduced trust-assessment effort, relative to their current tools. These findings suggest that the design constructs indicate a positive direction for future tool development. Conclusions: Reviewing LLM-generated multi-file changes is a trust-calibration problem rather than a diffing problem. The three-level workflow and the seven constructs we report give tool designers a conceptual framework for building AI-ready code review tools that surface risk and confidence signals at the granularity at which developers allocate attention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a participatory design study (double-diamond process) with 17 industry practitioners in the Discover phase (7 returning for Develop), author-led synthesis in Define, and a validation survey with 43 practitioners. It identifies trust-calibration as the central challenge in reviewing LLM-generated multi-file changes, derives a three-level workflow (overview, file-analysis, code snippet review) and seven design constructs (chunk, risk-per-line, risk-per-file, judge, walk-through, zooming in/out, security cage), and reports survey means of 3.50–3.91 with 63% and 52% expecting reduced effort, concluding that the constructs supply a conceptual framework for AI-ready code review tools.
Significance. If the framework translates to implemented tools, the work supplies a practitioner-grounded conceptual model that reframes code review tooling around risk and confidence signals at the granularity of developer attention rather than traditional diffs. The participatory method and positive directional survey responses provide initial ecological grounding for future tool development in this emerging area.
major comments (2)
- [Methods (Define phase)] Methods (Discover/Define phases): The manuscript provides no description of the qualitative synthesis process (e.g., thematic coding, affinity diagramming, or author consensus procedure) used to extract the seven specific design constructs from the N=17 session transcripts. This derivation step is load-bearing for the central claim that the constructs form a usable framework.
- [Results (survey)] Results (validation survey): No details are given on survey instrument construction, sampling/recruitment method, response rate, statistical procedures for the reported means (3.50–3.91), or handling of potential selection bias from the JetBrains collaboration. These omissions limit verification that the survey supports the positive-direction conclusion.
minor comments (2)
- [Abstract] Abstract: The phrasing 'no validated end-to-end workflow' is slightly inconsistent with the paper's own contribution of a conceptual (not yet implemented or field-tested) workflow.
- [Conclusions] The paper could more explicitly state the scoped nature of the claims (directional indication rather than demonstrated effectiveness) in the Conclusions to align with the study design.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and note the revisions we will make to improve methodological transparency.
read point-by-point responses
-
Referee: [Methods (Define phase)] Methods (Discover/Define phases): The manuscript provides no description of the qualitative synthesis process (e.g., thematic coding, affinity diagramming, or author consensus procedure) used to extract the seven specific design constructs from the N=17 session transcripts. This derivation step is load-bearing for the central claim that the constructs form a usable framework.
Authors: We agree that the current description of the Define phase is too brief and does not adequately document how the seven design constructs were derived. In the revised manuscript we will expand the Methods section with a dedicated subsection on the author-led synthesis. This will detail the process of reviewing session transcripts, identifying recurring themes related to trust calibration and review granularity, the use of affinity diagramming to group observations, and the iterative author consensus meetings that produced the final set of constructs and the three-level workflow. revision: yes
-
Referee: [Results (survey)] Results (validation survey): No details are given on survey instrument construction, sampling/recruitment method, response rate, statistical procedures for the reported means (3.50–3.91), or handling of potential selection bias from the JetBrains collaboration. These omissions limit verification that the survey supports the positive-direction conclusion.
Authors: We acknowledge that the validation survey section lacks the requested methodological details. The revised manuscript will add a dedicated subsection describing: (1) how the survey items were constructed directly from the three workflow levels and seven constructs identified in the participatory design sessions; (2) the recruitment channels, including professional networks and the JetBrains collaboration; (3) the number of invitations sent and the resulting response rate; (4) the descriptive statistical procedures used to compute the reported means; and (5) steps taken to mitigate and disclose potential selection bias. These additions will allow readers to evaluate the survey's role as directional validation more rigorously. revision: yes
Circularity Check
No significant circularity detected
full rationale
This is a qualitative participatory design study whose central claims (trust-calibration as the core challenge, plus the three-level workflow and seven constructs) are derived directly from session transcripts with 17 practitioners and validated by independent survey responses from 43 others. No equations, fitted parameters, self-citations, or uniqueness theorems appear in the derivation chain; the results rest on external participant input rather than any reduction to the authors' prior outputs or internal definitions. The scoped claim that the constructs 'indicate a positive direction' does not assert empirical effectiveness in deployed tools, avoiding any self-referential leap.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Participatory design with industry practitioners produces valid insights for tool design that generalize beyond the sample
Reference graph
Works this paper leans on
-
[1]
InProceedings of the 35th International Conference on Software Engineering (ICSE)
Accessed: 2026-05-07. URL: https://code. claude.com/docs/en/code-review. 2 Alberto Bacchelli and Christian Bird. Expectations, outcomes, and challenges of modern code review. In2013 35th international conference on software engineering (ICSE), pages 712–721. IEEE, 2013.doi:10.1109/ICSE.2013.6606617. 3 Deepika Badampudi, Michael Unterkalmsteiner, and Ricar...
-
[2]
URL:https://www.codeant.ai/blogs/severity-without-impact
Accessed: 2026-05-07. URL:https://www.codeant.ai/blogs/severity-without-impact. 7 CodeRabbit. Walkthroughs - coderabbit documentation,
2026
-
[3]
2020.Design Justice: Community-Led Practices to Build the Worlds We Need
Accessed: 2026-05-07. URL: https://docs.coderabbit.ai/pr-reviews/walkthroughs. 8 Sasha Costanza-Chock.Design Justice: Community-Led Practices to Build the Worlds We Need. The MIT Press, 2020.doi:10.7551/mitpress/12255.001.0001. 9 Norman K Denzin.The research act: A theoretical introduction to sociological methods. Routledge, 2017.doi:10.4324/9781315134543...
-
[4]
URL: https://www.designcouncil.org.uk/our-resources/ framework-for-innovation/
Online; ac- cessed 26-February-2026. URL: https://www.designcouncil.org.uk/our-resources/ framework-for-innovation/. 11 GitHub. Reviewing ai-generated code,
2026
-
[5]
URL: https://docs
Accessed: 2026-05-07. URL: https://docs. github.com/en/copilot/tutorials/review-ai-generated-code. 12 Pavlína Wurzel Gonçalves, Enrico Fregnan, Tobias Baum, Kurt Schneider, and Alberto Bacchelli. Do explicit review strategies improve code review performance? towards un- derstanding the role of cognitive load.Empirical Software Engineering, 27(4):99,
2026
-
[6]
doi:10.1007/s10664-022-10123-8. 13 Pavlína Wurzel Gonçalves, Pooja Rani, Margaret-Anne Storey, Diomidis Spinellis, and Alberto Bacchelli. Code review comprehension: Reviewing strategies seen through code comprehension theories. In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC), pages 589–601. IEEE, 2025.doi:10.1109/icpc66645.20...
-
[7]
URL:https://developer.harness
Accessed: 2026-05-07. URL:https://developer.harness. io/docs/platform/harness-ai/code-pr/. 17 Lo Heander, Emma Söderberg, and Christofer Rydenfält. Support, Not Automation: Towards AI-supported Code Review For Code Quality and Beyond. InProceedings of the 33rd ACM Gullstrand Heander et al. 19 International Conference on the Foundations of Software Enginee...
2026
-
[8]
18 Jeff Johnson and Austin Henderson
ACM.doi:10.1145/3696630.3728505. 18 Jeff Johnson and Austin Henderson. Conceptual models: begin by designing what to design. interactions, 9(1):25–32, 2002.doi:10.1145/503355.503366. 19 Patricia Lago, Per Runeson, Qunying Song, and Roberto Verdecchia. Threats to validity in software engineering–hypocritical paper section or essential analysis? InProceedin...
-
[9]
Accessed: 2026-05-07. URL:https://developers. openai.com/codex/integrations/github. 22Peter C Rigby and Christian Bird. Convergent contemporary software peer review practices. InProceedings of the 2013 9th joint meeting on foundations of software engineering, pages 202–212, 2013.doi:10.1145/2491411.2491444. 23 Caitlin Sadowski, Emma Söderberg, Luke Church...
-
[10]
Association for Computing Machinery.doi:10.1145/3183519.3183525. 24 BenShneiderman. Theeyeshaveit: ataskbydatatypetaxonomyforinformationvisualizations. InProceedings 1996 IEEE Symposium on Visual Languages, pages 336–343. IEEE,
-
[11]
doi:10.1109/VL.1996.545307. 25 Clay Spinuzzi. The methodology of participatory design.Technical communica- tion, 52(2):163–174,
-
[12]
Cognitive load during problem solving: Effects on learning.Cognitive Science, 12(2): 257–285, 1988
URL:https://www.ingentaconnect.com/content/stc/tc/2005/ 00000052/00000002/art00005. 26 John Sweller. Cognitive load during problem solving: Effects on learning.Cognitive Science, 12(2):257–285, 1988.doi:10.1207/s15516709cog1202_4. 27 Manushree Vijayvergiya, Małgorzata Salawa, Ivan Budiselić, Dan Zheng, Pascal Lamblin, Marko Ivanković, Juanjo Carin, Mateus...
-
[13]
URL: https://github.blog/ai-and-ml/github-copilot/ 60-million-copilot-code-reviews-and-counting/
Ac- cessed: 2026-05-07. URL: https://github.blog/ai-and-ml/github-copilot/ 60-million-copilot-code-reviews-and-counting/. 29 Suzhen Zhong, Shayan Noei, Ying Zou, and Bram Adams. Human-AI synergy in agentic code review, March
2026
-
[14]
Preprint.arXiv:2603.15911,doi:10.48550/arXiv.2603.15911
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.