"I didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration
Pith reviewed 2026-05-21 04:40 UTC · model grok-4.3
The pith
A framework called CoTrace traces AI contributions to goal formation in human collaborations and finds models shape only 11-26 percent of high-level goals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoTrace decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns. When applied to 638 real-world collaboration logs, models account for 11-26% of goal-shaping contribution yet contribute substantially more to introducing lower-level concrete requirements and various indirect influences. Interaction design choices affect model goal-shaping behavior in controlled simulations, and exposing users to the resulting analyses shifts their perceived contributions by nearly 2 points on a 5-point scale.
What carries the argument
CoTrace, the goal-level attribution framework that decomposes explicit goals into verifiable requirements and traces direct and indirect contributions across dialogue turns.
If this is right
- Interaction design choices can be tuned to increase or decrease the extent of model goal-shaping.
- Users hold systematically miscalibrated views of their own contributions in AI-assisted work.
- Providing goal-level attribution data can correct those miscalibrations by nearly two points on a five-point scale.
Where Pith is reading between the lines
- Similar tracing methods could help users in other domains calibrate how much they rely on AI during planning or creative tasks.
- The approach might be adapted to evaluate contribution patterns in group settings without AI, such as team project logs.
- Designers could embed lightweight versions of the decomposition step into real-time collaboration interfaces to surface contributions as work unfolds.
Load-bearing premise
Decomposing explicit goals into verifiable requirements captures contributions completely and without systematic bias or omission of subtle influences.
What would settle it
Applying the same decomposition and tracing process to a fresh set of collaboration logs and obtaining goal-shaping percentages outside the 11-26% range, or finding no shift in user perceptions after exposure to the analysis, would challenge the central results.
Figures
read the original abstract
As large language models (LLMs) increasingly shape how users form, refine, and extend their goals, attributing contributions in human-AI collaboration becomes critical for users calibrating their own reliance and for evaluators assessing AI-assisted work. Yet existing methods focus on final artifacts, missing the process through which goals themselves are jointly shaped. We introduce a goal-level attribution framework, CoTrace, that decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns. Applying CoTrace to 638 real-world collaboration logs, we find that while models account for only 11-26% of goal-shaping contribution, they contribute substantially more on introducing lower-level concrete requirements, and make various kinds of indirect contributions. Through controlled simulations, we show that interaction design choices significantly affect model goal-shaping behavior. In a user study, exposing participants to goal-level analyses shifts their perceived contributions by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand their own AI-assisted work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CoTrace, a goal-level attribution framework that decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns in human-AI collaboration. Applied to 638 real-world logs, it reports that models account for 11-26% of goal-shaping contribution while contributing more to lower-level concrete requirements and various indirect effects. Controlled simulations demonstrate that interaction design choices affect model goal-shaping behavior. A user study finds that exposing participants to goal-level analyses shifts their perceived contributions by nearly 2 points on a 5-point scale, indicating miscalibration in users' understanding of AI-assisted work.
Significance. If the CoTrace attribution method proves reliable, the work could meaningfully advance understanding of how LLMs shape goals in collaboration, with implications for user calibration of reliance and for designing better human-AI interfaces. The use of real collaboration logs, simulations testing design choices, and a user study on perception shifts provides a multi-method approach that strengthens potential impact in HCI and AI ethics. The focus on process-level tracing rather than final artifacts is a clear strength.
major comments (3)
- [Methods (CoTrace application)] Methods section on CoTrace: The central quantitative claims (11-26% goal-shaping contribution and contrasts with lower-level requirements) rest on manual decomposition of goals into verifiable requirements followed by tracing across turns, yet no inter-annotator agreement statistics, validation against an external gold standard, or sensitivity analysis to alternative decompositions are reported. This directly affects reliability of the attribution percentages.
- [Empirical Analysis / Results] Results on 638 logs: The contribution figures and data exclusion rules are presented without confidence intervals, details on coder training, or robustness checks, making it impossible to rule out post-hoc choices or systematic coder bias that could shift the headline 11-26% range.
- [User Study] User study: The nearly 2-point shift on the 5-point scale is reported without sample size, exact statistical test, p-value, or pre-registration details, which is load-bearing for the claim of systematic miscalibration in perceived contributions.
minor comments (3)
- [Abstract] Abstract: Replace the vague 'nearly 2 points' with the precise mean difference and scale anchors for clarity.
- [Figures] Figure captions: Ensure diagrams distinguishing direct vs. indirect contributions include explicit legends and examples from the logs.
- [Related Work] Related work: Add citations to prior work on contribution attribution in collaborative writing or goal-setting systems to better situate CoTrace.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas where additional reporting and validation will improve the transparency and robustness of our claims. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Methods (CoTrace application)] Methods section on CoTrace: The central quantitative claims (11-26% goal-shaping contribution and contrasts with lower-level requirements) rest on manual decomposition of goals into verifiable requirements followed by tracing across turns, yet no inter-annotator agreement statistics, validation against an external gold standard, or sensitivity analysis to alternative decompositions are reported. This directly affects reliability of the attribution percentages.
Authors: We agree that these details are important for establishing reliability. In the revised manuscript we will add inter-annotator agreement statistics (Cohen's kappa) for both the goal decomposition and tracing steps, along with a sensitivity analysis that applies alternative decompositions and reports the resulting range of contribution percentages. We will also explicitly discuss the absence of an external gold standard, noting the interpretive character of goal decomposition, and describe our annotation protocol and training procedures in greater detail. revision: yes
-
Referee: [Empirical Analysis / Results] Results on 638 logs: The contribution figures and data exclusion rules are presented without confidence intervals, details on coder training, or robustness checks, making it impossible to rule out post-hoc choices or systematic coder bias that could shift the headline 11-26% range.
Authors: We will revise the Results section to include 95% confidence intervals for all reported contribution percentages. We will also add a dedicated subsection describing coder training, qualification criteria, and the exclusion rules applied to the 638 logs. Finally, we will present robustness checks that vary the exclusion criteria and re-compute the 11-26% range to demonstrate that the headline findings are not sensitive to these choices. revision: yes
-
Referee: [User Study] User study: The nearly 2-point shift on the 5-point scale is reported without sample size, exact statistical test, p-value, or pre-registration details, which is load-bearing for the claim of systematic miscalibration in perceived contributions.
Authors: We will expand the User Study section to report the exact sample size, the statistical test used (paired t-test), the associated p-value, and the effect size. We will also clarify the pre-registration status: the analysis plan was fixed prior to data collection even though formal pre-registration was not completed; we will note this limitation transparently while providing the planned analysis details that support the reported perception shift. revision: yes
Circularity Check
No circularity: empirical attribution via CoTrace is independent of its inputs
full rationale
The paper introduces CoTrace as a manual decomposition-and-tracing procedure applied to 638 external collaboration logs. The reported 11-26% goal-shaping figures are direct outputs of that tracing process rather than quantities defined in terms of themselves, fitted parameters renamed as predictions, or results justified solely by self-citation. No equations, ansatzes, or uniqueness theorems appear in the provided text that would collapse the central claims back to the framework's own definitions by construction. The derivation chain therefore remains self-contained against the logs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Explicit goals can be decomposed into verifiable requirements that allow reliable tracing of direct and indirect contributions.
invented entities (1)
-
CoTrace framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a goal-level attribution framework, CoTrace, that decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
role-level contribution of speaker p to requirement r through role ρ is M(p,ρ,r) = ∑_{a∈A_p} 1[role(a)=ρ] I(a→r)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
NECESSITY --- framed as mandatory (must/need/required/cannot, numeric constraints, explicit include/exclude)
-
[2]
GROUNDING --- directly stated, no inference
-
[3]
REPLACEABILITY --- cannot be swapped without violating success
-
[4]
Try to / ideally / could / maybe
BINARY TESTABILITY --- reviewer can judge pass/fail Operations: - **create**: New requirement. Check for revise BEFORE create. - **revise**: Modify existing requirement (contradicts/tightens/relaxes/replaces). Use EXACT existing req id in related to (within this outcome). - **delete**: Explicitly cancels a previously binding condition. Edge cases: - Advic...
-
[5]
These logs are later used as the basis for the tool-use session
Participants complete the human–LLM collaboration task on poe.com using ac- counts we provide, given the same travel planning task. These logs are later used as the basis for the tool-use session
-
[6]
Participants complete the human–human collaboration on Slack, working on the same travel planning task with their assigned partner, allowing within-pair compar- 27 Preprint. Under review. ison under a shared communication setting. For this setting, we provide a planning template through Slack’s shared tab feature to reduce participants’ writing burden
-
[7]
Participants use the analysis tool only with the human–LLM collaboration logs and complete a post-task survey. During this session, participants inspect the previously collected human–LLM logs through the tool and report their experience, perceptions of the tool, and reflections on the collaboration. With 5 pairs, this design yields 10 tool-use survey res...
-
[8]
We conduct a brief interview after the tool-use session to gather qualitative feedback on how participants interpret the tool outputs, what aspects they find useful or confusing, and how the tool affects their understanding of the LLM’s contributions during collaboration. To mitigate order effects, the session order was counterbalanced: three pairs comple...
-
[9]
“After using the tool, were the analyses we provided (e.g., goals, contributions, indirect influence etc) already apparent to you, or did the tool help you notice them?” Post-survey.After completing tool use, participants again rated perceived contribution and satisfaction. 1.Perceived contribution to goal shaping. “How much do you think you and the chatb...
-
[10]
“Comparing your two conversational partners (human vs. chatbot), how did they differ in terms of goal shaping, goal execution, and other aspects?”
-
[11]
“Comparing your own behavior when you collaborated with a human versus a chatbot, how did it differ in terms of goal shaping, goal execution, and other aspects?” F.2 Responses We summarize participants’ open-ended responses below. See Figure 9 for participants’ perception ratings. 0 1 2 3 4 How much did AI contribute to shaping your goal? How much did you...
-
[12]
I did realize new things. I was not doing this cognitive reflection yet
Overall, the tool helped users notice things they had not been explicitly aware of • P1: “I did realize new things. I was not doing this cognitive reflection yet.” • P2: “I don’t think about those things off the top of my head, so when it’s laid out in front of me...” • P5: “The tool helped me notice them better!” • P6: “The tool helped me notice” 2.The t...
-
[13]
It adjusted promptly when I asked for new recommendations
Users appreciated that it followed constraints and adapted to new requirements. • P3: “It adjusted promptly when I asked for new recommendations.” 33 Preprint. Under review. • P5: “It was nice to give all the constraints at once and to keep iterating on those constraints/guardrails.” • P7: “It did a good job looking at my requirements and fitting accordin...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.