pith. sign in

arxiv: 2605.21363 · v1 · pith:2G44LSXFnew · submitted 2026-05-20 · 💻 cs.CL

"I didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration

Pith reviewed 2026-05-21 04:40 UTC · model grok-4.3

classification 💻 cs.CL
keywords human-AI collaborationgoal attributionLLM contributionsinteraction designuser perceptioncontribution tracingdialogue analysis
0
0 comments X

The pith

A framework called CoTrace traces AI contributions to goal formation in human collaborations and finds models shape only 11-26 percent of high-level goals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoTrace to break down explicit goals into verifiable requirements and track both direct inputs and indirect influences across conversation turns. Analysis of 638 real collaboration logs shows AI plays a smaller role in shaping overall goals but adds more concrete lower-level requirements and various indirect effects. Controlled tests reveal that changes in interaction design alter these patterns, while a user study finds that presenting the goal-level breakdown shifts how much credit participants assign to themselves or the AI by nearly two points on a five-point scale.

Core claim

CoTrace decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns. When applied to 638 real-world collaboration logs, models account for 11-26% of goal-shaping contribution yet contribute substantially more to introducing lower-level concrete requirements and various indirect influences. Interaction design choices affect model goal-shaping behavior in controlled simulations, and exposing users to the resulting analyses shifts their perceived contributions by nearly 2 points on a 5-point scale.

What carries the argument

CoTrace, the goal-level attribution framework that decomposes explicit goals into verifiable requirements and traces direct and indirect contributions across dialogue turns.

If this is right

  • Interaction design choices can be tuned to increase or decrease the extent of model goal-shaping.
  • Users hold systematically miscalibrated views of their own contributions in AI-assisted work.
  • Providing goal-level attribution data can correct those miscalibrations by nearly two points on a five-point scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar tracing methods could help users in other domains calibrate how much they rely on AI during planning or creative tasks.
  • The approach might be adapted to evaluate contribution patterns in group settings without AI, such as team project logs.
  • Designers could embed lightweight versions of the decomposition step into real-time collaboration interfaces to surface contributions as work unfolds.

Load-bearing premise

Decomposing explicit goals into verifiable requirements captures contributions completely and without systematic bias or omission of subtle influences.

What would settle it

Applying the same decomposition and tracing process to a fresh set of collaboration logs and obtaining goal-shaping percentages outside the 11-26% range, or finding no shift in user perceptions after exposure to the analysis, would challenge the central results.

Figures

Figures reproduced from arXiv: 2605.21363 by Eunsu Kim, Jessica R. Mindel, Kyungjin Kim, Sherry Tongshuang Wu.

Figure 1
Figure 1. Figure 1: Illustrative overview of COTRACE and its benefits. COTRACE analyzes human and LLM contributions at the goal level by tracing requirement lifecycles, including direct contributions (who explicitly creates a requirement) and indirect contributions (who influ￾ences another party to introduce it). It supports measuring goal-shaping behavior, provides a signal for inducing it, and supports user awareness by exp… view at source ↗
Figure 2
Figure 2. Figure 2: Overall goal shaping tendencies. Humans (H) dominate overall shaping (a), while LLM (L) contributions on goal shaping increase as goals become more specific (b). Humans primarily set direction, while models add specificity [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Impact of task on requirement generation. Models generate more requirements in technical tasks than in less technical tasks such as writing and planning. while LLMs account for 96–99% of all EXECUTOR mass. This aligns with the instruction￾following nature of current LLMs, which are typically guided by human-specified instruc￾tions (Ouyang et al., 2022). However, a more nuanced pattern appears along the goa… view at source ↗
Figure 4
Figure 4. Figure 4: Framework Overview. quality (Fragiadakis et al., 2025). Most, however, assume that the user’s goal is fixed in advance. This assumption is especially explicit in simulation-based settings, where collabo￾ration is organized around predefined tasks, requirements, or evaluation criteria (Shao et al., 2025a). Such setups enable controlled comparison, but offer limited visibility into how goals are formulated, … view at source ↗
Figure 5
Figure 5. Figure 5: Tool Validation Survey Results. overly weak relationships as influence links (3/9, 33.3%), such as cases where two entities merely share a broad topic, but are not strongly or directly related. Disagreement on Extracted Requirements B.3.2 Validation from User Study In the user study, participants rated their level of agreement with each component of the analysis (goals, requirements, and indirect influence… view at source ↗
Figure 6
Figure 6. Figure 6: Impact of task characteristics on Requirement generation (Closed vs. Open-ended [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Actions users and assistants employ to formulate and execute goals. The action examples shown in the figure are drawn from cases where those actions were actually used in the creation of requirements. Users engage in direct goal-shaping actions (e.g., Request, Constrain, Instruct). In contrast, assistants tend to shape goals either indirectly through advisory actions (e.g., Suggest, Recommend) or silently … view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative examples of requirement embeddings in PCA space, shown for four task types (programming, data analysis, writing, planning). Each point is colored based on whether its surrounding semantic neighborhood is dominated by user-created requirements (user-heavy), assistant-created requirements (assistant-heavy), or a mix of both (mixed). This allows us to examine whether users and assistants created r… view at source ↗
Figure 9
Figure 9. Figure 9: Changes in participants’ perceptions after exposure to [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Screenshots of UI and Tutorial we used for Human Study [PITH_FULL_IMAGE:figures/full_fig_p055_10.png] view at source ↗
Figure 10
Figure 10. Figure 10: Screenshots of UI and Tutorial we used for Human Study (continued) [PITH_FULL_IMAGE:figures/full_fig_p056_10.png] view at source ↗
read the original abstract

As large language models (LLMs) increasingly shape how users form, refine, and extend their goals, attributing contributions in human-AI collaboration becomes critical for users calibrating their own reliance and for evaluators assessing AI-assisted work. Yet existing methods focus on final artifacts, missing the process through which goals themselves are jointly shaped. We introduce a goal-level attribution framework, CoTrace, that decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns. Applying CoTrace to 638 real-world collaboration logs, we find that while models account for only 11-26% of goal-shaping contribution, they contribute substantially more on introducing lower-level concrete requirements, and make various kinds of indirect contributions. Through controlled simulations, we show that interaction design choices significantly affect model goal-shaping behavior. In a user study, exposing participants to goal-level analyses shifts their perceived contributions by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand their own AI-assisted work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces CoTrace, a goal-level attribution framework that decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns in human-AI collaboration. Applied to 638 real-world logs, it reports that models account for 11-26% of goal-shaping contribution while contributing more to lower-level concrete requirements and various indirect effects. Controlled simulations demonstrate that interaction design choices affect model goal-shaping behavior. A user study finds that exposing participants to goal-level analyses shifts their perceived contributions by nearly 2 points on a 5-point scale, indicating miscalibration in users' understanding of AI-assisted work.

Significance. If the CoTrace attribution method proves reliable, the work could meaningfully advance understanding of how LLMs shape goals in collaboration, with implications for user calibration of reliance and for designing better human-AI interfaces. The use of real collaboration logs, simulations testing design choices, and a user study on perception shifts provides a multi-method approach that strengthens potential impact in HCI and AI ethics. The focus on process-level tracing rather than final artifacts is a clear strength.

major comments (3)
  1. [Methods (CoTrace application)] Methods section on CoTrace: The central quantitative claims (11-26% goal-shaping contribution and contrasts with lower-level requirements) rest on manual decomposition of goals into verifiable requirements followed by tracing across turns, yet no inter-annotator agreement statistics, validation against an external gold standard, or sensitivity analysis to alternative decompositions are reported. This directly affects reliability of the attribution percentages.
  2. [Empirical Analysis / Results] Results on 638 logs: The contribution figures and data exclusion rules are presented without confidence intervals, details on coder training, or robustness checks, making it impossible to rule out post-hoc choices or systematic coder bias that could shift the headline 11-26% range.
  3. [User Study] User study: The nearly 2-point shift on the 5-point scale is reported without sample size, exact statistical test, p-value, or pre-registration details, which is load-bearing for the claim of systematic miscalibration in perceived contributions.
minor comments (3)
  1. [Abstract] Abstract: Replace the vague 'nearly 2 points' with the precise mean difference and scale anchors for clarity.
  2. [Figures] Figure captions: Ensure diagrams distinguishing direct vs. indirect contributions include explicit legends and examples from the logs.
  3. [Related Work] Related work: Add citations to prior work on contribution attribution in collaborative writing or goal-setting systems to better situate CoTrace.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional reporting and validation will improve the transparency and robustness of our claims. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Methods (CoTrace application)] Methods section on CoTrace: The central quantitative claims (11-26% goal-shaping contribution and contrasts with lower-level requirements) rest on manual decomposition of goals into verifiable requirements followed by tracing across turns, yet no inter-annotator agreement statistics, validation against an external gold standard, or sensitivity analysis to alternative decompositions are reported. This directly affects reliability of the attribution percentages.

    Authors: We agree that these details are important for establishing reliability. In the revised manuscript we will add inter-annotator agreement statistics (Cohen's kappa) for both the goal decomposition and tracing steps, along with a sensitivity analysis that applies alternative decompositions and reports the resulting range of contribution percentages. We will also explicitly discuss the absence of an external gold standard, noting the interpretive character of goal decomposition, and describe our annotation protocol and training procedures in greater detail. revision: yes

  2. Referee: [Empirical Analysis / Results] Results on 638 logs: The contribution figures and data exclusion rules are presented without confidence intervals, details on coder training, or robustness checks, making it impossible to rule out post-hoc choices or systematic coder bias that could shift the headline 11-26% range.

    Authors: We will revise the Results section to include 95% confidence intervals for all reported contribution percentages. We will also add a dedicated subsection describing coder training, qualification criteria, and the exclusion rules applied to the 638 logs. Finally, we will present robustness checks that vary the exclusion criteria and re-compute the 11-26% range to demonstrate that the headline findings are not sensitive to these choices. revision: yes

  3. Referee: [User Study] User study: The nearly 2-point shift on the 5-point scale is reported without sample size, exact statistical test, p-value, or pre-registration details, which is load-bearing for the claim of systematic miscalibration in perceived contributions.

    Authors: We will expand the User Study section to report the exact sample size, the statistical test used (paired t-test), the associated p-value, and the effect size. We will also clarify the pre-registration status: the analysis plan was fixed prior to data collection even though formal pre-registration was not completed; we will note this limitation transparently while providing the planned analysis details that support the reported perception shift. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attribution via CoTrace is independent of its inputs

full rationale

The paper introduces CoTrace as a manual decomposition-and-tracing procedure applied to 638 external collaboration logs. The reported 11-26% goal-shaping figures are direct outputs of that tracing process rather than quantities defined in terms of themselves, fitted parameters renamed as predictions, or results justified solely by self-citation. No equations, ansatzes, or uniqueness theorems appear in the provided text that would collapse the central claims back to the framework's own definitions by construction. The derivation chain therefore remains self-contained against the logs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that goals can be meaningfully decomposed into verifiable requirements and that the 638 logs are representative of real-world collaboration; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Explicit goals can be decomposed into verifiable requirements that allow reliable tracing of direct and indirect contributions.
    This decomposition is the foundational step of the CoTrace framework described in the abstract.
invented entities (1)
  • CoTrace framework no independent evidence
    purpose: Decompose goals into requirements and trace AI contributions across dialogue turns
    Newly introduced method in this paper with no independent evidence supplied outside the current work.

pith-pipeline@v0.9.0 · 5725 in / 1409 out tokens · 46031 ms · 2026-05-21T04:40:04.206257+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    NECESSITY --- framed as mandatory (must/need/required/cannot, numeric constraints, explicit include/exclude)

  2. [2]

    GROUNDING --- directly stated, no inference

  3. [3]

    REPLACEABILITY --- cannot be swapped without violating success

  4. [4]

    Try to / ideally / could / maybe

    BINARY TESTABILITY --- reviewer can judge pass/fail Operations: - **create**: New requirement. Check for revise BEFORE create. - **revise**: Modify existing requirement (contradicts/tightens/relaxes/replaces). Use EXACT existing req id in related to (within this outcome). - **delete**: Explicitly cancels a previously binding condition. Edge cases: - Advic...

  5. [5]

    These logs are later used as the basis for the tool-use session

    Participants complete the human–LLM collaboration task on poe.com using ac- counts we provide, given the same travel planning task. These logs are later used as the basis for the tool-use session

  6. [6]

    Under review

    Participants complete the human–human collaboration on Slack, working on the same travel planning task with their assigned partner, allowing within-pair compar- 27 Preprint. Under review. ison under a shared communication setting. For this setting, we provide a planning template through Slack’s shared tab feature to reduce participants’ writing burden

  7. [7]

    Participants use the analysis tool only with the human–LLM collaboration logs and complete a post-task survey. During this session, participants inspect the previously collected human–LLM logs through the tool and report their experience, perceptions of the tool, and reflections on the collaboration. With 5 pairs, this design yields 10 tool-use survey res...

  8. [8]

    How much do you think you and your conversation partner contributed to shaping your goal (e.g., shaping the constraints, shaping the preferences, and setting the criteria)?

    We conduct a brief interview after the tool-use session to gather qualitative feedback on how participants interpret the tool outputs, what aspects they find useful or confusing, and how the tool affects their understanding of the LLM’s contributions during collaboration. To mitigate order effects, the session order was counterbalanced: three pairs comple...

  9. [9]

    After using the tool, were the analyses we provided (e.g., goals, contributions, indirect influence etc) already apparent to you, or did the tool help you notice them?

    “After using the tool, were the analyses we provided (e.g., goals, contributions, indirect influence etc) already apparent to you, or did the tool help you notice them?” Post-survey.After completing tool use, participants again rated perceived contribution and satisfaction. 1.Perceived contribution to goal shaping. “How much do you think you and the chatb...

  10. [10]

    Comparing your two conversational partners (human vs. chatbot), how did they differ in terms of goal shaping, goal execution, and other aspects?

    “Comparing your two conversational partners (human vs. chatbot), how did they differ in terms of goal shaping, goal execution, and other aspects?”

  11. [11]

    Comparing your own behavior when you collaborated with a human versus a chatbot, how did it differ in terms of goal shaping, goal execution, and other aspects?

    “Comparing your own behavior when you collaborated with a human versus a chatbot, how did it differ in terms of goal shaping, goal execution, and other aspects?” F.2 Responses We summarize participants’ open-ended responses below. See Figure 9 for participants’ perception ratings. 0 1 2 3 4 How much did AI contribute to shaping your goal? How much did you...

  12. [12]

    I did realize new things. I was not doing this cognitive reflection yet

    Overall, the tool helped users notice things they had not been explicitly aware of • P1: “I did realize new things. I was not doing this cognitive reflection yet.” • P2: “I don’t think about those things off the top of my head, so when it’s laid out in front of me...” • P5: “The tool helped me notice them better!” • P6: “The tool helped me notice” 2.The t...

  13. [13]

    It adjusted promptly when I asked for new recommendations

    Users appreciated that it followed constraints and adapted to new requirements. • P3: “It adjusted promptly when I asked for new recommendations.” 33 Preprint. Under review. • P5: “It was nice to give all the constraints at once and to keep iterating on those constraints/guardrails.” • P7: “It did a good job looking at my requirements and fitting accordin...