pith. sign in

arxiv: 2605.09421 · v3 · pith:EAYAS546new · submitted 2026-05-10 · 💻 cs.SE

MACAA: Belief-Revision Multi-Agent Reasoning for Code Authorship Verification

Pith reviewed 2026-05-19 16:48 UTC · model grok-4.3

classification 💻 cs.SE
keywords code authorship verificationmulti-agent systemsbelief revisionlarge language modelscross-language code analysistraining-free methodssoftware forensicsplagiarism detection
0
0 comments X

The pith

A coordinator and four expert agents use belief revision to verify code authorship without any training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MACAA as a way to determine if two code samples come from the same author even when no prior examples of that author exist. Traditional methods need lots of labeled training data and struggle with code in different languages or from new programmers. By breaking the analysis into layout, word choice, structure, and coding style, then having a coordinator expand, contract, and revise beliefs to keep them consistent, the system produces decisions with traceable reasons. This matters because it opens authorship checks for real-world cases like detecting copied code or investigating software incidents where training sets are unavailable.

Core claim

MACAA replaces direct judgments from large language models with a structured process of hypothesis refinement: the Coordinator collects signals from four Expert Agents on layout, lexical, syntactic, and programming-pattern evidence, then applies expansion to gather more, contraction to discount unreliable parts, and revision to resolve conflicts, resulting in auditable authorship decisions that maintain consistency.

What carries the argument

The belief-revision multi-agent framework with a Coordinator that manages expansion, contraction, and revision of evidence collected by four specialized Expert Agents.

Load-bearing premise

The four expert agents can pull accurate evidence without hallucinating even from mixed-language code, and the coordinator's expansion-contraction-revision steps lead to correct authorship calls.

What would settle it

Running the system on a new set of cross-language code pairs with known authors and checking if its accuracy falls below that of simpler baseline methods when the agents produce conflicting signals.

Figures

Figures reproduced from arXiv: 2605.09421 by Chenbin Su, Cong Gao, Ge Chu, Jianfei Tang, Jieshuai Yang, Jingwei Ye, Xin Li, Zhi Wang.

Figure 1
Figure 1. Figure 1: From Verification to Attribution in Code Authorship [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MACAA overview with Coordinator Agent state-machine flow for expert evidence analysis, belief revision, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Code authorship attribution (CAA) supports software forensics, plagiarism detection, and intellectual property protection. However, existing supervised CAA approaches suffer from scarce training data and closed-world assumptions: they require sufficient labeled code from fixed candidate-author sets, making training difficult in low-data cases and predictions unreliable for open-world test pairs with unseen samples, or heterogeneous code pairs. Large language models remove task-specific training, but direct prompting depends on costly expert-designed prompts, can hallucinate over complex heterogeneous code pairs, and rarely yields auditable evidence traces. We propose MACAA, a belief-revision-based multi-agent framework for training-free code authorship verification. MACAA comprises a Coordinator and four Expert Agents analyzing layout, lexical, syntactic, and programming-pattern evidence. The Coordinator gathers expert signals for expansion, discounts unreliable evidence through contraction, and resolves conflicts through revision to preserve belief consistency, replacing direct LLM judgment with auditable hypothesis refinement. MACAA achieves 89.15\% F1 on same-language benchmarks and 80.00\% on mixed cross-language pairs, outperforming the baselines overall in both same-language and cross-language evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MACAA, a training-free multi-agent framework for code authorship verification that employs a Coordinator together with four Expert Agents specialized in layout, lexical, syntactic, and programming-pattern evidence. The Coordinator performs expansion, contraction, and revision steps to maintain belief consistency and produce auditable authorship decisions. The central empirical claim is that MACAA attains 89.15% F1 on same-language benchmarks and 80.00% F1 on mixed cross-language pairs while outperforming the chosen baselines in both regimes.

Significance. If the reported gains are shown to arise specifically from the belief-revision loop rather than from the underlying LLM priors or individual agent signals, the work would offer a concrete, auditable alternative to both supervised CAA methods and direct LLM prompting for open-world and heterogeneous code settings.

major comments (2)
  1. [§4 (Experimental Results)] §4 (Experimental Results) and Table 2: the headline F1 scores of 89.15% (same-language) and 80.00% (cross-language) are presented without an ablation that disables the Coordinator’s expansion-contraction-revision loop while retaining identical agent outputs and test pairs; without this comparison it is impossible to attribute the cross-language improvement to the proposed mechanism rather than to the base LLM’s cross-lingual code understanding.
  2. [§3.2 (Expert Agents)] §3.2 (Expert Agents) and §4.1 (Dataset Construction): the claim that the four agents reliably extract non-hallucinated evidence on heterogeneous or cross-language pairs is not supported by any quantitative validation (e.g., inter-agent agreement rates, manual inspection of extracted evidence, or error analysis on mixed-language pairs); this assumption is load-bearing for the cross-language result.
minor comments (2)
  1. [Abstract] The abstract states that MACAA “outperforms the baselines overall” but neither names the baselines nor supplies the corresponding F1 numbers; this information should appear in the abstract or be cross-referenced to a table.
  2. [Figure 3] Figure 3 (coordinator workflow) uses abbreviations (E, C, R) that are defined only in the caption; inline definitions or a legend would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the empirical claims.

read point-by-point responses
  1. Referee: [§4 (Experimental Results)] §4 (Experimental Results) and Table 2: the headline F1 scores of 89.15% (same-language) and 80.00% (cross-language) are presented without an ablation that disables the Coordinator’s expansion-contraction-revision loop while retaining identical agent outputs and test pairs; without this comparison it is impossible to attribute the cross-language improvement to the proposed mechanism rather than to the base LLM’s cross-lingual code understanding.

    Authors: We agree that an ablation isolating the Coordinator’s expansion-contraction-revision loop is required to attribute gains specifically to the belief-revision mechanism rather than to the underlying LLM. While our current baselines include direct LLM prompting, they do not hold agent outputs fixed. In the revised manuscript we will add this controlled ablation to §4 and Table 2, using identical agent outputs and test pairs but replacing the belief-revision steps with simple aggregation, and report the resulting performance to quantify the loop’s contribution. revision: yes

  2. Referee: [§3.2 (Expert Agents)] §3.2 (Expert Agents) and §4.1 (Dataset Construction): the claim that the four agents reliably extract non-hallucinated evidence on heterogeneous or cross-language pairs is not supported by any quantitative validation (e.g., inter-agent agreement rates, manual inspection of extracted evidence, or error analysis on mixed-language pairs); this assumption is load-bearing for the cross-language result.

    Authors: We acknowledge that direct quantitative validation of the agents’ evidence extraction on cross-language pairs is currently absent and would strengthen the cross-language claims. Although overall performance and the auditable traces provide supporting evidence, we will add inter-agent agreement rates, a manual inspection of evidence from a sample of mixed-language pairs, and a focused error analysis on heterogeneous cases to the revised §3.2 and §4.1. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and results presented as empirical construction without self-referential reduction

full rationale

The paper introduces MACAA as a novel multi-agent belief-revision framework with a coordinator and four expert agents for layout, lexical, syntactic, and pattern analysis. No equations, fitted parameters, or self-citations are invoked in the provided text to derive the reported F1 scores (89.15% same-language, 80.00% cross-language) from the framework definition itself. Performance claims are positioned as outcomes of the described process evaluated on benchmarks, not tautological renamings or load-bearing self-citations. The derivation chain remains self-contained against external benchmarks and does not reduce any central result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested reliability of LLM-based expert agents and the effectiveness of the belief-revision loop; no free parameters, new physical entities, or machine-checked axioms are declared.

axioms (1)
  • domain assumption LLM expert agents produce reliable, non-hallucinated signals on layout, lexical, syntactic, and pattern evidence for heterogeneous code
    Invoked when the coordinator gathers and revises agent outputs; if false, the entire evidence-refinement process collapses.

pith-pipeline@v0.9.0 · 5741 in / 1259 out tokens · 39326 ms · 2026-05-19T16:48:57.648527+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages

  1. [1]

    Ruchir Puri, David Kung, Geert Janssen, Wei Zhang, Gi- acomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, and 1 others

    Scs-gan: learning functionality-agnostic sty- lometric representations for source code authorship verification.IEEE Transactions on Software Engi- neering, 49(4):1426–1442. Ruchir Puri, David Kung, Geert Janssen, Wei Zhang, Gi- acomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, and 1 others. 2021. Codenet: A large...

  2. [2]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao

    Talk isn’t always cheap: Understanding fail- ure modes in multi-agent debate.arXiv preprint arXiv:2509.05396. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. In11th International Conference on Learn- ing Representations, ICLR 2023. A Experimental ...

  3. [3]

    naming: both codes use short, lowercase-dominant identifiers (avg_len<3), suggesting consistent personal naming compression habit

  4. [4]

    structure: both adopt flat, single-block scripts without helper functions or abstractions

  5. [5]

    comment: both are comment-free, aligning with rapid-competition authoring style

  6. [6]

    Listing 1: Coordinator: configuration and preliminary review

    confounders: competitive template, language_syntax may mimic author-level consistency. Listing 1: Coordinator: configuration and preliminary review. Expert Evidence.Four Expert Agents analyze complementary dimensions in parallel via ReAct tool loops. [LAYOUT] s=0.32, conf=0.58 -> different Python: space (37/37), avg_indent=7.28, indent_std=3.37, comma_tig...

  7. [7]

    Source-code verified: flag at time-judge entry, is_half at same logic position

    Lexical (s=0.52, conf=0.68): flag->is_half chain is stable, ecosystem-independent. Source-code verified: flag at time-judge entry, is_half at same logic position

  8. [8]

    Rechecked 2x, no counter found

    Layout (s=0.32) downweighted: Tab/Space attributed to language ecosystem (Py=space, C++=Tab). Rechecked 2x, no counter found

  9. [9]

    Layout Expert Agent

    Syntactic/Pattern uncertain but do not contradict. opponent notes: Tab/Space persistence >85% in literature; but no evidence author systematically switched. process: 4 rounds, 1 debate, 2 rechecks, 35/40 LLM calls. anchored: PRELIMINARY same@0.62. Listing 6: Layout recheck and final decision. The final decision (same_author, 0.79) agrees with the ground t...

  10. [10]

    Current uncertainty? (which layout dims lack/conflict evidence)

  11. [11]

    Candidate tool's new info? (expected distinguishing signals)

  12. [12]

    Why now? (max info gain, complementary, avoid repetition) ReAct Structure per step:

  13. [13]

    Thought: current uncertainty dimension, expected signals, causal link from previous observation

  14. [14]

    Priority: uncovered > complementary > conflict resolution

    Action: select tool. Priority: uncovered > complementary > conflict resolution

  15. [15]

    Assess template/task influence; downweight if affected

    Observation: convert output to 1-3 signals. Assess template/task influence; downweight if affected

  16. [16]

    thought":

    Stop: when coverage met or budget exhausted. Output evidence: summary (one-line style portrait), signals (per dimension), confidence (0-1, stability confidence, not same-author). Output (strict JSON only, no text/markdown/fences): Continue: {"thought":"...", "action":{"type":"tool", "name":"tool_name"}, "stop":false} Stop: {"thought":"...","action": {"typ...

  17. [17]

    whitespace_profile: avg_indent, tab/space lines, avg_line_length, empty_line_ratio, trailing_space_lines, indent_std

  18. [18]

    delimiter_layout_profile: control_space_before_paren, control_tight_before_paren, comma_space/tight, same_line_block_opener, next_line_block_opener

  19. [19]

    comment_layout_profile: comment_line_ratio, inline_comments, standalone_comments, doc_comments

  20. [20]

    Lexical Expert Agent

    format_stability_profile: indent_switch_rate, line_length_std Key judgment principles: - Indentation and spacing preferences are strong author signals. - Delimiter formatting habits (if(x) vs if (x)) are stable. - Comment style aids judgment but content is task-influenced. - Large code-size differences distort absolute metrics; focus on ratios. - Layout i...

  21. [21]

    token_frequency_profile: keyword_ratio, identifier_ratio, operator_ratio, punctuation_ratio, token_top

  22. [22]

    token_ngram_profile: token_bigrams, abstract_token_trigrams, longest_repeated_sequence

  23. [23]

    char_ngram_profile: char_4gram, char_5gram

  24. [24]

    identifier_style_profile: identifier_cases, avg_length, unique_ratio, digit_ratio, underscore_ratio

  25. [25]

    if(ID)" vs

    abstract_lexical_profile: abstract distributions + bigrams Key principles: - Naming style = strong author signal (snake_case vs camelCase, identifier length, abbreviation habits). Stable across projects. - Abstract templates > concrete tokens. "if(ID)" vs "if(ID==NUM)". - Same-author/different-problem: trust identifier_style, abstract_lexical, char_ngram ...

  26. [26]

    ast_node_profile: node type ratios (degraded mode possible)

  27. [27]

    ast_path_profile: parent_child_pairs, sibling_pairs

  28. [28]

    tree_shape_profile: max_depth, avg_branching, branching_std, node_count

  29. [29]

    construct_usage_profile: if/for/while/switch/return ratios

  30. [30]

    programming pattern

    [Optional] Dolos: dolos_similarity, total_overlap, longest_fragment (reference only) Key principles: - AST paths + context = core author signals. - Tree shape = structural thinking (nested vs flat). - Control-structure prefs (for vs while, early return) = stable. - Size differences: compare RATIOS, not absolutes. - Degraded mode: reduce confidence. - Simi...

  31. [31]

    function_metric_profile: function_count, avg_lines_per_function, return_per_function, avg_line_length

  32. [32]

    control_strategy_profile: guard_if_ratio, recursive_function_hints, loop_count, if_count

  33. [33]

    api_idiom_profile: api_families; plugin_flags

  34. [34]

    one simple, one complex

    semantic_habit_profile: short_temp_ratio, helper_name_ratio, uppercase_constant_ratio, assert_like_count Key principles: - Function size + organization = stable. - Control strategy = core author signal (guard clause, recursion). - Semantic habits = strong signals: temp variable naming (i/j/k vs x/y/z), helper naming, constant style. - Code-size difference...

  35. [35]

    Naming 3

    Coding Style 2. Naming 3. Code Structure

  36. [36]

    Comments 6

    Control-Flow 5. Comments 6. Language Features

  37. [37]

    Lexical Fingerprints

    Error-Handling 8. Lexical Fingerprints

  38. [38]

    Research Manager

    Statistical Cues 10. Idiosyncratic. Hard Rules: no default to different_author from artifacts; mark confounders; balanced same/different. Output: {overall_first_impression, candidate_style_axes[], suspected_confounders[], dimension_routing{layout/lexical/ syntactic/pattern{priority,why,focus_question}}, global_questions[], do_not_overtrust[]} Listing 15: ...

  39. [39]

    FINALIZE: evidence sufficient or budget exhausted

  40. [40]

    RECHECK_DIMENSION: low-confidence/high-impact dimension

  41. [41]

    START_DEBATE: two dimensions in conflict

  42. [42]

    Priority: no conflict+full evidence > FINALIZE; preliminary conflict > RECHECK; two expert dims conflict > DEBATE

    ADJUST_WEIGHTS: post-debate/recheck credibility shift. Priority: no conflict+full evidence > FINALIZE; preliminary conflict > RECHECK; two expert dims conflict > DEBATE. Mandatory: dim-divergence check (dim<0.40 + dim>=0.60 => DEBATE/ RECHECK). LLM comparison failure => must RECHECK. Debate participants = real dimensions only (layout|lexical|syntactic|pat...

  43. [43]

    Evidence Sufficiency: all dimensions covered? concrete signals?

  44. [44]

    Conflict Resolution: cross-dimension conflicts explained?

  45. [45]

    suspect

    Marginal Gain: how much new info could further investigation yield? Low-Similarity Veto Rule: - A dimension with similarity_score < 0.40 is "suspect." - Suspect dims prevent evidence_sufficient unless: (a) after debate/recheck, similarity rises to >= 0.45, OR (b) the gap is confirmed as task/algorithm-driven, not style. - With 2+ suspect dims, prefer CONT...

  46. [46]

    Ruling: conflict resolved?

  47. [47]

    Tracing: dimension credibility update?

  48. [48]

    New Evidence: source-level insights from debate

  49. [49]

    Corrections: which dimension reports need revision? Evaluation: source consistency, preliminary review alignment, external consistency (one dim contradicts others?), argument strength (who provided more verifiable observations?). Required tasks: determine conflict resolution, assess which side is more persuasive, update dimension credibility, extract new ...

  50. [50]

    After re-reading code, overall gut tendency?

  51. [51]

    Which dims support same_author? different_author? uncertain?

  52. [52]

    Is preliminary confirmed/weakened/overturned?

  53. [53]

    Did debate bring genuinely new evidence?

  54. [54]

    - Overturning preliminary requires explanation

    Was reflection's advice adopted? Hard Rules: - No numeric anchors; uncertain != different_author. - Overturning preliminary requires explanation. - different_author requires >=2 moderate different dims OR 1 strong structural counter-evidence (confounder_risk=low) + debate. - Mixed evidence (2 same + 2 different) => uncertain. - Cross-lang: syntactic = wea...

  55. [55]

    author_stable_signals: similarities persisting across problems

  56. [56]

    different_author_signals: contrasts supporting different authors

  57. [57]

    neutral_or_confounding_signals: overlaps better explained by templates, tasks, ecosystems, or language defaults. Prioritize: {per-dimension stable_focus items} Different-author cues: {per-dimension different_focus items} Actively discount: {per-dimension confounders} Return exactly one JSON object: {tendency, similarity_score, confidence, summary, author_...