MACAA: Belief-Revision Multi-Agent Reasoning for Code Authorship Verification
Pith reviewed 2026-05-19 16:48 UTC · model grok-4.3
The pith
A coordinator and four expert agents use belief revision to verify code authorship without any training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MACAA replaces direct judgments from large language models with a structured process of hypothesis refinement: the Coordinator collects signals from four Expert Agents on layout, lexical, syntactic, and programming-pattern evidence, then applies expansion to gather more, contraction to discount unreliable parts, and revision to resolve conflicts, resulting in auditable authorship decisions that maintain consistency.
What carries the argument
The belief-revision multi-agent framework with a Coordinator that manages expansion, contraction, and revision of evidence collected by four specialized Expert Agents.
Load-bearing premise
The four expert agents can pull accurate evidence without hallucinating even from mixed-language code, and the coordinator's expansion-contraction-revision steps lead to correct authorship calls.
What would settle it
Running the system on a new set of cross-language code pairs with known authors and checking if its accuracy falls below that of simpler baseline methods when the agents produce conflicting signals.
Figures
read the original abstract
Code authorship attribution (CAA) supports software forensics, plagiarism detection, and intellectual property protection. However, existing supervised CAA approaches suffer from scarce training data and closed-world assumptions: they require sufficient labeled code from fixed candidate-author sets, making training difficult in low-data cases and predictions unreliable for open-world test pairs with unseen samples, or heterogeneous code pairs. Large language models remove task-specific training, but direct prompting depends on costly expert-designed prompts, can hallucinate over complex heterogeneous code pairs, and rarely yields auditable evidence traces. We propose MACAA, a belief-revision-based multi-agent framework for training-free code authorship verification. MACAA comprises a Coordinator and four Expert Agents analyzing layout, lexical, syntactic, and programming-pattern evidence. The Coordinator gathers expert signals for expansion, discounts unreliable evidence through contraction, and resolves conflicts through revision to preserve belief consistency, replacing direct LLM judgment with auditable hypothesis refinement. MACAA achieves 89.15\% F1 on same-language benchmarks and 80.00\% on mixed cross-language pairs, outperforming the baselines overall in both same-language and cross-language evaluations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MACAA, a training-free multi-agent framework for code authorship verification that employs a Coordinator together with four Expert Agents specialized in layout, lexical, syntactic, and programming-pattern evidence. The Coordinator performs expansion, contraction, and revision steps to maintain belief consistency and produce auditable authorship decisions. The central empirical claim is that MACAA attains 89.15% F1 on same-language benchmarks and 80.00% F1 on mixed cross-language pairs while outperforming the chosen baselines in both regimes.
Significance. If the reported gains are shown to arise specifically from the belief-revision loop rather than from the underlying LLM priors or individual agent signals, the work would offer a concrete, auditable alternative to both supervised CAA methods and direct LLM prompting for open-world and heterogeneous code settings.
major comments (2)
- [§4 (Experimental Results)] §4 (Experimental Results) and Table 2: the headline F1 scores of 89.15% (same-language) and 80.00% (cross-language) are presented without an ablation that disables the Coordinator’s expansion-contraction-revision loop while retaining identical agent outputs and test pairs; without this comparison it is impossible to attribute the cross-language improvement to the proposed mechanism rather than to the base LLM’s cross-lingual code understanding.
- [§3.2 (Expert Agents)] §3.2 (Expert Agents) and §4.1 (Dataset Construction): the claim that the four agents reliably extract non-hallucinated evidence on heterogeneous or cross-language pairs is not supported by any quantitative validation (e.g., inter-agent agreement rates, manual inspection of extracted evidence, or error analysis on mixed-language pairs); this assumption is load-bearing for the cross-language result.
minor comments (2)
- [Abstract] The abstract states that MACAA “outperforms the baselines overall” but neither names the baselines nor supplies the corresponding F1 numbers; this information should appear in the abstract or be cross-referenced to a table.
- [Figure 3] Figure 3 (coordinator workflow) uses abbreviations (E, C, R) that are defined only in the caption; inline definitions or a legend would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the empirical claims.
read point-by-point responses
-
Referee: [§4 (Experimental Results)] §4 (Experimental Results) and Table 2: the headline F1 scores of 89.15% (same-language) and 80.00% (cross-language) are presented without an ablation that disables the Coordinator’s expansion-contraction-revision loop while retaining identical agent outputs and test pairs; without this comparison it is impossible to attribute the cross-language improvement to the proposed mechanism rather than to the base LLM’s cross-lingual code understanding.
Authors: We agree that an ablation isolating the Coordinator’s expansion-contraction-revision loop is required to attribute gains specifically to the belief-revision mechanism rather than to the underlying LLM. While our current baselines include direct LLM prompting, they do not hold agent outputs fixed. In the revised manuscript we will add this controlled ablation to §4 and Table 2, using identical agent outputs and test pairs but replacing the belief-revision steps with simple aggregation, and report the resulting performance to quantify the loop’s contribution. revision: yes
-
Referee: [§3.2 (Expert Agents)] §3.2 (Expert Agents) and §4.1 (Dataset Construction): the claim that the four agents reliably extract non-hallucinated evidence on heterogeneous or cross-language pairs is not supported by any quantitative validation (e.g., inter-agent agreement rates, manual inspection of extracted evidence, or error analysis on mixed-language pairs); this assumption is load-bearing for the cross-language result.
Authors: We acknowledge that direct quantitative validation of the agents’ evidence extraction on cross-language pairs is currently absent and would strengthen the cross-language claims. Although overall performance and the auditable traces provide supporting evidence, we will add inter-agent agreement rates, a manual inspection of evidence from a sample of mixed-language pairs, and a focused error analysis on heterogeneous cases to the revised §3.2 and §4.1. revision: yes
Circularity Check
No circularity: framework and results presented as empirical construction without self-referential reduction
full rationale
The paper introduces MACAA as a novel multi-agent belief-revision framework with a coordinator and four expert agents for layout, lexical, syntactic, and pattern analysis. No equations, fitted parameters, or self-citations are invoked in the provided text to derive the reported F1 scores (89.15% same-language, 80.00% cross-language) from the framework definition itself. Performance claims are positioned as outcomes of the described process evaluated on benchmarks, not tautological renamings or load-bearing self-citations. The derivation chain remains self-contained against external benchmarks and does not reduce any central result to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM expert agents produce reliable, non-hallucinated signals on layout, lexical, syntactic, and pattern evidence for heterogeneous code
Reference graph
Works this paper leans on
-
[1]
Scs-gan: learning functionality-agnostic sty- lometric representations for source code authorship verification.IEEE Transactions on Software Engi- neering, 49(4):1426–1442. Ruchir Puri, David Kung, Geert Janssen, Wei Zhang, Gi- acomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, and 1 others. 2021. Codenet: A large...
work page 2021
-
[2]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao
Talk isn’t always cheap: Understanding fail- ure modes in multi-agent debate.arXiv preprint arXiv:2509.05396. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. In11th International Conference on Learn- ing Representations, ICLR 2023. A Experimental ...
-
[3]
naming: both codes use short, lowercase-dominant identifiers (avg_len<3), suggesting consistent personal naming compression habit
-
[4]
structure: both adopt flat, single-block scripts without helper functions or abstractions
-
[5]
comment: both are comment-free, aligning with rapid-competition authoring style
-
[6]
Listing 1: Coordinator: configuration and preliminary review
confounders: competitive template, language_syntax may mimic author-level consistency. Listing 1: Coordinator: configuration and preliminary review. Expert Evidence.Four Expert Agents analyze complementary dimensions in parallel via ReAct tool loops. [LAYOUT] s=0.32, conf=0.58 -> different Python: space (37/37), avg_indent=7.28, indent_std=3.37, comma_tig...
-
[7]
Source-code verified: flag at time-judge entry, is_half at same logic position
Lexical (s=0.52, conf=0.68): flag->is_half chain is stable, ecosystem-independent. Source-code verified: flag at time-judge entry, is_half at same logic position
-
[8]
Rechecked 2x, no counter found
Layout (s=0.32) downweighted: Tab/Space attributed to language ecosystem (Py=space, C++=Tab). Rechecked 2x, no counter found
-
[9]
Syntactic/Pattern uncertain but do not contradict. opponent notes: Tab/Space persistence >85% in literature; but no evidence author systematically switched. process: 4 rounds, 1 debate, 2 rechecks, 35/40 LLM calls. anchored: PRELIMINARY same@0.62. Listing 6: Layout recheck and final decision. The final decision (same_author, 0.79) agrees with the ground t...
-
[10]
Current uncertainty? (which layout dims lack/conflict evidence)
-
[11]
Candidate tool's new info? (expected distinguishing signals)
-
[12]
Why now? (max info gain, complementary, avoid repetition) ReAct Structure per step:
-
[13]
Thought: current uncertainty dimension, expected signals, causal link from previous observation
-
[14]
Priority: uncovered > complementary > conflict resolution
Action: select tool. Priority: uncovered > complementary > conflict resolution
-
[15]
Assess template/task influence; downweight if affected
Observation: convert output to 1-3 signals. Assess template/task influence; downweight if affected
-
[16]
Stop: when coverage met or budget exhausted. Output evidence: summary (one-line style portrait), signals (per dimension), confidence (0-1, stability confidence, not same-author). Output (strict JSON only, no text/markdown/fences): Continue: {"thought":"...", "action":{"type":"tool", "name":"tool_name"}, "stop":false} Stop: {"thought":"...","action": {"typ...
-
[17]
whitespace_profile: avg_indent, tab/space lines, avg_line_length, empty_line_ratio, trailing_space_lines, indent_std
-
[18]
delimiter_layout_profile: control_space_before_paren, control_tight_before_paren, comma_space/tight, same_line_block_opener, next_line_block_opener
-
[19]
comment_layout_profile: comment_line_ratio, inline_comments, standalone_comments, doc_comments
-
[20]
format_stability_profile: indent_switch_rate, line_length_std Key judgment principles: - Indentation and spacing preferences are strong author signals. - Delimiter formatting habits (if(x) vs if (x)) are stable. - Comment style aids judgment but content is task-influenced. - Large code-size differences distort absolute metrics; focus on ratios. - Layout i...
-
[21]
token_frequency_profile: keyword_ratio, identifier_ratio, operator_ratio, punctuation_ratio, token_top
-
[22]
token_ngram_profile: token_bigrams, abstract_token_trigrams, longest_repeated_sequence
-
[23]
char_ngram_profile: char_4gram, char_5gram
-
[24]
identifier_style_profile: identifier_cases, avg_length, unique_ratio, digit_ratio, underscore_ratio
-
[25]
abstract_lexical_profile: abstract distributions + bigrams Key principles: - Naming style = strong author signal (snake_case vs camelCase, identifier length, abbreviation habits). Stable across projects. - Abstract templates > concrete tokens. "if(ID)" vs "if(ID==NUM)". - Same-author/different-problem: trust identifier_style, abstract_lexical, char_ngram ...
-
[26]
ast_node_profile: node type ratios (degraded mode possible)
-
[27]
ast_path_profile: parent_child_pairs, sibling_pairs
-
[28]
tree_shape_profile: max_depth, avg_branching, branching_std, node_count
-
[29]
construct_usage_profile: if/for/while/switch/return ratios
-
[30]
[Optional] Dolos: dolos_similarity, total_overlap, longest_fragment (reference only) Key principles: - AST paths + context = core author signals. - Tree shape = structural thinking (nested vs flat). - Control-structure prefs (for vs while, early return) = stable. - Size differences: compare RATIOS, not absolutes. - Degraded mode: reduce confidence. - Simi...
-
[31]
function_metric_profile: function_count, avg_lines_per_function, return_per_function, avg_line_length
-
[32]
control_strategy_profile: guard_if_ratio, recursive_function_hints, loop_count, if_count
-
[33]
api_idiom_profile: api_families; plugin_flags
-
[34]
semantic_habit_profile: short_temp_ratio, helper_name_ratio, uppercase_constant_ratio, assert_like_count Key principles: - Function size + organization = stable. - Control strategy = core author signal (guard clause, recursion). - Semantic habits = strong signals: temp variable naming (i/j/k vs x/y/z), helper naming, constant style. - Code-size difference...
- [35]
- [36]
- [37]
-
[38]
Statistical Cues 10. Idiosyncratic. Hard Rules: no default to different_author from artifacts; mark confounders; balanced same/different. Output: {overall_first_impression, candidate_style_axes[], suspected_confounders[], dimension_routing{layout/lexical/ syntactic/pattern{priority,why,focus_question}}, global_questions[], do_not_overtrust[]} Listing 15: ...
-
[39]
FINALIZE: evidence sufficient or budget exhausted
-
[40]
RECHECK_DIMENSION: low-confidence/high-impact dimension
-
[41]
START_DEBATE: two dimensions in conflict
-
[42]
ADJUST_WEIGHTS: post-debate/recheck credibility shift. Priority: no conflict+full evidence > FINALIZE; preliminary conflict > RECHECK; two expert dims conflict > DEBATE. Mandatory: dim-divergence check (dim<0.40 + dim>=0.60 => DEBATE/ RECHECK). LLM comparison failure => must RECHECK. Debate participants = real dimensions only (layout|lexical|syntactic|pat...
-
[43]
Evidence Sufficiency: all dimensions covered? concrete signals?
-
[44]
Conflict Resolution: cross-dimension conflicts explained?
-
[45]
Marginal Gain: how much new info could further investigation yield? Low-Similarity Veto Rule: - A dimension with similarity_score < 0.40 is "suspect." - Suspect dims prevent evidence_sufficient unless: (a) after debate/recheck, similarity rises to >= 0.45, OR (b) the gap is confirmed as task/algorithm-driven, not style. - With 2+ suspect dims, prefer CONT...
-
[46]
Ruling: conflict resolved?
-
[47]
Tracing: dimension credibility update?
-
[48]
New Evidence: source-level insights from debate
-
[49]
Corrections: which dimension reports need revision? Evaluation: source consistency, preliminary review alignment, external consistency (one dim contradicts others?), argument strength (who provided more verifiable observations?). Required tasks: determine conflict resolution, assess which side is more persuasive, update dimension credibility, extract new ...
-
[50]
After re-reading code, overall gut tendency?
-
[51]
Which dims support same_author? different_author? uncertain?
-
[52]
Is preliminary confirmed/weakened/overturned?
-
[53]
Did debate bring genuinely new evidence?
-
[54]
- Overturning preliminary requires explanation
Was reflection's advice adopted? Hard Rules: - No numeric anchors; uncertain != different_author. - Overturning preliminary requires explanation. - different_author requires >=2 moderate different dims OR 1 strong structural counter-evidence (confounder_risk=low) + debate. - Mixed evidence (2 same + 2 different) => uncertain. - Cross-lang: syntactic = wea...
-
[55]
author_stable_signals: similarities persisting across problems
-
[56]
different_author_signals: contrasts supporting different authors
-
[57]
neutral_or_confounding_signals: overlaps better explained by templates, tasks, ecosystems, or language defaults. Prioritize: {per-dimension stable_focus items} Different-author cues: {per-dimension different_focus items} Actively discount: {per-dimension confounders} Return exactly one JSON object: {tendency, similarity_score, confidence, summary, author_...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.