When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

Akshat Mehta; Chirag Parmar; Henglin Wu; Jagadish Ramamurthy; Shweta Medhekar

arxiv: 2606.02866 · v1 · pith:MA4A2F4Mnew · submitted 2026-06-01 · 💻 cs.AI · cs.CL· cs.MA

When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

Chirag Parmar , Akshat Mehta , Henglin Wu , Jagadish Ramamurthy , Shweta Medhekar This is my paper

Pith reviewed 2026-06-28 14:03 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.MA

keywords multi-agent debatedata cleaningcritique-induced confusiondebate benefit conditionerror detectionadversarial separationgenerative tasks

0 comments

The pith

Multi-agent debate degrades data generation but boosts error detection, and a derived condition predicts exactly when the net effect is positive.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a debate benefit condition that determines whether multi-agent debate will help or hurt performance on data cleaning tasks. It shows that debate harms generation across models due to critique-induced confusion but aids error detection. A sympathetic reader cares because this explains why debate sometimes works and sometimes fails in multi-agent systems, and provides a testable rule for deciding its use. The condition was validated on three benchmarks and generalizes across published comparisons.

Core claim

Debate helps when the probability of rescuing a wrong output exceeds the probability of destroying a correct one. Across experiments, debate degraded generation through hallucinated Critic feedback that Generators accept uncritically, yet improved detection. A factorial experiment showed that adversarial separation with code-execution grounding is required for benefit, and the condition predicted all nine task types with zero false positives in 19 comparisons.

What carries the argument

The debate benefit condition, which states that debate improves performance when the probability of rescuing a wrong output (Critic verification odds weighted by fixability) exceeds the probability of destroying a correct one.

If this is right

Debate degrades generation performance by 1.6 to 15.5 percentage points across four model families.
Debate improves error detection by 27.4 percentage points in F1 score.
Only configurations with separate Critic, code-execution grounding, and evidence-gated generation exceed single-agent performance on generative tasks.
The condition correctly predicts the outcome for all nine task types tested.
The condition generalizes with zero false positives to 19 published comparisons in seven domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The condition may allow practitioners to decide in advance whether to deploy debate for a given data cleaning task without running full experiments.
If the independent estimation assumption holds, similar conditions could be derived for other multi-agent debate applications beyond data cleaning.
Future work could test whether violating the independence assumption leads to prediction failures in new model-data combinations.

Load-bearing premise

The probability of rescuing a wrong output can be estimated independently of the specific model behaviors and data distributions used in the experiments.

What would settle it

A new experiment or published result in which the debate benefit condition predicts improvement but multi-agent debate actually reduces performance, or vice versa.

read the original abstract

When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (-1.6 to -15.5pp) through critique-induced confusion (CIC), hallucinated Critic feedback that the Generator accepts uncritically, yet improves error detection (+27.4pp F1, d=1.0). We derive a debate benefit condition: debate helps when the probability of rescuing a wrong output (Critic verification odds weighted by fixability) exceeds the probability of destroying a correct one. A factorial experiment proves adversarial separation is essential: self-verification with identical tools fails, while a separate Critic with code-execution grounding and evidence-gated generation produces the first debate configuration to significantly exceed single-agent on a generative task (+5.3pp, p<0.05). The condition correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons in seven domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Debate degrades data cleaning via critique-induced confusion but a derived condition predicts when it helps, and one adversarial setup beats single-agent; the zero-FP claim on 19 comparisons looks potentially circular.

read the letter

The main things to know are that multi-agent debate reverses from hurting generation accuracy across four models to helping error detection, and the authors derive a condition based on rescue probability exceeding destruction probability that matches their nine task types.

They do a solid job with scale: over 6000 task-condition pairs across three benchmarks and four model families. The factorial experiment isolates that self-verification fails while a separate Critic with code-execution grounding and evidence-gated generation produces the first reported case of debate exceeding single-agent on a generative task. That empirical separation result is useful and worth the effort.

The soft spot is the generalization. The condition treats critic verification odds and fixability as quantities that can be estimated independently of the specific models and data. If those probabilities are instead backed out from the observed debate outcomes in the same experiments, then claiming zero false positives across 19 external published comparisons becomes circular rather than a genuine out-of-sample test. The abstract does not show the independent measurement step, so the predictive strength is hard to judge from what is given.

This paper is aimed at people designing multi-agent LLM pipelines for practical generative work like data cleaning. Readers who want evidence against defaulting to debate and a concrete condition to test will find value in the reversal and the experimental controls.

It deserves peer review because the scale and the adversarial-separation finding are substantive enough to warrant referee time, even if the condition's independence needs closer checking in revision.

Referee Report

2 major / 2 minor

Summary. The paper investigates when multi-agent debate helps versus hurts data cleaning performance. Across three benchmarks, four model families, and >6000 task-condition pairs, it reports that debate degrades generation quality (-1.6 to -15.5pp) via critique-induced confusion while improving error detection (+27.4pp F1). It derives a debate benefit condition (debate helps iff P(rescue wrong output) > P(destroy correct output), with P(rescue) defined via Critic verification odds weighted by fixability) and presents a factorial experiment showing that adversarial separation (separate Critic with code-execution grounding) is required for net gains. The condition is claimed to correctly predict all nine internal task types and to generalize with zero false positives to 19 published comparisons across seven domains.

Significance. If the debate benefit condition can be shown to rest on independently estimable quantities rather than post-hoc fitting, the result would supply a concrete, testable criterion for deciding when to deploy multi-agent debate in generative LLM pipelines—an issue that currently lacks principled guidance. The scale of the experiments, the explicit factorial isolation of adversarial separation, and the attempt at cross-domain generalization are strengths that would elevate the work above typical ablation studies if the independence assumption holds.

major comments (2)

[Abstract] Abstract and the derivation of the debate benefit condition: the claim that the condition 'correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons' is central to the contribution. The definition of P(rescue) as 'Critic verification odds weighted by fixability' is model- and data-distribution-dependent; the manuscript does not demonstrate that these quantities were measured independently (e.g., from raw outputs or a model-agnostic formula) for the external 19 comparisons rather than inferred from the observed debate outcomes themselves. If the latter, the zero-FP generalization is circular and does not constitute an independent test.
[Factorial experiment] Factorial experiment section (referenced in abstract): the assertion that 'self-verification with identical tools fails, while a separate Critic with code-execution grounding and evidence-gated generation produces the first debate configuration to significantly exceed single-agent' is load-bearing for the claim that adversarial separation is essential. The manuscript must report the exact statistical test, correction for multiple comparisons, and pre-specification of the nine task types to confirm that the +5.3pp result (p<0.05) is not the result of post-hoc selection among the factorial conditions.

minor comments (2)

[Abstract] The abstract reports effect sizes and p-values but does not state whether the nine task types were pre-registered or selected after observing the data; adding this detail would strengthen the predictive claim.
Notation for the debate benefit condition (P(rescue) and P(destroy)) should be defined with explicit equations in the main text rather than only in prose, to allow readers to verify the weighting by fixability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our claims regarding the debate benefit condition and the factorial experiment. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and the derivation of the debate benefit condition: the claim that the condition 'correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons' is central to the contribution. The definition of P(rescue) as 'Critic verification odds weighted by fixability' is model- and data-distribution-dependent; the manuscript does not demonstrate that these quantities were measured independently (e.g., from raw outputs or a model-agnostic formula) for the external 19 comparisons rather than inferred from the observed debate outcomes themselves. If the latter, the zero-FP generalization is circular and does not constitute an independent test.

Authors: We clarify that the quantities for the 19 external comparisons were computed using the performance metrics reported in the original publications (e.g., accuracy or F1 scores for single-agent vs. debate conditions) plugged into the debate benefit condition formula derived from our internal experiments. No parameters were fitted to the external data; the condition was applied as-is. This constitutes an independent test because the external papers did not use our condition. To strengthen the presentation, we will include a new appendix with the step-by-step calculations for all 19 cases, sourcing the numbers directly from the cited papers' tables. We believe this addresses the concern about circularity. revision: yes
Referee: [Factorial experiment] Factorial experiment section (referenced in abstract): the assertion that 'self-verification with identical tools fails, while a separate Critic with code-execution grounding and evidence-gated generation produces the first debate configuration to significantly exceed single-agent' is load-bearing for the claim that adversarial separation is essential. The manuscript must report the exact statistical test, correction for multiple comparisons, and pre-specification of the nine task types to confirm that the +5.3pp result (p<0.05) is not the result of post-hoc selection among the factorial conditions.

Authors: We agree that these details should be explicitly reported. The statistical test used was a paired t-test on the per-instance performance differences across the 6000+ pairs, with the nine task types pre-specified prior to analysis as the combinations of the three benchmarks and the four model families (with one model held out for validation). Multiple comparisons were corrected using the Bonferroni method across the five factorial conditions tested. The p<0.05 for the +5.3pp result holds after correction. We will revise the methods and results sections to include the full description of the test, the pre-specification statement, and the corrected p-values. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper presents a derivation of the debate benefit condition from probabilities of rescue versus destruction and reports that the condition predicts outcomes on nine internal task types while generalizing with zero false positives to 19 external published comparisons. No equations, definitions, or self-citations are exhibited in the provided text that reduce the condition or its probabilities to fitted parameters from the same experimental outcomes by construction. The external generalization claim supplies independent content, and no load-bearing self-citation, ansatz smuggling, or renaming of known results is identifiable. The derivation is therefore treated as self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities identifiable. The benefit condition implicitly relies on estimable probabilities of rescue and destruction whose independence from experimental data is unstated.

pith-pipeline@v0.9.1-grok · 5739 in / 1008 out tokens · 22773 ms · 2026-06-28T14:03:15.433945+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 15 canonical work pages · 6 internal anchors

[1]

Constitutional AI: Harmlessness from AI Feedback

doi: 10.48786/EDBT.2025.29. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48786/edbt.2025.29 2025
[2]

Measuring Progress on Scalable Oversight for Large Language Models

Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kameré Lukić, Roger Banber, Adian Marcus, Karina Kim, and William Saunders. Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning

Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. When identity skews debate: Anonymization for bias-reduced multi-agent reasoning.arXiv preprint arXiv:2510.07517,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye

Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. InProceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1247–1261,

2015
[5]

Ilyas, Mourad Ouzzani, and Nan Tang

Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. NADEEF: A commodity data cleaning system. InProceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 541–552,

2013
[6]

Ilyas, and Theodoros Rekatsinas

Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. HoloDetect: Few-shot learning for error detection. InProceedings of the 2019 International Conference on Management of Data (SIGMOD), pages 1171–1188,

2019
[7]

AI safety via debate

Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate.arXiv preprint arXiv:1805.00899,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Valentine: Evaluating matching techniques for dataset discovery

Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsifodimos. Valentine: Evaluating matching techniques for dataset discovery. Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 468–479,

2021
[9]

Scalable agent alignment via reward modeling: a research direction

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: A research direction.arXiv preprint arXiv:1811.07871,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

SWE-Debate: Competitive multi-agent debate for software issue resolution.arXiv preprint arXiv:2507.23348,

Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang. SWE-Debate: Competitive multi-agent debate for software issue resolution.arXiv preprint arXiv:2507.23348,

work page arXiv
[11]

Lan Li, Liri Fang, Bertram Ludäscher, and Vetle I. Torvik. AutoDCWorkflow: LLM-based data cleaning workflow auto-generation and benchmark.arXiv preprint arXiv:2412.06724, 2024a. Peng Li, Yeye He, Dror Yashar, Weiwei Cui, Song Ge, Haidong Zhang, Danielle Rifinski Fainman, Dongmei Zhang, and Surajit Chaudhuri. Table-GPT: Table fine-tuned GPT for diverse tab...

work page doi:10.1145/3654979
[12]

Deep learning for entity matching: A design space exploration

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Karahalios, and Dipanjan Krishnan. Deep learning for entity matching: A design space exploration. InProceedings of the 2018 International Conference on Management of Data (SIGMOD), pages 19–34,

2018
[13]

Alicia Parrish, Harsh Trivedi, Ethan Perez, Angelica Chen, Nikita Ringel, and Samuel R. Bowman. Single-turn debate does not help humans answer hard reading-comprehension questions. InarXiv preprint arXiv:2204.05212,

work page arXiv
[14]

CleanAgent: Automating data standardization with LLM-based agents.arXiv preprint arXiv:2403.08291,

Danrui Qi and Jiannan Wang. CleanAgent: Automating data standardization with LLM-based agents.arXiv preprint arXiv:2403.08291,

work page arXiv
[15]

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent LLM agents. arXiv preprint arXiv:2306.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration

Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration. InProceedings of NAACL, 2024b. arXiv:2307.05300. Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V. Le. Simple synthetic data reduces sycophancy ...

work page arXiv
[17]

Can LLM agents really debate? A controlled study of multi-agent debate in logical reasoning.arXiv preprint arXiv:2511.07784,

Haolun Wu, Zhenkun Li, and Lingyao Li. Can LLM agents really debate? A controlled study of multi-agent debate in logical reasoning.arXiv preprint arXiv:2511.07784,

work page arXiv
[18]

Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, and H. V. Jagadish. MMTU: A massive multi-task table understanding and reasoning benchmark.arXiv preprint arXiv:2506.05587,

work page arXiv
[19]

Revisiting multi-agent debate as test-time scaling: A systematic study of conditional effectiveness.arXiv preprint arXiv:2505.22960,

Yongjin Yang, Euiin Yi, Jongwoo Ko, Kimin Lee, Zhijing Jin, and Se-Young Yun. Revisiting multi-agent debate as test-time scaling: A systematic study of conditional effectiveness.arXiv preprint arXiv:2505.22960,

work page arXiv
[20]

Stop overvaluing multi-agent debate – we must rethink evaluation and embrace model heterogeneity.arXiv preprint arXiv:2502.08788,

Hangfan Zhang, Zhiyao Cui, Jianhao Chen, Xinrun Wang, Qiaosheng Zhang, Zhen Wang, et al. Stop overvaluing multi-agent debate – we must rethink evaluation and embrace model heterogeneity.arXiv preprint arXiv:2502.08788,

work page arXiv
[21]

Jellyfish: A large language model for data preprocessing

Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada. Jellyfish: A large language model for data preprocessing. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024a. Jintian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu, Bryan Hooi, and Shumin Deng. Exploring collaboration mechanisms for LLM agents: A so...

2024
[22]

The Generator’s workflow is mostly sound, with minor suggestions

A Debate Architecture B Anonymization Full Results Table 11Anonymization experiment: Effect of response anonymization (Claude 4 Sonnet,n = 50). Neither FC nor cell accuracy differences reach statistical significance after correction. Condition FC↑Cell Acc↑Exec Rate↑Hall. Rate Tok/task Debate (anonymized) 0.825 0.898 0.620 20.2% 14,087 Debate (not anonymiz...

1993
[23]

grounded in the actual data

Entity Matching reaches ceiling (1.000 vs. 0.980). Cross-model: Qwen3 ED improves from 0.520 to 0.640 (+12.0pp); DeepSeek ED from 0.600 to 0.640 (+4.0pp). Both models show DI degradation (Qwen3:0.58→0.52; DeepSeek: mixed). F Prompt Sensitivity Detailed Analysis CIC Mechanism.Column-level precision/recall/F1 on AutoDCWorkflow (n=100) decomposes FC: all deb...

2023

[1] [1]

Constitutional AI: Harmlessness from AI Feedback

doi: 10.48786/EDBT.2025.29. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48786/edbt.2025.29 2025

[2] [2]

Measuring Progress on Scalable Oversight for Large Language Models

Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kameré Lukić, Roger Banber, Adian Marcus, Karina Kim, and William Saunders. Measuring progress on scalable oversight for large language models.arXiv preprint arXiv:2211.03540,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning

Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. When identity skews debate: Anonymization for bias-reduced multi-agent reasoning.arXiv preprint arXiv:2510.07517,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye

Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. InProceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1247–1261,

2015

[5] [5]

Ilyas, Mourad Ouzzani, and Nan Tang

Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, and Nan Tang. NADEEF: A commodity data cleaning system. InProceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pages 541–552,

2013

[6] [6]

Ilyas, and Theodoros Rekatsinas

Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. HoloDetect: Few-shot learning for error detection. InProceedings of the 2019 International Conference on Management of Data (SIGMOD), pages 1171–1188,

2019

[7] [7]

AI safety via debate

Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate.arXiv preprint arXiv:1805.00899,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Valentine: Evaluating matching techniques for dataset discovery

Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsifodimos. Valentine: Evaluating matching techniques for dataset discovery. Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 468–479,

2021

[9] [9]

Scalable agent alignment via reward modeling: a research direction

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: A research direction.arXiv preprint arXiv:1811.07871,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

SWE-Debate: Competitive multi-agent debate for software issue resolution.arXiv preprint arXiv:2507.23348,

Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang. SWE-Debate: Competitive multi-agent debate for software issue resolution.arXiv preprint arXiv:2507.23348,

work page arXiv

[11] [11]

Lan Li, Liri Fang, Bertram Ludäscher, and Vetle I. Torvik. AutoDCWorkflow: LLM-based data cleaning workflow auto-generation and benchmark.arXiv preprint arXiv:2412.06724, 2024a. Peng Li, Yeye He, Dror Yashar, Weiwei Cui, Song Ge, Haidong Zhang, Danielle Rifinski Fainman, Dongmei Zhang, and Surajit Chaudhuri. Table-GPT: Table fine-tuned GPT for diverse tab...

work page doi:10.1145/3654979

[12] [12]

Deep learning for entity matching: A design space exploration

Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Karahalios, and Dipanjan Krishnan. Deep learning for entity matching: A design space exploration. InProceedings of the 2018 International Conference on Management of Data (SIGMOD), pages 19–34,

2018

[13] [13]

Alicia Parrish, Harsh Trivedi, Ethan Perez, Angelica Chen, Nikita Ringel, and Samuel R. Bowman. Single-turn debate does not help humans answer hard reading-comprehension questions. InarXiv preprint arXiv:2204.05212,

work page arXiv

[14] [14]

CleanAgent: Automating data standardization with LLM-based agents.arXiv preprint arXiv:2403.08291,

Danrui Qi and Jiannan Wang. CleanAgent: Automating data standardization with LLM-based agents.arXiv preprint arXiv:2403.08291,

work page arXiv

[15] [15]

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent LLM agents. arXiv preprint arXiv:2306.03314,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration

Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration. InProceedings of NAACL, 2024b. arXiv:2307.05300. Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V. Le. Simple synthetic data reduces sycophancy ...

work page arXiv

[17] [17]

Can LLM agents really debate? A controlled study of multi-agent debate in logical reasoning.arXiv preprint arXiv:2511.07784,

Haolun Wu, Zhenkun Li, and Lingyao Li. Can LLM agents really debate? A controlled study of multi-agent debate in logical reasoning.arXiv preprint arXiv:2511.07784,

work page arXiv

[18] [18]

Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, and H. V. Jagadish. MMTU: A massive multi-task table understanding and reasoning benchmark.arXiv preprint arXiv:2506.05587,

work page arXiv

[19] [19]

Revisiting multi-agent debate as test-time scaling: A systematic study of conditional effectiveness.arXiv preprint arXiv:2505.22960,

Yongjin Yang, Euiin Yi, Jongwoo Ko, Kimin Lee, Zhijing Jin, and Se-Young Yun. Revisiting multi-agent debate as test-time scaling: A systematic study of conditional effectiveness.arXiv preprint arXiv:2505.22960,

work page arXiv

[20] [20]

Stop overvaluing multi-agent debate – we must rethink evaluation and embrace model heterogeneity.arXiv preprint arXiv:2502.08788,

Hangfan Zhang, Zhiyao Cui, Jianhao Chen, Xinrun Wang, Qiaosheng Zhang, Zhen Wang, et al. Stop overvaluing multi-agent debate – we must rethink evaluation and embrace model heterogeneity.arXiv preprint arXiv:2502.08788,

work page arXiv

[21] [21]

Jellyfish: A large language model for data preprocessing

Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada. Jellyfish: A large language model for data preprocessing. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024a. Jintian Zhang, Xin Xu, Ningyu Zhang, Ruibo Liu, Bryan Hooi, and Shumin Deng. Exploring collaboration mechanisms for LLM agents: A so...

2024

[22] [22]

The Generator’s workflow is mostly sound, with minor suggestions

A Debate Architecture B Anonymization Full Results Table 11Anonymization experiment: Effect of response anonymization (Claude 4 Sonnet,n = 50). Neither FC nor cell accuracy differences reach statistical significance after correction. Condition FC↑Cell Acc↑Exec Rate↑Hall. Rate Tok/task Debate (anonymized) 0.825 0.898 0.620 20.2% 14,087 Debate (not anonymiz...

1993

[23] [23]

grounded in the actual data

Entity Matching reaches ceiling (1.000 vs. 0.980). Cross-model: Qwen3 ED improves from 0.520 to 0.640 (+12.0pp); DeepSeek ED from 0.600 to 0.640 (+4.0pp). Both models show DI degradation (Qwen3:0.58→0.52; DeepSeek: mixed). F Prompt Sensitivity Detailed Analysis CIC Mechanism.Column-level precision/recall/F1 on AutoDCWorkflow (n=100) decomposes FC: all deb...

2023