pith. machine review for the scientific record. sign in

arxiv: 2604.27725 · v1 · submitted 2026-04-30 · 💻 cs.HC · cs.AI

Recognition: unknown

AgentEconomist: An End-to-end Agentic System Translating Economic Intuitions into Executable Computational Experiments

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:26 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords agentic systemscomputational economicsresearch ideationhuman-AI collaborationLLM agentsidea generationknowledge baseexperimental design
0
0 comments X

The pith

AgentEconomist converts economic intuitions into executable computational experiments through a multi-stage agentic workflow grounded in over 13,000 papers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Economists frequently possess strong intuitions about markets, behaviors, or policies yet face barriers in converting those intuitions into verifiable, runnable models. AgentEconomist tackles this by deploying a modular system with three linked stages: first generating hypotheses that draw directly from a curated database of more than 13,000 academic papers, then configuring experimental parameters and protocols to match available simulators, and finally executing the experiments while returning structured results. A human researcher remains in the loop to steer direction and supply feedback at each step. Controlled evaluations by both domain experts and large language models show that the ideas produced score higher on literature grounding, novelty, and insight than ideas produced by standard generic LLMs. The design therefore shifts the researcher's effort from literature synthesis and coding toward high-level conceptual work.

Core claim

AgentEconomist is an end-to-end interactive system that translates abstract economic intuitions into executable computational experiments via a three-stage modular architecture. The Idea Development Stage produces literature-grounded hypotheses from a knowledge base of more than 13,000 high-quality papers; the Experimental Design Stage configures simulator-aligned parameters and protocols; and the Experimental Execution Stage runs the experiments and returns structured analyses. The stages operate in a human-in-the-loop iterative workflow. Evaluations by human experts and LLMs as judges demonstrate that the system yields ideas with stronger literature grounding and higher novelty and insight

What carries the argument

Three-stage modular agentic architecture consisting of Idea Development (literature-grounded hypothesis generation from a 13,000-paper knowledge base), Experimental Design (simulator-aligned parameter and protocol configuration), and Experimental Execution (running experiments and producing structured analyses), all within a human-in-the-loop loop.

If this is right

  • Economists can focus effort on high-level intuitions while the system automates literature synthesis, parameter setting, and execution.
  • Generated ideas are more likely to build on rather than duplicate existing work because of the explicit grounding stage.
  • The workflow produces structured, reproducible experimental outputs that can be directly inspected or iterated upon.
  • The same staged architecture supplies a template for automating idea-to-experiment translation in other data-rich domains.
  • Iterative human feedback during the loop allows refinement without restarting the entire process from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the knowledge base with newer papers or cross-disciplinary sources would likely improve performance on emerging topics such as digital economies or climate policy.
  • The system could be paired with live economic data feeds to turn static experiments into rolling forecasts.
  • Replacing or supplementing LLM judges with outcome-based metrics, such as whether generated ideas later appear in peer-reviewed publications, would provide a stronger external check.
  • Similar agentic pipelines might shorten the time from initial intuition to first testable result in fields like psychology or public health where intuition-to-model translation is also laborious.

Load-bearing premise

The assumption that human experts and LLM judges can accurately and without bias assess the true literature grounding, novelty, and insight of generated ideas, and that the 13,000-paper knowledge base is sufficiently comprehensive across economic topics.

What would settle it

A blinded evaluation in which a new panel of economists rates a matched set of ideas generated by AgentEconomist versus generic LLMs on the same criteria, or an independent test showing that the generated hypotheses systematically fail to predict outcomes when implemented on real economic datasets.

Figures

Figures reproduced from arXiv: 2604.27725 by Jiaju Chen, Jinghua Piao, Songwei Li, Tong Xia, Xiangnan He, Xia Xu, Yong Li.

Figure 1
Figure 1. Figure 1: Comparison of economic research workflows. (a) In view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the AgentEconomist framework. The system structures the intuition-to-experiment process into three view at source ↗
Figure 3
Figure 3. Figure 3: Detailed front-end component view. The interface combines a research canvas with workflow tabs, a copilot panel for view at source ↗
Figure 4
Figure 4. Figure 4: The interactive research workflow of AgentEconomist, illustrating the end-to-end process from intuition input and view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of hypothesis quality scores across eight evaluation dimensions. Top: LLM-based anonymous referee view at source ↗
Figure 6
Figure 6. Figure 6: Income and consumption dynamics under innovation-support policies. These observations suggest that future work should prioritize im￾proving responsiveness, exposing interpretable intermediate states, and strengthening long-horizon intent preservation. 4.4 Case Study (RQ3) To strengthen empirical realism beyond architectural description, we provide a detailed end-to-end case study based on a real user sessi… view at source ↗
Figure 7
Figure 7. Figure 7: Template of the user study questionnaire used for quantitative and qualitative evaluation, including scoring scope view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template for LLM-based hypothesis-quality judging (anonymous economics referee). view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template for LLM-assisted grounded-theory thematic analysis of participants’ open-ended feedback. view at source ↗
read the original abstract

A long-standing challenge in economics lies not in the lack of intuition, but in the difficulty of translating intuitive insights into verifiable research. To address this challenge, we introduce AgentEconomist, an end-to-end interactive system designed to translate abstract intuitions into executable computational experiments. Grounded in a domain-specific knowledge base covering over 13,000 high-quality academic papers, the system employs a modular multi-stage architecture. Specifically, the Idea Development Stage generates literature-grounded hypotheses, the Experimental Design Stage configures simulator-aligned experimental parameters and protocols, and the Experimental Execution Stage runs experiments and returns structured analyses. Together, these stages form a human-in-the-loop, iterative workflow that translates economic intuitions into executable computational experiments. Through extensive experiments involving human expert evaluation and large language models (LLMs) as judges, we show that the system generates research ideas with stronger literature grounding and higher novelty and insight than state-of-the-art generic LLMs. Overall, AgentEconomist adopts a human-AI collaboration paradigm that enables researchers to focus on high-level intuitions, while delegating the labor-intensive processes of translation and computational execution to agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces AgentEconomist, an end-to-end interactive agentic system that translates abstract economic intuitions into executable computational experiments. It employs a modular three-stage architecture (Idea Development for generating literature-grounded hypotheses, Experimental Design for configuring simulator-aligned parameters, and Experimental Execution for running and analyzing experiments) grounded in a knowledge base of over 13,000 high-quality academic papers. The system supports a human-in-the-loop iterative workflow, and the authors claim through human expert evaluations and LLM-as-judge experiments that it produces research ideas with stronger literature grounding, higher novelty, and greater insight than state-of-the-art generic LLMs.

Significance. If the empirical claims hold under rigorous scrutiny, this work could meaningfully advance human-AI collaboration paradigms in economics and HCI by automating labor-intensive translation steps while keeping researchers in control of high-level intuitions. The domain-specific KB integration and staged agentic design offer a concrete example of how retrieval-augmented agents might reduce barriers to computational verification in social sciences. However, the significance is currently difficult to assess because the evaluation methodology lacks the transparency needed to confirm that observed advantages stem from the proposed architecture rather than uncontrolled factors.

major comments (3)
  1. Abstract and Evaluation sections: The manuscript reports 'extensive experiments' with human experts and LLMs as judges showing superiority in literature grounding, novelty, and insight, but provides no information on experimental protocols, exact metrics, baselines, sample sizes, number of ideas evaluated, or controls for bias in LLM judging. These omissions are load-bearing because they prevent independent assessment of whether the central superiority claim is supported.
  2. Idea Development Stage (as described in the abstract): The system explicitly incorporates retrieval and grounding from the 13,000-paper KB to generate hypotheses. The comparison to generic LLMs does not grant those baselines equivalent access to the same specialized KB. Consequently, any metric of 'literature grounding' (e.g., citation relevance or coverage) will systematically favor AgentEconomist by construction, confounding the contribution of the agentic workflow with the mere presence of domain-specific retrieval.
  3. Evaluation (LLM-as-judge component): The paper relies on LLMs to judge outputs from an LLM-powered system without discussing or mitigating the risk of circularity. Judges may favor outputs that align with their own training distributions or capabilities, yet no inter-judge agreement statistics, human-LLM correlation analysis, or debiasing steps are reported. This directly undermines confidence in the novelty and insight results.
minor comments (2)
  1. The abstract and system description would benefit from explicit definitions or operationalizations of 'novelty' and 'insight' used in the human and LLM evaluations to improve reproducibility.
  2. Figure or diagram clarity: If the manuscript includes an architecture diagram for the three stages, ensure it clearly distinguishes the human-in-the-loop points from fully automated components.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where our evaluation methodology requires greater transparency and rigor. We address each major comment point by point below and describe the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: Abstract and Evaluation sections: The manuscript reports 'extensive experiments' with human experts and LLMs as judges showing superiority in literature grounding, novelty, and insight, but provides no information on experimental protocols, exact metrics, baselines, sample sizes, number of ideas evaluated, or controls for bias in LLM judging. These omissions are load-bearing because they prevent independent assessment of whether the central superiority claim is supported.

    Authors: We agree that the current manuscript lacks sufficient detail on the evaluation protocols, which limits independent verification. In the revised version, we will substantially expand the Evaluation section to include: the exact number of ideas generated and evaluated (with sample sizes for both human experts and LLM judges), the full experimental protocols (including how ideas were presented and randomized), precise definitions and rubrics for the metrics of literature grounding, novelty, and insight, the specific baseline models used, and any bias controls such as blinded evaluation and standardized prompts. We will also update the abstract to reference these details more precisely. revision: yes

  2. Referee: Idea Development Stage (as described in the abstract): The system explicitly incorporates retrieval and grounding from the 13,000-paper KB to generate hypotheses. The comparison to generic LLMs does not grant those baselines equivalent access to the same specialized KB. Consequently, any metric of 'literature grounding' (e.g., citation relevance or coverage) will systematically favor AgentEconomist by construction, confounding the contribution of the agentic workflow with the mere presence of domain-specific retrieval.

    Authors: We acknowledge that this is a substantive concern about experimental controls. The primary intent of the comparison is to contrast the full AgentEconomist pipeline (including KB integration and staged agentic reasoning) against standard LLM usage without domain-specific augmentation. However, to isolate the agentic workflow's contribution from retrieval alone, we will add a new baseline condition in the revised manuscript: a retrieval-augmented generic LLM using the same 13,000-paper KB (via RAG-style prompting) but without the multi-stage architecture or human-in-the-loop elements. We will report comparative results and add discussion clarifying that the system's value lies in the integrated end-to-end design rather than retrieval in isolation. revision: yes

  3. Referee: Evaluation (LLM-as-judge component): The paper relies on LLMs to judge outputs from an LLM-powered system without discussing or mitigating the risk of circularity. Judges may favor outputs that align with their own training distributions or capabilities, yet no inter-judge agreement statistics, human-LLM correlation analysis, or debiasing steps are reported. This directly undermines confidence in the novelty and insight results.

    Authors: We recognize the validity of concerns about potential circularity and bias in LLM-as-judge evaluations. In the revised manuscript, we will add a dedicated subsection on evaluation methodology that: discusses the risk of circularity, reports inter-judge agreement statistics (e.g., using multiple distinct LLM judges and computing agreement metrics such as Cohen's kappa), includes correlation analysis between LLM judgments and human expert ratings on a held-out subset of ideas, and details mitigation steps such as using judge models from different providers, structured rubrics with explicit criteria, and chain-of-thought prompting for evaluations. We will emphasize that human expert evaluations remain the primary source of evidence. revision: yes

Circularity Check

1 steps flagged

Literature grounding superiority reduces to exclusive KB access by construction

specific steps
  1. self definitional [Abstract]
    "Grounded in a domain-specific knowledge base covering over 13,000 high-quality academic papers, the system employs a modular multi-stage architecture. Specifically, the Idea Development Stage generates literature-grounded hypotheses... Through extensive experiments involving human expert evaluation and large language models (LLMs) as judges, we show that the system generates research ideas with stronger literature grounding and higher novelty and insight than state-of-the-art generic LLMs."

    The Idea Development Stage uses retrieval from the 13k-paper KB to produce 'literature-grounded hypotheses.' Generic LLMs have no such access. Therefore the reported 'stronger literature grounding' is definitionally equivalent to the presence of the KB rather than an independent outcome of the multi-stage agentic workflow.

full rationale

The paper's central claim rests on experiments showing superior literature grounding versus generic LLMs. However, the system architecture explicitly incorporates retrieval from the 13,000-paper KB during Idea Development, while the baseline generic LLMs lack this mechanism. Any grounding metric therefore favors the system tautologically. This matches the self-definitional pattern: the claimed output (stronger grounding) is equivalent to the distinguishing input (KB access). Novelty and insight metrics are less directly affected, and no equations or self-citation chains are involved, so the circularity is partial rather than total. LLM judges add evaluation risk but are not the load-bearing reduction here.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on assumptions about LLM capabilities for specialized economic reasoning and the representativeness of the curated paper database; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Large language models can reliably generate literature-grounded hypotheses and configure simulator-aligned experiments when given access to a domain-specific knowledge base of economics papers.
    This underpins the Idea Development and Experimental Design stages described in the abstract.

pith-pipeline@v0.9.0 · 5516 in / 1463 out tokens · 82440 ms · 2026-05-07T07:26:07.827434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 13 canonical work pages · 3 internal anchors

  1. [1]

    Robert L Axtell and J Doyne Farmer. 2025. Agent-based modeling in economics and finance: Past, present, and future.Journal of Economic Literature63, 1 (2025), 197–287

  2. [2]

    1976.The economic approach to human behavior

    Gary S Becker. 1976.The economic approach to human behavior. Vol. 803. Univer- sity of Chicago press

  3. [3]

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Ya- dav. 2025. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413(2025)

  4. [4]

    2011.Poor economics

    Esther Duflo and Abhijit Banerjee. 2011.Poor economics. Vol. 619. PublicAffairs New York

  5. [5]

    1996.Growing artificial societies: social science from the bottom up

    Joshua M Epstein and Robert Axtell. 1996.Growing artificial societies: social science from the bottom up. Brookings Institution Press

  6. [6]

    J Doyne Farmer and Duncan Foley. 2009. The economy needs agent-based modelling.Nature460, 7256 (2009), 685–686

  7. [7]

    Shanghua Gao, Richard Zhu, Pengwei Sui, Zhenglun Kong, Sufian Aldogom, Yepeng Huang, Ayush Noori, Reza Shamji, Krishna Parvataneni, Theodoros Tsiligkaridis, et al. 2025. Democratizing AI scientists using ToolUniverse.arXiv preprint arXiv:2509.23426(2025)

  8. [8]

    Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, and Yuzhuo Fu. 2025. Reviewagents: Bridging the gap between human and ai-generated paper reviews.arXiv preprint arXiv:2503.08506(2025)

  9. [9]

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. 2025. Towards an AI co-scientist.arXiv preprint arXiv:2502.18864(2025)

  10. [10]

    Sikun Guo, Amir Hassan Shariatmadari, Guangzhi Xiong, Albert Huang, Myles Kim, Corey M Williams, Stefan Bekiranov, and Aidong Zhang. 2025. Ideabench: Benchmarking large language models for research idea generation. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

  11. [11]

    Glenn W Harrison and John A List. 2004. Field experiments.Journal of Economic literature42, 4 (2004), 1009–1055

  12. [12]

    Jiale Hong, Hongqiu Wu, and Hai Zhao. 2025. Game Development as Human- LLM Interaction. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4333–4354

  13. [13]

    Ming Hu, Chenglong Ma, Wei Li, Wanghan Xu, Jiamin Wu, Jucheng Hu, Tianbin Li, Guohang Zhuang, Jiaqi Liu, Yingzhou Lu, et al. 2025. A survey of scientific large language models: From data foundations to agent frontiers.arXiv preprint arXiv:2508.21148(2025)

  14. [14]

    Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al . 2025. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096(2025)

  15. [15]

    Blake LeBaron. 2006. Agent-based computational finance.Handbook of computa- tional economics2 (2006), 1187–1233

  16. [16]

    Keyu Li, Mohan Jiang, Dayuan Fu, Yunze Wu, Xiangkun Hu, Dequan Wang, and Pengfei Liu. 2025. Datasetresearch: Benchmarking agent systems for demand- driven dataset discovery.arXiv preprint arXiv:2508.06960(2025)

  17. [17]

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha

  18. [18]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292(2024)

  19. [19]

    Junting Lu, Zhiyang Zhang, Fangkai Yang, Jue Zhang, Lu Wang, Chao Du, Qing- wei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. 2025. Axis: Efficient human-agent-computer interaction with api-first llm-based agents. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7711–7743

  20. [20]

    1990.Institutions, institutional change and economic perfor- mance

    Douglass C North. 1990.Institutions, institutional change and economic perfor- mance. Cambridge university press

  21. [21]

    Michael Polanyi. 2009. The tacit dimension. InKnowledge in organisations. Routledge, 135–146

  22. [22]

    2005.The logic of scientific discovery

    Karl Popper. 2005.The logic of scientific discovery. Routledge

  23. [23]

    Kevin Pu, KJ Kevin Feng, Tovi Grossman, Tom Hope, Bhavana Dalvi Mishra, Matt Latzke, Jonathan Bragg, Joseph Chee Chang, and Pao Siangliulue. 2025. Ideasynth: Iterative research idea development through evolving and composing idea facets with literature-grounded feedback. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–31

  24. [24]

    2019.Agent-based and individual-based modeling: a practical introduction

    Steven F Railsback and Volker Grimm. 2019.Agent-based and individual-based modeling: a practical introduction. Princeton university press

  25. [25]

    Paul Anthony Samuelson. 1948. Foundations of economic analysis.Science and Society13, 1 (1948)

  26. [26]

    2006.Micromotives and macrobehavior

    Thomas C Schelling. 2006.Micromotives and macrobehavior. WW Norton & Company. Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Chen et al

  27. [27]

    Chenyang Shao, Dehao Huang, Yu Li, Keyu Zhao, Weiquan Lin, Yining Zhang, Qingbin Zeng, Zhiyu Chen, Tianxing Li, Yifei Huang, et al. 2025. OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists.arXiv preprint arXiv:2511.16931(2025)

  28. [28]

    Erzhuo Shao, Yifang Wang, Yifan Qian, Zhenyu Pan, Han Liu, and Dashun Wang

  29. [29]

    Nature Computational Science(2025), 1–15

    SciSciGPT: advancing human–AI collaboration in the science of science. Nature Computational Science(2025), 1–15

  30. [30]

    Amanpreet Singh, Mike D’Arcy, Arman Cohan, Doug Downey, and Sergey Feld- man. 2022. SciRepEval: A Multi-Format Benchmark for Scientific Document Representations. InConference on Empirical Methods in Natural Language Pro- cessing

  31. [31]

    Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, et al . 2025. Many heads are better than one: Improved scientific idea generation by a llm-based multi-agent system. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 28201–28240

  32. [32]

    Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. 2025. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies.Nature646, 8085 (2025), 716–723

  33. [33]

    Leigh Tesfatsion. 2006. Agent-based computational economics: A constructive approach to economic theory.Handbook of computational economics2 (2006), 831–880

  34. [34]

    Seth Tisue, Uri Wilensky, et al. 2004. Netlogo: A simple environment for modeling complexity. InInternational conference on complex systems, Vol. 21. Boston, MA, 16–21

  35. [35]

    Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. 2024. Cycleresearcher: Improving automated research via automated review.arXiv preprint arXiv:2411.00816(2024)

  36. [36]

    Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, and Yue Zhang. 2025. Deepscientist: Advancing frontier-pushing scientific findings progressively.arXiv preprint arXiv:2509.26603(2025)

  37. [37]

    Anjie Xu, Ruiqing Ding, and Leye Wang. 2025. ChatPD: An LLM-driven Paper- Dataset Networking System. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 5106–5116

  38. [38]

    Qingbin Zeng, Bingbing Fan, Zhiyu Chen, Sijian Ren, Zhilun Zhou, Xuhua Zhang, Yuanyi Zhen, Fengli Xu, Yong Li, and Tie-Yan Liu. 2025. MirrorMind: Empowering OmniScientist with the Expert Perspectives and Collective Knowledge of Human Scientists.arXiv preprint arXiv:2511.16997(2025)

  39. [39]

    Sihang Zeng, Kai Tian, Kaiyan Zhang, Yuru Wang, Junqi Gao, Runze Liu, Sa Yang, Jingxuan Li, Xinwei Long, Jiaheng Ma, et al . 2025. ReviewRL: Towards Automated Scientific Review with RL. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 16942–16954

  40. [40]

    Minjun Zhu, Yixuan Weng, Linyi Yang, and Yue Zhang. 2025. Deepreview: Improving llm-based paper review with human-like deep thinking process.arXiv preprint arXiv:2503.08569(2025). A Experiment Details A.1 Hypothesis Quality Dimensions Participants were asked to evaluate each generated hypothesis along the following eight dimensions. For each dimension, pa...

  41. [41]

    Baseline phase (e.g., GPT-5.2 / equivalent): run the same task and record the full interaction

  42. [42]

    magic variables

    AgentEconomist phase: run the same task and record retrieval outputs, hypothesis updates, configuration steps, execution feedback, and final summaries. [Critical Scope Restriction for Quantitative Scoring] Scoreonly hypothesis-generation content. Do not score downstream simulation plots, policy discussion, writing style, verbosity, or rhetorical confidenc...

  43. [43]

    Content-only evaluation: ignore style, verbosity, tone, or rhetorical confidence

  44. [44]

    Order invariance: do not favor earlier or later materials

  45. [45]

    Model blindness: do not infer system identity or sophistication

  46. [46]

    Referee-level standards: judge as if reviewing an economics paper’s hypotheses. [Dimensions: score each 1–5 (integer)] (1) Clarity & Structure (2) Literature Grounding & Factual Plausibility (3) Economic Logic / Soundness (4) Mechanism Completeness (at hypothesis level) (5) Hypothesis Specificity (6) Novelty & Insight (7) Relevance & Significance (8) Feas...