arxiv: 2605.06772 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.HC· hep-ph· hep-th

Recognition: no theorem link

When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning

Vasilis Niarchos , Constantinos Papageorgakis , Alexander G. Stapleton , Sokratis Trifinopoulos

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:55 UTC · model grok-4.3

classification 💻 cs.AI cs.HChep-phhep-th

keywords actor-critic loopsLLM agentstheoretical physicsmulti-turn reasoningquantum field theoryAI critiqueagentic AI

0 comments

The pith

Multi-turn critic feedback improves AI solutions to theoretical physics problems depending on the actor-critic pairing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors introduce SCALAR, a pipeline that lets one AI model act as a critic giving iterative feedback to another model solving quantum field theory and string theory problems. They show that this back-and-forth improves results compared to the actor working alone. The amount of improvement and the best way to prompt the critic depend on how the two models are chosen and whether they are similar in capability. Scaling the models helps with some problems but not the hardest ones, suggesting that pairing strategy matters as much as raw size.

Core claim

Using the SCALAR framework on a set of theoretical physics problems, the study finds that multi-turn interactions where a critic provides feedback to an actor produce better solutions than single attempts. The mechanism of improvement and the effectiveness of different feedback strategies vary significantly with the specific actor-critic pairing: constructive feedback helps most when a smaller model is guided by a larger one, while same-family pairings show less sensitivity to strategy and sometimes favor lenient feedback. Increasing model scale within a family improves performance on easier problems but leaves the most challenging cases as bottlenecks.

What carries the argument

The Structured Critic-Actor Loop (SCALAR) pipeline, consisting of an actor model that proposes solutions, a critic model that supplies iterative feedback, and an independent judge that scores the results against reference solutions.

Load-bearing premise

The LLM judge can evaluate the correctness and quality of solutions to advanced theoretical physics problems without bias or error compared to the reference solutions.

What would settle it

A direct comparison where domain experts manually grade the AI transcripts and the rankings of different SCALAR variants differ from those given by the automated judge.

Figures

Figures reproduced from arXiv: 2605.06772 by Alexander G. Stapleton, Constantinos Papageorgakis, Sokratis Trifinopoulos, Vasilis Niarchos.

**Figure 1.** Figure 1: The SCALAR Actor–Critic–Judge pipeline. The Actor and the Critic engage in iterative dialogue, while an independent evaluator (Judge) scores the Actor’s current solution against a ground truth. The Actor’s output is shaped by the Actor persona; the Critic’s feedback is shaped by the Critic feedback strategy. to endpoint scores and convergence rates, we use per-turn score-update curves as a compact diagnost… view at source ↗

**Figure 2.** Figure 2: Per-problem convergence for the three Actor model settings. Haiku uses the Sonnet Critic; DS8B and DS70B both use the DS70B Critic. Haiku has reduced coverage on Peskin 4.2 and Polchinski 2.7 (n = 83 and n = 51, respectively), while the other shown cells use n = 300 runs. a Sonnet Critic reaches high QWQ convergence (87.7% on Peskin 2.3 and 96.4% on Peskin 4.2), while the reduced Polchinski screen remains … view at source ↗

**Figure 4.** Figure 4: Empirical score-update curves for the two DeepSeek-family Actor scales under the common QWQ scoring. Points estimate the projected score-update field v(s) = E[∆st | st ≃ s] by averaging next-turn updates ∆st = st+1 − st in bins of the current Actor score st for the indicated Actor model and problem, pooling over personas and Critic feedback strategies; shaded bands are bootstrap 95% confidence intervals. O… view at source ↗

**Figure 5.** Figure 5: Problem-normalized Critic feedback strategy contrasts under QWQ scoring. Haiku values use the two Peskin problems (n=383); DS8B and DS70B values use all three problems (n=900 each). For each Actor model setting and problem, we subtract the local baseline before averaging by Critic feedback strategy, so the figure asks which Critic feedback strategies sit above or below the corresponding model–problem basel… view at source ↗

**Figure 6.** Figure 6: DS8B s¯ heatmaps under QWQ scoring, separated by problem. Rows are Actor personas, columns are Critic feedback strategies, and cell color gives the mean per-turn score. Ped. Len. Default Strict Adv. Critic feedback default/default default/meticulous default/physical default/skeptical expert/default expert/meticulous expert/physical expert/skeptical novice/default novice/meticulous novice/physical novice/sk… view at source ↗

**Figure 7.** Figure 7: DS70B s¯ heatmaps under QWQ scoring, separated by problem. Rows are Actor personas, columns are Critic feedback strategies, and cell color gives the mean per-turn score. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Actor score-component evolution under QWQ scoring. Each component is normalized to its rubric maximum before averaging. Haiku uses the two Peskin problems; DS8B and DS70B use all three problems. This diagnostic decomposes the Judge score and is not an additional aggregate evaluation metric. counted as converged if any iteration of the dialogue satisfies this rule under the scoring Judge. For runs whose ori… view at source ↗

**Figure 9.** Figure 9: Endpoint gain distributions under QWQ scoring. Haiku uses the two Peskin problems (n=383), while DS8B and DS70B use all three problems (n=900 each). Dashed vertical lines mark the median gain. E. Score-Update Field and Markov Projection The score-update curves in [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

As large language models (LLMs) show increasing promise on research-level physics reasoning tasks and agentic AI becomes more common, a practical question emerges: How does the interaction between researchers and agents affect the results? We study this using SCALAR (Structured Critic--Actor Loop for AI Reasoning), an Actor--Critic--Judge pipeline applied to quantum field theory and string theory problems. The Actor proposes solutions, the Critic provides iterative feedback, and an independent Judge evaluates the transcript against reference solutions. We vary the Actor persona, the Critic feedback strategy, and the Actor model family and scale. Multi-turn dialogue improves over single-shot attempts throughout, but both the mechanism of improvement and the value of different prompting choices depend strongly on the Actor--Critic pairing. Increasing the scale within one model family (e.g. from the 8B-parameter DeepSeek-R1 variant to DeepSeek-R1 70B) improves some easier-problem behavior, but does not remove the hardest bottleneck we observe. Critic feedback strategy matters most clearly in the asymmetric Actor--Critic setting (e.g., a lightweight Haiku Actor guided by a stronger Sonnet Critic), where constructive feedback improves mean-score outcomes. In same-family Actor--Critic settings, strategy effects are weaker: lenient feedback is sometimes favored, while strict and adversarial feedback are not beneficial. Taken together, SCALAR provides a controlled testbed for evaluating which interaction structures help or hinder AI-driven scientific discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCALAR gives some practical data on pairing effects in LLM actor-critic loops for physics problems, but the unvalidated LLM judge undercuts the comparisons.

read the letter

The main takeaway is that multi-turn feedback in an actor-critic setup improves LLM outputs on quantum field theory and string theory tasks compared to single-shot attempts, yet the size of the gain and which prompting strategy works depend heavily on the specific models paired together. In asymmetric cases where the critic is stronger, constructive feedback lifts the scores; in same-family pairings the strategy differences shrink and lenient feedback can even edge out stricter versions. Scaling the model within a family helps on easier problems but leaves the hardest bottlenecks untouched. They package this as SCALAR, a controlled pipeline that varies actor persona, critic strategy, and model family while scoring everything through a separate judge LLM against reference solutions. That controlled variation is the clearest new element. It moves beyond generic benchmarks by focusing on physics-specific reasoning and by documenting how interaction structure interacts with model choice. The observations on asymmetric versus symmetric pairings are concrete enough to be worth noting for anyone building agent loops in technical domains. The central weakness is the evaluation. All claims about improvement and strategy value rest on scores from an independent LLM judge, with no reported human-expert calibration or agreement checks. For research-level physics, where errors often sit in conceptual gaps or incomplete consistency arguments rather than surface mismatches, this leaves open the possibility that the judge systematically favors or penalizes certain styles. Without that grounding, the reported differences between pairings and feedback types are harder to trust as real signals about reasoning rather than artifacts of the scorer. The abstract also omits the actual numbers, variances, and problem counts, which makes it difficult to judge effect sizes or robustness. This work is aimed at people experimenting with multi-agent systems for scientific reasoning, especially those who care about prompt and pairing choices in math-heavy settings. A reader who wants empirical pointers on when critic feedback helps in physics contexts can extract usable observations, though they will want to re-run the judge step with human spot-checks. It deserves peer review. The testbed and the pairing dependence are worth documenting and refining, provided the evaluation method is strengthened in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces SCALAR, a Structured Critic-Actor Loop pipeline consisting of an Actor LLM that proposes solutions to quantum field theory and string theory problems, a Critic LLM that provides iterative feedback, and an independent Judge LLM that scores the resulting transcripts against reference solutions. Through controlled variations in Actor persona, Critic feedback strategy (constructive, lenient, strict, adversarial), model family, and scale, the work claims that multi-turn Actor-Critic dialogue consistently outperforms single-shot attempts, but that both the mechanism of improvement and the utility of specific prompting choices depend strongly on whether the Actor-Critic pair is asymmetric or same-family. Scale increases within a family help easier problems but leave hardest bottlenecks intact; constructive feedback is most beneficial in asymmetric settings while strategy effects weaken in matched-family cases. The study positions SCALAR as a testbed for identifying interaction structures that aid or hinder AI-driven theoretical physics.

Significance. If the core empirical patterns hold under validated evaluation, the work supplies a useful controlled experimental framework for dissecting how critique loops interact with model capabilities in research-level physics reasoning. It supplies concrete evidence that pairing asymmetry and feedback style are first-order variables rather than secondary, which could inform the design of future agentic systems for scientific discovery. The systematic variation across model scales and strategies is a strength, as is the focus on realistic theoretical-physics tasks rather than toy benchmarks.

major comments (2)

Abstract and Evaluation section: All quantitative claims (mean-score improvements, strategy rankings, and pairing-dependent effects) rest exclusively on scores produced by an independent LLM Judge that compares transcripts to reference solutions. No human-expert calibration, inter-rater agreement statistics, or correlation analysis with domain specialists is described. For research-level QFT and string theory problems, where correctness often hinges on subtle conceptual gaps or non-obvious consistency checks, this introduces a high risk that the Judge systematically favors or penalizes particular error types or stylistic features, rendering the reported dependence on Actor-Critic pairings potentially an artifact of the evaluator rather than a robust finding about agentic reasoning.
Results section (strategy comparisons): The claim that 'Critic feedback strategy matters most clearly in the asymmetric Actor-Critic setting' is load-bearing for the practical takeaway, yet the manuscript provides no effect-size tables, statistical tests, or confidence intervals showing the magnitude and reliability of the constructive-feedback advantage over lenient/strict/adversarial variants. Without these, it is impossible to judge whether the observed differences are practically meaningful or merely suggestive.

minor comments (2)

Abstract: The summary states that multi-turn dialogue 'improves over single-shot attempts throughout' but supplies no numerical deltas, number of problems, or variance measures. Adding one or two concrete performance figures would help readers gauge the scale of the reported gains.
Methods: The description of the Judge prompt and scoring rubric is not reproduced in sufficient detail to allow exact replication of the evaluation protocol. Including the exact Judge template and any temperature or few-shot settings would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback, which identifies key areas for strengthening the evaluation methodology and statistical presentation in our SCALAR study. We address each major comment below and commit to revisions that improve transparency and rigor without altering the core empirical findings.

read point-by-point responses

Referee: Abstract and Evaluation section: All quantitative claims (mean-score improvements, strategy rankings, and pairing-dependent effects) rest exclusively on scores produced by an independent LLM Judge that compares transcripts to reference solutions. No human-expert calibration, inter-rater agreement statistics, or correlation analysis with domain specialists is described. For research-level QFT and string theory problems, where correctness often hinges on subtle conceptual gaps or non-obvious consistency checks, this introduces a high risk that the Judge systematically favors or penalizes particular error types or stylistic features, rendering the reported dependence on Actor-Critic pairings potentially an artifact of the evaluator rather than a robust finding about agentic reasoning.

Authors: We agree this is a substantive limitation. While the Judge was prompted to evaluate factual alignment with reference solutions rather than stylistic features, and while the fixed Judge ensures relative comparisons across all experimental conditions are internally consistent, the absence of human calibration leaves open the possibility of systematic bias on nuanced physics content. In the revised manuscript we will add an explicit subsection to the Evaluation section that (i) details the Judge prompt and its focus on correctness, (ii) discusses known risks of LLM evaluators on research-level tasks, and (iii) reports a new small-scale human validation: a domain expert will independently score a stratified subset of 30 transcripts (balanced across difficulty, strategy, and pairing type) and we will compute Pearson/Spearman correlations and agreement statistics with the Judge scores. These results will be presented alongside the main findings. revision: yes
Referee: Results section (strategy comparisons): The claim that 'Critic feedback strategy matters most clearly in the asymmetric Actor-Critic setting' is load-bearing for the practical takeaway, yet the manuscript provides no effect-size tables, statistical tests, or confidence intervals showing the magnitude and reliability of the constructive-feedback advantage over lenient/strict/adversarial variants. Without these, it is impossible to judge whether the observed differences are practically meaningful or merely suggestive.

Authors: We accept that the current reporting of mean scores alone is insufficient to substantiate the differential effect of feedback strategy. The underlying per-transcript scores are available and permit formal analysis. In the revision we will expand the Results section with (i) tables reporting mean scores, standard deviations, 95% confidence intervals, and effect sizes (Cohen’s d for pairwise contrasts and partial eta-squared for multi-strategy comparisons), (ii) statistical tests (paired Wilcoxon signed-rank tests within each pairing type, with Bonferroni correction, and a mixed ANOVA testing the Strategy × Pairing interaction), and (iii) explicit discussion of whether the constructive-feedback advantage in asymmetric settings reaches conventional significance thresholds. These additions will be placed immediately after the existing mean-score figures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical experimental study with direct observations

full rationale

The paper reports results from controlled experiments running the SCALAR Actor-Critic-Judge pipeline on QFT/string theory problems. All claims rest on measured differences in mean Judge scores across variations in models, personas, and feedback strategies. No mathematical derivations, parameter fitting, self-citations, or ansatzes are present in the load-bearing steps. The Judge compares transcripts to external reference solutions; this is an evaluation protocol, not a self-referential reduction. The setup is self-contained against the observed experimental outcomes and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper relies on standard assumptions about LLM capabilities for evaluation and introduces one main new entity: the SCALAR framework itself. No free parameters are fitted as this is not a modeling paper.

axioms (1)

domain assumption The quality of AI-generated solutions to physics problems can be objectively scored by comparing to reference solutions using an LLM judge.
This underpins the entire evaluation pipeline described in the abstract.

invented entities (1)

SCALAR (Structured Critic--Actor Loop for AI Reasoning) no independent evidence
purpose: To provide a structured pipeline for testing iterative feedback in AI agent reasoning for physics.
Introduced as the core contribution of the paper.

pith-pipeline@v0.9.0 · 5591 in / 1425 out tokens · 72075 ms · 2026-05-11T00:55:05.896024+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 19 canonical work pages · 6 internal anchors

[1]

1959 , publisher=

The Logic of Scientific Discovery , author=. 1959 , publisher=

1959
[2]

1995 , publisher=

An Introduction to Quantum Field Theory , author=. 1995 , publisher=

1995
[3]

1998 , publisher=

String Theory , author=. 1998 , publisher=

1998
[4]

1978 , publisher =

Vygotsky, Lev Semyonovich , title =. 1978 , publisher =

1978
[5]

and Ross, Gail , title =

Wood, David and Bruner, Jerome S. and Ross, Gail , title =. Journal of Child Psychology and Psychiatry , volume =. 1976 , doi =

1976
[6]

Biometrics Bulletin , volume =

Wilcoxon, Frank , title =. Biometrics Bulletin , volume =. 1945 , doi =

1945
[7]

and Whitney, Donald R

Mann, Henry B. and Whitney, Donald R. , title =. The Annals of Mathematical Statistics , volume =. 1947 , doi =

1947
[8]

and Wallis, W

Kruskal, William H. and Wallis, W. Allen , title =. Journal of the American Statistical Association , volume =. 1952 , doi =

1952
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Nature , volume=

Mathematical Discoveries from Program Search with Large Language Models , author=. Nature , volume=
[11]

Nature , volume=

Solving Olympiad Geometry without Human Demonstrations , author=. Nature , volume=
[12]

UIST , year=

Generative Agents: Interactive Simulacra of Human Behavior , author=. UIST , year=
[13]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Improving Factuality and Reasoning in Language Models through Multiagent Debate , author=. arXiv preprint arXiv:2305.14325 , year=

work page internal anchor Pith review arXiv
[14]

Large Language Models Cannot Self-Correct Reasoning Yet

Large Language Models Cannot Self-Correct Reasoning Yet , author=. arXiv preprint arXiv:2310.01798 , year=

work page internal anchor Pith review arXiv
[15]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , author=. arXiv preprint arXiv:2305.19118 , year=

work page internal anchor Pith review arXiv
[16]

Problems of Monetary Management: The

Goodhart, Charles AE , journal=. Problems of Monetary Management: The. 1984 , publisher=

1984
[17]

Guevara, A

Single-minus gluon tree amplitudes are nonzero , author=. arXiv preprint arXiv:2602.12176 , year=

work page arXiv
[18]

Fireside chat:

Altman, Sam and Deutsch, David , year=. Fireside chat:
[19]

Theoretical Physics Benchmark (

Chung, Daniel and others , journal=. Theoretical Physics Benchmark (
[20]

org/abs/2506.06214

Can Theoretical Physics Research Benefit from Language Agents? , author=. arXiv preprint arXiv:2506.06214 , year=

work page arXiv
[21]

The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity , author=. arXiv preprint arXiv:2506.06941 , year=

work page arXiv
[22]

arXiv preprint arXiv:2506.18957 , year=

A Comment On ``The Illusion of Thinking'': Reframing the Reasoning Cliff as an Agentic Gap , author=. arXiv preprint arXiv:2506.18957 , year=

work page arXiv
[23]

Have Large Language Models Learned to Reason? A Characterization via 3-

Hazra, Rishi and Venturato, Gabriele and Zuidberg Dos Martires, Pedro and De Raedt, Luc , journal=. Have Large Language Models Learned to Reason? A Characterization via 3-
[24]

Logical Phase Transitions: Understanding Collapse in

Zhang, Xinglang and Zhang, Yunyao and Chen, ZeLiang and Yu, Junqing and Yang, Wei and Song, Zikai , journal=. Logical Phase Transitions: Understanding Collapse in
[25]

Novikov, Alexander and Vu, Ngan and Eisenberger, Marvin and Dupont, Emilien and Huang, Po-Sen and Wagner, Adam Zsolt and Shirobokov, Sergey and Kozlovskii, Borislav and Ruiz, Francisco J. R. and Mehrabian, Abbas and Kumar, M. Pawan and See, Abigail and Chaudhuri, Swarat and Holland, George and Davies, Alex and Nowozin, Sebastian and Kohli, Pushmeet and Ba...
[26]

Plehn, Tilman and Schiller, Daniel and Schmal, Nikita , journal=
[27]

Agrawal, Prateek and Craig, Nathaniel and Madden, Amalia and Valenzuela Lombera, Inigo , journal=
[28]

2025 , publisher=

Hwang, Yerin and Kim, Yongil and Koo, Jahyun and Kang, Taegwan and Bae, Hyunkyung and Jung, Kyomin , booktitle=. 2025 , publisher=. doi:10.18653/v1/2025.acl-long.957 , url=

work page doi:10.18653/v1/2025.acl-long.957 2025
[29]

Test-time Scaling Techniques in Theoretical Physics: A Comparison of Methods on the

Gao, Zhiyuan and others , journal=. Test-time Scaling Techniques in Theoretical Physics: A Comparison of Methods on the
[30]

arXiv preprint arXiv:2412.00821 , year=

Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents , author=. arXiv preprint arXiv:2412.00821 , year=

work page arXiv
[31]

Laban, Philippe and others , journal=
[32]

Estornell, Andrew and Ton, Jean-Francois and Yao, Yuanshun and Liu, Yang , journal=
[33]

Advancing

Xu, Yinggan and Kimlee, Hana and Xiao, Yijia and Luo, Di , journal=. Advancing. 2025 , note=

2025
[34]

arXiv preprint arXiv:2408.08631 , year=

Persona is a Double-Edged Sword: Enhancing the Zero-shot Reasoning by Ensembling the Role-playing and Neutral Prompts , author=. arXiv preprint arXiv:2408.08631 , year=

work page arXiv
[35]

Zhang, Yifan and others , journal=
[36]

Sirdeshmukh, Aniketh and others , journal=
[37]

arXiv preprint , year=

A Statistical Physics of Language Model Reasoning , author=. arXiv preprint , year=
[38]

ICLR , year=

Lines of Thought in Large Language Models , author=. ICLR , year=
[39]

Zhang, Xinyu and Dong, Yiwen and others , journal=
[40]

arXiv preprint , year=

Unveiling Attractor Cycles in Large Language Models: A Dynamical Systems View of Successive Paraphrasing , author=. arXiv preprint , year=
[41]

arXiv preprint , year=

Multi-Model Consistency Improves Hallucination Detection and Mitigation in Large Language Models , author=. arXiv preprint , year=
[43]

Richmond, Paul and Papageorgakis, Constantinos and Niarchos, Vasilis and Chowdhury, Borun and Agarwal, Prarit. Mach. Learn. Sci. Tech. 2026. doi:10.1088/2632-2153/ae47bb. arXiv:2508.03716

work page doi:10.1088/2632-2153/ae47bb 2026
[46]

Single-minus graviton tree amplitudes are nonzero,

Guevara, Alfredo and Lupsasca, Alexandru and Skinner, David and Strominger, Andrew and Weil, Kevin. Single-minus graviton tree amplitudes are nonzero. 2026. arXiv:2603.04330

work page arXiv 2026
[47]

doi:10.5281/zenodo.438045 , url =

Apocrita - High Performance Computing Cluster for Queen Mary University of London , author =. doi:10.5281/zenodo.438045 , url =

work page doi:10.5281/zenodo.438045
[48]

2024 , eprint=

Persona is a Double-edged Sword: Mitigating the Negative Impact of Role-playing Prompts in Zero-shot Reasoning Tasks , author=. 2024 , eprint=

2024
[49]

2024 , eprint=

MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions , author=. 2024 , eprint=

2024
[50]

2024 , eprint=

Boosting Large Language Models with Socratic Method for Conversational Mathematics Teaching , author=. 2024 , eprint=

2024
[51]

2025 , eprint=

JudgeBench: A Benchmark for Evaluating LLM-based Judges , author=. 2025 , eprint=

2025
[52]

2026 , eprint=

Asymmetric Actor-Critic for Multi-turn LLM Agents , author=. 2026 , eprint=

2026
[53]

2024 , eprint=

Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging , author=. 2024 , eprint=

2024
[54]

2025 , eprint=

Are We on the Right Way to Assessing LLM-as-a-Judge? , author=. 2025 , eprint=

2025
[55]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , url =

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and Zhang, Hao and Gonzalez, Joseph E and Stoica, Ion , booktitle =. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , url =
[56]

2025 , eprint=

Test-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench Dataset , author=. 2025 , eprint=

2025
[57]

arXiv preprint arXiv:2602.03837 , year=

Accelerating Scientific Research with Gemini: Case Studies and Common Techniques , author=. arXiv preprint arXiv:2602.03837 , year=

work page arXiv
[58]

arXiv preprint arXiv:2508.08370 , year=

The DNA of nuclear models: How AI predicts nuclear masses , author=. arXiv preprint arXiv:2508.08370 , year=

work page arXiv
[59]

Proceedings of the 41st International Conference on Machine Learning , pages=

From Neurons to Neutrons: A Case Study in Interpretability , author=. Proceedings of the 41st International Conference on Machine Learning , pages=. 2024 , organization=

2024
[60]

Science , volume=

Distilling free-form natural laws from experimental data , author=. Science , volume=
[61]

Science Advances , volume=

AI Feynman: A physics-inspired method for symbolic regression , author=. Science Advances , volume=
[62]

Advances in Neural Information Processing Systems , volume=

Discovering symbolic models from deep learning with inductive biases , author=. Advances in Neural Information Processing Systems , volume=
[63]

Learning to Unscramble Feynman Loop Integrals with SAILIR

Shih, David. Learning to Unscramble Feynman Loop Integrals with SAILIR. 2026. arXiv:2604.05034

work page internal anchor Pith review Pith/arXiv arXiv 2026
[64]

Learning to Unscramble: Simplifying Symbolic Expressions via Self-Supervised Oracle Trajectories

Shih, David. Learning to Unscramble: Simplifying Symbolic Expressions via Self-Supervised Oracle Trajectories. 2026. arXiv:2603.11164

work page internal anchor Pith review Pith/arXiv arXiv 2026
[65]

Vibe physics: the AI grad student,

Schwartz, Matthew D. Resummation of the C-Parameter Sudakov Shoulder Using Effective Field Theory. 2026. arXiv:2601.02484

work page arXiv 2026