Caesar: Deep Agentic Web Exploration for Creative Answer Synthesis

Elliot Meyerson; Jason Liang; Risto Miikkulainen

arxiv: 2604.20855 · v3 · submitted 2026-02-24 · 💻 cs.IR · cs.MA

Caesar: Deep Agentic Web Exploration for Creative Answer Synthesis

Jason Liang , Elliot Meyerson , Risto Miikkulainen This is my paper

Pith reviewed 2026-05-15 19:53 UTC · model grok-4.3

classification 💻 cs.IR cs.MA

keywords agentic web explorationdynamic knowledge graphadversarial refinementcreative synthesisnovelty in answersdeep research agentsinformation retrievalautonomous agents

0 comments

The pith

Caesar builds a dynamic knowledge graph through deep web traversal and uses adversarial refinement to synthesize answers with higher novelty and structural coherence than existing agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Caesar as an agentic architecture that moves beyond flat web retrieval by constructing a dynamic knowledge graph during exploration. This graph guides the agent toward diverse, non-obvious connections across the web's structure, while adversarial refinement during synthesis actively seeks novel perspectives instead of confirming prior knowledge. If correct, the approach would enable autonomous systems to produce original artifacts and answers rather than derivative summaries, with measured gains of 13 to 23 percent over current deep research agents across output formats. A reader would care because it targets the core limitation of convergent search in agentic frameworks, potentially unlocking more creative uses of web-scale information.

Core claim

Caesar performs deep web traversal driven by a context-aware policy to build a dynamic knowledge graph that serves as a navigational scaffold. This graph maximizes information coverage by revealing connections that flat retrieval misses. Synthesis then occurs through adversarial refinement that seeks novel perspectives. The result is the generation of artifacts and answers with high novelty and structural coherence, delivering 13 to 23 percent improvement over state-of-the-art agents in creative synthesis challenges and strong performance across all tested output formats.

What carries the argument

Dynamic knowledge graph from context-aware web traversal policy that acts as navigational scaffold, paired with adversarial refinement for synthesis

If this is right

Produces creative answers and artifacts with measurably higher novelty and structural coherence
Delivers 13 to 23 percent gains over state-of-the-art deep research agents on synthesis benchmarks
Maintains dominance across multiple output formats including text, structured data, and hybrid artifacts
Shifts agent behavior from convergent search to associative synthesis of new ideas
Bridges information gathering directly to insight generation without intermediate derivative summaries

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The graph-based scaffold could be tested in domains like hypothesis generation in science, where non-obvious cross-domain links matter most.
Future agents might combine this traversal with memory mechanisms to handle even longer exploration horizons without losing coherence.
The adversarial component suggests a general pattern for countering confirmation bias in any retrieval-augmented generation system.
Scaling the policy to specialized subgraphs, such as academic paper networks, could be a direct next step for domain-specific creativity tasks.

Load-bearing premise

A dynamic knowledge graph built from web traversal will reliably surface diverse non-obvious information missed by flat retrieval, and adversarial refinement will produce genuinely novel insights rather than rephrased outputs.

What would settle it

Blind expert evaluation of novelty and coherence on identical creative prompts, where Caesar-generated answers show no statistically significant improvement over flat-retrieval baselines, would falsify the core performance claim.

Figures

Figures reproduced from arXiv: 2604.20855 by Elliot Meyerson, Jason Liang, Risto Miikkulainen.

**Figure 1.** Figure 1: Visualization of the Caesar architecture. (Left) Phase 1: Deep Web Exploration. A dynamic exploration policy controls a three-stage loop (Perceive, Think, Act) to traverse the web and to build a knowledge graph/database from insights. (Right) Phase 2: Adversarial Artifact Synthesis. Insights are retrieved to synthesize an initial draft. The agent then enters a recursive cycle, critiquing the current draft … view at source ↗

**Figure 2.** Figure 2: The knowledge graphs G created by Caesar during the deep web exploration phase for each of the five challenges. Brighter colors indicate further exploration depth from the root node (red) while cyan nodes indicate sources cited by the final artifact text. These figures show that the semantic content of the challenge has a substantial impact on exploration strategy and the diversity of network topologies ge… view at source ↗

**Figure 3.** Figure 3: Evolution of the knowledge graph G for Challenge 5 over 1000 steps. Brighter colors indicate further exploration depth from the root node (red). The figures show a transition from initial depth-first search to breadth-first branching later. The lower right contains a t-SNE [Maaten and Hinton, 2008] plot for node text embeddings in G that shows the diversity of insights collected during Caesar’s exploration… view at source ↗

read the original abstract

To advance from passive retrieval to creative discovery of new ideas, autonomous agents must be capable of deep, associative synthesis. However, current agentic frameworks prioritize convergent search, often resulting in derivative summaries that lack creativity. Caesar is an agentic architecture designed to bridge the gap between information gathering and synthesis of new insights. Unlike existing agents that treat the web as a flat sequence of disconnected documents, Caesar performs a deep web traversal to construct a dynamic knowledge graph. This graph then serves as a navigational scaffold, guiding the agent to diverse, non-obvious information that flat retrieval would never encounter. Caesar thus consists of two components: (1) exploration driven by a dynamic context-aware policy that maximizes information coverage across the web's topological structure, and (2) synthesis through adversarial refinement that actively seeks novel perspectives rather than confirming established priors. Caesar demonstrates the ability to generate artifacts and answers characterized by high novelty and structural coherence, achieving 13% to 23% improvement over state-of-the-art deep research agents in creative synthesis challenges, with strong dominance across all output formats.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Caesar sketches a plausible architecture pairing dynamic knowledge-graph web traversal with adversarial refinement for creative synthesis, but the 13-23% gains rest on zero experimental details.

read the letter

Caesar's main contribution is an agent that builds a dynamic knowledge graph while traversing the web's topology and then uses adversarial refinement to push outputs toward novelty instead of derivative summaries. The specific combination of a context-aware exploration policy with an adversary that seeks non-obvious perspectives is not reduced to prior work in the abstract, so that pairing counts as new within agentic IR setups. The framing of the problem—current agents converging too quickly and missing associative connections—is clear and on target. The paper does a decent job explaining why flat retrieval falls short and how a navigational scaffold from the graph could surface diverse material. That part is straightforward and useful as an idea sketch. The soft spots are substantial and central. No methods, baselines, metrics, controls, or ablation results appear anywhere, so the claimed improvements cannot be checked. It is entirely possible the measured edge comes from longer contexts or prompt tweaks rather than the graph or the adversary. The refinement objective, how the adversary is defined, and stopping rules are left unspecified, which makes it hard to see why the outputs would be structurally coherent new insights rather than varied restatements. This paper is for people already working on agentic systems for research assistance or idea generation who are looking for architecture ideas. A reader in that narrow group might pull one or two design choices from it, but the work has too little substance for broader use or citation right now. I would not bring it to a reading group, would not cite it, and would not send it to peer review until the experiments are written up with proper controls and evidence.

Referee Report

3 major / 0 minor

Summary. The paper presents Caesar, an agentic architecture for creative answer synthesis. It constructs a dynamic knowledge graph via deep web traversal using a context-aware policy that maximizes information coverage, then applies adversarial refinement to generate novel perspectives. The central claim is that this yields artifacts with high novelty and structural coherence, delivering 13% to 23% improvement over state-of-the-art deep research agents across creative synthesis tasks and output formats.

Significance. If the claimed gains are shown to arise specifically from the dynamic KG traversal and adversarial refinement rather than from longer context or prompt engineering, the work would meaningfully advance agentic systems beyond convergent retrieval toward divergent synthesis. The architecture directly targets a recognized limitation in current frameworks, and reproducible evaluation protocols for novelty would strengthen its contribution.

major comments (3)

[Architecture and Synthesis sections] The architecture description leaves the refinement objective, adversary definition, and stopping criteria unspecified. This is load-bearing for the central claim because the 13-23% improvement is attributed to adversarial refinement producing genuinely novel insights; without these details it remains possible that measured gains arise from increased context length alone.
[Experimental Evaluation] No experimental details, baselines, metrics, controls, or ablation studies are supplied to support the quantitative claim of 13-23% improvement. The evaluation of novelty and structural coherence therefore cannot be assessed, undermining verification of the core empirical result.
[Exploration Component] The assumption that dynamic KG traversal consistently surfaces non-obvious information missed by flat retrieval is stated but not validated with concrete traversal examples, coverage metrics, or comparison to standard retrieval baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight areas where additional specification and evidence are needed to substantiate the core claims. We address each point below and commit to revisions that will strengthen the manuscript without altering its central contributions.

read point-by-point responses

Referee: [Architecture and Synthesis sections] The architecture description leaves the refinement objective, adversary definition, and stopping criteria unspecified. This is load-bearing for the central claim because the 13-23% improvement is attributed to adversarial refinement producing genuinely novel insights; without these details it remains possible that measured gains arise from increased context length alone.

Authors: We agree that the current description is insufficiently precise on these elements. In the revised manuscript we will expand the Architecture and Synthesis sections with: (i) the explicit refinement objective (a minimax formulation that penalizes convergence to prior knowledge while rewarding divergence), (ii) the adversary definition (a secondary LLM agent trained to critique and propose counter-perspectives), and (iii) the stopping criteria (a combination of coverage saturation on the knowledge graph and a novelty threshold measured via embedding distance to existing nodes). These additions will include pseudocode and a short derivation showing why the mechanism is not reducible to context length alone. revision: yes
Referee: [Experimental Evaluation] No experimental details, baselines, metrics, controls, or ablation studies are supplied to support the quantitative claim of 13-23% improvement. The evaluation of novelty and structural coherence therefore cannot be assessed, undermining verification of the core empirical result.

Authors: We acknowledge the absence of these details in the submitted draft. The revised version will include a dedicated Experimental Evaluation section specifying: the full set of baselines (including GPT-4o with web browsing, ReAct, and recent deep research agents), the exact metrics (human-rated novelty on a 1-5 scale with inter-annotator agreement, structural coherence via graph-edit distance, and automated proxies), control conditions (fixed context length, no KG, no adversary), ablation studies isolating each component, and statistical tests (paired t-tests with p-values) supporting the reported 13-23% gains. We will also release the evaluation prompts and anonymized outputs. revision: yes
Referee: [Exploration Component] The assumption that dynamic KG traversal consistently surfaces non-obvious information missed by flat retrieval is stated but not validated with concrete traversal examples, coverage metrics, or comparison to standard retrieval baselines.

Authors: We will add a new subsection under Exploration that supplies: (i) two concrete traversal traces with step-by-step node expansions and the non-obvious facts discovered, (ii) quantitative coverage metrics (unique entity coverage, information entropy across the induced graph, and path diversity), and (iii) head-to-head comparisons against flat retrieval baselines (BM25, dense passage retrieval, and web-search-only agents) on the same query set, demonstrating higher recall of peripheral but relevant information. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an empirical agentic architecture (dynamic knowledge graph traversal + adversarial refinement) and reports measured improvements (13-23%) over baselines. No equations, derivations, fitted parameters, or self-citations appear in the abstract or described text. Central claims rest on experimental outcomes rather than any quantity defined in terms of itself or reduced by construction to prior inputs. The architecture is described at the level of components and policy goals without mathematical formalization that could create self-definitional or fitted-input circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that the described components produce the claimed gains.

pith-pipeline@v0.9.0 · 5484 in / 1078 out tokens · 35231 ms · 2026-05-15T19:53:37.169891+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

[1]

Language Models are Few-Shot Learners

Cornell University, 5 2020. doi: 10.48550/arxiv.2005.14165. Markus J. Buehler. Agentic deep graph reasoning yields self-organizing knowledge networks.Jour- nal of Materials Research, 40(15):2204–2242, 7 2025. ISSN 0884-1616. doi: 10.1557/s43578-0 25-01652-1. Ruth M. J. Byrne.The Rational Imagination: How People Create Alternatives to Reality. MIT Press, C...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 2020
[2]

Dedre Gentner

URLhttps://blog.google/products-and-platforms/products/gemini/gemin i-3/. Dedre Gentner. Structure-mapping: A theoretical framework for analogy.Cognitive Science, 7(2): 155–170, 4 1983. ISSN 0364-0213. doi: 10.1207/s15516709cog0702_3. Joy Paul Guilford.The Nature of Human Intelligence. McGraw-Hill, New York, 1967. Aric A. Hagberg, Daniel A. Schult, and Pi...

work page doi:10.1207/s15516709cog0702_3 1983
[3]

Jieyi Long

doi: 10.18653/v1/2024.emnlp-main.35. Jieyi Long. Large language model guided tree-of-thought.ArXiv preprint, abs/2305.08291, 5 2023. doi: 10.48550/arxiv.2305.08291. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(Nov):2579–2605, 2008. ISSN 1532-4435. Guillermo Macbeth, Eugenia Razumiejczyk, ...

work page doi:10.18653/v1/2024.emnlp-main.35 2024
[4]

Agentic Large Language Models, a Survey , volume=

ISSN 0033-295X. Aske Plaat, Max van Duijn, Niki van Stein, Mike Preuss, Peter van der Putten, and Kees Joost Batenburg. Agentic large language models, a survey.ArXiv preprint, abs/2503.23037, 12 2025. ISSN 1076-9757. doi: 10.1613/jair.1.18675. Hongjin Qian and Zheng Liu. Scent of Knowledge: Optimizing search-enhanced reasoning with information foraging.Ar...

work page doi:10.1613/jair.1.18675 2025
[5]

Resolving knowledge conflicts in large language models,

URLhttps://github.com/mem0ai/mem0. GitHub repository. Yike Wang, Shangbin Feng, Heng Wang, Weijia Shi, Vidhisha Balachandran, Tianxing He, and Yulia Tsvetkov. Resolving knowledge conflicts in large language models.ArXiv preprint, abs/2310.00935, 10 2023. doi: 10.48550/arxiv.2310.00935. Thomas B. Ward. Structured imagination: The role of category structure...

work page doi:10.48550/arxiv.2310.00935 2023
[6]

Constrained Synthesis Caesar 9.11 8.89 9.11 27.11 Gemini 3 (Deep) 7.78 8.33 7.67 23.78 Gemini 3 (Shallow) 6.89 6.22 6.89 20.00 Sonnet 4.5 (Shallow) 6.67 6.44 6.11 19.22 Sonnet 4.5 (Deep) 5.78 6.11 5.00 16.89 GPT-5.2 (Shallow) 5.56 5.89 5.00 16.45 GPT-5.2 (Deep) 4.11 6.44 3.33 13.88

work page
[7]

Counterfactual Reasoning Caesar 9.44 9.11 9.44 27.99 Gemini 3 (Deep) 8.56 8.11 8.44 25.11 Sonnet 4.5 (Deep) 6.89 8.33 6.56 21.78 GPT-5.2 (Deep) 4.78 6.44 4.44 15.66 Gemini 3 (Shallow) 5.00 5.22 5.33 15.55 Sonnet 4.5 (Shallow) 3.89 4.89 3.56 12.34 GPT-5.2 (Shallow) 3.78 5.11 3.22 12.11

work page
[8]

Cross-Domain Synthesis Caesar 9.56 8.56 9.44 27.56 Gemini 3 (Deep) 7.00 7.89 6.78 21.67 Sonnet 4.5 (Deep) 6.22 7.67 6.44 20.33 GPT-5.2 (Deep) 5.00 6.22 4.56 15.78 Gemini 3 (Shallow) 3.78 4.78 3.56 12.12 GPT-5.2 (Shallow) 3.33 4.78 3.11 11.22 Sonnet 4.5 (Shallow) 2.56 3.89 2.44 8.89

work page
[9]

Meta-Creativity 20 Table 5 – continued from previous page Agent New Useful Surp. Total Caesar 9.22 9.22 8.89 27.33 Gemini 3 (Deep) 8.78 7.228.8924.89 Sonnet 4.5 (Deep) 7.44 7.33 7.22 21.99 GPT-5.2 (Deep) 5.78 5.67 4.67 16.12 Gemini 3 (Shallow) 4.44 4.22 4.22 12.88 Sonnet 4.5 (Shallow) 4.33 4.44 3.89 12.66 GPT-5.2 (Shallow) 4.11 4.67 3.56 12.34

work page
[10]

Scores represent the mean of nine samples

Open-Ended Synthesis Caesar8.228.568.0024.78 Gemini 3 (Deep) 8.33 6.44 8.67 23.44 Sonnet 4.5 (Deep) 7.33 8.00 6.89 22.22 GPT-5.2 (Shallow) 7.44 6.33 7.22 20.99 Sonnet 4.5 (Shallow)8.892.339.2220.44 GPT-5.2 (Deep) 5.67 6.78 4.78 17.23 Gemini 3 (Shallow) 5.33 6.56 4.67 16.56 Table 6: Detailed performance breakdown forUnconstrained ELI5 Answers. Scores repre...

work page
[11]

Constrained Synthesis Caesar 8.89 8.89 8.67 26.45 Sonnet 4.5 (Deep) 6.89 8.22 6.44 21.55 Gemini 3 (Shallow) 6.67 5.89 6.56 19.12 Gemini 3 (Deep) 4.78 6.56 5.22 16.56 GPT-5.2 (Shallow) 5.78 5.33 5.22 16.33 GPT-5.2 (Deep) 4.67 7.33 4.22 16.22 Sonnet 4.5 (Shallow) 5.44 5.33 5.22 15.99

work page
[12]

Counterfactual Reasoning Caesar 9.22 9.00 9.22 27.44 Sonnet 4.5 (Deep) 7.22 8.00 6.67 21.89 Gemini 3 (Deep) 6.67 6.33 6.89 19.89 GPT-5.2 (Deep) 5.00 7.00 4.67 16.67 Sonnet 4.5 (Shallow) 4.67 5.56 4.44 14.67 GPT-5.2 (Shallow) 3.33 5.00 2.67 11.00 Gemini 3 (Shallow) 2.67 3.56 2.44 8.67

work page
[13]

Total Gemini 3 (Deep) 3.78 5.22 3.78 12.78 Sonnet 4.5 (Shallow) 3.89 4.56 3.67 12.12 Gemini 3 (Shallow) 2.33 3.11 2.00 7.44

Cross-Domain Synthesis Caesar 9.11 9.11 8.89 27.11 Sonnet 4.5 (Deep) 6.78 7.78 6.78 21.34 GPT-5.2 (Deep) 5.44 6.89 5.00 17.33 GPT-5.2 (Shallow) 4.00 5.33 3.56 12.89 21 Table 6 – continued from previous page Agent New Useful Surp. Total Gemini 3 (Deep) 3.78 5.22 3.78 12.78 Sonnet 4.5 (Shallow) 3.89 4.56 3.67 12.12 Gemini 3 (Shallow) 2.33 3.11 2.00 7.44

work page
[14]

Meta-Creativity Caesar 8.78 8.33 8.67 25.78 Sonnet 4.5 (Deep) 7.56 7.44 7.22 22.22 Gemini 3 (Deep) 6.78 5.56 6.89 19.23 GPT-5.2 (Deep) 6.11 6.56 6.00 18.67 Sonnet 4.5 (Shallow) 4.89 5.22 4.00 14.11 GPT-5.2 (Shallow) 4.22 4.67 3.44 12.33 Gemini 3 (Shallow) 3.00 3.11 2.56 8.67

work page
[15]

Scores represent the mean of nine samples

Open-Ended Synthesis Sonnet 4.5 (Deep) 7.11 7.89 7.3322.33 Caesar7.118.116.44 21.66 Sonnet 4.5 (Shallow)8.892.899.0020.78 Gemini 3 (Deep) 7.44 5.22 7.33 19.99 GPT-5.2 (Shallow) 6.78 5.22 6.67 18.67 GPT-5.2 (Deep) 5.67 5.78 5.11 16.56 Gemini 3 (Shallow) 5.11 6.67 4.33 16.11 Table 7: Detailed performance breakdown forELI5 Answers (450 Word Limit). Scores re...

work page
[16]

Constrained Synthesis Caesar 8.78 8.78 9.00 26.56 Gemini 3 (Deep) 6.56 7.33 6.67 20.56 Sonnet 4.5 (Deep) 6.00 7.78 5.33 19.11 Sonnet 4.5 (Shallow) 6.22 6.11 6.00 18.33 GPT-5.2 (Shallow) 5.78 6.11 5.00 16.89 Gemini 3 (Shallow) 5.44 6.00 5.33 16.77 GPT-5.2 (Deep) 4.44 7.00 4.00 15.44

work page
[17]

Counterfactual Reasoning Caesar 8.33 8.33 8.33 24.99 Gemini 3 (Deep) 8.22 7.67 8.11 24.00 Sonnet 4.5 (Deep) 6.78 8.00 6.67 21.45 Sonnet 4.5 (Shallow) 6.11 6.44 6.00 18.55 GPT-5.2 (Deep) 4.33 6.00 3.44 13.77 Gemini 3 (Shallow) 4.33 4.78 3.78 12.89 GPT-5.2 (Shallow) 3.11 4.67 2.56 10.34

work page
[18]

Cross-Domain Synthesis Caesar 8.44 9.11 8.22 25.77 22 Table 7 – continued from previous page Agent New Useful Surp. Total Sonnet 4.5 (Shallow) 7.11 6.89 7.11 21.11 Sonnet 4.5 (Deep) 5.78 7.44 6.00 19.22 Gemini 3 (Deep) 5.22 6.22 5.33 16.77 GPT-5.2 (Deep) 3.89 5.89 3.11 12.89 GPT-5.2 (Shallow) 3.89 5.56 3.44 12.89 Gemini 3 (Shallow) 3.11 4.33 2.78 10.22

work page
[19]

Meta-Creativity Caesar 8.78 8.89 8.56 26.23 Sonnet 4.5 (Deep) 6.89 6.67 6.56 20.12 Gemini 3 (Deep) 5.89 4.78 6.11 16.78 GPT-5.2 (Deep) 5.33 6.11 4.89 16.33 Sonnet 4.5 (Shallow) 5.44 5.11 4.89 15.44 GPT-5.2 (Shallow) 4.78 5.22 4.11 14.11 Gemini 3 (Shallow) 3.56 3.78 3.33 10.67

work page
[20]

derivative

Open-Ended Synthesis Sonnet 4.5 (Deep) 7.11 8.00 7.0022.11 Caesar7.338.336.44 22.10 Sonnet 4.5 (Shallow)8.893.009.1121.00 Gemini 3 (Deep) 7.33 5.00 7.89 20.22 GPT-5.2 (Shallow) 6.89 6.00 6.89 19.78 Gemini 3 (Shallow) 4.89 6.56 4.33 15.78 GPT-5.2 (Deep) 4.00 5.33 3.11 12.44 D A Qualitative Comparison of Answers To illustrate the fundamental distinction bet...

work page 2025
[21]

H o u s e h o l d C o n t i n u i t y Account

Data O w n e r s h i p : The " H o u s e h o l d C o n t i n u i t y Account " is owned by the user via a Data Trust . The Carrier is a f i d u c i a r y p r o c e s s o r with no o w n e r s h i p rights

work page
[22]

Duress Modes

Consent under Duress : F ea tu res " Duress Modes " r e l e a s i n g only minimum a t t e s t a t i o n s . V i o l a t i o n s trigger a u t o m a t i c s a n c t i o n s

work page
[23]

Silence

V e r i f i c a t i o n : An I n d e p e n d e n t R e l i a b i l i t y R e g u l a t o r ( multi - s t a k e h o l d e r board ) audits c r y p t o g r a p h i c a l l y signed service re ce ipt s . " Silence " is treated as a risk signal . 28 [ I T E R A T I O N 3] Q : With the g o v e r n a n c e a r c h i t e c t u r e established , what is the sm al...

work page
[24]

Must - Cover

G e o g r a p h i c " Must - Cover " Co ntr ac t : Carrier must cover 100% of r e g i s t e r e d h o u s e h o l d s ( no cherry - picking )

work page
[25]

Two - Part Tariff : Se as on al R et ai ner ( r e a d i n e s s ) + T r i g g e r e d Usage Pa yme nt s ( surge events )

work page
[26]

Monsoon Pilot

Stop - loss Pool : Reinsurance - style fund covers costs above c a t a s t r o p h i c t h r e s h o l d s . [ I T E R A T I O N 4] Q : To execute the " Monsoon Pilot " model defined above , what is the supply - side o p e r a t i n g model ? How do you di sp atc h / pay a h e t e r o g e n e o u s network of clinics / vendors in real - time ? A : O p e r...

work page
[27]

R e s i l i e n c e P r i m i t i v e s

" R e s i l i e n c e P r i m i t i v e s ": Se rv ic es c o n v e r t e d into s t a n d a r d i z e d modules with strict inputs / outputs ( e . g . , " Acute PTSD S t a b i l i z a t i o n ")

work page
[28]

Tiered R eg is try : From l ic en sed NGOs ( Tier A ) to c o m m u n i t y actors ( Tier C , s p o n s o r e d by anchors )

work page
[29]

Work Tokens

D is pa tc h Engine : Issues " Work Tokens " based on location , availability , and equity c o n s t r a i n t s

work page
[30]

Escrow Payment : Two - key release re qu ir es S up pl ier Proof + I n d e p e n d e n t Ve ri fi er c o n f i r m a t i o n

work page
[31]

SUMMARY :

P e r m i s s i o n i n g : S u p p l i e r s never own the user record ; they write outputs to the ledger via t e m p o r a r y consent tokens . G Detailed Ablation Results This section expands upon the ablation results provided in the main paper to better understand how they affect Caesar’s drafting processes and ELI5 outputs. G.1 Ablation Results for E...

work page 2008
[32]

** EXPLORE ** new un - visited pages to discover novel i n f o r m a t i o n or k now le dg e

work page
[33]

** B ACK TR AC K ** to the im me di at e p r e v i o u s l y visited page to try a l t e r n a t i v e paths

work page
[34]

** W E B _ S E A R C H ** relevant topics to address current e x p l o r a t i o n insights Consider : - K no wl edg e gaps vs areas of s a t u r a t i o n - Depth of current e x p l o r a t i o n branch - Success patterns from previous d ec is io ns - Risk / reward of new e x p l o r a t i o n vs c o n s o l i d a t i o n K.2 Phase 2 Prompts (Adversarial...

work page 2025

[1] [1]

Language Models are Few-Shot Learners

Cornell University, 5 2020. doi: 10.48550/arxiv.2005.14165. Markus J. Buehler. Agentic deep graph reasoning yields self-organizing knowledge networks.Jour- nal of Materials Research, 40(15):2204–2242, 7 2025. ISSN 0884-1616. doi: 10.1557/s43578-0 25-01652-1. Ruth M. J. Byrne.The Rational Imagination: How People Create Alternatives to Reality. MIT Press, C...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.14165 2020

[2] [2]

Dedre Gentner

URLhttps://blog.google/products-and-platforms/products/gemini/gemin i-3/. Dedre Gentner. Structure-mapping: A theoretical framework for analogy.Cognitive Science, 7(2): 155–170, 4 1983. ISSN 0364-0213. doi: 10.1207/s15516709cog0702_3. Joy Paul Guilford.The Nature of Human Intelligence. McGraw-Hill, New York, 1967. Aric A. Hagberg, Daniel A. Schult, and Pi...

work page doi:10.1207/s15516709cog0702_3 1983

[3] [3]

Jieyi Long

doi: 10.18653/v1/2024.emnlp-main.35. Jieyi Long. Large language model guided tree-of-thought.ArXiv preprint, abs/2305.08291, 5 2023. doi: 10.48550/arxiv.2305.08291. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9(Nov):2579–2605, 2008. ISSN 1532-4435. Guillermo Macbeth, Eugenia Razumiejczyk, ...

work page doi:10.18653/v1/2024.emnlp-main.35 2024

[4] [4]

Agentic Large Language Models, a Survey , volume=

ISSN 0033-295X. Aske Plaat, Max van Duijn, Niki van Stein, Mike Preuss, Peter van der Putten, and Kees Joost Batenburg. Agentic large language models, a survey.ArXiv preprint, abs/2503.23037, 12 2025. ISSN 1076-9757. doi: 10.1613/jair.1.18675. Hongjin Qian and Zheng Liu. Scent of Knowledge: Optimizing search-enhanced reasoning with information foraging.Ar...

work page doi:10.1613/jair.1.18675 2025

[5] [5]

Resolving knowledge conflicts in large language models,

URLhttps://github.com/mem0ai/mem0. GitHub repository. Yike Wang, Shangbin Feng, Heng Wang, Weijia Shi, Vidhisha Balachandran, Tianxing He, and Yulia Tsvetkov. Resolving knowledge conflicts in large language models.ArXiv preprint, abs/2310.00935, 10 2023. doi: 10.48550/arxiv.2310.00935. Thomas B. Ward. Structured imagination: The role of category structure...

work page doi:10.48550/arxiv.2310.00935 2023

[6] [6]

Constrained Synthesis Caesar 9.11 8.89 9.11 27.11 Gemini 3 (Deep) 7.78 8.33 7.67 23.78 Gemini 3 (Shallow) 6.89 6.22 6.89 20.00 Sonnet 4.5 (Shallow) 6.67 6.44 6.11 19.22 Sonnet 4.5 (Deep) 5.78 6.11 5.00 16.89 GPT-5.2 (Shallow) 5.56 5.89 5.00 16.45 GPT-5.2 (Deep) 4.11 6.44 3.33 13.88

work page

[7] [7]

Counterfactual Reasoning Caesar 9.44 9.11 9.44 27.99 Gemini 3 (Deep) 8.56 8.11 8.44 25.11 Sonnet 4.5 (Deep) 6.89 8.33 6.56 21.78 GPT-5.2 (Deep) 4.78 6.44 4.44 15.66 Gemini 3 (Shallow) 5.00 5.22 5.33 15.55 Sonnet 4.5 (Shallow) 3.89 4.89 3.56 12.34 GPT-5.2 (Shallow) 3.78 5.11 3.22 12.11

work page

[8] [8]

Cross-Domain Synthesis Caesar 9.56 8.56 9.44 27.56 Gemini 3 (Deep) 7.00 7.89 6.78 21.67 Sonnet 4.5 (Deep) 6.22 7.67 6.44 20.33 GPT-5.2 (Deep) 5.00 6.22 4.56 15.78 Gemini 3 (Shallow) 3.78 4.78 3.56 12.12 GPT-5.2 (Shallow) 3.33 4.78 3.11 11.22 Sonnet 4.5 (Shallow) 2.56 3.89 2.44 8.89

work page

[9] [9]

Meta-Creativity 20 Table 5 – continued from previous page Agent New Useful Surp. Total Caesar 9.22 9.22 8.89 27.33 Gemini 3 (Deep) 8.78 7.228.8924.89 Sonnet 4.5 (Deep) 7.44 7.33 7.22 21.99 GPT-5.2 (Deep) 5.78 5.67 4.67 16.12 Gemini 3 (Shallow) 4.44 4.22 4.22 12.88 Sonnet 4.5 (Shallow) 4.33 4.44 3.89 12.66 GPT-5.2 (Shallow) 4.11 4.67 3.56 12.34

work page

[10] [10]

Scores represent the mean of nine samples

Open-Ended Synthesis Caesar8.228.568.0024.78 Gemini 3 (Deep) 8.33 6.44 8.67 23.44 Sonnet 4.5 (Deep) 7.33 8.00 6.89 22.22 GPT-5.2 (Shallow) 7.44 6.33 7.22 20.99 Sonnet 4.5 (Shallow)8.892.339.2220.44 GPT-5.2 (Deep) 5.67 6.78 4.78 17.23 Gemini 3 (Shallow) 5.33 6.56 4.67 16.56 Table 6: Detailed performance breakdown forUnconstrained ELI5 Answers. Scores repre...

work page

[11] [11]

Constrained Synthesis Caesar 8.89 8.89 8.67 26.45 Sonnet 4.5 (Deep) 6.89 8.22 6.44 21.55 Gemini 3 (Shallow) 6.67 5.89 6.56 19.12 Gemini 3 (Deep) 4.78 6.56 5.22 16.56 GPT-5.2 (Shallow) 5.78 5.33 5.22 16.33 GPT-5.2 (Deep) 4.67 7.33 4.22 16.22 Sonnet 4.5 (Shallow) 5.44 5.33 5.22 15.99

work page

[12] [12]

Counterfactual Reasoning Caesar 9.22 9.00 9.22 27.44 Sonnet 4.5 (Deep) 7.22 8.00 6.67 21.89 Gemini 3 (Deep) 6.67 6.33 6.89 19.89 GPT-5.2 (Deep) 5.00 7.00 4.67 16.67 Sonnet 4.5 (Shallow) 4.67 5.56 4.44 14.67 GPT-5.2 (Shallow) 3.33 5.00 2.67 11.00 Gemini 3 (Shallow) 2.67 3.56 2.44 8.67

work page

[13] [13]

Total Gemini 3 (Deep) 3.78 5.22 3.78 12.78 Sonnet 4.5 (Shallow) 3.89 4.56 3.67 12.12 Gemini 3 (Shallow) 2.33 3.11 2.00 7.44

Cross-Domain Synthesis Caesar 9.11 9.11 8.89 27.11 Sonnet 4.5 (Deep) 6.78 7.78 6.78 21.34 GPT-5.2 (Deep) 5.44 6.89 5.00 17.33 GPT-5.2 (Shallow) 4.00 5.33 3.56 12.89 21 Table 6 – continued from previous page Agent New Useful Surp. Total Gemini 3 (Deep) 3.78 5.22 3.78 12.78 Sonnet 4.5 (Shallow) 3.89 4.56 3.67 12.12 Gemini 3 (Shallow) 2.33 3.11 2.00 7.44

work page

[14] [14]

Meta-Creativity Caesar 8.78 8.33 8.67 25.78 Sonnet 4.5 (Deep) 7.56 7.44 7.22 22.22 Gemini 3 (Deep) 6.78 5.56 6.89 19.23 GPT-5.2 (Deep) 6.11 6.56 6.00 18.67 Sonnet 4.5 (Shallow) 4.89 5.22 4.00 14.11 GPT-5.2 (Shallow) 4.22 4.67 3.44 12.33 Gemini 3 (Shallow) 3.00 3.11 2.56 8.67

work page

[15] [15]

Scores represent the mean of nine samples

Open-Ended Synthesis Sonnet 4.5 (Deep) 7.11 7.89 7.3322.33 Caesar7.118.116.44 21.66 Sonnet 4.5 (Shallow)8.892.899.0020.78 Gemini 3 (Deep) 7.44 5.22 7.33 19.99 GPT-5.2 (Shallow) 6.78 5.22 6.67 18.67 GPT-5.2 (Deep) 5.67 5.78 5.11 16.56 Gemini 3 (Shallow) 5.11 6.67 4.33 16.11 Table 7: Detailed performance breakdown forELI5 Answers (450 Word Limit). Scores re...

work page

[16] [16]

Constrained Synthesis Caesar 8.78 8.78 9.00 26.56 Gemini 3 (Deep) 6.56 7.33 6.67 20.56 Sonnet 4.5 (Deep) 6.00 7.78 5.33 19.11 Sonnet 4.5 (Shallow) 6.22 6.11 6.00 18.33 GPT-5.2 (Shallow) 5.78 6.11 5.00 16.89 Gemini 3 (Shallow) 5.44 6.00 5.33 16.77 GPT-5.2 (Deep) 4.44 7.00 4.00 15.44

work page

[17] [17]

Counterfactual Reasoning Caesar 8.33 8.33 8.33 24.99 Gemini 3 (Deep) 8.22 7.67 8.11 24.00 Sonnet 4.5 (Deep) 6.78 8.00 6.67 21.45 Sonnet 4.5 (Shallow) 6.11 6.44 6.00 18.55 GPT-5.2 (Deep) 4.33 6.00 3.44 13.77 Gemini 3 (Shallow) 4.33 4.78 3.78 12.89 GPT-5.2 (Shallow) 3.11 4.67 2.56 10.34

work page

[18] [18]

Cross-Domain Synthesis Caesar 8.44 9.11 8.22 25.77 22 Table 7 – continued from previous page Agent New Useful Surp. Total Sonnet 4.5 (Shallow) 7.11 6.89 7.11 21.11 Sonnet 4.5 (Deep) 5.78 7.44 6.00 19.22 Gemini 3 (Deep) 5.22 6.22 5.33 16.77 GPT-5.2 (Deep) 3.89 5.89 3.11 12.89 GPT-5.2 (Shallow) 3.89 5.56 3.44 12.89 Gemini 3 (Shallow) 3.11 4.33 2.78 10.22

work page

[19] [19]

Meta-Creativity Caesar 8.78 8.89 8.56 26.23 Sonnet 4.5 (Deep) 6.89 6.67 6.56 20.12 Gemini 3 (Deep) 5.89 4.78 6.11 16.78 GPT-5.2 (Deep) 5.33 6.11 4.89 16.33 Sonnet 4.5 (Shallow) 5.44 5.11 4.89 15.44 GPT-5.2 (Shallow) 4.78 5.22 4.11 14.11 Gemini 3 (Shallow) 3.56 3.78 3.33 10.67

work page

[20] [20]

derivative

Open-Ended Synthesis Sonnet 4.5 (Deep) 7.11 8.00 7.0022.11 Caesar7.338.336.44 22.10 Sonnet 4.5 (Shallow)8.893.009.1121.00 Gemini 3 (Deep) 7.33 5.00 7.89 20.22 GPT-5.2 (Shallow) 6.89 6.00 6.89 19.78 Gemini 3 (Shallow) 4.89 6.56 4.33 15.78 GPT-5.2 (Deep) 4.00 5.33 3.11 12.44 D A Qualitative Comparison of Answers To illustrate the fundamental distinction bet...

work page 2025

[21] [21]

H o u s e h o l d C o n t i n u i t y Account

Data O w n e r s h i p : The " H o u s e h o l d C o n t i n u i t y Account " is owned by the user via a Data Trust . The Carrier is a f i d u c i a r y p r o c e s s o r with no o w n e r s h i p rights

work page

[22] [22]

Duress Modes

Consent under Duress : F ea tu res " Duress Modes " r e l e a s i n g only minimum a t t e s t a t i o n s . V i o l a t i o n s trigger a u t o m a t i c s a n c t i o n s

work page

[23] [23]

Silence

V e r i f i c a t i o n : An I n d e p e n d e n t R e l i a b i l i t y R e g u l a t o r ( multi - s t a k e h o l d e r board ) audits c r y p t o g r a p h i c a l l y signed service re ce ipt s . " Silence " is treated as a risk signal . 28 [ I T E R A T I O N 3] Q : With the g o v e r n a n c e a r c h i t e c t u r e established , what is the sm al...

work page

[24] [24]

Must - Cover

G e o g r a p h i c " Must - Cover " Co ntr ac t : Carrier must cover 100% of r e g i s t e r e d h o u s e h o l d s ( no cherry - picking )

work page

[25] [25]

Two - Part Tariff : Se as on al R et ai ner ( r e a d i n e s s ) + T r i g g e r e d Usage Pa yme nt s ( surge events )

work page

[26] [26]

Monsoon Pilot

Stop - loss Pool : Reinsurance - style fund covers costs above c a t a s t r o p h i c t h r e s h o l d s . [ I T E R A T I O N 4] Q : To execute the " Monsoon Pilot " model defined above , what is the supply - side o p e r a t i n g model ? How do you di sp atc h / pay a h e t e r o g e n e o u s network of clinics / vendors in real - time ? A : O p e r...

work page

[27] [27]

R e s i l i e n c e P r i m i t i v e s

" R e s i l i e n c e P r i m i t i v e s ": Se rv ic es c o n v e r t e d into s t a n d a r d i z e d modules with strict inputs / outputs ( e . g . , " Acute PTSD S t a b i l i z a t i o n ")

work page

[28] [28]

Tiered R eg is try : From l ic en sed NGOs ( Tier A ) to c o m m u n i t y actors ( Tier C , s p o n s o r e d by anchors )

work page

[29] [29]

Work Tokens

D is pa tc h Engine : Issues " Work Tokens " based on location , availability , and equity c o n s t r a i n t s

work page

[30] [30]

Escrow Payment : Two - key release re qu ir es S up pl ier Proof + I n d e p e n d e n t Ve ri fi er c o n f i r m a t i o n

work page

[31] [31]

SUMMARY :

P e r m i s s i o n i n g : S u p p l i e r s never own the user record ; they write outputs to the ledger via t e m p o r a r y consent tokens . G Detailed Ablation Results This section expands upon the ablation results provided in the main paper to better understand how they affect Caesar’s drafting processes and ELI5 outputs. G.1 Ablation Results for E...

work page 2008

[32] [32]

** EXPLORE ** new un - visited pages to discover novel i n f o r m a t i o n or k now le dg e

work page

[33] [33]

** B ACK TR AC K ** to the im me di at e p r e v i o u s l y visited page to try a l t e r n a t i v e paths

work page

[34] [34]

** W E B _ S E A R C H ** relevant topics to address current e x p l o r a t i o n insights Consider : - K no wl edg e gaps vs areas of s a t u r a t i o n - Depth of current e x p l o r a t i o n branch - Success patterns from previous d ec is io ns - Risk / reward of new e x p l o r a t i o n vs c o n s o l i d a t i o n K.2 Phase 2 Prompts (Adversarial...

work page 2025