AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment

Aimin Zhou; Hong Qian; Jiajun Guo; Wenkai Wang; Yifei Ding; Yixuan Wang; Yue Huang; Yunzhao Wei; Zhi Liu; Zhongjing Huang

arxiv: 2604.18398 · v3 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment

Yixuan Wang , Yue Huang , Hong Qian , Yunzhao Wei , Yifei Ding , Wenkai Wang , Zhi Liu , Zhongjing Huang

show 2 more authors

Aimin Zhou Jiajun Guo

This is my paper

Pith reviewed 2026-05-10 04:27 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords creativity assessmentpsychometric contextscontext generationevolutionary optimizationMonte Carlo tree searchLLM evaluationnarrative coherenceassessment instruments

0 comments

The pith

AlphaContext generates psychometric contexts for creativity assessment by evolving tree-structured outlines with Monte Carlo search and niche optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a new generator called AlphaContext to create high-quality contexts for measuring creative thinking. Existing LLM methods often produce contexts that lack strong assessment cues, coherent stories, stylistic variety, and real support for creative responses. AlphaContext first plans a hierarchical outline using a rule-guided hypertree, then fills it with Monte Carlo tree search, evolves the results with niche-based optimization to boost both quality and diversity, and refines weak outputs by simulating varied participant styles. If the approach works, it could make valid creativity tests more available at scale without depending on scarce expert writers. This would matter for tracking and improving creative skills in an era of widespread human-AI collaboration.

Core claim

AlphaContext formalizes expert-designed outlining as a rule-guided hypertree for top-down planning, fills the outline via Monte Carlo tree search to balance global structure with local quality, evolves the contexts using MAP-Elites to jointly raise diversity and quality, and refines them through assessment-guided evolution that simulates virtual participants with diverse styles before recycling weak contexts.

What carries the argument

The evolutionary tree-based generator that combines hypertree outline planning, MCTS-based filling, MAP-Elites niche optimization, and simulated-participant refinement.

If this is right

The generated contexts display stronger narrative coherence, stylistic diversity, and assessment cues than prior LLM methods.
The combined planning, search, and evolution steps jointly raise measured quality across the six metrics.
Simulating diverse participant styles allows repeated recycling and improvement of initially weak contexts.
The method reduces dependence on scarce expert-designed contexts for creativity assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tree-evolution pipeline could be adapted to generate contexts for other psychometric domains such as problem-solving or ethical reasoning tests.
Large-scale automated context creation might enable broader studies of how creativity changes under different human-AI collaboration conditions.
Feeding real test-taker performance data back into the refiner could close the loop between generation and validation more tightly than simulation alone.

Load-bearing premise

The six quality metrics used in experiments accurately reflect the psychometric validity and support for creative thinking required in real assessment instruments.

What would settle it

A head-to-head study in which expert raters or actual test-takers judge AlphaContext outputs as no more effective at eliciting creative responses than outputs from standard LLM generators would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2604.18398 by Aimin Zhou, Hong Qian, Jiajun Guo, Wenkai Wang, Yifei Ding, Yixuan Wang, Yue Huang, Yunzhao Wei, Zhi Liu, Zhongjing Huang.

**Figure 2.** Figure 2: The procedure of the proposed AlphaContext. (a) Given a context query [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Preference evaluation of AlphaContext vs. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Preference evaluation of AlphaContext vs. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Case study on measurement-level alignment [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Example input format of CreaTE. Each entry [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Case comparison under the same input theme: [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Spearman rank-correlation heatmap between [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Extended Measurement-level Alignment Study Across Baselines. F Analysis of MCG Evaluator This section analyzes the evaluator used in the MCTS-based Context Generator (MCG). MCG formulates long-form context generation as a sentencelevel tree search under a planned outline. Each node represents a partial context, and candidate continuations are explored through MCTS. To decide which branches to expand an… view at source ↗

**Figure 11.** Figure 11: Comparison of the overall average scores [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Effect of different MCG evaluator coefficient [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Unified chat prompt template used by baseline LLMs for creativity context generation. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Evaluation prompt template for checklist-grounded pairwise judging across subjective metrics. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Illustrative example of an assessment-ready creativity context generated by AlphaContext, conditioned [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

read the original abstract

Creativity has become a core competence in the era of LLMs and human-AI collaboration, underpinning innovation in real-world problem solving. Crucially, the systematic improvement of creativity necessitates scientifically valid assessment instruments. Psychometric research recognizes context-based assessment as an effective way to measure creative thinking. However, high-quality expert-designed contexts remain scarce. Existing LLM-based generators often struggle with insufficient assessment cues, weak narrative coherence, limited stylistic diversity, and poor support for creative thinking. To address these challenges, we propose AlphaContext, an evolutionary tree-based psychometric context generator for creativity assessment. First, the HyperTree Outline Planner formalizes expert-designed outlining as a rule-guided hypertree and performs top-down hierarchical planning. The MCTS-based Context Generator fills the outline via MCTS to balance global structure and local quality. Then, the Evolutionary Context Optimizer evolves contexts with MAP-Elites by repeatedly updating niche elites to jointly improve diversity and quality. Finally, the Assessment-Guided Evolution Refiner simulates virtual participants with diverse styles and recycles weak contexts for further evolution. Experiments show that AlphaContext yields an average improvement of 8% over competitive methods across 6 quality metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AlphaContext puts together hypertree planning, MCTS, MAP-Elites, and guided evolution to generate creativity-assessment contexts, but the 8% gain rests on metrics whose link to real psychometric quality is not shown.

read the letter

The paper's main contribution is a four-stage pipeline that turns expert-style outlining into a hypertree, fills it with MCTS, optimizes niches with MAP-Elites, and refines via simulated participants. That specific stack for this use case is new. It directly tackles the shortage of high-quality contexts for measuring creative thinking, which matters as LLMs change how we assess human-AI collaboration skills.

Referee Report

2 major / 1 minor

Summary. The paper proposes AlphaContext, an evolutionary tree-based psychometric context generator for creativity assessment. It introduces a HyperTree Outline Planner that formalizes expert outlining as a rule-guided hypertree for top-down planning, an MCTS-based Context Generator to fill outlines while balancing structure and quality, an Evolutionary Context Optimizer using MAP-Elites to evolve contexts for diversity and quality, and an Assessment-Guided Evolution Refiner that simulates virtual participants and recycles weak contexts. The central empirical claim is that AlphaContext achieves an average 8% improvement over competitive methods across 6 quality metrics.

Significance. If the reported improvement is supported by rigorous experiments and the quality metrics are shown to align with established psychometric standards for eliciting creative thinking, the work could meaningfully address the scarcity of high-quality contexts for creativity assessment. The combination of hierarchical planning, Monte Carlo tree search, and MAP-Elites evolution represents a technically coherent approach to generating structured, diverse contexts, and the use of simulated participants for refinement is a promising direction for scalable assessment tools.

major comments (2)

[Abstract] The abstract states that experiments demonstrate an average 8% improvement across 6 quality metrics, yet provides no description of the experimental design, baseline methods, statistical tests, participant simulation protocol, or how the metrics were selected and validated. This is load-bearing for the central claim because the paper's goal is to support valid creativity assessment; without these details it is impossible to determine whether the metrics capture psychometric properties such as divergent thinking or narrative scaffolding.
[Experiments] The quality metrics are described in terms of internal properties (coherence, diversity, stylistic variety) but the manuscript supplies no correlation analysis with established creativity instruments, no inter-rater reliability data with human experts, and no ablation showing that higher metric scores produce better creative outputs from test-takers. This directly affects whether the 8% gain translates to improved assessment validity.

minor comments (1)

[Abstract] The abstract does not name the six quality metrics or the competitive baseline methods, making it difficult to interpret the reported improvement without consulting the full experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential significance of AlphaContext in addressing the scarcity of high-quality contexts for creativity assessment. We address each major comment point by point below, providing clarifications from the full manuscript and outlining targeted revisions to strengthen the presentation of our experimental claims and metric validation.

read point-by-point responses

Referee: [Abstract] The abstract states that experiments demonstrate an average 8% improvement across 6 quality metrics, yet provides no description of the experimental design, baseline methods, statistical tests, participant simulation protocol, or how the metrics were selected and validated. This is load-bearing for the central claim because the paper's goal is to support valid creativity assessment; without these details it is impossible to determine whether the metrics capture psychometric properties such as divergent thinking or narrative scaffolding.

Authors: We agree that the abstract's brevity omits key experimental details, which can make the central claim harder to evaluate at first glance. The full manuscript (Section 4) details the experimental design, including the use of simulated virtual participants with diverse styles, the specific baseline methods (prior LLM-based generators and rule-based approaches), statistical tests for significance, the MCTS-based filling and MAP-Elites evolution protocols, and the rationale for selecting the six metrics (coherence, diversity, stylistic variety, assessment cue strength, narrative scaffolding, and creative thinking support) based on psychometric literature. To resolve this, we will revise the abstract to include a concise summary of the experimental setup, baselines, and metric selection criteria, ensuring the 8% improvement claim is presented with sufficient context while adhering to length limits. revision: yes
Referee: [Experiments] The quality metrics are described in terms of internal properties (coherence, diversity, stylistic variety) but the manuscript supplies no correlation analysis with established creativity instruments, no inter-rater reliability data with human experts, and no ablation showing that higher metric scores produce better creative outputs from test-takers. This directly affects whether the 8% gain translates to improved assessment validity.

Authors: We acknowledge that our evaluation relies on intrinsic metrics without direct empirical correlation to external instruments such as the Torrance Tests of Creative Thinking or human inter-rater reliability data. The metrics were selected to operationalize established psychometric properties (e.g., coherence for narrative scaffolding and diversity for divergent thinking), as justified in Section 3 and the related work. However, the manuscript does not include correlation analyses, inter-rater studies, or explicit ablations linking metric scores to test-taker creative outputs. We will add a dedicated subsection in the Experiments section to elaborate on the theoretical grounding of the metrics, include any available internal consistency or ablation results from our evolutionary optimization, and explicitly discuss the absence of human validation as a limitation with outlined directions for future work. This will be a partial revision, as conducting full human inter-rater and correlation studies exceeds the scope of the current computational experiments but can be noted for follow-up research. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external baselines and algorithmic descriptions, not self-referential definitions or fits

full rationale

The paper describes an algorithmic pipeline (HyperTree Outline Planner, MCTS Context Generator, MAP-Elites Evolutionary Optimizer, Assessment-Guided Refiner) and reports an 8% average improvement over external competitive methods on six quality metrics. No equations appear that define a quantity in terms of itself or rename a fitted parameter as a prediction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The evaluation is framed as comparison against independent baselines rather than quantities derived from the system's own parameters. The derivation chain is therefore self-contained and does not reduce to tautology by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no explicit free parameters, mathematical axioms, or newly postulated entities; the system is described as an application of existing techniques (MCTS, evolutionary algorithms, LLMs) to context generation.

pith-pipeline@v0.9.0 · 5532 in / 1188 out tokens · 47976 ms · 2026-05-10T04:27:59.227750+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Assessment of divergent thinking by means of the subjective top-scoring method: Effects of the number of top-ideas and time-on-task on reliability and validity.Psychology of Aesthetics, Creativity, and the Arts, 7(4):341. A.B. Crabbe. 1989. The future problem solving pro- gram.Educational Leadership, 7(1):27–29. Anne Borland Crabbe. 1982. Creating a brigh...

work page arXiv 1989
[2]

Kai Ruan, Xuan Wang, Jixiang Hong, and Hao Sun

Creativity and the finding and solving of real- world problems.Journal of Psychoeducational as- sessment, 9(1):45–53. Kai Ruan, Xuan Wang, Jixiang Hong, and Hao Sun

work page
[3]

arXiv e-prints, pages arXiv–2412

Liveideabench: Evaluating llms’ scientific creativity and idea generation with minimal context. arXiv e-prints, pages arXiv–2412. M.A. Runco and S. Acar. 2012. Divergent thinking as an indicator of creative potential.Creativity Research Journal, 24(1):66–75. Mark A Runco and Garrett J Jaeger. 2012. The standard definition of creativity.Creativity research...

work page 2012
[4]

title":"YouthinCompetitiveSports

Random tree model of meaningful memory. Physical Review Letters, 134(23):237402. Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, and Misha Tsodyks. 2026. Semantic chunk- ing and the entropy of natural language.CoRR, abs/2602.13194. Appendix A CreaTE Dataset AlphaContext takes a title and a theme as input, so evaluation requires inputs that are exp...

work page arXiv 2026
[5]

[Plan] -> [Anchor][Scene Setting][Characters & Interaction][Conflict & Challenge][Open Task] # The plan can be divided into five core narrative aspects

work page
[6]

[Anchor] -> [Future Horizon][Place][Scale][Challenge Seeds 1] 5# Establishes the fundamental time, space, and scope coordinates

work page
[7]

[Scene Setting] -> [Scenario Frame][Constraint Hints][Challenge Seeds 2] 7# Defines context and constraints

work page
[8]

[Characters & Interaction] -> [Interaction Goal][Dispute Focus][Problem Slot][Challenge Seeds 3] 9# Constructs interpersonal dynamics

work page
[9]

[Conflict & Challenge] -> [Challenge Seeds 4][Creativity Triggers] 11# Introduces complicating factors

work page
[10]

[Open Task] -> [Challenge Identification][Solution Exploration] 13# Defines the student's objective. 14 15# ==================== Part 2: Dynamic Selection Nodes (LLM-Driven) ======================= 16# Logic A: When this node is selected for expansion, the LLM selects one option from the predefined candidate pool based on theme relevance

work page
[11]

[Future Horizon] -> {NearFuture (5-15y) | MidFuture | FarFuture | Speculative}

work page
[12]

[Scale] -> {Community | National | International | Space}

work page
[13]

[Scenario Frame] -> {Everyday Life | Urban Infrastructure | Virtual-Reality Fusion | ...} 20

work page
[14]

[Interaction Goal] -> {Co-creation Workshop | Negotiation | Emergency Response | ...}

work page
[15]

[Dispute Focus] -> {Value Conflict | Resource Conflict | Trust Conflict | ...}

work page
[16]

[Creativity Triggers] -> {Uncertainty | Contradiction | Resource Constraints | ...} 24 25# Logic B: When this node is selected for expansion, the LLM selects multiple options from the predefined candidate pool based on theme relevance

work page
[17]

[Challenge Seeds 1] -> {{Select 2-3 seeds from Pool}}

work page
[18]

[Challenge Seeds 2] -> {{Select 2-3 seeds from Pool}}

work page
[19]

[Challenge Seeds 3] -> {{Select 3-4 seeds from Pool}}

work page
[20]

[Challenge Seeds 4] -> {{Select 4-5 seeds from Pool}} 30

work page
[21]

[Topic Phrase] -> {{LLM-generated phrase (6-8 words)}} # Summarizes the core conflict based on Title/Theme. 32

work page
[22]

""CHECKLIST={

[Constraint Hints] -> {{Select 2-3 from: Policy, Budget, Time Limit, Safety, etc.}} # Limits the solution space. Listing 1: Formal definitions of static rules and dynamic LLM-driven selection rules in the HyperTree Outline Planner. ϕ3(C) in a structured JSON format. We then dis- cretize the continuous behavior space uniformly to construct a 3D grid archiv...

work page

[1] [1]

Assessment of divergent thinking by means of the subjective top-scoring method: Effects of the number of top-ideas and time-on-task on reliability and validity.Psychology of Aesthetics, Creativity, and the Arts, 7(4):341. A.B. Crabbe. 1989. The future problem solving pro- gram.Educational Leadership, 7(1):27–29. Anne Borland Crabbe. 1982. Creating a brigh...

work page arXiv 1989

[2] [2]

Kai Ruan, Xuan Wang, Jixiang Hong, and Hao Sun

Creativity and the finding and solving of real- world problems.Journal of Psychoeducational as- sessment, 9(1):45–53. Kai Ruan, Xuan Wang, Jixiang Hong, and Hao Sun

work page

[3] [3]

arXiv e-prints, pages arXiv–2412

Liveideabench: Evaluating llms’ scientific creativity and idea generation with minimal context. arXiv e-prints, pages arXiv–2412. M.A. Runco and S. Acar. 2012. Divergent thinking as an indicator of creative potential.Creativity Research Journal, 24(1):66–75. Mark A Runco and Garrett J Jaeger. 2012. The standard definition of creativity.Creativity research...

work page 2012

[4] [4]

title":"YouthinCompetitiveSports

Random tree model of meaningful memory. Physical Review Letters, 134(23):237402. Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, and Misha Tsodyks. 2026. Semantic chunk- ing and the entropy of natural language.CoRR, abs/2602.13194. Appendix A CreaTE Dataset AlphaContext takes a title and a theme as input, so evaluation requires inputs that are exp...

work page arXiv 2026

[5] [5]

[Plan] -> [Anchor][Scene Setting][Characters & Interaction][Conflict & Challenge][Open Task] # The plan can be divided into five core narrative aspects

work page

[6] [6]

[Anchor] -> [Future Horizon][Place][Scale][Challenge Seeds 1] 5# Establishes the fundamental time, space, and scope coordinates

work page

[7] [7]

[Scene Setting] -> [Scenario Frame][Constraint Hints][Challenge Seeds 2] 7# Defines context and constraints

work page

[8] [8]

[Characters & Interaction] -> [Interaction Goal][Dispute Focus][Problem Slot][Challenge Seeds 3] 9# Constructs interpersonal dynamics

work page

[9] [9]

[Conflict & Challenge] -> [Challenge Seeds 4][Creativity Triggers] 11# Introduces complicating factors

work page

[10] [10]

[Open Task] -> [Challenge Identification][Solution Exploration] 13# Defines the student's objective. 14 15# ==================== Part 2: Dynamic Selection Nodes (LLM-Driven) ======================= 16# Logic A: When this node is selected for expansion, the LLM selects one option from the predefined candidate pool based on theme relevance

work page

[11] [11]

[Future Horizon] -> {NearFuture (5-15y) | MidFuture | FarFuture | Speculative}

work page

[12] [12]

[Scale] -> {Community | National | International | Space}

work page

[13] [13]

[Scenario Frame] -> {Everyday Life | Urban Infrastructure | Virtual-Reality Fusion | ...} 20

work page

[14] [14]

[Interaction Goal] -> {Co-creation Workshop | Negotiation | Emergency Response | ...}

work page

[15] [15]

[Dispute Focus] -> {Value Conflict | Resource Conflict | Trust Conflict | ...}

work page

[16] [16]

[Creativity Triggers] -> {Uncertainty | Contradiction | Resource Constraints | ...} 24 25# Logic B: When this node is selected for expansion, the LLM selects multiple options from the predefined candidate pool based on theme relevance

work page

[17] [17]

[Challenge Seeds 1] -> {{Select 2-3 seeds from Pool}}

work page

[18] [18]

[Challenge Seeds 2] -> {{Select 2-3 seeds from Pool}}

work page

[19] [19]

[Challenge Seeds 3] -> {{Select 3-4 seeds from Pool}}

work page

[20] [20]

[Challenge Seeds 4] -> {{Select 4-5 seeds from Pool}} 30

work page

[21] [21]

[Topic Phrase] -> {{LLM-generated phrase (6-8 words)}} # Summarizes the core conflict based on Title/Theme. 32

work page

[22] [22]

""CHECKLIST={

[Constraint Hints] -> {{Select 2-3 from: Policy, Budget, Time Limit, Safety, etc.}} # Limits the solution space. Listing 1: Formal definitions of static rules and dynamic LLM-driven selection rules in the HyperTree Outline Planner. ϕ3(C) in a structured JSON format. We then dis- cretize the continuous behavior space uniformly to construct a 3D grid archiv...

work page