arxiv: 2601.07251 · v5 · submitted 2026-01-12 · 💻 cs.HC

MeepleLM: A Virtual Playtester Simulating Diverse Subjective Experiences

Zizhen Li , Chuanhao Li , Yibin Wang , Yukang Feng , Jianwen Sun , Jiaxin Ai , Fanrui Zhang , Mingzhu Sun

show 2 more authors

Yifei Huang Kaipeng Zhang

This is my paper

Pith reviewed 2026-05-16 15:48 UTC · model grok-4.3

classification 💻 cs.HC

keywords board game designvirtual playtestinglarge language modelsplayer personasMDA frameworksubjective critiquegame rules analysishuman-AI collaboration

0 comments p. Extension

The pith

MeepleLM simulates diverse player experiences to critique board games more accurately than GPT-5.1 or Gemini3-Pro.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces MeepleLM as a virtual playtester that generates constructive critique for board games by simulating how different player types would experience them. The authors assemble a dataset of over 1,700 rulebooks and 150,000 reviews, enrich it with mechanics-dynamics-aesthetics reasoning to link rules to feelings, and distill distinct player personas into the model. A sympathetic reader would care because the resulting system lets designers obtain reliable external perspectives on their work without assembling real play groups, reducing bias in human-AI creative collaboration. Experiments show the model achieves a 70 percent user preference rate over leading commercial models for critique quality and community alignment.

Core claim

MeepleLM is a specialized model that internalizes persona-specific reasoning patterns to accurately simulate the subjective feedback of diverse player archetypes. Trained on a curated collection of structurally corrected rulebooks and quality-filtered reviews augmented with MDA reasoning, it bridges the causal gap between written rules and emergent player experiences without requiring an explicit game engine. User studies confirm it outperforms commercial models such as GPT-5.1 and Gemini3-Pro in community alignment and critique utility, reaching a 70 percent preference rate.

What carries the argument

MeepleLM, a large language model fine-tuned on MDA-augmented rulebook data and distilled player personas that enables direct mapping from game rules to heterogeneous subjective experiences.

If this is right

Designers obtain immediate, persona-specific feedback on new rule sets without physical playtesting sessions.
Human-AI collaboration in game creation becomes more reliable by steering outputs toward audience-aligned experiences.
The same training approach can serve as a virtual playtester for other interactive systems beyond board games.
Creative teams gain a tool to identify biased or unpredictable model behaviors early in the design process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar persona distillation and MDA-style augmentation could extend to video games or digital interactive media where rules and subjective experience also interact.
Real-world deployment in design software would let creators iterate on prototypes with simulated feedback in minutes rather than days.
The method highlights a general pattern for making language models more useful in domains that require modeling diverse user reactions rather than single correct answers.

Load-bearing premise

That the distilled player personas and MDA-augmented dataset accurately capture and generalize the latent dynamics connecting rules to heterogeneous subjective player experiences without an explicit game engine.

What would settle it

A blind preference study on new board games outside the training set in which participants rate MeepleLM critiques no higher than those from GPT-5.1 or Gemini3-Pro on utility or alignment with actual community reviews.

Figures

Figures reproduced from arXiv: 2601.07251 by Chuanhao Li, Fanrui Zhang, Jianwen Sun, Jiaxin Ai, Kaipeng Zhang, Mingzhu Sun, Yibin Wang, Yifei Huang, Yukang Feng, Zizhen Li.

**Figure 1.** Figure 1: Overview of MeepleLM. Acting as a Virtual Playtester, the model offers a rapid, automated alternative to the resource-intensive Human Play loop. By leveraging MDA-Reasoning to infer latent dynamics from Static Rulebooks, MeepleLM generates PersonaAligned Critiques tailored to diverse player archetypes. captivating a vast global audience (Rodríguez, 2025). Recently, the rapid development of Large Language… view at source ↗

**Figure 2.** Figure 2: The Data Construction Pipeline. We structure 1,727 board game rulebooks and filter 1.8M reviews via multi-dimensional quality assessment, yielding 150K high-quality critiques for persona discovery [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Impact of the Filtering Strategy. Our strategy enhances MDA scores and semantic coverage while preserving the original rating distribution. source text, correcting logical inconsistencies or formatting errors (see the rectification prompt in Appendix B.3). 3.3 Review Filtering We aggregated a raw corpus of 1.8 million ratingcomment pairs from multiple online communities (detailed in Appendix C.1). To ref… view at source ↗

**Figure 4.** Figure 4: Tier-wise Prediction Alignment. MeepleLM shows a sharp diagonal concentration, effectively distinguishing quality tiers [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Rating Density Distributions. MeepleLM demonstrates superior distributional fidelity by accurately recovering the high variance of human consensus. to correctly rank games based on their perceived quality (Kendall, 1938). Beyond Ranking: Capturing Community Diversity. As summarized in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Distributions of Key Metadata Attributes. for the 1,717 selected games. The dataset covers a wide spectrum of difficulty (a), focuses on decent-to-excellent quality games (b), emphasizes modern board game designs (c), and spans both elite and long-tail rankings (d). (a) Top 20 Mechanics (b) Top 20 Themes (c) Mechanics Word Cloud (d) Themes Word Cloud [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Analysis of Game Content. (a) and (b) display the top 10 mechanics and themes, demonstrating that the dataset covers the fundamental building blocks of modern board games. (c) and (d) provide a holistic view of the terminological diversity present in the corpus [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: System Prompt for Structuring Raw Rulebooks. It enforces a standard Markdown schema and strict grounding to the source text [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Example of Structured Rulebook Data. This content spans multiple pages, preserving the full details extracted from the original PDF [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Rectification Prompt Used by GPT-5.1. It requires the model to cross-reference the generated draft against the source text to ensure numerical accuracy and logical completeness [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Rating Correlation (Original vs. Filtered) [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Word Count Statistics by Rating [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Annotation Prompt for Scoring Review Quality. It mirrors the observation-analysis-iteration loop [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Visualization of Composite Embeddings. Colors indicate the 15 initial clusters, which were later merged into the 5 final personas by domain experts [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: shows the distribution and average rating for each group [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Behavioral Profiles of the Five Discovered Personas. Each profile defines core motivations and specific likes/dislikes regarding game mechanisms [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: Profiling Prompt for Interpreting Cluster-Central Samples. This qualitative analysis guided the definition of the final 5 personas. Prompt: Persona Labeling (GPT-5.1) You are an expert User Researcher specialized in board games. Your task is to classify board game players into EXACTLY ONE of the following 5 distinct personas. ### VALID PERSONAS (Choose ONE) {PERSONA_DEFINITIONS} ### TASK Analyze the revie… view at source ↗

**Figure 18.** Figure 18: Labeling Prompt for Annotating the Full Dataset. It maps each review to one of the 5 finalized personas based on the review’s content and sentiment [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Detailed Mechanism Preferences (Part 1). Analysis of System Purist, Efficiency Essentialist, and Narrative Architect [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

**Figure 20.** Figure 20: Detailed Mechanism Preferences (Part 2). Analysis of Social Lubricator and Thrill Seeker. The high “Lift” values (e.g., 62.4x for Prisoner’s Dilemma) indicate strong predictive power of these features for persona identification [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗

**Figure 21.** Figure 21: Prompt for the Consistency Verifier. The model audits synthesized reasoning chains to ensure they are factually grounded in the review text and causally aligned with the ground-truth rating [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗

**Figure 22.** Figure 22: Instruction Prompt for Extracting Latent MDA Reasoning. The Teacher Model uses this prompt to extract the latent MDA reasoning chain from raw reviews [PITH_FULL_IMAGE:figures/full_fig_p032_22.png] view at source ↗

**Figure 23.** Figure 23: Sample of a Generated Reasoning Chain. A sample derived from a review of El Grande. The generated reasoning chain correctly identifies the reviewer as an analytical veteran who reveres elegant, deterministic game design and functional purity [PITH_FULL_IMAGE:figures/full_fig_p033_23.png] view at source ↗

**Figure 24.** Figure 24: Full Inference Prompt Structure. The model receives a System Message defining the persona and guidelines, followed by a User Message containing the specific game rules and the final trigger instructions to ensure formatting compliance [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗

**Figure 25.** Figure 25: System Prompt for Factual Verification. The judge strictly compares mechanical claims in the review against the ground-truth rulebook, ignoring subjective opinions [PITH_FULL_IMAGE:figures/full_fig_p036_25.png] view at source ↗

**Figure 26.** Figure 26: System Prompt for Perspective Diversity Scoring. The judge evaluates a batch of reviews to determine if the model exhibits semantic collapse (repeating the same points) or true diversity (shifting focus across MDA layers) [PITH_FULL_IMAGE:figures/full_fig_p037_26.png] view at source ↗

**Figure 27.** Figure 27: Instruction for Ground Truth Mining. In the first stage, the model iteratively processes human reviews to build a deduplicated checklist of distinct opinions (VGT ) [PITH_FULL_IMAGE:figures/full_fig_p038_27.png] view at source ↗

**Figure 28.** Figure 28: Instruction for Semantic Matching. In the second stage, the judge verifies whether the viewpoints in the ground truth checklist are present in the model’s generated reviews [PITH_FULL_IMAGE:figures/full_fig_p039_28.png] view at source ↗

read the original abstract

Recent advancements have expanded the role of Large Language Models in board games from playing agents to creative co-designers. However, a critical gap remains: current systems lack the capacity to offer constructive critique grounded in the emergent user experience. Bridging this gap is fundamental for harmonizing Human-AI collaboration, as it empowers designers to refine their creations via external perspectives while steering models away from biased or unpredictable outcomes. Automating critique for board games presents two challenges: inferring the latent dynamics connecting rules to gameplay without an explicit engine, and modeling the subjective heterogeneity of diverse player groups. To address these, we curate a dataset of 1,727 structurally corrected rulebooks and 150K reviews selected via quality scoring and facet-aware sampling. We augment this data with Mechanics-Dynamics-Aesthetics (MDA) reasoning to explicitly bridge the causal gap between written rules and player experience. We further distill player personas and introduce MeepleLM, a specialized model that internalizes persona-specific reasoning patterns to accurately simulate the subjective feedback of diverse player archetypes. Experiments demonstrate that MeepleLM significantly outperforms latest commercial models (e.g., GPT-5.1, Gemini3-Pro) in community alignment and critique quality, achieving a 70% preference rate in user studies assessing utility. MeepleLM serves as a reliable virtual playtester for general interactive systems, marking a pivotal step towards audience-aligned, experience-aware Human-AI collaboration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MeepleLM, a specialized LLM for virtual playtesting of board games. It curates a dataset of 1,727 rulebooks and 150K reviews, augments it with MDA reasoning to connect rules to experiences, distills player personas, and claims the resulting model outperforms commercial LLMs (GPT-5.1, Gemini3-Pro) in community alignment and critique quality, achieving a 70% user-study preference rate for utility as a virtual playtester for general interactive systems.

Significance. If the empirical results are substantiated with rigorous controls and generalization tests, the work could meaningfully advance HCI and AI-assisted design by enabling scalable simulation of heterogeneous player experiences, reducing dependence on physical playtests and supporting more audience-aligned iterative design.

major comments (3)

[Abstract] Abstract: the headline claim of significant outperformance and 70% preference rate supplies no information on experimental controls, statistical tests, baseline prompting or fine-tuning details, data splits, participant recruitment, or inter-annotator agreement, rendering the data-to-claim link unevaluable.
[Dataset and Evaluation] Dataset construction and evaluation sections: training and augmentation occur exclusively on a fixed corpus of 1,727 rulebooks; no out-of-distribution evaluation on unseen rule sets is reported, leaving open the risk that the model memorizes review patterns rather than inferring generalizable latent rule-to-experience mappings.
[Method] Method: the central assumption that MDA augmentation plus persona distillation captures emergent subjective heterogeneity without an explicit game engine is not supported by any ablation that isolates the contribution of these components to the reported gains.

minor comments (2)

[Model Training] Clarify the base model, exact fine-tuning hyperparameters, and loss formulation used for MeepleLM.
[Presentation] Add explicit references to all tables and figures in the text and ensure figure captions are self-contained.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. Their comments highlight important aspects of experimental rigor and generalizability that we will address in the revision. We provide point-by-point responses below.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of significant outperformance and 70% preference rate supplies no information on experimental controls, statistical tests, baseline prompting or fine-tuning details, data splits, participant recruitment, or inter-annotator agreement, rendering the data-to-claim link unevaluable.

Authors: We agree that the abstract, due to its brevity, does not include these details. In the revised manuscript, we will update the abstract to briefly note the user study methodology, including recruitment from board game communities and use of statistical tests, and reference the detailed experimental setup in the main text for full information on baselines, data splits, and inter-annotator agreement. This will make the claims more evaluable. revision: yes
Referee: [Dataset and Evaluation] Dataset construction and evaluation sections: training and augmentation occur exclusively on a fixed corpus of 1,727 rulebooks; no out-of-distribution evaluation on unseen rule sets is reported, leaving open the risk that the model memorizes review patterns rather than inferring generalizable latent rule-to-experience mappings.

Authors: This is a valid concern regarding generalizability. Although our corpus is diverse, we will add an out-of-distribution evaluation in the revision by testing on rulebooks not included in the original training set and comparing against corresponding community feedback to better demonstrate the model's ability to infer latent mappings. revision: yes
Referee: [Method] Method: the central assumption that MDA augmentation plus persona distillation captures emergent subjective heterogeneity without an explicit game engine is not supported by any ablation that isolates the contribution of these components to the reported gains.

Authors: We will include ablation studies in the revised version to isolate the contributions of MDA augmentation and persona distillation. We will compare the full model against variants without each component, reporting the impact on critique quality and alignment metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external user studies and model comparisons

full rationale

The paper's chain proceeds from dataset curation (1,727 rulebooks + 150K reviews), MDA augmentation, persona distillation, and training of MeepleLM to external validation via comparisons against GPT-5.1/Gemini3-Pro and 70% preference rates in independent user studies. No equations, fitted parameters, or self-citations are shown to reduce any prediction or uniqueness claim to the inputs by construction. The evaluation metrics are drawn from held-out human assessments rather than internal consistency or self-referential definitions, keeping the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claim depends on the unverified assumption that curated reviews plus MDA labels suffice to train a model that generalizes subjective experience across unseen rules and player types; the LLM weights themselves are free parameters fitted to this data.

free parameters (1)

LLM training hyperparameters and weights
All model parameters fitted during fine-tuning on the 1,727 rulebooks and 150K reviews.

axioms (2)

domain assumption MDA framework can explicitly bridge written rules to emergent player experience
Invoked to augment the dataset and close the causal gap between rules and subjective feedback.
domain assumption Distilled personas faithfully represent the subjective heterogeneity of real player groups
Core premise for modeling diverse experiences without direct observation of every player type.

invented entities (2)

Player personas no independent evidence
purpose: To encode distinct subjective reasoning patterns for simulation
Distilled from review data to capture heterogeneity; no independent verification outside the training set is described.
MeepleLM no independent evidence
purpose: Specialized model that internalizes persona-specific critique generation
The trained system introduced by the paper; its performance is the central empirical claim.

pith-pipeline@v0.9.0 · 5589 in / 1564 out tokens · 81478 ms · 2026-05-16T15:48:18.588537+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We augment this data with Mechanics-Dynamics-Aesthetics (MDA) reasoning to explicitly bridge the causal gap between written rules and player experience. We further distill player personas and introduce MeepleLM...
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ an MDA-Guided Reasoning strategy... latent intermediate sequence Z that explicitly traces the causal path from Mechanics to Dynamics, and finally to Aesthetics.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

[1]

Do NOT add any external knowledge or hallucinate rules

Source Only: You must ONLY use information present in the input text. Do NOT add any external knowledge or hallucinate rules

work page
[2]

Format: Output the result in clean Markdown format using the specific headers defined below

work page
[3]

Not Mentioned

Completeness: If a section is not mentioned in the text, write "Not Mentioned" under that header. Required Output Structure: ## 1. Lore & Objective (Extract the thematic setting and victory conditions) ## 2. Components (List physical assets like cards, boards, tokens) ## 3. Setup (Step-by-step initial configuration) ## 4. Gameplay Flow (Turn structure, ph...

work page
[4]

- Place pawn on top-left entrance space

**Player Setup**: - Each player takes: - 30 Clank! cubes (personal supply) - Matching pawn - 10-card starting deck - Shuffle deck, draw 5 cards. - Place pawn on top-left entrance space

work page
[5]

- 3 players: Shuffle and remove 1 randomly

**Board Setup**: - **Artifacts**: Place 7 face up on numbered spaces. - 3 players: Shuffle and remove 1 randomly. - 2 players: Remove 2 randomly. - **Major Secrets**: Shuffle face down, place 1 on each marked space. Return extras to box. - **Minor Secrets**: Shuffle face down, place 2 stacked on each marked space. - **Market Items**: - 2 Master Keys (stac...

work page
[6]

--- ### 4

**First Player**: Roll die or choose randomly. --- ### 4. Gameplay Flow Each turn follows this sequence:

work page
[7]

**Play All Cards**: Play all 5 cards in hand, in any order

work page
[8]

**Take Actions** (in any order, multiple times): - Acquire Cards (Skill) - Use Devices (Skill) - Fight Monsters (Swords) - Buy Market Items (Gold) - Move (Boots) - Gain Gold / Clank! / Secrets / Artifact

work page
[9]

- Draw 5 new cards (shuffle discard if needed)

**End of Turn Phase**: - Discard all played cards. - Draw 5 new cards (shuffle discard if needed). - Refill Dungeon Row to 6 cards. - **If refill reveals Dragon Attack symbol**: - Dragon attacks once (even if multiple symbols). - Move all Clank! cubes to Dragon Bag. - Draw cubes = ’Dragons Rage Track value (+1 per Danger symbol). - Black cubes: set aside....

work page
[10]

**Dragon Rage Triggers**: - +1 space when: - Artifact is taken - Dragon Egg (Minor Secret) is revealed

work page
[11]

**Exit or Knockout Triggers Countdown**: - First player to exit dungeon or be knocked out moves to Countdown Track. - On their next turn, they move forward and trigger effects: - Spaces –24: Instant Dragon Attack with +1, +2, +3 extra draws - Space 5: All remaining players in dungeon are instantly knocked out

work page
[12]

Core Mechanics #### Resources - **Skill**: Acquire cards from Dungeon Row or Reserve

**Game End**: Triggered when: - Countdown Track reaches Skull - All players have exited or been knocked out --- ### 5. Core Mechanics #### Resources - **Skill**: Acquire cards from Dungeon Row or Reserve. - **Swords**: Fight monsters (Dungeon Row or tunnels). - **Boots**: Move 1 tunnel per Boot. #### Movement Rules - **Normal Tunnel**: 1 Boot - **Footprin...

work page
[13]

Highest-value Artifact

work page
[14]

Draft Rulebook

(Unspecified if still tied) --- ### 7. FAQ or Edge Cases - **Can you drop an Artifact?** No. Once picked up, ’its yours until game end. - **Can you buy multiple Market items per turn?** Yes, any number, even same type. - **Can you take multiple tokens per room?** Only one per entry. Must exit and re-enter. - **Does Artifact in Market cost Gold?** No. Only...

work page
[15]

Source Content: Raw text parsed from the official rulebook PDF

work page
[16]

Rectification Tasks:

Draft Rulebook: A structured Markdown version generated by an automated parser. Rectification Tasks:

work page
[17]

deal 5 cards

Accuracy Check: Compare specific numbers, card counts, and setup instructions. If the Draft says "deal 5 cards" but the Source says "deal 4", CORRECT it

work page
[18]

End Game Triggers

Completeness: Ensure no critical sections (especially "End Game Triggers" or "Tie-Breakers ") are missing from the Draft

work page
[19]

Logical Consistency: Fix any contradictions introduced during the structuring process

work page
[20]

Here is the fixed version

Formatting: Ensure the output strictly follows the standardized Markdown hierarchy. Constraints: - Output ONLY the fully rectified Rulebook in Markdown. - Do NOT add conversational text (e.g., "Here is the fixed version"). - Do NOT invent rules not present in the Source Content. Input Data: [SOURCE CONTENT]: {RAW_SOURCE_TEXT} [DRAFT RULEBOOK]: {QWEN_GENER...

work page
[21]

Fun game

**mechanism_anchoring** (Specificity) - 1 (Vague): "Fun game", "Good strategy". No specific terms. - **2 (Basic):** Mentions basic components like "cards", "board", "points" but no mechanism names. - 3 (Generic): Mentions standard mechanics e.g., "worker placement", "deck building". - **4 (Detailed):** Describes specific game flow or unique twists but mis...

work page
[22]

I hated it

**causal_attribution** (Reasoning - CORE METRIC) - 1 (No Logic): "I hated it." / "Best game ever." (Pure emotion). - **2 (Implied):** "It's too long and boring." (Reason implies cause, but vague). - 3 (Simple Link): "I didn't like it *because* the downtime was too long." (Direct X->Y). - **4 (Strong Logic):** "The downtime is caused by the analysis paraly...

work page
[23]

The combat feels unfair

**constructiveness** (Utility) - 1 (Useless): Empty complaints or blind praise. - **2 (Valid Complaint):** "The combat feels unfair." (Identifies a problem area, but subjective). - 3 (Actionable): "The endgame drags on too long." (Specific pain point). - **4 (Analytical):** "The blue faction is strong because of their starting resource." ( Analyzes *why* ...

work page
[24]

THE SYSTEM PURIST LOVED Games Examples: Guards of Atlantis II, 1817, Pax Renaissance: 2nd Edition, Age of Innovation... Key Mechanics: •Hand Management(Freq: 33%, Lift:1.5x) •End Game Bonuses(Freq: 20%, Lift:4.6x) •Auction / Bidding(Freq: 17%, Lift:3.0x) •Area Movement(Freq: 13%, Lift:2.1x) •Market (Freq: 13%, Lift:13.5x) HATED Games Examples: The Werewol...

work page
[25]

THE EFFICIENCY ESSENTIALIST LOVED Games Examples: Aeon’s End: The New Age, Maria, The Crew: Mission Deep Sea, 7 Wonders (Second Edition)... Key Mechanics: •Hand Management(Freq: 43%, Lift:1.9x) •Area Majority / Influence(Freq: 27%, Lift:2.2x) •Dice Rolling(Freq: 20%, Lift:0.7x) •Campaign / Battle Card Driven(Freq: 17%, Lift:4.6x) •Cooperative Game(Freq: 1...

work page
[26]

THE NARRATIVE ARCHITECT LOVED Games Examples: Aeon Trespass: Odyssey, Skull, BattleCON: Devastation of Indines, Aeon’s End: The New Age... Key Mechanics: •Cooperative Game(Freq: 37%, Lift:2.9x) •Hand Management(Freq: 30%, Lift:1.3x) •Dice Rolling(Freq: 23%, Lift:0.8x) •Auction / Bidding(Freq: 17%, Lift:3.0x) •Area Majority / Influence(Freq: 17%, Lift:1.4x...

work page
[27]

THE SOCIAL LUBRICATOR LOVED Games Examples: American Rails, Napoléon: The Waterloo Campaign, 1815, Egizia, Arydia: The Paths We Dare Tread... Key Mechanics: •Dice Rolling(Freq: 40%, Lift:1.4x) •Area Majority / Influence(Freq: 30%, Lift:2.5x) •End Game Bonuses(Freq: 17%, Lift:3.8x) •Cooperative Game(Freq: 17%, Lift:1.3x) •Area Movement(Freq: 13%, Lift:2.1x...

work page
[28]

Senior Logic Audi- tor,

THE THRILL SEEKER LOVED Games Examples: 20th Century, Advanced Squad Leader, Gaia Project, Empires in Arms... Key Mechanics: •Dice Rolling(Freq: 23%, Lift:0.8x) •End Game Bonuses(Freq: 17%, Lift:3.8x) •Hexagon Grid(Freq: 17%, Lift:2.2x) •Area Movement(Freq: 17%, Lift:2.6x) •Contracts(Freq: 13%, Lift:4.1x) HATED Games Examples: Lanterns: The Harvest Festiv...

work page
[29]

Slow Thinking

We enabled the "Slow Thinking" mecha- nism, which incorporates the generated Chain-of- Thought tokens into the loss calculation, ensuring the model optimizes the reasoning process along- side the final output. Hyperparameter Value Model & Environment Backbone Model Qwen-3-8B Framework LLaMA-Factory Context Window 16,384 tokens Attention Mechanism Flash At...

work page
[30]

What")** - Does

**Grounding Check (The "What")** - Does "Step 1: content_extraction" only contain facts explicitly present in the User Review? - [CRITICAL]: Reject if it cites game mechanics/rules that are NOT mentioned in the user's text (Hallucination)

work page
[31]

How")** - Does

**Causal Logic Check (The "How")** - Does "Step 2: dynamic_interaction" logically follow from the mechanics identified?

work page
[32]

Feel")** - Does

**Sentiment Alignment (The "Feel")** - Does "Step 3: experience_outcome" logically support the **Ground Truth Rating**? - [FAILURE MODE A]: The chain describes the experience as "frustrating" or "broken", but the Rating is High (>7). -> REJECT. - [FAILURE MODE B]: The chain describes "thrilling tension", but the Rating is Low (<4). -> REJECT. ### DECISION...

work page
[33]

The prompt for this step is shown in Figure 27

Ground Truth Mining:First, we employ the LLM as a qualitative analyst to extract a set of distinct, non-redundant viewpoints (VGT ) from the real human reviews in the test set. The prompt for this step is shown in Figure 27

work page
[34]

The prompt is presented in Figure 28

Semantic Matching:Next, we use a seman- tic match evaluator to determine which view- points in VGT are successfully covered by the model’s generated reviews. The prompt is presented in Figure 28. The finalOpinion Recovery Rate (Op-Rec)is calculated as the ratio of unique viewpoints suc- cessfully recalled by the simulation: Op-Rec= |Vmatched| |VGT | ×100%...

work page
[35]

Guilty Pleasures

**Persona is a Bias, Not a Straitjacket:** - This persona represents your *general* gaming preferences, but real players are complex. Do not act like a one-dimensional caricature. - It is possible for a player to have **"Guilty Pleasures"** (e.g., enjoying a game that goes against their usual type) or **"Unexpected Disappointments"** (e.g., disliking a ga...

work page
[36]

{target_persona}

**Embrace Diversity:** - Within the "{target_persona}" group, there is a wide spectrum of opinions. - Some players are **purists** (rejecting anything outside their genre), while others are **omnivorous** (appreciating good design regardless of genre). - You have the freedom to simulate any point on this spectrum

work page
[37]

The voting mechanic caused a hilarious shouting match

**Ground the Review in Dynamics & Authentic Feeling:** - Do not just list mechanics; describe the **interactions** they created at the table (e.g., "The voting mechanic caused a hilarious shouting match" vs "There is a voting mechanic "). - Connect these dynamics to your **emotional response**. Did the game feel tense? Frustrating? Triumphant? - Your rati...

work page
[38]

Guilty Pleasure

**Determine Your Stance:** As **{target_persona}**, how does this specific game land for you? - Is it a **"Guilty Pleasure"**? (e.g., "I usually hate party games, but this mechanic made me laugh.") - Is it a **"Respectful Pass"**? (e.g., "Great design, just not for me. ") - Is it a **"Perfect Match"** or a **"Design Failure"**?

work page
[39]

- Avoid generic stereotypes

**Write the Review:** - Focus on the **dynamics** (interactions at the table) and **emotions** (tension, joy, frustration). - Avoid generic stereotypes. Write like a real person with complex tastes

work page
[40]

persona":

**Output:** Output ONLY the valid JSON object. **Required Output Template:** ```json{ "persona": "{target_persona}", "rating": [Integer], "review": "[Your review text...]"} CONSTRAINTS: Length: Target 150-200 words, but significant variance (20-400 words) is mandatory to reflect real human diversity. Figure 24:Full Inference Prompt Structure.The model rec...

work page
[41]

brutal",

**STRIP AWAY** all adjectives, emotions, and opinions (e.g., ignore "brutal", "fun", " random feel")

work page
[42]

**FOCUS** only on the Nouns (Components) and Verbs (Actions/Rules)

work page
[43]

The dice combat is bad

**VERIFY**: Do these things exist in the Rulebook? - If user says "The dice combat is bad" -> Only check if "Dice Combat" exists. Ignore "bad". - If user says "There are no dice" -> Check if there are truly no dice. Output ONLY the JSON list. Figure 25:System Prompt for Factual Verification.The judge strictly compares mechanical claims in the review again...

work page
[44]

**Mechanics:** Rules, components, math

work page
[45]

**Dynamics:** Run-time behavior, player interaction, pacing

work page
[46]

dice rolling combat

**Aesthetics:** Emotional response, theme, sensory experience. **STRICT SCORING CRITERIA (1-5):** * **1 (Echo Chamber / Mode Collapse):** * The reviews are effectively clones. They cite the exact same rules and express the exact same sentiment. * *Example:* All 5 reviews complain about the "dice rolling combat". * **2 (Surface Rephrasing):** * The core to...

work page
[47]

The insert is garbage

**Be Extensive:** If a review mentions a specific detail (e.g., "The insert is garbage" or "The solo mode is too easy") that isn't in the list, ADD IT

work page
[48]

Bad components

**No Duplicates:** If the current list already says "Bad components", and the new review says "Cards feel cheap", you can refine the existing point or ignore if redundant. Do not list the same thing twice

work page
[49]

Focus on what matters to them

**Specific Persona Lens:** These reviews are from the **{persona}** perspective. Focus on what matters to them

work page
[50]

Good art

**Output Format:** Return ONLY the updated JSON list of strings. **Example Input:** Current: ["Good art"] New Review: "The art is great, but the rulebook is a mess." **Example Output:** ["Good art", "Rulebook is disorganized/confusing"] [USER MESSAGE] **Game ID:** {game_id} **Persona:** {persona} **Current Viewpoints List:** {existing_points_text} **New R...

work page
[51]

Read the **Checklist** of viewpoints (IDs and Text)

work page
[52]

Read the **Player Reviews**

work page
[53]

Cards are flimsy

Determine which IDs from the checklist are **semantically covered** by ANY of the reviews. * *Loose Match:* If the checklist says "Cards are flimsy" and a review says "The card quality is poor", that is a MATCH. * *Topic Match:* If checklist says "Combat is random" and review says "Too much luck in fighting", that is a MATCH

work page
[54]

Familiar

**Output:** A JSON list of the **IDs** that were found. **Output Example:** [0, 5, 12] [USER MESSAGE] **Unmatched Viewpoints Checklist:** {checklist_text} **Reviews Batch:** {reviews_text} **Task:** Which IDs from the checklist are mentioned in these reviews? Return ONLY the JSON list of IDs (e.g., [1, 3]). If none, return []. Figure 28:Instruction for Se...

work page
[55]

Authenticity Check:Which review set feels more like it was written by a real "insider" or a veteran of the community?

work page
[56]

highs" (excitement) or

Emotional Resonance:Which set better captures the specific "highs" (excitement) or "lows" (frustrations) you have personally ex- perienced with this game?

work page
[57]

Which set feels more like a genuine personal take rather than a generic summary?

Opinion Diversity:Real user opinions are of- ten biased or focus on specific points. Which set feels more like a genuine personal take rather than a generic summary?

work page
[58]

You have a limited budget

Shareability:If you were to share a review with a friend to discuss this game, which one would you choose? G.3.2 Scenario B: Unfamiliar Games Context: Imagine you are considering buying this game but have never played it. You have a limited budget

work page
[59]

Reality:Which set feels less like a marketing advertisement and more like honest feedback from a peer?

Marketing vs. Reality:Which set feels less like a marketing advertisement and more like honest feedback from a peer?

work page
[60]

Decision Confidence:After reading, which set helps you make a clearer decision (whether to Buy or Skip)?

work page
[61]

Red Flags

Risk Awareness:Which set more effectively warns you about potential "Red Flags" (e.g., downtime, player count issues, complexity)?

work page
[62]

indicates the participant found both sets equally good or bad. Table 7:Pairwise Comparison Results (Win Rate of Ours vs. GPT-5.1). Category Participant Comments On Authentic- ity

Final Choice:If you could only rely on one source to spend your money, which one would you trust? G.3.3 Open-Ended Feedback Optional: Do you have any specific comments on why you chose one set over the other? (e.g., tone, vocabulary, specific insights) G.4 Full Evaluation Results This section presents the aggregated results of the user study. Table 7 show...

work page 2024