pith. machine review for the scientific record. sign in

arxiv: 2604.04351 · v1 · submitted 2026-04-06 · 💻 cs.HC

Recognition: no theorem link

Cognibit: From Digital Exhaustion to Real-World Connection Through Gamified Territory Control and LLM-Powered Twin Networking

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:24 UTC · model grok-4.3

classification 💻 cs.HC
keywords digital twinsLLMsocial discoverygamificationterritory controlcompatibility matchingAI companionsdeployment
0
0 comments X

The pith

A deployed platform uses LLM digital twins to simulate compatibility conversations and gamified territories to drive real-world meetings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes Cognibit as a social discovery system that lets users' digital twins hold autonomous multi-turn conversations to estimate interpersonal compatibility. These simulations combine with territory conquest mechanics that reward physical exploration and create natural settings for in-person encounters, plus AI companions that keep shared memory across devices. By moving the approach from simulation-only tests into a live deployment built on the Columbia Speed Dating dataset and prior CogniPair architecture, the work produces concrete cost and quality measurements along with scaling limits that isolated component tests cannot reveal. The goal is to reduce digital exhaustion by turning simulated matches into actual social activity.

Core claim

The central claim is that an LLM-powered platform integrating autonomous digital-twin conversations for compatibility estimation, gamified territory conquest to prompt real-world exploration, and persistent AI companions forms a complete social discovery environment. When this system is taken from prior simulation-only matching into full deployment, it supplies empirical cost-quality baselines and exposes scaling bottlenecks that component-level testing leaves hidden.

What carries the argument

LLM-powered digital twins that perform autonomous multi-turn conversations to estimate compatibility, operating together with gamified territory conquest mechanics that incentivize real-world movement and encounters.

If this is right

  • The fully deployed system supplies measurable cost-quality baselines for twin-based matching that simulations alone cannot provide.
  • Fundamental scaling bottlenecks in multi-twin conversations and territory mechanics become visible only after real-world operation.
  • Persistent shared memory across AI companions maintains continuity for users across devices and sessions.
  • The combination of simulation matching and physical incentives extends prior work into an environment that can be evaluated end-to-end.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Larger user bases could test whether the observed bottlenecks grow linearly or can be reduced by caching or hybrid human review of twin outputs.
  • Adding finer location signals might strengthen the link between territory conquest and actual meetings beyond the current mechanics.
  • Direct validation of twin predictions against post-meeting user feedback would clarify how much simulation accuracy contributes to real outcomes.
  • The cost baselines could serve as a reference point for comparing this approach against non-LLM social discovery apps on the same metrics.

Load-bearing premise

Autonomous conversations between digital twins can accurately predict real interpersonal compatibility and gamified territory mechanics will reliably produce organic in-person meetings rather than remaining virtual.

What would settle it

A controlled comparison showing whether pairs with high twin-simulation compatibility scores actually meet in person and report positive outcomes at higher rates than low-score pairs, or whether territory-driven encounters exceed baseline rates of unprompted meetings.

Figures

Figures reproduced from arXiv: 2604.04351 by Ang Li, Bowei Tian, Guoheng Sun, Hanzhang Qin, Joshua Liu, Lang Xiong, Meng Feng, Meng Liu, Shwai He, Sihan Chen, Siyuan Peng, Wanghao Ye, Yang Wang, Yanhong Qian, Yexiao He, Yifei Dong, Yilong Dai, Yiting Wang, Yuning Zhang, Zhenle Duan, Zheyu Shen, Ziyao Wang, Ziyi Wang.

Figure 1
Figure 1. Figure 1: Deployed system surfaces: (a) 3D territory exploration with avatar and twin [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three-stage filtering: 200 → 5 (97.5% reduction). 4.2 LLM Integration Each twin conversation uses a two-stage pipeline: pragmatic intent analysis (temperature=0.3) followed by personality-conditioned generation (temperature=0.8). The system routes across four providers (GPT-4o, GPT-4o-mini, Claude-3-Haiku, DeepSeek) with template-based fallback; the deployed system, offline evaluation (Qwen2.5-72B-Instruct… view at source ↗
Figure 3
Figure 3. Figure 3: Cost-quality tradeoff analysis. (a) Match quality improves monotonically across [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Geolocation-mediated encounter loop: (a) territory map with color-coded cities, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Deployment outcomes (N=20, 14 days): 4/20 never engaged, 8/20 initiated [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cost-quality trade-off showing logarithmic relationship between computational [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cost vs. matching quality across LLM providers (log-scale x-axis). GPT-4o-mini [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Conversation depth vs. matching quality showing diminishing returns. The first 3 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Detailed integrated user journey across the three platform applications (Social [PITH_FULL_IMAGE:figures/full_fig_p050_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Primary user journey from initial discovery to sustained connection [PITH_FULL_IMAGE:figures/full_fig_p050_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Three-pillar system architecture showing user interactions and inter-component [PITH_FULL_IMAGE:figures/full_fig_p051_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Sequence diagram showing autonomous twin-to-twin networking process [PITH_FULL_IMAGE:figures/full_fig_p052_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Gamification scaffolding showing how game mechanics progressively introduce [PITH_FULL_IMAGE:figures/full_fig_p052_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: State machine diagram showing pendant companion behavioral states and [PITH_FULL_IMAGE:figures/full_fig_p053_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Information architecture showing main platform components and navigation [PITH_FULL_IMAGE:figures/full_fig_p054_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Emotional journey showing user’s emotional state evolution through platform [PITH_FULL_IMAGE:figures/full_fig_p055_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Firebase Realtime Database schema and cross-device synchronization dataflow. [PITH_FULL_IMAGE:figures/full_fig_p057_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: GNWT cognitive processing cycle (GlobalWorkspace.js). Five specialist modules process stimuli in parallel, then compete for global workspace access via salience-weighted selection (100ms). Winners above τ = 0.7 are broadcast to all modules (50ms), which integrate (150ms) and update adaptive weights. Items below threshold enter a sub-threshold buffer for proportional blending. The cycle repeats at 10Hz. • … view at source ↗
Figure 19
Figure 19. Figure 19: PAC predictive coding cycle (PACAgent.js). Text input is processed by VADER sentiment analysis to produce an actual affective score. The prediction error (difference between predicted and actual outcome) triggers a three-tier graduated response. The resulting emotional state update feeds back into the generative model, which produces predictions for the next interaction. Emotional contagion (30% interlocu… view at source ↗
Figure 20
Figure 20. Figure 20: Parameter sensitivity analysis showing engagement score vs. parameter value [PITH_FULL_IMAGE:figures/full_fig_p074_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Synchronization timeline for concurrent writes with optimistic updates. Device [PITH_FULL_IMAGE:figures/full_fig_p077_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Failure cascade diagram. Root failures (left, with frequency) propagate through [PITH_FULL_IMAGE:figures/full_fig_p080_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Performance degradation as agent count increases. (a) Frame rate drops exponen [PITH_FULL_IMAGE:figures/full_fig_p081_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Social Hub interface showing Twitter-like feed design with real-time updates [PITH_FULL_IMAGE:figures/full_fig_p091_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Twin Networking interface displaying GNWT multi-agent system visualization. [PITH_FULL_IMAGE:figures/full_fig_p092_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Pendant companion interfaces: (a) Always-available AI companion with persistent [PITH_FULL_IMAGE:figures/full_fig_p093_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Agent Builder deployment pipeline. Users drag modules from the library onto [PITH_FULL_IMAGE:figures/full_fig_p093_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Agent Builder node-based visual editor. The left sidebar contains the module [PITH_FULL_IMAGE:figures/full_fig_p094_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Agent customization interface for the pendant companion. Users adjust personality [PITH_FULL_IMAGE:figures/full_fig_p095_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Pendant companion deployed in the 3D game world. The agent built through the [PITH_FULL_IMAGE:figures/full_fig_p096_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: API usage dashboard showing real-time token consumption, cost breakdown by [PITH_FULL_IMAGE:figures/full_fig_p098_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: GPS-driven world map interface: (a) Territory ownership visualization with [PITH_FULL_IMAGE:figures/full_fig_p098_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Boss fight interface showing real-time combat with AI teammates. The interface [PITH_FULL_IMAGE:figures/full_fig_p099_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Integrated 3D game world showing seamless blend of exploration, social interaction, [PITH_FULL_IMAGE:figures/full_fig_p100_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Territory capture lifecycle (CityTakeoverSystem.js). Players accumulate cap￾ture points at 5/s within a 50m GPS radius, requiring 100 points for full capture. Ownership confers bonuses. Uncontested territories decay at 0.5 pts/s, with a daily exponential factor of 0.95 preventing stale ownership. Capture pauses when opponents co-occupy the radius. 109 [PITH_FULL_IMAGE:figures/full_fig_p109_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Presence state machine (NetworkSync.js). Sync rate adapts from 200ms (5 updates/sec) during active movement to 2000ms (0.5 updates/sec) when idle. Disconnect triggers the Firebase onDisconnect handler. Reconnection is detected via .info/connected [PITH_FULL_IMAGE:figures/full_fig_p113_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Animation LOD tier system (AnimationLODSystem.js). Four distance-based tiers progressively reduce bone-skinning overhead. The critical optimization is hiding objects beyond 100 units, which prevents WebGL from computing bone matrix transformations entirely. 114 [PITH_FULL_IMAGE:figures/full_fig_p114_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Improvement factors from each optimization system. Object pooling yields the [PITH_FULL_IMAGE:figures/full_fig_p118_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Circuit breaker state machine. The breaker opens after 5 consecutive failures, [PITH_FULL_IMAGE:figures/full_fig_p118_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Cognibit system architecture showing the integration of autonomous digital twins, [PITH_FULL_IMAGE:figures/full_fig_p125_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: GNWT agent internal processing flow showing how emotion, memory, planning, [PITH_FULL_IMAGE:figures/full_fig_p126_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Platform system flow diagram illustrating the end-to-end pipeline from user [PITH_FULL_IMAGE:figures/full_fig_p127_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: Cognitive architecture comparison: traditional chatbot linear processing (left) [PITH_FULL_IMAGE:figures/full_fig_p128_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: End-to-end twin conversation pipeline. The twin blueprint is processed through [PITH_FULL_IMAGE:figures/full_fig_p141_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: Efficiency comparison between traditional dating platforms and Cognibit across [PITH_FULL_IMAGE:figures/full_fig_p149_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: Module salience competition (GlobalWorkspace.js). Five specialist modules compute salience scores using domain-specific formulas. Values are compared against the broadcast threshold (τ = 0.7); the highest-scoring module above threshold wins workspace access and broadcasts its output to all other modules. In this example, the Emotion module wins with s = 0.82. Grounded in: EmotionSpecialist.js. learned fro… view at source ↗
Figure 47
Figure 47. Figure 47: Advanced algorithm subsystem overview. User interactions feed both the [PITH_FULL_IMAGE:figures/full_fig_p153_47.png] view at source ↗
read the original abstract

We present an LLM-powered social discovery platform that uses digital twins to autonomously evaluate interpersonal compatibility through behavioral simulation. The platform unifies three key pillars: (1) digital twins that engage in autonomous multi-turn conversations on behalf of users to estimate compatibility, (2) gamified territory conquest mechanics that incentivize real-world exploration and create organic settings for in-person encounters, and (3) AI companions that preserve persistent shared memory across devices. Built upon CogniPair's cognitive architecture (Ye et al., 2026), validated on the Columbia Speed Dating dataset (551 participants), our system extends prior simulation-only matching into a fully deployed social discovery environment. Through deployment, we derive empirical cost-quality baselines and identify fundamental scaling bottlenecks that remain hidden in component-level testing alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Cognibit, an LLM-powered social discovery platform that uses digital twins to autonomously evaluate interpersonal compatibility through multi-turn behavioral simulations, gamified territory conquest mechanics to incentivize real-world exploration and in-person encounters, and AI companions with persistent shared memory across devices. It builds directly on the CogniPair cognitive architecture (Ye et al., 2026), references validation on the Columbia Speed Dating dataset (551 participants), and claims to extend prior simulation-only work into a fully deployed environment that yields empirical cost-quality baselines and reveals scaling bottlenecks invisible in component-level testing.

Significance. If the deployment claims and twin-to-real compatibility mappings hold with rigorous evidence, the work could advance HCI research on bridging simulated social matching with physical interactions via gamification and persistent AI agents. The emphasis on identifying practical scaling issues from real deployment would be a notable strength for system-building papers in the field.

major comments (2)
  1. [Abstract] Abstract: The central claim that the system 'extends prior simulation-only matching into a fully deployed social discovery environment' and 'derive[s] empirical cost-quality baselines' from deployment is unsupported. The only concrete validation referenced is the Columbia Speed Dating dataset (551 participants), which records brief real-human encounters; no methods, quantitative outcomes, correlation coefficients between twin predictions and post-meeting reports, deployment logs, user-study results, or identified bottleneck metrics are supplied anywhere in the manuscript.
  2. [Abstract] Abstract: The assumption that autonomous multi-turn conversations between digital twins accurately estimate real interpersonal compatibility (and that gamified territory mechanics reliably produce organic in-person encounters) is load-bearing for both the compatibility-evaluation and territory-control pillars, yet no evidence, error analysis, or external benchmarks testing this twin-to-real mapping is provided.
minor comments (1)
  1. [Abstract] Abstract: The description of how the CogniPair architecture is extended (versus reused) should be expanded with explicit component-level details to clarify the incremental contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review of our manuscript. We appreciate the referee's focus on the need for stronger substantiation of the deployment claims and the twin-to-real mapping assumptions. We agree that the current version requires revisions to address these points accurately and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the system 'extends prior simulation-only matching into a fully deployed social discovery environment' and 'derive[s] empirical cost-quality baselines' from deployment is unsupported. The only concrete validation referenced is the Columbia Speed Dating dataset (551 participants), which records brief real-human encounters; no methods, quantitative outcomes, correlation coefficients between twin predictions and post-meeting reports, deployment logs, user-study results, or identified bottleneck metrics are supplied anywhere in the manuscript.

    Authors: We thank the referee for this observation. The Columbia Speed Dating dataset (551 participants) was used to validate the underlying CogniPair cognitive architecture for short-term compatibility predictions, including some correlation analysis between twin simulations and real encounter reports. However, we acknowledge that the manuscript does not supply detailed deployment logs, full user-study results, or specific bottleneck metrics from the live Cognibit system. The abstract's phrasing regarding a 'fully deployed' environment and derived empirical baselines overstates what is currently evidenced. In the revision, we will update the abstract to qualify these claims, clarify the scope of the Columbia validation, and add a new section describing the system implementation, initial deployment setup, and any preliminary observations on cost-quality tradeoffs. revision: yes

  2. Referee: [Abstract] Abstract: The assumption that autonomous multi-turn conversations between digital twins accurately estimate real interpersonal compatibility (and that gamified territory mechanics reliably produce organic in-person encounters) is load-bearing for both the compatibility-evaluation and territory-control pillars, yet no evidence, error analysis, or external benchmarks testing this twin-to-real mapping is provided.

    Authors: We agree that the twin-to-real mapping is a central assumption for both pillars. The Columbia dataset provides initial support for compatibility prediction in brief encounters, but the manuscript lacks dedicated error analysis for multi-turn autonomous simulations or benchmarks demonstrating that gamified territory control leads to organic in-person meetings. This is a genuine limitation in the current version. We will revise the paper to explicitly acknowledge this assumption and its evidential basis, include any available preliminary alignment data from deployment where it exists, and discuss it as a limitation with directions for future validation studies. revision: yes

Circularity Check

1 steps flagged

Self-citation to CogniPair underpins extension to deployed empirical baselines

specific steps
  1. self citation load bearing [Abstract]
    "Built upon CogniPair's cognitive architecture (Ye et al., 2026), validated on the Columbia Speed Dating dataset (551 participants), our system extends prior simulation-only matching into a fully deployed social discovery environment. Through deployment, we derive empirical cost-quality baselines and identify fundamental scaling bottlenecks that remain hidden in component-level testing alone."

    The extension claim and derivation of empirical baselines from deployment are justified solely by reference to prior work by overlapping authors, without independent evidence. The cited dataset records brief real-human speed-dating encounters rather than autonomous LLM twin conversations or gamified in-person territory mechanics, so the new deployed results reduce to the self-cited architecture by construction.

full rationale

The paper's core claim of moving beyond simulation-only matching to a fully deployed system that yields new cost-quality baselines and scaling bottlenecks rests on the CogniPair cognitive architecture from overlapping authors (Ye et al., 2026). The only cited validation is the Columbia Speed Dating dataset of real-human encounters, which does not cover multi-turn LLM twin dialogues or gamified territory mechanics. No deployment logs, correlation metrics, or user-study results are supplied to independently ground the new claims, making the load-bearing architecture and extension self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The platform depends on untested assumptions about LLM simulation fidelity and gamification efficacy, with no independent evidence supplied in the abstract for either.

axioms (1)
  • domain assumption LLM-based digital twins can autonomously simulate user behavior to estimate interpersonal compatibility
    Stated as the core mechanism for the first pillar without supporting derivation or validation details.
invented entities (2)
  • Digital twins no independent evidence
    purpose: Autonomous multi-turn conversation for compatibility estimation
    Newly instantiated per-user agents whose accuracy is assumed rather than demonstrated.
  • AI companions with persistent shared memory no independent evidence
    purpose: Maintain conversation history across devices
    Introduced as a system component without external validation.

pith-pipeline@v0.9.0 · 5517 in / 1312 out tokens · 26666 ms · 2026-05-10T20:24:30.643306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references

  1. [1]

    be empathetic

    (8,378 speed date encounters, 551 participants). We compare against CogniPair’s reported 77.8% match prediction accuracy, acknowledging that CogniPair uses server-side GPT-4o inference while Cognibit uses a lightweight browser-compatible scoring function. Table 12: Match Prediction on Columbia Speed Dating Dataset (80/20 train/test split). CogniPair uses ...

  2. [2]

    Gamification as Scaffolding 14/20 P3, P8, P15 70%

  3. [3]

    Choice Reduction via AI 12/20 P7, P12, P19 60%

  4. [4]

    Pendant as Transitional Object 15/20 P11, P16, P20 75%

  5. [5]

    Emergent Behaviors 8/20 P9, P14, P18 40%

  6. [6]

    meaningful connections

    Negative Cases 6/20 P2, P6, P17 30% H.1 Theme 1: Gamification as Social Scaffolding Contrary to our initial concern that gaming might distract from social goals, participants leveraged game mechanics as comfortable interaction frameworks that reduced social anxiety. H.1.1 Territory Battles as Ice Breakers P8 (28, software engineer) described how competing...

  7. [7]

    It’s just Pokémon GO but more awkward

    Notify user Figure 12: Sequence diagram showing autonomous twin-to-twin networking process (trait similarity 0.3, interest overlap 0.4, personality match 0.3). If the combined score exceeds the 20% threshold, candidates proceed to a 3-turn LLM-simulated twin conversation for behavioral evaluation. The match result propagates back through Twin A (step 5) t...

  8. [8]

    childish

    deployment (N=20, 14 days) revealed multiple failure modes that provide valuable insights for future deployment. Figure 5 summarizes the key deployment outcomes. R.1 Non-Engagement (20% of Participants) Four participants never claimed territories despite completing onboarding, representing a 20% attrition rate before any meaningful system interaction occu...

  9. [9]

    Clone the repository and install dependencies:npm install 89 Table 43: Software and hardware requirements for reproduction Requirement Version Notes Node.js 18+ Required for API proxy server Chrome / Chromium 120+ Primary tested browser; WebGL 2.0 required Firefox 120+ Supported; 1536MB heap limit vs. Chrome’s 2048MB RAM 8GB minimum 16GB recommended for 1...

  10. [10]

    Configure environment variables: copy.env.example to .env and add API keys (OpenAI, Firebase credentials)

  11. [11]

    Start the API proxy:node server/api-proxy.js(runs on port 3001)

  12. [12]

    Deploy to Firebase Hosting:firebase deploy (or serve locally: firebase serve )

  13. [13]

    Open Chrome 120+ and navigate to the deployment URL T.3.2 Performance Benchmark Reproduction

  14. [14]

    Open the deployed application in Chrome with DevTools open

  15. [15]

    Navigate to the 3D game world with 5–8 concurrent digital twin agents loaded

  16. [16]

    Runthebuilt-inbenchmark: execute new SocialHubPerformanceBenchmark().runAll() in the console

  17. [17]

    The benchmark runs rendering tests (3s with 1s warmup), memory profiling, and scene complexity analysis

  18. [18]

    Expected results: 58.3 FPS average, 256.5MB memory usage, 37 draw calls, 361K triangles (see Appendix S for full baseline) T.3.3 Synthetic Experiment Reproduction

  19. [19]

    Install Python dependencies:pip install -r experiments/requirements.txt

  20. [20]

    Forfunnelvalidation: python experiments/paper_validation/exp_funnel_validation.py (requires Qwen2.5-72B-Instruct and Llama-3.1-70B-Instruct models)

  21. [21]

    Fixed seeds are used throughout:random.seed(42),torch.manual_seed(42)

  22. [22]

    The PersonaChat validation set (200 unique personas) serves as the candidate pool for funnel experiments

    Results should match within±2% of reported values due to LLM stochasticity T.4 Synthetic Data Generation For testing without real users, synthetic twin profiles can be generated with random- ized personality traits (0–100 scale on five dimensions: friendliness, openness, indepen- dence, loyalty, playfulness), GPS coordinates near the test location, and pr...

  23. [23]

    The full source code under MIT License

  24. [24]

    Anonymized interaction logs from the 342 twin sessions, including behavioral traces, compatibility scores, and outcome data

  25. [25]

    Anonymized survey responses and performance metrics

  26. [26]

    cognitive bit

    Pre-trained model weights and evaluation scripts for all synthetic experiments All personally identifiable information (GPS coordinates, free-text responses containing names) will be removed or aggregated to protect participant privacy. U User Interface Design and Implementation The Cognibit platform implements a comprehensive multi-application interface ...

  27. [27]

    Interaction Quality: How likely are they to have a productive, enjoyable social interaction?

  28. [28]

    interaction_quality

    Complementarity: How well do their personalities and interests complement each other? Output ONLY JSON: {“interaction_quality”: N, “complementarity”: N} The judge isblindto: funnel internal scores, condition labels, heuristic weights, and twin conversation content. It sees only the two persona profiles. AI.1.5 Aggregation For each target user, the judge r...

  29. [29]

    Empathy: Does the response demonstrate understanding of the user’s emo- tional state?

  30. [30]

    Contextual Appropriateness: Is the response appropriate for the specific situation?

  31. [31]

    Reassurance Quality: Does the response help the user feel calmer, more confident, or more grounded?

  32. [32]

    show, don’t tell

    Personalization: Does the response feel personalized to this specific user and situation (vs. generic to anyone)? All three conditions are judged on all four dimensions. The judge is blind to condition labels. AI.5 Territory Dynamics Protocol Agent-based simulation with 100 Monte Carlo runs. Each run: 20 agents, 50 territory zones, 14 simulated days. Agen...

  33. [33]

    System prompt instructs: speak naturally, show personality, keep to 1–2 sentences

    Turn 1 (Initiation): Twin A receives Twin B’s personality profile and generates a natural opening. System prompt instructs: speak naturally, show personality, keep to 1–2 sentences

  34. [34]

    Turn 2 (Response): Twin B receives Turn 1 and responds in character based on its own personality profile

  35. [35]

    You are a helpful, supportive AI assis- tant. Respond warmly to the user. Be empathetic and kind. Keep responses concise (2–4 sentences)

    Turn 3 (Continuation): Twin A receives the conversation history and continues naturally. AJ.4 Behavioral Score Extraction After the 3-turn conversation, a behavioral compatibility score is extracted via a separate LLM call that evaluates the conversation quality on a 0.0–1.0 scale, considering: •Conversational flow and mutual engagement •Topic compatibili...