ALSO: Adversarial Online Strategy Optimization for Social Agents

Liping Yi; Mingze Kong; Min Zhang; Qinghua Hu; Xiang Li; Zhongxiang Dai

arxiv: 2605.15768 · v2 · pith:WLDAYEGVnew · submitted 2026-05-15 · 💻 cs.AI · cs.CY

ALSO: Adversarial Online Strategy Optimization for Social Agents

Xiang Li , Liping Yi , Mingze Kong , Min Zhang , Zhongxiang Dai , QingHua Hu This is my paper

Pith reviewed 2026-05-20 19:04 UTC · model grok-4.3

classification 💻 cs.AI cs.CY

keywords social agentsonline strategy optimizationadversarial banditsmulti-agent simulationLLM agentsnon-stationary environmentsreward predictiondialogue adaptation

0 comments

The pith

Social agents adapt strategies online in non-stationary dialogues by treating persona-instruction pairs as adversarial bandit arms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ALSO as a framework for online strategy optimization in multi-agent social simulations using large language models. It tackles non-stationary environments where agents must adjust to evolving contexts and opponents without assuming environmental stability. The approach models multi-turn interactions as an adversarial bandit problem with static personas combined with dynamic strategy instructions as arms, and adds a lightweight neural surrogate that predicts rewards from interaction histories to support sample-efficient exploration. This setup allows continuous adaptation with lower overhead than offline reinforcement learning or external planners. Experiments on the Sotopia benchmark show consistent outperformance over static baselines and prior optimization methods in dynamic settings.

Core claim

ALSO is the first framework for online strategy optimization in multi-agent social simulation. It formulates multi-turn interaction as an adversarial bandit problem, where combinations of static personas and dynamic strategy instructions are treated as arms. A lightweight neural surrogate predicts rewards from interaction histories to enable sample-efficient exploration and continuous online adaptation without relying on stationarity assumptions.

What carries the argument

Adversarial bandit formulation that treats combinations of static personas and dynamic strategy instructions as selectable arms, paired with a neural surrogate model for generalizing sparse reward signals from dialogue histories.

If this is right

Social agents can update strategies continuously during interactions instead of relying on fixed personas or costly retraining.
The bandit framing provides a way to handle non-stationary opponent strategies and evolving contexts without external planners.
A neural reward predictor makes exploration practical even when feedback arrives only after full dialogues.
The method reduces training overhead compared to offline reinforcement learning approaches in social simulation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bandit-plus-surrogate structure could apply to other non-stationary sequential decision settings such as negotiation or turn-based games.
Replacing the lightweight neural surrogate with a larger model might improve prediction accuracy at the cost of added computation during online updates.
Combining this online adaptation loop with occasional offline fine-tuning could yield hybrid agents that retain both flexibility and long-term performance.

Load-bearing premise

Multi-turn social interactions can be modeled as an adversarial bandit problem where persona-strategy combinations function as arms to solve non-stationarity without needing a stable environment.

What would settle it

If ALSO shows no performance gain over static baselines when tested on the Sotopia benchmark under changing opponent behaviors and contexts, the advantage of the adversarial bandit formulation for robust adaptation would be disproven.

Figures

Figures reproduced from arXiv: 2605.15768 by Liping Yi, Mingze Kong, Min Zhang, Qinghua Hu, Xiang Li, Zhongxiang Dai.

**Figure 1.** Figure 1: Online social interaction between agents with personas and adaptive strategies, where feedback multi-turn dialogue drives continuous strategy optimization under evolving behaviors. 1. Introduction Modeling social intelligence (Mathur et al., 2024) is a central pursuit in Artificial Intelligence research. The advent of Large Language Models (LLMs) has substantially advanced this field by endowing agents wi… view at source ↗

**Figure 2.** Figure 2: Overview of ALSO for adaptive social strategy learning in LLM-based multi-agent social simulation. Static persona-driven agents exhibit rigid interactions and fail to achieve social goals (left), while ALSO leverages adversarial online strategy selection with surrogate reward modeling to dynamically adapt strategies and enable successful social outcomes (center–right). principles over goal orientation and … view at source ↗

**Figure 3.** Figure 3: Conflict Resolution. Comparison of dialogue trajectories at the critical deadlock phase (Turns 7–9), highlighting turn-level strategy switches and their effect on reward/relationship. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Turn 0.65 0.70 0.75 0.80 0.85 Average Reward (P1) Same-Arm Reward Drift Over Turns Arm 0 Arm 1 Arm 2 Arm 3 Arm 4 Arm 5 Arm 6 Arm 7 Arm 8 Arm 9 Arm 10 Arm 11 Arm 12 [PITH_FULL_… view at source ↗

**Figure 4.** Figure 4: Strategy Reward Drift Over Dialogue Turns. Each line represents a different strategy (arm), showing how the average normalized reward varies across turns within episodes. ding likewise contributes a non-trivial margin across all four dimensions. Single vs. Bilateral Optimization. We compare bilateral strategy optimization with unilateral variants that adapt strategies for only one agent (P1-only or P2-onl… view at source ↗

**Figure 6.** Figure 6: Cross-scenario generalization results. (a) Per-scenario goal and overall scores on unseen test scenarios, comparing onlinefrom-scratch learning, zero-shot transfer, and finetuning. (b) Average performance across all 7 unseen test scenarios. Zero-shot transfer outperforms the scratch baseline on both goal score (7.14 vs. 6.79, +5.3%) and overall score (3.60 vs. 3.17, +13.5%). and per-scenario levels. Aver… view at source ↗

**Figure 5.** Figure 5: Bilateral optimization improves social interactions. Comparison of P1-only, P2-only, and bilateral approaches on Qwen2.5-72B-Instruct (a–b) and DeepSeek-V3.2 (c–d). Left: overall scores; right: dimension-wise with percentage gains. Significance: p < 0.001 (Qwen), p < 0.01 (DeepSeek). benchmark, consisting of 14 challenging scenarios. For each scenario, we sample a single episode to evaluate performance … view at source ↗

**Figure 7.** Figure 7: Performance across heterogeneous model, measured by only final P1 score. (a) Baseline performance. (b) Performance with ALSO. Green annotations denote relative improvement over baseline. ALSO yields consistent gains across heterogeneous pairings. Heterogeneous Model Pairing. We further evaluate ALSO 8 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: OPRO meta-prompt template. The placeholder {instruction score pairs} is filled with previous strategies and their scores in ascending order. B.6. EvoPrompt-GA EvoPrompt-GA Crossover + Mutation Please follow the instruction step-by-step to generate a better agent bio description. 1. Crossover the following agent bios and generate a new bio: Bio 1: <bio1> Bio 2: <bio2> 2. Mutate the bio generated in Step 1 a… view at source ↗

**Figure 9.** Figure 9: EvoPrompt-GA template implementing genetic crossover and mutation. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: (a-d) Strategy selection trajectories for four representative scenarios, showing how the bandit algorithm converges to scenariospecific optimal strategies over conversation turns. Each colored dot represents the strategy selected at that turn, with the dashed horizontal line indicating the most frequently selected strategy. Different scenarios converge to distinct strategies: Face-Saving for relationship… view at source ↗

**Figure 11.** Figure 11: Empirical learning dynamics of ALSO. Left: average per-turn reward trajectory on Sotopia-Hard. Right: surrogate-predicted versus realized rewards across turns. D.3. Extended Comparisons and Robustness Analyses D.3.1. COMPARISON WITH OFFLINE STRATEGY-INJECTION BASELINES The main experiments compare ALSO against online prompt-optimization baselines (OPRO, EvoPrompt, INSTINCT) that operate within the same on… view at source ↗

read the original abstract

Social simulation provides a compelling testbed for studying social intelligence, where agents interact through multi-turn dialogues under evolving contexts and strategically adapting opponents. Such environments are inherently non-stationary, requiring agents to dynamically adjust their strategies over time. However, most Large Language Model (LLM) based social agents rely on static personas, while existing approaches for enhancing social intelligence, such as offline reinforcement learning or external planners, are ill-suited to these settings, typically assuming stationarity and incurring substantial training overhead. To bridge this gap, we propose \textbf{ALSO} (\textbf{A}dversarial on\textbf{L}ine \textbf{S}trategy \textbf{O}ptimization), the first framework for online strategy optimization in multi-agent social simulation. ALSO advances social adaptation through two key contributions. (1) ALSO formulates multi-turn interaction as an adversarial bandit problem, where combinations of static personas and dynamic strategy instructions are treated as arms, providing a principled solution to non-stationarity without relying on environmental stability assumptions. (2) To predict rewards and generalize sparse feedback in multi-turn dialogues, ALSO introduces a lightweight neural surrogate to predict rewards from interaction histories, enabling sample-efficient exploration and continuous online adaptation. Experiments on the Sotopia benchmark demonstrate that ALSO consistently outperforms static baselines and existing optimization methods in dynamic environments, validating the effectiveness of adversarial online strategy optimization for building robust social agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ALSO casts multi-turn social interactions as an adversarial bandit with persona-strategy arms and a neural surrogate, but the sequential credit assignment problem looks under-addressed.

read the letter

The main thing here is that ALSO tries to give LLM-based social agents an online adaptation method by treating combinations of static personas and dynamic strategy instructions as arms in an adversarial bandit, then using a lightweight neural surrogate to predict rewards from interaction histories. This is positioned as the first such framework that avoids the stationarity assumptions of offline RL while keeping training overhead low. The Sotopia experiments report consistent gains over static baselines and other optimization methods in changing environments, which is the concrete evidence the paper offers. That part is straightforward and worth noting as a practical step forward for people running multi-agent dialogue simulations. The formulation itself is the part that needs checking. Standard adversarial bandit analysis assumes each arm pull gives an independent reward sample. Multi-turn dialogues have history-dependent states, shifting opponent behavior, and rewards that often arrive only at the end. If the arm is chosen once per episode rather than conditioned on the current dialogue state each turn, the surrogate has to do all the temporal credit assignment. The abstract gives no sign that the surrogate is trained with return decomposition or temporal-difference updates, so the claim that this solves non-stationarity without stability assumptions rests on an assumption that may not hold internally. If the full paper shows explicit handling of these issues in the surrogate training, that would tighten the argument. The work is aimed at researchers building adaptive conversational agents and multi-agent social simulations who want lighter online methods instead of heavy offline retraining. A reader working on Sotopia-style benchmarks or similar dynamic dialogue setups would find the reported comparisons useful. It has enough of a clear framing and benchmark results to go to peer review, mainly so the surrogate details and bandit application across turns can be examined directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ALSO, a framework for online strategy optimization in multi-agent social simulations using LLMs. It formulates multi-turn interactions as an adversarial bandit problem in which arms consist of combinations of static personas and dynamic strategy instructions, and deploys a lightweight neural surrogate to predict rewards from interaction histories for sample-efficient exploration and adaptation. Experiments on the Sotopia benchmark are reported to show consistent outperformance relative to static baselines and prior optimization methods in dynamic environments.

Significance. If the central claims are substantiated, the work offers a practical route to online adaptation for social agents without stationarity assumptions or heavy offline training, potentially advancing robust LLM agents for non-stationary dialogue settings. The combination of adversarial bandit framing with a surrogate model is a targeted response to sparse feedback and evolving contexts.

major comments (2)

[§3.1] §3.1 (Bandit Formulation): The claim that treating static-persona + dynamic-strategy pairs as arms provides a principled solution to non-stationarity rests on the assumption that arm selection occurs once per episode. Standard adversarial bandit regret bounds assume independent per-pull rewards, yet multi-turn dialogues involve history-dependent states, evolving opponent strategies, and sparse terminal rewards; the manuscript does not demonstrate that the surrogate is trained with explicit temporal-difference learning or return decomposition to carry the credit-assignment burden.
[§4.2] §4.2 (Surrogate Model): The lightweight neural surrogate is described as predicting rewards from interaction histories, but no ablation or analysis is provided on whether it conditions on per-turn dialogue state or merely aggregates episode-level features. Without this, the non-stationarity claim cannot be internally verified against the sequential nature of the task.

minor comments (2)

[§3.2] Notation for the surrogate input (history encoding) is introduced without an explicit equation or diagram showing how multi-turn context is tokenized or embedded.
[Table 1] Table 1 (Sotopia results): The reported metrics lack error bars or statistical significance tests across runs, making it difficult to assess the consistency of the claimed outperformance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below with clarifications on the design choices in ALSO and indicate planned revisions to improve the manuscript.

read point-by-point responses

Referee: [§3.1] §3.1 (Bandit Formulation): The claim that treating static-persona + dynamic-strategy pairs as arms provides a principled solution to non-stationarity rests on the assumption that arm selection occurs once per episode. Standard adversarial bandit regret bounds assume independent per-pull rewards, yet multi-turn dialogues involve history-dependent states, evolving opponent strategies, and sparse terminal rewards; the manuscript does not demonstrate that the surrogate is trained with explicit temporal-difference learning or return decomposition to carry the credit-assignment burden.

Authors: We appreciate this observation. In ALSO, arm selection occurs once per episode: a combination of static persona and dynamic strategy instruction is chosen at the start and then guides the agent's responses throughout the multi-turn interaction. This per-episode selection enables adaptation to non-stationarity, as the bandit policy is updated between episodes using outcomes from prior interactions with evolving opponents. The neural surrogate is trained via supervised learning on full interaction histories to predict terminal rewards, implicitly addressing credit assignment by learning mappings from complete trajectories to outcomes. We do not employ explicit temporal-difference learning or return decomposition, as the surrogate's generalization from observed histories suffices for sample-efficient online updates in this setting. We will revise §3.1 to explicitly describe the per-episode arm selection mechanism and the surrogate's role in approximating episode-level returns. revision: partial
Referee: [§4.2] §4.2 (Surrogate Model): The lightweight neural surrogate is described as predicting rewards from interaction histories, but no ablation or analysis is provided on whether it conditions on per-turn dialogue state or merely aggregates episode-level features. Without this, the non-stationarity claim cannot be internally verified against the sequential nature of the task.

Authors: Thank you for this feedback. The surrogate encodes the full sequence of per-turn dialogue states from the interaction history to predict the episode reward, rather than using only aggregated episode-level features; this sequential conditioning supports capturing dynamics in non-stationary environments. We acknowledge that an explicit ablation was not included. We will add an ablation analysis in the revised §4.2, comparing per-turn conditioning against episode-level aggregation in terms of reward prediction accuracy and impact on overall agent performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity: novel adversarial bandit formulation stands as independent proposal

full rationale

The paper proposes ALSO as a new framework that formulates multi-turn interaction as an adversarial bandit problem treating static-persona + dynamic-strategy pairs as arms, plus a lightweight neural surrogate for reward prediction from histories. This is presented directly as the core contribution without equations or claims that reduce the result to a fitted parameter, prior self-citation chain, or self-definitional loop. Experiments on the Sotopia benchmark are invoked for validation rather than as the source of the formulation itself. No load-bearing self-citations, ansatz smuggling, or renaming of known results appear in the provided text. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the bandit arm construction and neural surrogate are presented as novel but without internal details on assumptions or fitted quantities.

pith-pipeline@v0.9.0 · 5784 in / 1099 out tokens · 33278 ms · 2026-05-20T19:04:22.845750+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ALSO formulates multi-turn interaction as an adversarial bandit problem, where combinations of static personas and dynamic strategy instructions are treated as arms
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lightweight neural surrogate to predict rewards from interaction histories

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang

URL https://openreview.net/forum? id=cRBg1dtj7o. Axelrod, R. and Hamilton, W. D. The evolution of coopera- tion.science, 211(4489):1390–1396, 1981. Bisbee, J., Clinton, J. D., Dorff, C., Kenkel, B., and Larson, J. M. Synthetic replacements for human survey data? the perils of large language models.Political Analysis, 32(4): 401–416, 2024. Chen, H., Chen, ...

work page doi:10.18653/v1/2024 1981
[2]

acl-long.607/

URL https://aclanthology.org/2025. acl-long.607/. Kong, M., Wang, Z., Shu, Y ., and Dai, Z. Meta-prompt optimization for llm-based sequential decision making. arXiv preprint arXiv:2502.00728, 2025b. Lattimore, T. and Szepesv´ari, C.Bandit algorithms. Cam- bridge University Press, 2020. Lee, M., Srivastava, M., Hardy, A., Thickstun, J., Durmus, E., Paranja...

work page arXiv 2025
[3]

doi: 10.18653/v1/2023.emnlp-main.13

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.13. URL https:// aclanthology.org/2023.emnlp-main.13/. Li, K., Wang, Y ., Vi´egas, F., and Wattenberg, M. Dialogue action tokens: Steering language models in goal-directed dialogue with a multi-turn planner.arXiv preprint arXiv:2406.11978, 2024. Lin, X., Dai, Z., Verma, A., Ng, S....

work page doi:10.18653/v1/2023.emnlp-main.13 2023
[4]

emnlp-main.1143/

URL https://aclanthology.org/2024. emnlp-main.1143/. Mou, X., Liang, J., Lin, J., Zhang, X., Liu, X., Yang, S., Ye, R., Chen, L., Kuang, H., Huang, X.-J., et al. Agentsense: Benchmarking social intelligence of language agents through interactive scenarios. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for C...

work page doi:10.18653/v1/2024.emnlp-main 2024
[5]

Social IQ a: Commonsense reasoning about social interactions

URL https://aclanthology.org/2024. emnlp-main.525/. Osgood, C.An Alternative to War or Surrender. Illini Books Edition. University of Illinois Press, 1962. ISBN 978-0-598-14243-6. 10 ALSO: Adversarial Online Strategy Optimization for Social Agents Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Inter...

work page doi:10.18653/v1/d19-1454 2024
[6]

OpenAI GPT-5 System Card

URL https://aclanthology.org/2022. emnlp-main.248/. Schelling, T. C.The strategy of conflict. Harvard University Press, London, England, 2 edition, July 1990. Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. Spit...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.698 2022
[7]

Output ONLY a valid JSON object

work page
[8]

DO NOT repeat or copy the schema definition above where {dialogue history} contains the conversation history, and {format instructions} specifies the JSON schema for the seven evaluation dimensions defined in Appendix A. B.2. Example Strategy Instruction Strategy-Instruction-Enhanced Bio Template {original bio} {strategy description} where {original bio} ...

work page 2011
[9]

Generate{n}paraphrased versions of this strategy

work page
[10]

• Use different wording, sentence structures, and examples

Each paraphrase must: • Preserve the core behavioral intent and theoretical grounding. • Use different wording, sentence structures, and examples. • Be directly usable as an agent prompt

work page
[11]

Vary the linguistic style: some formal, some conversational

work page
[12]

original_id

Do NOT change the underlying negotiation tactic. Output Format: { "original_id": "<strategy_id>", "paraphrases": [ {"id": "<strategy_id>_v1", "description": "..."}, {"id": "<strategy_id>_v2", "description": "..."}, ... ] } We use GPT-5 (Singh et al., 2025) to generate the strategy space. B.5. OPRO Meta-Prompt OPRO Meta-Prompt Your task is to generate an a...

work page 2025
[13]

Crossover the following agent bios and generate a new bio: Bio 1:<bio1>Bio 2:<bio2>

work page
[14]

Hi Lily...I do enjoy apples quite a bit, but I also have a soft spot for oranges. Bananas are nice, but they’re not my top choice.How about you?

Mutate the bio generated in Step 1 and generate a final bio bracketed with<BIO>and</BIO>. Figure 9.EvoPrompt-GA template implementing genetic crossover and mutation. 16 ALSO: Adversarial Online Strategy Optimization for Social Agents C. Experiment Details We provide detailed hyperparameter configurations for all baseline methods and our proposed approach....

work page 2024

[1] [1]

Siyuan Guo, Cheng Deng, Ying Wen, Hechang Chen, Yi Chang, and Jun Wang

URL https://openreview.net/forum? id=cRBg1dtj7o. Axelrod, R. and Hamilton, W. D. The evolution of coopera- tion.science, 211(4489):1390–1396, 1981. Bisbee, J., Clinton, J. D., Dorff, C., Kenkel, B., and Larson, J. M. Synthetic replacements for human survey data? the perils of large language models.Political Analysis, 32(4): 401–416, 2024. Chen, H., Chen, ...

work page doi:10.18653/v1/2024 1981

[2] [2]

acl-long.607/

URL https://aclanthology.org/2025. acl-long.607/. Kong, M., Wang, Z., Shu, Y ., and Dai, Z. Meta-prompt optimization for llm-based sequential decision making. arXiv preprint arXiv:2502.00728, 2025b. Lattimore, T. and Szepesv´ari, C.Bandit algorithms. Cam- bridge University Press, 2020. Lee, M., Srivastava, M., Hardy, A., Thickstun, J., Durmus, E., Paranja...

work page arXiv 2025

[3] [3]

doi: 10.18653/v1/2023.emnlp-main.13

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.13. URL https:// aclanthology.org/2023.emnlp-main.13/. Li, K., Wang, Y ., Vi´egas, F., and Wattenberg, M. Dialogue action tokens: Steering language models in goal-directed dialogue with a multi-turn planner.arXiv preprint arXiv:2406.11978, 2024. Lin, X., Dai, Z., Verma, A., Ng, S....

work page doi:10.18653/v1/2023.emnlp-main.13 2023

[4] [4]

emnlp-main.1143/

URL https://aclanthology.org/2024. emnlp-main.1143/. Mou, X., Liang, J., Lin, J., Zhang, X., Liu, X., Yang, S., Ye, R., Chen, L., Kuang, H., Huang, X.-J., et al. Agentsense: Benchmarking social intelligence of language agents through interactive scenarios. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for C...

work page doi:10.18653/v1/2024.emnlp-main 2024

[5] [5]

Social IQ a: Commonsense reasoning about social interactions

URL https://aclanthology.org/2024. emnlp-main.525/. Osgood, C.An Alternative to War or Surrender. Illini Books Edition. University of Illinois Press, 1962. ISBN 978-0-598-14243-6. 10 ALSO: Adversarial Online Strategy Optimization for Social Agents Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Inter...

work page doi:10.18653/v1/d19-1454 2024

[6] [6]

OpenAI GPT-5 System Card

URL https://aclanthology.org/2022. emnlp-main.248/. Schelling, T. C.The strategy of conflict. Harvard University Press, London, England, 2 edition, July 1990. Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. Spit...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.698 2022

[7] [7]

Output ONLY a valid JSON object

work page

[8] [8]

DO NOT repeat or copy the schema definition above where {dialogue history} contains the conversation history, and {format instructions} specifies the JSON schema for the seven evaluation dimensions defined in Appendix A. B.2. Example Strategy Instruction Strategy-Instruction-Enhanced Bio Template {original bio} {strategy description} where {original bio} ...

work page 2011

[9] [9]

Generate{n}paraphrased versions of this strategy

work page

[10] [10]

• Use different wording, sentence structures, and examples

Each paraphrase must: • Preserve the core behavioral intent and theoretical grounding. • Use different wording, sentence structures, and examples. • Be directly usable as an agent prompt

work page

[11] [11]

Vary the linguistic style: some formal, some conversational

work page

[12] [12]

original_id

Do NOT change the underlying negotiation tactic. Output Format: { "original_id": "<strategy_id>", "paraphrases": [ {"id": "<strategy_id>_v1", "description": "..."}, {"id": "<strategy_id>_v2", "description": "..."}, ... ] } We use GPT-5 (Singh et al., 2025) to generate the strategy space. B.5. OPRO Meta-Prompt OPRO Meta-Prompt Your task is to generate an a...

work page 2025

[13] [13]

Crossover the following agent bios and generate a new bio: Bio 1:<bio1>Bio 2:<bio2>

work page

[14] [14]

Hi Lily...I do enjoy apples quite a bit, but I also have a soft spot for oranges. Bananas are nice, but they’re not my top choice.How about you?

Mutate the bio generated in Step 1 and generate a final bio bracketed with<BIO>and</BIO>. Figure 9.EvoPrompt-GA template implementing genetic crossover and mutation. 16 ALSO: Adversarial Online Strategy Optimization for Social Agents C. Experiment Details We provide detailed hyperparameter configurations for all baseline methods and our proposed approach....

work page 2024