Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas

V\'ictor Gallego

arxiv: 2603.19453 · v2 · pith:RA636ZMNnew · submitted 2026-03-19 · 💻 cs.CL · cs.GT

Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas

V\'ictor Gallego This is my paper

classification 💻 cs.CL cs.GT

keywords feedbacksocialmetricsdensepolicyrewardscalaracross

0 comments

read the original abstract

We study LLM policy synthesis: using a language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. We explain the asymmetry through feedback aliasing: when scalar reward alone maps distinct failure modes to the same value (e.g., under- vs. over-cleaning), social metrics break the alias and let the LLM diagnose which corrective direction to take. Social metrics thus function as a coordination signal rather than a distraction, yielding strategies such as Voronoi territory partitioning and waste-adaptive cleaner schedules. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon
cs.LG 2026-05 unverdicted novelty 6.0

Metal-Sci is a benchmark and harness for LLM evolutionary optimization of Apple Silicon Metal kernels that uses held-out sizes to detect silent regressions missed by in-distribution scores.