Playing 20 Question Game with Policy-Based Reinforcement Learning

Bingfeng Luo; Can Xu; Chongyang Tao; Huang Hu; Wei Wu; Xianchao Wu; Zhan Chen

arxiv: 1808.07645 · v5 · pith:S6JLOEMJnew · submitted 2018-08-23 · 💻 cs.HC · cs.AI· cs.CL

Playing 20 Question Game with Policy-Based Reinforcement Learning

Huang Hu , Xianchao Wu , Bingfeng Luo , Chongyang Tao , Can Xu , Wei Wu , Zhan Chen This is my paper

classification 💻 cs.HC cs.AIcs.CL

keywords gamequestionmethodobjectquestionerselectionsystemanswerer

0 comments

read the original abstract

The 20 Questions (Q20) game is a well known game which encourages deductive reasoning and creativity. In the game, the answerer first thinks of an object such as a famous person or a kind of animal. Then the questioner tries to guess the object by asking 20 questions. In a Q20 game system, the user is considered as the answerer while the system itself acts as the questioner which requires a good strategy of question selection to figure out the correct object and win the game. However, the optimal policy of question selection is hard to be derived due to the complexity and volatility of the game environment. In this paper, we propose a novel policy-based Reinforcement Learning (RL) method, which enables the questioner agent to learn the optimal policy of question selection through continuous interactions with users. To facilitate training, we also propose to use a reward network to estimate the more informative reward. Compared to previous methods, our RL method is robust to noisy answers and does not rely on the Knowledge Base of objects. Experimental results show that our RL method clearly outperforms an entropy-based engineering system and has competitive performance in a noisy-free simulation environment.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ECHO: Learning Epistemically Adaptive Language Agents with Turn-Level Credit
cs.MA 2026-06 unverdicted novelty 7.0

ECHO is a clipped policy-gradient method that uses posterior-sensitive rewards to give turn-level epistemic credit in multi-turn information-seeking tasks, outperforming trajectory-level GRPO on a new Clue Selector Ga...