Learning Goal-Oriented Visual Dialog Agents: Imitating and Surpassing Analytic Experts

Wen-Hsiao Peng; Yen-Wei Chang

arxiv: 1907.10500 · v1 · pith:MYGKMJKInew · submitted 2019-07-24 · 💻 cs.AI · cs.LG

Learning Goal-Oriented Visual Dialog Agents: Imitating and Surpassing Analytic Experts

Yen-Wei Chang , Wen-Hsiao Peng This is my paper

Pith reviewed 2026-05-24 16:47 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords visual dialogimitation learningreinforcement learningquestion generationGuessWhatgoal-oriented agentsanalytic experts

0 comments

The pith

Two analytic experts inspired by information theory supply demonstrations that let imitation learning initialize a visual dialog questioner, after which reinforcement learning refines it to state-of-the-art goal performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to train a questioner agent that must discover an unknown object through yes-no questions in a visual scene. Earlier model-free reinforcement learning approaches start from sparse human dialog data and therefore explore too slowly in the enormous space of possible question sequences. The authors construct two analytic experts that follow an information-theoretic strategy to produce high-coverage demonstrations; imitation learning copies these demonstrations to seed the policy, and reinforcement learning then tunes the policy directly toward the guessing goal. Experiments on the GuessWhat?! benchmark show the resulting agent outperforms prior methods.

Core claim

Analytic experts that generate demonstrations according to an information-theoretic criterion allow imitation learning to place the questioner policy in a useful region of a large search space; reinforcement learning subsequently improves the policy toward the explicit goal of identifying the target object, producing state-of-the-art results on GuessWhat?!.

What carries the argument

Two analytic experts that generate high-quality question sequences for imitation learning, followed by reinforcement learning that optimizes the policy for the guessing objective.

If this is right

The hybrid imitation-plus-reinforcement pipeline combines the sample efficiency of imitation with the goal-directed improvement of reinforcement learning.
High-quality synthetic demonstrations can substitute for scarce human data when the policy space is large.
The final agent surpasses both pure imitation baselines and prior reinforcement-learning agents on the same dataset.
The approach is applicable to any goal-oriented dialog task where an information-theoretic expert can be defined.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same expert-construction pattern could be tried in other partially observable decision tasks that suffer from large action spaces.
If the analytic experts encode domain knowledge that is easy to formalize, the method offers a route to reduce reliance on large human dialog corpora.
Extending the experts to handle multi-turn consistency or uncertainty in answers might further improve robustness.

Load-bearing premise

The demonstrations produced by the two analytic experts have enough quality and variety that imitation learning can place the policy where reinforcement learning is still able to reach the true goal optimum without being trapped by the experts' own biases.

What would settle it

Training an otherwise identical reinforcement-learning questioner from scratch, without any imitation-learning initialization from the analytic experts, and finding that it reaches the same or higher accuracy on GuessWhat?! would falsify the claim that the expert demonstrations are necessary for effective exploration.

read the original abstract

This paper tackles the problem of learning a questioner in the goal-oriented visual dialog task. Several previous works adopt model-free reinforcement learning. Most pretrain the model from a finite set of human-generated data. We argue that using limited demonstrations to kick-start the questioner is insufficient due to the large policy search space. Inspired by a recently proposed information theoretic approach, we develop two analytic experts to serve as a source of high-quality demonstrations for imitation learning. We then take advantage of reinforcement learning to refine the model towards the goal-oriented objective. Experimental results on the GuessWhat?! dataset show that our method has the combined merits of imitation and reinforcement learning, achieving the state-of-the-art performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses analytic experts for imitation then RL to claim SOTA on GuessWhat, but the abstract shows no evidence the experts actually solve the coverage problem.

read the letter

The one thing to know about this paper is that it proposes using two analytic experts, built from an information-theoretic approach, to generate demonstrations for imitation learning a questioner policy in goal-oriented visual dialog, followed by reinforcement learning to optimize for the goal. It reports state-of-the-art performance on the GuessWhat?! dataset. What is new here is the construction of those analytic experts to address the problem of insufficient coverage from human-generated data in the large policy search space. The authors argue that pretraining on limited human data is not enough and that analytic experts can provide higher-quality starting points. This is a legitimate step beyond standard imitation learning setups in this area. The paper does a good job of combining imitation and RL in a way that tries to get the benefits of both: good initialization from the experts and goal-directed refinement from RL. The motivation is clear from the abstract. The main soft spot is around the quality and coverage of the demonstrations from the analytic experts. The central claim depends on these experts producing demonstrations that are diverse enough and not limited by biases that would prevent the RL stage from reaching better performance. The abstract mentions no metrics on this—no analysis of question diversity, overlap with human data, or how well the experts cover different states in the dialog. Without that, it's difficult to know if the hybrid method truly surpasses the experts or if the results are driven by something else. The stress-test concern about insufficient quality and coverage seems on point based on the given text. This paper is aimed at researchers working on visual dialog agents and goal-oriented dialog systems, particularly those using the GuessWhat?! benchmark. A reader in that niche might find the hybrid recipe useful if the full experiments support the claims. It deserves a serious referee because it has a specific, testable method and an empirical result that can be scrutinized with the right details. The idea engages honestly with prior work on imitation and RL in dialog. I would recommend sending this to peer review.

Referee Report

2 major / 0 minor

Summary. The paper proposes two analytic experts, inspired by an information-theoretic approach, to generate high-quality demonstrations for imitation learning of a questioner policy in goal-oriented visual dialog. These demonstrations initialize the policy, after which reinforcement learning refines it toward the goal-oriented objective. The central claim is that this hybrid method combines the merits of imitation and RL, achieving state-of-the-art performance on the GuessWhat?! dataset.

Significance. If the performance claims hold with proper validation, the work would demonstrate a practical way to address insufficient coverage from human demonstrations in large policy spaces by leveraging analytic experts for initialization, potentially improving sample efficiency and final performance in goal-oriented dialog agents.

major comments (2)

[Abstract] Abstract: the assertion that the two analytic experts generate demonstrations of sufficient quality and coverage to initialize the questioner policy (allowing subsequent RL to reach SOTA without being limited by expert biases) is load-bearing for the central claim, yet the text provides no metrics on expert diversity, overlap with human data, or state coverage.
[Abstract] Abstract: the claim of achieving state-of-the-art performance via the hybrid method is presented without any details on experimental setup, baselines, statistical tests, ablation studies, or quantitative results, rendering the performance claim unverifiable from the given text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the feedback on our manuscript. We address the two major comments on the abstract below, clarifying the role of the abstract as a high-level summary while noting where the full paper provides supporting details and where revisions can strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the two analytic experts generate demonstrations of sufficient quality and coverage to initialize the questioner policy (allowing subsequent RL to reach SOTA without being limited by expert biases) is load-bearing for the central claim, yet the text provides no metrics on expert diversity, overlap with human data, or state coverage.

Authors: The abstract is a concise overview and does not include quantitative metrics, which are instead reported in the experimental sections of the full manuscript (including comparisons of expert success rates, diversity measures via entropy or coverage statistics, and overlap analysis with human dialogs). We agree this could be made more explicit at a high level and will revise the abstract to include one or two key quantitative indicators of expert quality and coverage to better support the claim. revision: yes
Referee: [Abstract] Abstract: the claim of achieving state-of-the-art performance via the hybrid method is presented without any details on experimental setup, baselines, statistical tests, ablation studies, or quantitative results, rendering the performance claim unverifiable from the given text.

Authors: Abstracts by design summarize contributions at a high level without experimental details, which appear in the main body (including GuessWhat?! results, baselines such as prior RL and imitation methods, ablations on the hybrid components, and reported metrics). The SOTA claim is substantiated there with quantitative results. We will partially revise the abstract to reference the magnitude of improvement for better context while remaining within length constraints. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical pipeline evaluated on external dataset

full rationale

The paper presents an empirical method that trains a questioner policy via imitation from two analytic experts (inspired by an external information-theoretic approach) followed by RL refinement, then reports performance on the public GuessWhat?! benchmark. No derivation step reduces by construction to a fitted quantity, self-citation chain, or renamed input; the central claim rests on experimental results against an independent test set rather than any definitional equivalence or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of imitation learning and RL plus the domain assumption that the GuessWhat?! dataset adequately represents goal-oriented visual dialog; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption The GuessWhat?! dataset is representative of goal-oriented visual dialog tasks.
All reported results depend on performance measured on this single benchmark.

pith-pipeline@v0.9.0 · 5641 in / 1178 out tokens · 21459 ms · 2026-05-24T16:47:22.662076+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

What is the man doing?

INTRODUCTION Research on goal-oriented visual dialogue [1, 5] has recently attracted lots of attention. Unlike the conventional VQA [8], where the robot answerer has to answer any question related to an input image raised by a human even if the question it- self is ambiguous or indeﬁnite, the goal-oriented visual di- alogue extends the question-answering ...

work page
[2]

Learning Goal-Oriented Visual Dialog Agents: Imitating and Surpassing Analytic Experts

RELA TED WORK Goal-Oriented Visual Dialogue. GuessWhat?! [5] is a col- laborative 2-player visual grounded object discovery game. The game begins with presenting an image I of a rich vi- sual scene containingM objectsC = {cm}M m=1 to both play- ers, the questioner and the answerer. The answerer ﬁrst picks in mind an object c∗ ∈ C, which is unknown to the ...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[3]

METHOD To overcome the problems faced by these previous works, our proposed method obtains a better questioner by learning from analytic experts, which provide virtually unlimited demon- strations, and by taking advantage of RL to discover a even better policy than the experts’, which suffer inherently from imperfect modeling of the oracle. In this sessio...

work page
[4]

Is it in the left side of the image?

EXPERIMENT This session compares the proposed method with AQM, IGE, and few other state-of-the-art baselines on the GuessWhat?! dataset, in terms of prediction accuracy. We follow the set- tings in [9] to test the robustness of different methods to the oracle approximation error. We conclude with a subjective evaluation. 4.1. Settings Dataset. GuessWhat?!...

work page
[5]

We develop two analytic experts, IGE and TPE, for imitation learning on top of the probabilistic framework developed for AQM

CONCLUSION We train a questioner for the GuessWhat?! task based on im- itation and reinforcement learning. We develop two analytic experts, IGE and TPE, for imitation learning on top of the probabilistic framework developed for AQM. Because both experts are greedy and have high reliance on an accurate ora- cle model of the answerer, we further reﬁne our m...

work page
[6]

Visual Dialog,

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos ´e M.F. Moura, Devi Parikh, and Dhruv Batra, “Visual Dialog,” in CVPR, 2017

work page 2017
[7]

Learning cooperative visual di- alog agents with deep reinforcement learning,

Abhishek Das, Satwik Kottur, Jos ´e M.F. Moura, Stefan Lee, and Dhruv Batra, “Learning cooperative visual di- alog agents with deep reinforcement learning,” inICCV, 2017

work page 2017
[8]

End-to-end optimization of goal-driven and visually grounded dia- logue systems,

Florian Strub, Harm de Vries, J ´er´emie Mary, Bilal Piot, Aaron C. Courville, and Olivier Pietquin, “End-to-end optimization of goal-driven and visually grounded dia- logue systems,” in IJCAI, 2017

work page 2017
[9]

PLATO: policy learning using adaptive trajectory optimization,

Gregory Kahn, Tianhao Zhang, Sergey Levine, and Pieter Abbeel, “PLATO: policy learning using adaptive trajectory optimization,” in ICRA, 2017

work page 2017
[10]

Guesswhat?! visual object discovery through multi- modal dialogue,

Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron C. Courville, “Guesswhat?! visual object discovery through multi- modal dialogue,” in CVPR, 2017

work page 2017
[11]

Simple statistical gradient- following algorithms for connectionist reinforcement learning,

Ronald J. Williams, “Simple statistical gradient- following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3-4, 1992

work page 1992
[12]

Improving goal-oriented visual dialog agents via advanced recurrent nets with tempered policy gradient,

Rui Zhao and V olker Tresp, “Improving goal-oriented visual dialog agents via advanced recurrent nets with tempered policy gradient,” in IJCAI, 2018

work page 2018
[13]

VQA: Visual Question Answering,

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh, “VQA: Visual Question Answering,” in ICCV, 2015

work page 2015
[14]

Answerer in Questioner’s Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog,

Sang-Woo Lee, Yu-Jung Heo, and Byoung-Tak Zhang, “Answerer in Questioner’s Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog,” in NIPS, 2018

work page 2018
[15]

A reduction of imitation learning and structured prediction to no-regret online learning,

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in AISTATS, 2011

work page 2011
[16]

Deepmimic: Example-guided deep reinforcement learning of physics-based charac- ter skills,

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based charac- ter skills,” ACM Transactions on Graphics (Proc. SIG- GRAPH 2018), vol. 37, no. 4, 2018

work page 2018
[17]

Deep learning for real-time atari game play using ofﬂine monte-carlo tree search planning,

Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang, “Deep learning for real-time atari game play using ofﬂine monte-carlo tree search planning,” in NIPS, 2014

work page 2014

[1] [1]

What is the man doing?

INTRODUCTION Research on goal-oriented visual dialogue [1, 5] has recently attracted lots of attention. Unlike the conventional VQA [8], where the robot answerer has to answer any question related to an input image raised by a human even if the question it- self is ambiguous or indeﬁnite, the goal-oriented visual di- alogue extends the question-answering ...

work page

[2] [2]

Learning Goal-Oriented Visual Dialog Agents: Imitating and Surpassing Analytic Experts

RELA TED WORK Goal-Oriented Visual Dialogue. GuessWhat?! [5] is a col- laborative 2-player visual grounded object discovery game. The game begins with presenting an image I of a rich vi- sual scene containingM objectsC = {cm}M m=1 to both play- ers, the questioner and the answerer. The answerer ﬁrst picks in mind an object c∗ ∈ C, which is unknown to the ...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[3] [3]

METHOD To overcome the problems faced by these previous works, our proposed method obtains a better questioner by learning from analytic experts, which provide virtually unlimited demon- strations, and by taking advantage of RL to discover a even better policy than the experts’, which suffer inherently from imperfect modeling of the oracle. In this sessio...

work page

[4] [4]

Is it in the left side of the image?

EXPERIMENT This session compares the proposed method with AQM, IGE, and few other state-of-the-art baselines on the GuessWhat?! dataset, in terms of prediction accuracy. We follow the set- tings in [9] to test the robustness of different methods to the oracle approximation error. We conclude with a subjective evaluation. 4.1. Settings Dataset. GuessWhat?!...

work page

[5] [5]

We develop two analytic experts, IGE and TPE, for imitation learning on top of the probabilistic framework developed for AQM

CONCLUSION We train a questioner for the GuessWhat?! task based on im- itation and reinforcement learning. We develop two analytic experts, IGE and TPE, for imitation learning on top of the probabilistic framework developed for AQM. Because both experts are greedy and have high reliance on an accurate ora- cle model of the answerer, we further reﬁne our m...

work page

[6] [6]

Visual Dialog,

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos ´e M.F. Moura, Devi Parikh, and Dhruv Batra, “Visual Dialog,” in CVPR, 2017

work page 2017

[7] [7]

Learning cooperative visual di- alog agents with deep reinforcement learning,

Abhishek Das, Satwik Kottur, Jos ´e M.F. Moura, Stefan Lee, and Dhruv Batra, “Learning cooperative visual di- alog agents with deep reinforcement learning,” inICCV, 2017

work page 2017

[8] [8]

End-to-end optimization of goal-driven and visually grounded dia- logue systems,

Florian Strub, Harm de Vries, J ´er´emie Mary, Bilal Piot, Aaron C. Courville, and Olivier Pietquin, “End-to-end optimization of goal-driven and visually grounded dia- logue systems,” in IJCAI, 2017

work page 2017

[9] [9]

PLATO: policy learning using adaptive trajectory optimization,

Gregory Kahn, Tianhao Zhang, Sergey Levine, and Pieter Abbeel, “PLATO: policy learning using adaptive trajectory optimization,” in ICRA, 2017

work page 2017

[10] [10]

Guesswhat?! visual object discovery through multi- modal dialogue,

Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron C. Courville, “Guesswhat?! visual object discovery through multi- modal dialogue,” in CVPR, 2017

work page 2017

[11] [11]

Simple statistical gradient- following algorithms for connectionist reinforcement learning,

Ronald J. Williams, “Simple statistical gradient- following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, no. 3-4, 1992

work page 1992

[12] [12]

Improving goal-oriented visual dialog agents via advanced recurrent nets with tempered policy gradient,

Rui Zhao and V olker Tresp, “Improving goal-oriented visual dialog agents via advanced recurrent nets with tempered policy gradient,” in IJCAI, 2018

work page 2018

[13] [13]

VQA: Visual Question Answering,

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh, “VQA: Visual Question Answering,” in ICCV, 2015

work page 2015

[14] [14]

Answerer in Questioner’s Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog,

Sang-Woo Lee, Yu-Jung Heo, and Byoung-Tak Zhang, “Answerer in Questioner’s Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog,” in NIPS, 2018

work page 2018

[15] [15]

A reduction of imitation learning and structured prediction to no-regret online learning,

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in AISTATS, 2011

work page 2011

[16] [16]

Deepmimic: Example-guided deep reinforcement learning of physics-based charac- ter skills,

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based charac- ter skills,” ACM Transactions on Graphics (Proc. SIG- GRAPH 2018), vol. 37, no. 4, 2018

work page 2018

[17] [17]

Deep learning for real-time atari game play using ofﬂine monte-carlo tree search planning,

Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang, “Deep learning for real-time atari game play using ofﬂine monte-carlo tree search planning,” in NIPS, 2014

work page 2014