Reinforcement Learning approaches are commonly used for dialog policy learning. Reward function is an important part of RL algorithms which affects the training and quality of the policy. Handcrafted reward functions have been replaced by machine-learned reward functions in recent approaches with promising results. Such reward models compare agent actions with human actions, more human-like agent actions receive higher rewards. Reward models so far consider only the latest dialog turn when computing reward for agent action. In this paper, we hypothesize that using a sequence of turns to decide next agent action is more beneficial. Towards this claim we mine for common patterns in human-human task-oriented dialog data. The experiment results suggest that there are obvious patterns i.e., human-human communication in task-oriented dialogs follows some common sequences of actions. Such patterns can be potentially incorporated into reward models to train agents that could better imitate human behaviors.
Nguyen, Anh Duy; Li, Minyi; and Vo, Bao Quoc, "Mining Conversation Data for Reward Estimation in Dialog Policy Learning" (2020). PACIS 2020 Proceedings. 109.
When commenting on articles, please be friendly, welcoming, respectful and abide by the AIS eLibrary Discussion Thread Code of Conduct posted here.