CEILing

Abstract

Learning to solve complex manipulation tasks from visual observations is a dominant challenge for real-world robot learning. Deep reinforcement learning algorithms have recently demonstrated impressive results, although they still require an impractical amount of time-consuming trial-and-error iterations. In this work, we consider the promising alternative paradigm of interactive learning where a human teacher provides feedback to the policy during execution, as opposed to imitation learning where a pre-collected dataset of perfect demonstrations is used. Our proposed CEILing (Corrective and Evaluative Interactive Learning) framework combines both corrective and evaluative feedback from the teacher to train a stochastic policy in an asynchronous manner, and employs a dedicated mechanism to trade off human corrections with the robot’s own experience. We present results obtained with our framework in extensive simulation and real-world experiments that demonstrate that CEILing can effectively solve complex robot manipulation tasks directly from raw images in less than one hour of real-world training.

How Does It Work?

Figure: CEILing allows the human teacher to provide both corrective and evaluative feedback to the robot. When the teacher does not provide any corrections, it is beacause either the trajectory is already optimal or because he/she was not able to intervene correctly. The use of evaluative feedback allows the teacher to mark the two different kinds of trajectory segments. The bad portions are discarded, while the positive ones are stored in the replay buffer alongside the corrected parts, in order to improve the robustness of the policy.

CEILing needs to operate in a severe low-data regime since the human teacher cannot be expected to supervise the robot training for hours or days. We therefore store all of the available data in a replay buffer. Given the random behavior of an untrained policy, the human teacher would be required to intervene most if not all of the time during the first few episodes, making this phase not much different from collecting actual demonstrations. Hence, we prefer to collect 10 demonstrations to warm start the policy by regressing on these demonstrations. After the warm start phase, the interactive learning phase starts where the robot applies the latest version of the policy to generate actions at a frequency of 20 Hz.

At the beginning of each episode, we start with a positive evaluative feedback label q=+1. As long as the trajectory remains appropriate, all the subsequent steps will be automatically labeled with this value, which helps reinforce the good behavior of the robot. When the teacher believes that the robot trajectory can no longer be considered satisfactory and it is not easy to correct, the teacher can toggle the evaluative label to q=0. From there on, all subsequent labels are going to be considered as zero, until they are toggled back to positive by the human teacher. This way we can label all state-action pairs without requiring too much effort from the teacher. If the trajectory is not on the right path but it is easily adjustable, the human teacher can provide corrections to the robot using a remote controller. In this case, all state-action pairs will be labeled with q=α. The weight α is defined as the ratio between the amount of non-corrected and corrected samples in the replay buffer. It is used to increase the prioritization of the corrected samples in order to counteract the imbalance of the collected data. In our experiments, we train for 100 episodes, which roughly correspond to 20 minutes of real-world training, in contrast to millions of episodes needed by standard RL algorithms.

Publication

Eugenio Chisari, Tim Welschehold, Joschka Boedecker, Wolfram Burgard, Abhinav Valada
Correct Me if I am Wrong: Interactive Learning for Robotic Manipulation
IEEE Robotics and Automation Letters (RA-L), 2022.

(Pdf) (Bibtex)