DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks

ICLR 2024

Tongzhou MuMinghua LiuHao Su

UC San Diego

[Paper] [Code][Video][Slides][Poster]

Overview Video (3m51s)

Click the "cc" button at the lower right corner to show captions.

drs.mp4

Abstract

The success of many RL techniques heavily relies on human-engineered dense rewards, which typically demand substantial domain expertise and extensive trial and error. In our work, we propose DrS (Dense reward learning from Stages), a novel approach for learning reusable dense rewards for multi-stage tasks in a data-driven manner. By leveraging the stage structures of the task, DrS learns a high-quality dense reward from sparse rewards and demonstrations if given. The learned rewards can be reused in unseen tasks, thus reducing the human effort for reward engineering. Extensive experiments on three physical robot manipulation task families with 1000+ task variants demonstrate that our learned rewards can be reused in unseen tasks, resulting in improved performance and sample efficiency of RL algorithms. The learned rewards even achieve comparable performance to human-engineered rewards on some tasks.

DrS Works on Diverse Task Variants 

DrS is evaluated on 1,000+ task variants from three task families in ManiSkill (each task variant is associated with a different object). 

DrS Trains RL Agents on Unseen Tasks

The learned rewards outperform the semi-sparse rewards and other baselines on all tasks, and even achieve performance comparable to human-engineered rewards on the “Pick and Place” and “Turn Faucet” tasks.

Method Overview

DrS is built upon adversarial imitation learning, but incorporates sparse rewards to split the trajectories into failure and success ones. Instead of classifying agent trajectories and demonstrations, it classifies the trajectories into success and failure ones. Expert demonstrations can be added as positive data, but they are not mandatory.

Tasks often consist of multiple stages, which can be leveraged by our approach to create a stronger dense reward. In multi-stage tasks, we train separate discriminators for each stage. Trajectories are assigned to different buffers according to the highest stage they reach. The negative data for stage i's discriminator is from buffers 0 to i, and positive data is from buffers i+1 to n. In this way, we get a dense reward for each stage. 

Citation

If you find our work useful, please consider citing the paper as follows:


@inproceedings{mu2024drs,

    title={DrS: Learning Reusable Dense Rewards for Multi-Stage Tasks},

    author={Mu, Tongzhou and Liu, Minghua and Su, Hao},

    booktitle={The Twelfth International Conference on Learning Representations},

    year={2024}

}