Westlake News ACADEMICS

New Algorithm Boosts Offline Reinforcement Learning in AI

28, 2022

Email: zhangchi@westlake.edu.cn
Phone: +86-(0)571-86886861
Office of Public Affairs

Recently, ICLR 2022, the top conference on artificial intelligence, announced that it had accepted "DARA: Dynamics-Aware Reward Augmentation in Offline Reinforcement Learning" – the latest achievement by Donglin Wang's group from the School of Engineering of Westlake University – among its collection of papers. Ph.D. student Jinxin Liu and research assistant Hongyin Zhang are the co-first authors, and Prof. Donglin Wang is the corresponding author.

Figure 1: Performance of various offline RL methods.

Offline reinforcement learning (RL), the task of learning from a previously collected dataset, holds the promise of acquiring policies without any costly active interaction required in the standard online RL paradigm. However, they note that although the active trial-and-error (online exploration) is eliminated, the performance of the offline RL method heavily relies on the amount of offline data used for training. As shown in Figure 1, the performance deteriorates dramatically as the amount of offline data decreases. A natural question therefore arises: Can we reduce the amount of the (target) offline data without significantly affecting the final performance for the target task?

Bringing the idea from transfer learning, they assume that they have access to another (source) offline dataset, hoping that they can leverage this dataset to compensate for the performance degradation caused by the reduced (target) offline dataset. In the offline setting, previous work has characterized the reward (goal) difference between the source and target, relying on the "conflicting" or multi-goal offline dataset, while they focus on the relatively unexplored transition dynamics difference between the source dataset and the target environment. Meanwhile, they believe that this dynamics shift is not arbitrary in reality: In healthcare treatment, offline data for a particular patient is often limited, whereas they can obtain diagnostic data from other patients with the same case (same reward/goal) and there often exist individual differences between patients (source datasets with different transition dynamics). Careful treatment with respect to the individual differences is thus a crucial requirement.

Given source offline data, the main challenge is to cope with the transition dynamics difference, i.e., strictly tracking the state-action supported by the source offline data cannot guarantee that the same transition (state-action-next-state) can be achieved in the target environment. However, in the offline setting, such a dynamics shift is not explicitly characterized by the previous offline RL methods, where they typically attribute the difficulty of learning from offline data to the state-action distribution shift. The corresponding algorithms that model the support of state-action distribution induced by the learned policy, will inevitably suffer from the transfer problem where dynamics shift happens.

Figure 2: Penalize the agent with a dynamics-aware reward modification.

Their approach is motivated by the well-established connection between reward modification and dynamics adaptation, which indicates that, by modifying rewards, one can train a policy in one environment and make the learned policy suitable for another environment (with different dynamics). Thus, they propose exploiting the joint distribution of state-action-next-state: Besides characterizing the state-action distribution shift, they additionally identify the dynamics (i.e., conditional distribution of next-state given state-action pair) shift and penalize the agent with a dynamics-aware reward modification. This modification aims to discourage the learning from these offline transitions that are likely in source but are unlikely in target (as shown in Figure 2). Unlike the concurrent work paying attention to the offline domain generalization, they explicitly focus on the offline domain (dynamics) adaptation.

Table 1: The DARA Framework.

Their principal contribution in the work is the characterization of the dynamics shift in offline RL and the derivation of a dynamics-aware reward augmentation (DARA) framework built on prior model-free and model-based formulations (Table 1). DARA is simple and general, can accommodate various offline RL methods, and can be implemented in just a few lines of code on top of the data-loader at training.

Table 2: Comparisons on D4RL datasets.

In their offline dynamics adaptation setting, they also release a dataset, including the Gym-MuJoCo tasks, with dynamics (mass, joint) shift compared to D4RL, and a 12-DoF quadruped robot in both a simulator and the real world. With only modest amounts of target offline data, they show that DARA-based offline methods can acquire an adaptive policy for the target tasks and achieve a better performance compared to baselines in both simulated and real-world tasks (Table 2).