Javascript must be enabled for the correct page display

Combining Model-based and Model-free approaches in achieving sample efficieny in Reinforcement Learning

Nair, Anjali (2022) Combining Model-based and Model-free approaches in achieving sample efficieny in Reinforcement Learning. Master's Thesis / Essay, Artificial Intelligence.

[img]
Preview
Text
Master_Thesis__Anjali_Nair-final.pdf

Download (5MB) | Preview
[img] Text
toestemming.pdf
Restricted to Registered users only

Download (100kB)

Abstract

Reinforcement Learning is broadly classified into model-free (MF) and model-based (MB) approaches. While MF approaches have repeatedly proved successful in solving a variety of robotic applications, the training is often accompanied with the need of large number of learning samples [1]. In absence of a simulation, sampling from the real environment can be expensive and lead to hardware wear and tear. Model-based (MB) reinforcement learning approaches on the other hand, plan trajectories in a learned model and execute only a subset of the transitions in the real environment. However, the reliance on the learned model and inherent modelling errors cause model-based approaches to struggle in achieving performance comparable to MF. In an attempt to get the best of both worlds, we propose to combine the two approaches, MB and MF, with a novel architecture. The MB counterpart of the architecture involves learning an approximate model of the real environment and planning trajectories by means of a modified Model Predictive Controller (MPC). Here, planning refers to rolling out trajectories in the learned environment without the agent making these trajectories in the real environment. While a traditional MPC plans trajectories by random action selection at every timestep, we propose to have an in-loop policy, trained through a MF approach in directing the actions. The samples collected through this planning are used to further train the policy for the agent. The policy attained through this approach is then fed as the initial policy to warm start pure MF training. The MF training here serves the purpose of fine-tuning our policy to combat incorrect planning due to model-errors in the learned environment. While MB and MF approaches have been combined in the past, the main contribution of the proposed architecture is in combining a traditional planner such as MPC and an MF policy, specifically in the planning stage. A fully connected feed-forward neural network is used in learning the environment. We choose Proximal Policy Optimisation (PPO) as the model-free algorithm, simply due to its relevance and popularity in robotics. In evaluating our architecture, we test it on the Half-Cheetah Mujoco environment, where the task is to make the half-cheetah run forward. As in reinforcement learning, comparisons are made based on the rewards attained in each case. We compare the performance of planning with MPC and a MF policy as opposed to planning with only MPC. To test the superiority of the hybrid MB and MF architecture (MBMF) to its pure MF counterpart (PPO), we compare the rewards obtained in each case. Further, we test the sensitivity of our architecture to different rewards. We also modify our architecture to replace PPO, an on-policy algorithm, with Soft Actor-Critic (SAC), in testing the applicability to an off-policy algorithm. Off-policy algorithms are model-free approaches where trajectories following old policies are also used in updating the current policy. We find that planning with MPC and PPO together achieves higher scores than planning with MPC alone in all scenarios tested (different rewards and MF algorithms). We also find, when trained with the default reward, our architecture achieves scores that PPO does in 1e6 timesteps, but with 5e5 fewer timesteps. However, our architecture proves to be sensitive to the rewards. Further, since scores do not justify the quality of the defined rewards, we analyse the gaits achieved for each reward based on the torques applied at each joint and the stability of the centre of mass of the half cheetah. We also note that our architecture works just as well with SAC as it does with PPO, showing its applicability to on-policy and off-policy MF algorithms.

Item Type: Thesis (Master's Thesis / Essay)
Supervisor name: Carloni, R.
Degree programme: Artificial Intelligence
Thesis type: Master's Thesis / Essay
Language: English
Date Deposited: 18 Nov 2022 13:27
Last Modified: 15 Nov 2023 13:15
URI: https://fse.studenttheses.ub.rug.nl/id/eprint/28968

Actions (login required)

View Item View Item