Train Reinforcement Learning Agents

Once you have created an environment and reinforcement learning agent, you can train the agent in the environment using thetrainfunction. To configure your training, use anrlTrainingOptionsobject. For example, create a training option setopt, and train agentagentin environmentenv.

opt = rlTrainingOptions(...MaxEpisodes=1000,...MaxStepsPerEpisode=1000,...StopTrainingCriteria="AverageReward",...StopTrainingValue=480); trainResults = train(agent,env,opt);

Ifenvis a multi-agent environment created withrlSimulinkEnv, specify the agent argument as an array. The order of the agents in the array must match the agent order used to createenv. Multi-agent training is not supported for MATLAB^®environments.

For more information on creating agents, seeReinforcement Learning Agents. For more information on creating environments, seeCreate MATLAB Reinforcement Learning EnvironmentsandCreate Simulink Reinforcement Learning Environments.

Note

trainupdates the agent as training progresses. This is possible because each agent is an handle object. To preserve the original agent parameters for later use, save the agent to a MAT-file:

save("initialAgent.mat","agent")

If you copy the agent into a new variable, the new variable will also always point to the most recent agent version with updated parameters. For more information about handle objects, seeHandle Object Behavior.

Training terminates automatically when the conditions you specify in theStopTrainingCriteriaandStopTrainingValueoptions of yourrlTrainingOptionsobject are satisfied. You can also terminate training before any termination condition is reached by clickingStop Trainingin the Reinforcement Learning Episode Manager.

When training terminates the training statistics and results are stored in thetrainResultsobject.

Becausetrainupdates the agent at the end of each episode, and becausetrainResultsstores the last training results along with data to correctly recreate the training scenario and update the episode manager, you can later resume training from the exact point at which it stopped. To do so, at the command line, type:

trainResults = train(agent,env,trainResults);

This starts the training from the last values of the agent parameters and training results object obtained after the previoustraincall.

ThetrainResultsobject contains, as one of its properties, therlTrainingOptionsobjectoptspecifying the training option set. Therefore, to restart the training with updated training options, first change the training options intrainResultsusing dot notation. If the maximum number of episodes was already reached in the previous training session, you must increase the maximum number of episodes.

For example, disable displaying the training progress on Episode Manager, enable theVerboseoption to display training progress at the command line, change the maximum number of episodes to 2000, and then restart the training, returning a newtrainResultsobject as output.

trainResults.TrainingOptions.MaxEpisodes = 2000; trainResults.TrainingOptions.Plots ="none"; trainResults.TrainingOptions.Verbose = 1; trainResultsNew = train(agent,env,trainResults);

Note

When training terminates,agentsreflects the state of each agent at the end of the final training episode. The rewards obtained by the final agents are not necessarily the highest achieved during the training process, due to continuous exploration. To save agents during training, create anrlTrainingOptionsobject specifying theSaveAgentCriteriaandSaveAgentValueproperties and pass it totrainas atrainOptsargument.

Training Algorithm

In general, training performs the following steps.

Initialize the agent.
For each episode:
1. Reset the environment.
2. Get the initial observations₀from the environment.
3. Compute the initial actiona₀=μ(s₀), whereμ(s) is the current policy.
4. Set the current action to the initial action (a←a₀), and set the current observation to the initial observation (s←s₀).
5. While the episode is not finished or terminated, perform the following steps.
  1. Apply actionato the environment and obtain the next observations''and the rewardr.
  2. Learn from the experience set (s,a,r,s').
  3. Compute the next actiona'=μ(s').
  4. Update the current action with the next action (a←a') and update the current observation with the next observation (s←s').
  5. Terminate the episode if the termination conditions defined in the environment are met.
If the training termination condition is met, terminate training. Otherwise, begin the next episode.

The specifics of how the software performs these steps depend on the configuration of the agent and environment. For instance, resetting the environment at the start of each episode can include randomizing initial state values, if you configure your environment to do so. For more information on agents and their training algorithms, seeReinforcement Learning Agents. To use parallel processing and GPUs to speed up training, seeTrain Agents Using Parallel Computing and GPUs.

Episode Manager

By default, calling thetrainfunction opens the Reinforcement Learning Episode Manager, which lets you visualize the training progress.

Episode manager window showing the completion of the training for a DQN agent on the predefined pendulum environment.

The Episode Manager plot shows the reward for each episode (EpisodeReward) and a running average reward value (AverageReward).

For agents with a critic,Episode Q0is the estimate of the discounted long-term reward at the start of each episode, given the initial observation of the environment. As training progresses, if the critic is well designed and learns successfully,Episode Q0approaches in average the true discounted long-term reward, which may be offset from theEpisodeRewardvalue because of discounting. For a well designed critic using an undiscounted reward (DiscountFactoris equal to1), then on averageEpisode Q0方法真正的奖励,如th所示e preceding figure.

The Episode Manager also displays various episode and training statistics. You can also use thetrainfunction to return episode and training information. To turn off the Reinforcement Learning Episode Manager, set thePlotsoption ofrlTrainingOptionsto"none".

Save Candidate Agents

During training, you can save candidate agents that meet conditions you specify in theSaveAgentCriteriaandSaveAgentValueoptions of yourrlTrainingOptionsobject. For instance, you can save any agent whose episode reward exceeds a certain value, even if the overall condition for terminating training is not yet satisfied. For example, save agents when the episode reward is greater than100.

opt = rlTrainingOptions(SaveAgentCriteria="EpisodeReward",SaveAgentValue=100);

trainstores saved agents in a MAT-file in the folder you specify using theSaveAgentDirectoryoption ofrlTrainingOptions. Saved agents can be useful, for instance, to test candidate agents generated during a long-running training process. For details about saving criteria and saving location, seerlTrainingOptions.

After training is complete, you can save the final trained agent from the MATLAB workspace using thesavefunction. For example, save the agentmyAgentto the filefinalAgent.matin the current working directory.

save(opt.SaveAgentDirectory +"/finalAgent.mat",'agent')

By default, when DDPG and DQN agents are saved, the experience buffer data is not saved. If you plan to further train your saved agent, you can start training with the previous experience buffer as a starting point. In this case, set theSaveExperienceBufferWithAgentoption totrue. For some agents, such as those with large experience buffers and image-based observations, the memory required for saving the experience buffer is large. In these cases, you must ensure that enough memory is available for the saved agents.

Validate Trained Policy

To validate your trained agent, you can simulate the agent within the training environment using thesimfunction. To configure the simulation, userlSimulationOptions.

When validating your agent, consider checking how your agent handles the following:

Changes to simulation initial conditions — To change the model initial conditions, modify the reset function for the environment. For example reset functions, seeCreate MATLAB Environment Using Custom Functions,Create Custom MATLAB Environment from Template, andCreate Simulink Reinforcement Learning Environments.
Mismatches between the training and simulation environment dynamics — To check such mismatches, create test environments in the same way that you created the training environment, modifying the environment behavior.

As with parallel training, if you have Parallel Computing Toolbox™ software, you can run multiple parallel simulations on multicore computers. If you haveMATLAB Parallel Server™software, you can run multiple parallel simulations on computer clusters or cloud resources. For more information on configuring your simulation to use parallel computing, seeUseParallelandParallelizationOptionsinrlSimulationOptions.

Environment Visualization

If your training environment implements theplot方法,您可以可视化的环境行为during training and simulation. If you callplot(env)before training or simulation, whereenvis your environment object, then the visualization updates during training to allow you to visualize the progress of each episode or simulation.

环境不支持可视化training or simulating your agent using parallel computing.

For custom environments, you must implement your ownplotmethod. For more information on creating a custom environments with aplotfunction, seeCreate Custom MATLAB Environment from Template.

Related Examples

More About

Reinforcement Learning Agents