The PPO Trainer is a reinforcement learning trainer that is composed of two parts: an actor sub graph and a critic sub-graph. One of the differences between PPO and A2C is that PPO works off of large batches of data. Instead of waiting for a small batch before training, PPO will wait until there is a mega batch, called the trajectory. When we've received that much data, we train over it in normal batch sizes for a couple of times and then wait for the next trajectory. More...

#include <PPOTrainer.h>

Inheritance diagram for SmartEngine::IPPOTrainer:

Public Member Functions
virtual float	GetPolicyLoss ()=0
	Returns the loss in the policy sub-graph More...

virtual float	GetValueLoss ()=0
	Returns the loss in value sub-graph More...

virtual float	GetEntropyLoss ()=0
	Returns the entropy loss - a measure of how random the network is. More...

Public Member Functions inherited from SmartEngine::IRLTrainer
virtual int	GetGenerationCount () const =0
	Returns how many generations we have trained More...

virtual float	GetLoss ()=0
	This value will mean different things to different trainers. See each trainer's description for the value returned. More...

virtual void	Reset ()=0
	Resets the trainer to a fresh state, initializing any internal weights to random values. More...

virtual void	Step ()=0
	Step training. May not actual result in any training if there is not enough data available yet. More...

Public Member Functions inherited from SmartEngine::IObject
virtual ObjectId	GetId () const =0
	Returns the ID of this object. More...

virtual void	AddRef () const =0
	Increments the internal reference count on this object. It is not common to use this method directly. More...

virtual void	Release () const =0
	Decrements the internal reference count on this object. It is not common to use this method directly. More...

virtual int	GetRefCount () const =0
	Returns the number of references to this object. More...

virtual void *	QueryInterface (ObjectClassId id)=0
	Queries the object for an interface and returns a pointer to that interface if found. More...

void	operator= (IObject const &x)=delete

Public Member Functions inherited from SmartEngine::IAgentFactory
virtual ObjectPtr< IAgent >	CreateAgent ()=0
	Creates an agent for a particular trainer. More...

Public Member Functions inherited from SmartEngine::IResource
virtual const char *	GetResourceName () const =0
	Returns the name of this resource passed to the constructor. More...

virtual SerializationResult	GetLastLoadResult () const =0
	Returns the result of the last call to Load(). Useful for checking loaded data state after creation. More...

virtual SerializationResult	Load (const char *appendName=nullptr)=0
	Load this object from disk. More...

virtual SerializationResult	Save (const char *appendName=nullptr)=0
	Save this object to disk. More...

Public Member Functions inherited from SmartEngine::ISerializable
virtual SerializationResult	Serialize (IMemoryBuffer *buffer)=0
	Write the contents of this object to a buffer. More...

virtual SerializationResult	Deserialize (IMemoryBuffer *buffer)=0
	Fill this object with contents from a buffer. More...

Additional Inherited Members
Public Attributes inherited from SmartEngine::IObject
	private

	__pad0__: IObject() {} IObject(IObject const&) = delete

Detailed Description

The PPO Trainer is a reinforcement learning trainer that is composed of two parts: an actor sub graph and a critic sub-graph. One of the differences between PPO and A2C is that PPO works off of large batches of data. Instead of waiting for a small batch before training, PPO will wait until there is a mega batch, called the trajectory. When we've received that much data, we train over it in normal batch sizes for a couple of times and then wait for the next trajectory.

The actor sub-graph is a set of nodes that is used to directly manipulate the enviornment. For instance, this is the part of the graph that will control your AI's movement given his position in the environment.

The output of the actor can either produce discrete or continuous actions. In the case of discrete actions, the output should have a number of neurons equal to the number of different actions (move left, move right, shoot, etc). The output node should be a RandomChoice node with the input of the random choice node having no activation function (random choice automatically adds a softmax activation). In the case of continuous actions, the output is unrestricted.

The critic sub-graph is a set of nodes that takes in the same input as the actor, but produces a single neuron which represents the estimated reward we expect to receive from now until the end of the episode. It is trained automatically along side the policy using the reward data from the agent. All that is required is that you hook up the pieces in the graph. The output should have one neuron and no activation function.

PPOTrainer.Loss returns the combined policy + value + entropy loss

Member Function Documentation

◆ GetEntropyLoss()

virtual float SmartEngine::IPPOTrainer::GetEntropyLoss ( )

pure virtual

Returns the entropy loss - a measure of how random the network is.

◆ GetPolicyLoss()

virtual float SmartEngine::IPPOTrainer::GetPolicyLoss ( )

pure virtual

Returns the loss in the policy sub-graph

◆ GetValueLoss()

virtual float SmartEngine::IPPOTrainer::GetValueLoss ( )

pure virtual

Returns the loss in value sub-graph

Public Member Functions

Additional Inherited Members

Detailed Description

Member Function Documentation

◆ GetEntropyLoss()

◆ GetPolicyLoss()

◆ GetValueLoss()