If the actor to train and the actor for interacting is the same, we call it on-policy. In other words, if the actor himself does training to gain experience, it is on policy; if the actor gains experience by watching other actors train, it is off-policy.
An advantage of off-policy is that it can be trained with the experience of other actors, so there is no need to constantly collect data, which can greatly improve the training efficiency.
All the previous ones were on-policy. The actor goes back to the environment to collect a lot of and updates the parameters according to the following equation:
If you can use another actor to interact with the environment and use the data it collects to train the theta (the data it collects can be stored and used multiple times), you can do multiple gradients with the same set of data.
follows the distribution , and to approximate the expectation of :
However, if you cannot sample from , but can only sample from another distribution , you cannot directly use the above formula, but make the following correction:
The expectation of on is satisfied:
And that translates into expectations on the distribution.
The fact that there is only one difference between and is a correction for the difference between the two different distributions.
This makes their expectations equal, but not their variances equal:
According to:
The equivalent is still missing a correction term. If and are too far apart, the effect of the difference can be very large.
We let interact with the environment, and then we tell what to expect.
The on this side is sampled from :
which is different from the target distribution , so it can be converted using the above formula:
This is equivalent to saying that we can sample multiple times to train the same , and when this is nearly trained, we can change another to train other .
initial policy parameters in each iteration using to interact with the environment to collect and compute advantage find optimizing //update parameters several times if , increase if , decrease
The acceptable range of values is set in advance. If is too small, it means that and are too close, and the effect is not obvious.