RL: Proximal policy optimization (PPO)

Fuufhjn...

RL: Proximal policy optimization (PPO)

If the actor to train and the actor for interacting is the same, we call it on-policy. In other words, if the actor himself does training to gain experience, it is on policy; if the actor gains experience by watching other actors train, it is off-policy.

An advantage of off-policy is that it can be trained with the experience of other actors, so there is no need to constantly collect data, which can greatly improve the training efficiency.

All the previous ones were on-policy. The actor goes back to the environment to collect a lot of $τ$ and updates the parameters according to the following equation:

\nabla {\overset{―}{R}}_{θ} = E_{τ \sim p_{θ} (τ)} [R (τ) \nabla l o g p_{θ} (τ)]

If you can use another actor to interact with the environment and use the data it collects to train the theta (the data it collects can be stored and used multiple times), you can do multiple gradients with the same set of data.

Theory

Importance Sampling

$x$ follows the distribution $p$ , and to approximate the expectation of $f (x)$ :

E_{x \sim p} [f (x)] \approx \frac{1}{N} \sum_{i = 1}^{N} f (x^{i})

However, if you cannot sample from $p$ , but can only sample from another distribution $q$ , you cannot directly use the above formula, but make the following correction:

The expectation of $f$ on $p$ is satisfied:

E_{x \sim p} [f (x)] = \int f (x) p (x) d x

\int f (x) p (x) = \int f (x) \frac{p (x)}{q (x)} q (x) d x = E_{x \sim q} [f (x) \frac{p (x)}{q (x)}]

And that translates into expectations on the $q$ distribution.

The fact that there is only one $\frac{p (x)}{q (x)}$ difference between $E_{x \sim p} [f (x)]$ and $E_{x \sim q} [f (x) \frac{p (x)}{q (x)}]$ is a correction for the difference between the two different distributions.

This makes their expectations equal, but not their variances equal:

V a r_{x \sim p} [f (x)] \neq V a r_{x \sim q} [f (x) \frac{p (x)}{q (x)}]

According to:

V a r (x) = E (x^{2}) - E [x]^{2},

V a r_{x \sim p} [f (x)] = E_{x \sim p} [f (x)^{2}] - E_{x \sim p} [f (x)]^{2}

V a r_{x \sim q} [f (x) \frac{p (x)}{q (x)}] = E_{x \sim q} [{(f (x) \frac{p (x)}{q (x)})}^{2}] - {(E_{x \sim q} [f (x) \frac{p (x)}{q (x)}])}^{2}

= E_{x \sim p} [f (x)^{2} \frac{p (x)}{q (x)}] - E_{x \sim p} [f (x)]^{2}

The equivalent is still missing a correction term. If $p$ and $q$ are too far apart, the effect of the difference can be very large.

We let $θ^{'}$ interact with the environment, and then we tell $θ$ what to expect.

The $τ$ on this side is sampled from $θ^{'}$ :

which is different from the target distribution $θ$ , so it can be converted using the above formula:

\nabla {\overset{―}{R}}_{θ} = E_{τ \sim p_{θ}^{'} (τ)} [\frac{p_{θ} (τ)}{p_{θ^{'}} (τ)} R (τ) \nabla l o g p_{θ} (τ)]

This is equivalent to saying that we can sample $θ^{'}$ multiple times to train the same $θ$ , and when this $θ$ is nearly trained, we can change another $θ^{'}$ to train other $θ$ .

On-policy $\to$ Off-policy Gradient for update

from note "RL overview and basic policy gradient", we know that:

\nabla {\overset{―}{R}}_{θ} = E_{(s_{t}, a_{t}) \sim π_{θ}} [A^{θ} (s_{t}, a_{t}) \nabla l o g p_{θ} (a_{t}^{n} | s_{t}^{n})]

The $s_{t}$ and $a_{t}$ on this side are derived from the interaction between $θ$ and the environment. If we replace $θ$ with $θ^{'}$ , the gradient becomes:

\nabla {\overset{―}{R}}_{θ} = E_{(s_{t}, a_{t}) \sim π_{θ^{'}}} [\frac{P_{θ} (s_{t}, a_{t})}{P_{θ^{'}} (s_{t}, a_{t})} A^{θ^{'}} (s_{t}, a_{t}) \nabla l o g p_{θ} (a_{t}^{n} | s_{t}^{n})]

Where $s_{t}$ and $a_{t}$ are derived from the interaction between $θ^{'}$ and the environment.

The advantage term should also be replaced, because $θ^{'}$ is now interacting with the environment and should be given an advantage value of $A^{θ^{'}}$ .

If we assume that the two distributions are similar, then the above equation is equal to:

= E_{(s_{t}, a_{t}) \sim π_{θ^{'}}} [\frac{p_{θ} (s_{t} | a_{t})}{p_{θ^{'}} (s_{t} | a_{t})} \frac{p_{θ} (s_{t})}{p_{θ^{'}} (s_{t})} A^{θ^{'}} (s_{t}, a_{t}) \nabla l o g p_{θ} (a_{t}^{n} | s_{t}^{n})]

Invert the objective function from the gradient function:

\nabla f (x) = f (x) \nabla l o g f (x),

J^{θ^{'}} (θ) = E_{(s_{t}, a_{t}) \sim π_{θ^{'}}} [\frac{p_{θ} (s_{t}, a_{t})}{p_{θ^{'}} (s_{t}, a_{t})} A^{θ^{'}} (s_{t}, a_{t})]

PPO

The goal is to avoid a big difference between the two distributions.

J_{P P O}^{θ^{'}} (θ) = J^{θ^{'}} (θ) - β K L (θ, θ^{'})

$β K L (θ, θ^{'})$ is a measure of how similar the two distributions are.

Algorithm

Pseudocode of PPO
initial policy parameters $θ^{0}$ in each iteration using $θ^{k}$ to interact with the environment to collect ${s_{t}, a_{t}}$ and compute advantage $A^{θ^{k}} (s_{t}, a_{t})$ find $θ$ optimizing $J_{P P O} (θ)$ $J_{P P O}^{θ^{k}} (θ) = J^{θ^{k}} (θ) - β K L (θ, θ^{k})$ //update parameters several times if $K L (θ, θ^{k}) > K L_{m a x}$ , increase $β$ if $K L (θ, θ^{k}) < K L_{m i n}$ , decrease $β$

Pseudocode of PPO

initial policy parameters

θ^{0}

in each iteration
using

θ^{k}

to interact with the environment to collect

{s_{t}, a_{t}}

and compute advantage

A^{θ^{k}} (s_{t}, a_{t})

find

θ

optimizing

J_{P P O} (θ)

J_{P P O}^{θ^{k}} (θ) = J^{θ^{k}} (θ) - β K L (θ, θ^{k})

//update parameters several times
if

K L (θ, θ^{k}) > K L_{m a x}

, increase

β

K L (θ, θ^{k}) < K L_{m i n}

, decrease

β

J^{θ^{k}} (θ) \approx \sum_{(s_{t}, a_{t})} \frac{p_{θ} (a_{t} | s_{t})}{p_{θ^{k}} (a_{t} | s_{t})} A^{θ^{k}} (s_{t}, a_{t})

The acceptable range of $K L$ values is set in advance. If $K L$ is too small, it means that $θ$ and $θ^{k}$ are too close, and the effect is not obvious.

PPO2

The objective function of PPO2:

J_{P P O 2}^{θ^{k}} \approx \sum_{(s_{t}, a_{t})} m i n (\frac{p_{θ} (a_{t} | s_{t})}{p_{θ^{k}} (a_{t} | s_{t})} A^{θ^{k}} (s_{t}, a_{t}), c l i p (\frac{p_{θ} (a_{t} | s_{t})}{p_{θ^{k}} (a_{t} | s_{t})}, 1 - ϵ, 1 + ϵ) A^{θ^{k}} (s_{t}, a_{t}))

clip(a, min, max) refers to: if a < min, let a = min; if a > max, let a = max.

The main purpose of this function is to keep the gap between $p_{θ}$ and $p_{θ^{k}}$ from getting too wide.

In this image:

blue line is $c l i p (\frac{p_{θ} (a_{t} | s_{t})}{p_{θ^{k}} (a_{t} | s_{t})}, 1 - ϵ, 1 + ϵ)$ ;

green line is $\frac{p_{θ} (a_{t} | s_{t})}{p_{θ^{k}} (a_{t} | s_{t})} A^{θ^{k}} (s_{t}, a_{t})$ ;

and the red line is $m i n (\frac{p_{θ} (a_{t} | s_{t})}{p_{θ^{k}} (a_{t} | s_{t})} A^{θ^{k}} (s_{t}, a_{t}), c l i p (\frac{p_{θ} (a_{t} | s_{t})}{p_{θ^{k}} (a_{t} | s_{t})}, 1 - ϵ, 1 + ϵ) A^{θ^{k}} (s_{t}, a_{t}))$ .

We can see that $\frac{p_{θ} (a_{t} | s_{t})}{p_{θ^{k}} (a_{t} | s_{t})}$ will be controlled when it's too large or small.

NickName

E-Mail

Website

Comments

Latest
Oldest
Hottest

RL: Proximal policy optimization (PPO)

RL: Proximal policy optimization (PPO)

Review

On-policy v.s. Off-policy

Theory

Importance Sampling

On-policy $\to$ Off-policy Gradient for update

PPO

Algorithm

PPO2

Preview: