RL: Overview and basic policy gradient

Fuufhjn...

RL: Overview and basic policy gradient

Difference between normal supervised learning

Reinforcement learning is unsupervised and it will solve the problem there is no standard answer(or humans also don't know the best way to solve).

The basic problems and parameters

All the purpose of the machine learning problem is to find a function, reinforcement learning is no exception.

To find a function is a set of Actor:

In the learning process of chess, since there is no reward for moving in the chess game no matter how, it is stipulated that winning a game gets 1 point, losing a game gets -1 point.

Three components of reinforcement learning

Function with unknown

Policy Network(Actor), input from environment(observation of machine represented as a vector or matrix), output an action, actually it's a classification task, each action corresponds to a neuron in output layer.

In order to ensure randomness(paper scissor stone), the probability is usually determined according to the reward, and then the behavior is randomly selected according to the probability, rather than directly choosing the behavior with the highest reward.

Define loss

The entire process from the beginning of training to the triggering of the termination condition that brings the game to an end is called an episode.

A total reward(return) can be calculated at the end of an episode:

R = t o t a l r e w a r d = \sum_{t = 1}^{T} r_{t}

Our target is to maximize $R$ .

L o s s = - R

Optimization

Find a set of network parameters that make the total reward as large as possible.

A complete set of s and A is called Trajectory:

τ = {s_{1}, a_{1}, s_{2}, a_{2}, . . .}

R (τ) = \sum_{t = 1}^{T} r_{t} ↑

$R$ is a random variable, we can only know it's expect.

Calculating the $R$ expectation for a particular parameter $θ$ requires exhausting all possible trajectories, calculating the probability of all $τ$ occurrences, multiplying the probability by its reward value and summing:

{\overset{―}{R}}_{θ} = \sum_{τ} R (τ) p_{θ} (τ) = E_{τ \sim p_{θ} (τ)} [R (τ)]

Difficulty

Since action is generated by probability, the network may take different actions when faced with the same environment.

Also, the mechanics of environment and reward are not clear, and we can only treat them as black boxes. To make matters worse, the circumstances and rewards are also random in some cases(For example, in chess, the opponent's position is uncertain when facing the same game).

Policy Gradient

When controlling the behavior of an actor, you can think of it as a classifier that sees a situation and takes a specific action.

Both $e_{1}$ and $e_{2}$ are cross-entropy loss, and the overall loss is defined as:

L = e_{1} - e_{2}

θ^{*} = a r g \underset{θ}{m i n} L

Since e2 is about what not to do, the loss is calculated with a negative sign.

This allows you to write the total loss function by defining how good or bad each behavior is:

${s_{i}, {\hat{a}}_{i}}$	$A_{i}$
${s_{1}, {\hat{a}}_{1}}$	+1.5
${s_{2}, {\hat{a}}_{2}}$	-0.5
${s_{3}, {\hat{a}}_{3}}$	+0.5
$⋮$	$⋮$
${s_{N}, {\hat{a}}_{N}}$	-3

L = \sum A_{i} e_{i}

θ^{*} = a r g \underset{θ}{m i n} L

(important)How to define ${s_{i}, {\hat{a}}_{i}}$ and $A_{i}$ ?

A short-sight version:

Defining $A$ directly in terms of the number of rewards, which always makes the model tend to get more reward points in an interaction.

This method is too eager for quick success, sometimes the local optimal solution may be very unfavorable to the overall situation, and sometimes the appropriate abandonment of some small benefits can ultimately obtain the most rewards.

version1

This method adds up all the rewards for selecting an action (i.e., what the outcome of the action was). Use the effect of the sum to determine $A$ .

$s_{1}$	$s_{2}$	$s_{3}$	$\dots$	$s_{N}$
$a_{1}$	$a_{2}$	$a_{3}$	$\dots$	$a_{N}$
$r_{1}$	$r_{2}$	$r_{3}$	$\dots$	$r_{N}$

A_{i} = G_{i} = \sum_{t = i}^{N} r_{t}

This approach is not good enough. If the whole process is long, we cannot simply attribute the reward $r_{N}$ to a1's decision, or $a_{1}$ 's influence on the outcome is not that great.

version2

The introduction of a attenuation factor indicates that the farther away from $a_{1}$ the reward is, the more independent it is of $a_{1}$ .

G_{1}^{'} = r_{1} + γ r_{2} + γ^{2} r_{3} + . . ., 0 < γ < 1

G_{i}^{'} = \sum_{t = i}^{N} γ^{t - i} r_{t}

version3

Good and bad are relative, and if the rewards are all non-negative, then $A$ 's are all non-negative, so it's very difficult for the model to avoid making bad decisions. So to standardize the $G$ value, even if the reward is positive, it should be preferred to choose the largest and avoid the small one.

Minus baseline $b$

All $G$ is subtracted by the same $b$ and then assigned to $A$ :

A_{i} = G_{i}^{'} - b

Pseudocode of policy gradient
initialize actor network parameters $θ^{0}$ for training iteration $i = 1$ to $T$ using actor $θ^{i - 1}$ to interact obtain data ${s_{1}, a_{1}}, {s_{2}, a_{2}}, . . ., {s_{N}, a_{N}}$ compute $A_{1}, A_{2}, . . ., A_{N}$ compute loss $L$ $θ^{i} \leftarrow θ^{i - 1} - η \nabla L$

Unlike the general model, the process of collecting data is within the for loop, that is, its data is dynamically collected along with the training.

In this process, each set of data can only update the parameter once, and the next time the parameter is updated, new data comes in. That's why RL training is slow.

A set of environments is only suitable for updating certain parameters, and if it is used to update other parameters, it may cause a bad effect. The so-called: his honey, my arsenic.

Exploration

The actor interacting with the environment should be made as random as possible, because this allows for as many effects of different behaviors as possible. This can be done by increasing entropy or directly adding noise to the parameters, making it easier for actors to take low-probability actions.

math: How to calculate gradient?

we know that

R = t o t a l r e w a r d = \sum_{t = 1}^{T} r_{t}

and $R$ is a random variable, we can only know it's expect.

{\overset{―}{R}}_{θ} = \sum_{τ} R (τ) p_{θ} (τ) = E_{τ \sim p_{θ} (τ)} [R (τ)]

so how can we find $\nabla {\overset{―}{R}}_{θ}$ ?

Since the gradient of $R$ is actually the gradient of probability, you can move the gradient symbol in:

\nabla {\overset{―}{R}}_{θ} = \sum_{τ} R (τ) \nabla p_{θ} (τ)

therefore, $R_{τ}$ do not have to be differentiable.

\sum_{τ} R (τ) \nabla p_{θ} (τ) = \sum_{τ} R (τ) p_{θ} (τ) \frac{\nabla p_{θ} (τ)}{p_{θ} (τ)}

using formula:

\nabla f (x) = f (x) \nabla l o g f (x),

\sum_{τ} R (τ) p_{θ} (τ) \frac{\nabla p_{θ} (τ)}{p_{θ} (τ)} = \sum_{τ} R (τ) p_{θ} (τ) \nabla l o g p_{θ} (τ)

= E_{τ \sim p_{θ} (τ)} [R (τ) \nabla l o g p_{θ} (τ)]

This gradient can not be calculated directly at this stage, so it is necessary to sample $N$ groups of $τ$ , calculate the value of each group of samples, and then average to approximate the gradient:

\nabla {\overset{―}{R}}_{θ} \approx \sum_{n = 1}^{N} R (τ^{n}) \nabla l o g p_{θ} (τ^{n})

= \frac{1}{N} \sum_{n = 1}^{N} \sum_{t = 1}^{T_{n}} R (τ^{n}) \nabla l o g p_{θ} (a_{t}^{n} | s_{t}^{n})

$s_{t}$ and $a_{t}$ is a state-action pair in trajectory $τ$ . If $a_{t}$ is performed in the face of $s_{t}$ and the reward for $τ$ is found to be positive, the probability of it is increased.

using Pytorch or TF, we can do it conveniently:

\frac{1}{N} \sum_{n = 1}^{N} \sum_{t = 1}^{T_{n}} R (τ^{n}) l o g p_{θ} (a_{t}^{n} | s_{t}^{n}) \to T F \to \frac{1}{N} \sum_{n = 1}^{N} \sum_{t = 1}^{T_{n}} R (τ^{n}) \nabla l o g p_{θ} (a_{t}^{n} | s_{t}^{n})

if add a baseline, the gradient becomes:

\nabla {\overset{―}{R}}_{θ} \approx \frac{1}{N} \sum_{n = 1}^{N} \sum_{t = 1}^{T_{n}} (R (τ^{n}) - b) \nabla l o g p_{θ} (a_{t}^{n} | s_{t}^{n}), b \approx E [R (τ)]

Imagine a situation where the first step was a very good decision, but the second step was wrong and the final score is not so good, but the score is still good because the first step was well done, then the model mistakenly believes that the second step is also good.

If you apply the calculation method in version2, you should not use the overall R to calculate the correctness of a certain step, but should calculate the sum of rewards after this step, so as to avoid the problems raised above:

\nabla {\overset{―}{R}}_{θ} \approx \frac{1}{N} \sum_{n = 1}^{N} \sum_{t = 1}^{T_{n}} (\sum_{t^{'} = t}^{T_{n}} r_{t^{'}}^{n} - b) \nabla l o g p_{θ} (a_{t}^{n} | s_{t}^{n})

Then further, directly using the method of version3 combined with the attenuation factor to calculate, we can get the final gradient expression:

\nabla {\overset{―}{R}}_{θ} \approx \frac{1}{N} \sum_{n = 1}^{N} \sum_{t = 1}^{T_{n}} (\sum_{t^{'} = t}^{T_{n}} γ^{t^{'} - t} \cdot r_{t^{'}}^{n} - b) \nabla l o g p_{θ} (a_{t}^{n} | s_{t}^{n}), 0 < γ < 1

On policy v.s. Off-policy

If the actor to train and the actor for interacting is the same, we call it on-policy. In other words, if the actor himself does training to gain experience, it is on policy; if the actor gains experience by watching other actors train, it is off-policy.

An advantage of off-policy is that it can be trained with the experience of other actors, so there is no need to constantly collect data, which can greatly improve the training efficiency.

NickName

E-Mail

Website

Comments

Latest
Oldest
Hottest

RL: Overview and basic policy gradient

Preview: