Meta Learning

Fuufhjn...

Meta Learning

這次學習並非清湯寡水，今天看理論，明天讀代碼，三位大廚正在全力烹飪。

concept

Meta-learning means "learn to learn", and while normal deep learning learns the function f, meta-learning learns how to get the function f.

	target	input	function	output	workflow
Machine learning	Find the mapping f between x and y through the training data	x	f	y	1.initialize f; 2.load data <x, y>; 3.calculate loss, optimize f; 4.find y=f(x)
Meta learning	Find a function f that generates the relation F by training the task and the data	Training tasks and their data	F	f	1.initialize F; 2.load training tasks T and their data D, optimize F; 3.find f = F*; 4.in the new task: y = f(x)

In machine learning, there is only one layer of training, which directly uses data as the training unit, but in meta-learning, the first layer of training unit is the task, and the second layer is the data (in fact, only a small amount of training is needed to the second layer, and the data obtained from the first layer can be directly processed to adapt to new tasks quickly, which is also known as fee-shot learning).

Bi-level optimization

Meta-training can be formalized as bi-level optimization:

\begin{matrix} (outerloop) & ω^{*} = \underset{ω}{a r g m i n} \sum_{i = 1}^{M} L^{m e t a} (θ^{* (i)} (ω), D_{t r a i n}^{q u e r y (i)}) \end{matrix}

\begin{matrix} (innerloop) & s . t . θ^{* (i)} (ω) = \underset{θ}{a r g m i n} L^{t a s k} (θ, ω, D_{t r a i n}^{s u p p o r t (i)}) \end{matrix}

Meta-testing for new task $j$ can be formalized as:

θ^{*} = \underset{θ}{a r g m i n} L^{t a s k} (θ, ω^{*}, D_{t e s t}^{s u p p o r t (j)})

Firstly, the meta-training set is divided into Support set and Query set. w can be viewed as an algorithm; θ can be considered as a model parameter.

In the Inner loop optimization stage, in the support set, the w algorithm is used to optimize θ parameters according to the performance of the loss value of the task. Finally, the optimal θ' value is obtained by the inner optimization according to the minimum $L$ task value.

In the Outer loop optimization stage, in the query set, the current Lmeta value is calculated according to the optimal θ 'value of the inner layer optimization. After multiple tasks, the minimum total loss value of all tasks is calculated to optimize the w parameter, and the $w$ algorithm is constantly adjusted to achieve the optimal performance in all tasks.

The inner layer optimization deals with ordinary machine learning problems, and the outer layer takes the results of the inner layer as data to obtain the parameters of the metamodel, that is, the "hyperparameters" of ordinary machine learning problems.

Intensive reading of thesis

Source: https://arxiv.org/abs/1703.03400

MAML

Problem Set-Up

The goal of few-shot meta-learning is to train a model that can quickly adapt to a new task using only a few datapoints and training iterations.

consider a model, denoted $f$ , that maps observations $x$ to outputs $a$ .

A generic notion of a learning task:

Each task

T = {L (x_{1}, a_{1}, . . ., x_{H}, a_{H}), q (x_{1}), q (x_{t + 1} | x_{t}, a_{t}), H}

consists of a loss function $L$ , a distribution over initial observations $q (x 1)$ , a transition distribution $q (x_{t + 1} | x_{t}, a_{t})$ , and an episode length $H$ .

In supervised learning problems, the length $H = 1$ . The loss $L (x_{1}, a_{1}, . . ., x_{H}, a_{H}) \to R$ , provides task-specific feedback, which might be in the form of a misclassification loss or a cost function in a Markov decision process.

During meta-training, a task $T_{i}$ is sampled from $p (T)$ , the model is trained with $K$ samples and feedback from the corresponding loss ${L_{T}}_{i}$ from $T_{i}$ and then tested on new samples from $T_{i}$ . The test error on sampled tasks $T_{i}$ serves as the training error of the meta-learning process. That is, the "errors" of the inner training are also used as meta-training data, so that the meta-learning model knows: if the parameters of the task are designed in this way, what will be the consequences? Gradually the meta-learning model can be designed better and better. After the K round of training, new tasks are used to evaluate the results.

MAML

The intuitive analysis of this algorithm is that there must be an internal pattern that is most transferrable, just find it.

Meta-learning is also based on gradient descent, so it is only necessary to find the parameters that are sensitive to task changes, and when they change, the task loss under the task set will be greatly affected.

When faced with a particular task, parameter $θ$ will choose different values according to the gradient, and when adapted to task $T_{i}$ , $θ$ will become $θ_{i}^{*}$ , which is the ideal effect.

Consider a model represented by a parametrized function $f_{θ}$ with parameters $θ$ . When adapted to task $T_{i}$ , $θ$ will become $θ_{i}^{'}$ .

We will update $θ_{i}^{'}$ according to the classical gradient descent method:

θ_{i}^{'} = θ - α \nabla_{θ} {L_{T}}_{i} (f_{θ})

${L_{T}}_{i} (f_{θ})$ is the loss of the model on $T_{i}$ , and $\nabla_{θ} {L_{T}}_{i} (f_{θ})$ is the gradient of $θ$ on this loss function.

Our task(meta-objective) now becomes to find a specific $θ$ that minimizes the sum of $f_{θ_{i}^{'}}$ 's losses on all tasks:

\underset{θ}{m i n} \sum_{T_{i} \sim p (T)} {L_{T}}_{i} (f_{θ_{i}^{'}}) = \sum_{T_{i}} (f_{θ - α \nabla_{θ} {L_{T}}_{i} (f_{θ})})

Meta optimization is the optimization of $θ$ , using the updated $θ^{'}$ to calculate meta-objective. We want to have some optimal parameters determined at the end of each round of training.

The meta-optimization across tasks is performed via SGD, the model parameters $θ$ are updated as follows:

θ \leftarrow θ - β \nabla_{θ} \sum_{T_{i} \sim p (T)} {L_{T}}_{i} (f_{θ_{i}^{'}})

where $β$ is the meta step size.

Pseudocode of MAML
randomly initialize $θ$ while not done do sample batch of tasks $T_{i} \sim p (T)$ for all $T_{i}$ do evaluate $\nabla_{θ} {L_{T}}_{i} (f_{θ})$ with respect to $K$ examples compute adapted parameters with gradient descent: $θ_{i}^{'} = θ - \nabla_{θ} {L_{T}}_{i} (f_{θ})$ end for update $θ \leftarrow θ - β \nabla_{θ} \sum_{T_{i} \sim p (T)} {L_{T}}_{i} (f_{θ_{i}^{'}})$ end while

Since the update of the metagradient goes through two loops (updating the outer gradient by the gradient of the inner layer), an additional reverse pass through $f$ is required to compute the Hessian-vector product.

About Hessian-vector product

A Hessian matrix is a matrix consisting of the second partial derivatives of a function whose independent variables are vectors.

Consider a function:

f (x_{1}, x_{2}, . . ., x_{n})

If all the second partial derivatives of $f$ exist, then the $i j^{t h}$ term of the Hessian matrix of $f$ is:

H (f)_{i j} (x) = D_{i} D_{j} f (x)

where $x = (x_{1}, x_{2}, . . ., x_{n})$ ,

H (f) = {[\begin{matrix} \frac{\partial^{2} f}{\partial x_{1}^{2}} & \frac{\partial^{2} f}{\partial x_{1} \partial x_{2}} & \dots & \frac{\partial^{2} f}{\partial x_{1} \partial x_{n}} \\ \frac{\partial^{2} f}{\partial x_{2} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{2}^{2}} & \dots & \frac{\partial^{2} f}{\partial x_{2} \partial x_{n}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial^{2} f}{\partial x_{n} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{n} \partial x_{2}} & \dots & \frac{\partial^{2} f}{\partial x_{n}^{2}} \end{matrix}]}_{n n}

For a set of parameters $x_{1}, x_{2}, . . ., x_{n}$ , then the model is $f (x_{1}, x_{2}, . . ., x_{n})$ .

In meta-learning, the outer layer optimization involves the results of the inner layer optimization, which leads to the gradient of the outer layer loss function not only dependent on the current model parameters, but also on how the model is optimized by updating the inner layer gradient. Therefore, the gradient calculation of the outer layer optimization usually contains second-order derivative information, so the Hessian matrix is required. However, computing the complete Hessian matrix faces a huge complexity of $O (n^{2})$ , so to use the Hessian vector product, it is possible to approximate these second-order gradients in a single backpropagation without actually constructing and inverting the complete Hessian matrix. Here I omit how does the Hessian vector product compute the second order gradient around the Hessian matrix.

Species of MAML(apply)

Supervised Regression and Classification

Few-shot learning makes it possible to use only a small amount of data to perform classification tasks, such as recognizing cats based on a large number of other types of objects that it have seen before.

Since the model accepts a single input and produces a single output, rather than a sequence of inputs and outputs, we can define the horizon $H = 1$ (episode length) and drop the timestep subscript on $x_{t}$ . The task generates $K$ observations $x$ from $q_{i}$ , and the loss is represented by the error between the model's output for $x$ and corresponding target values $y$ for that observation and task.

Two common loss functions used for supervised classification and regression are cross-entropy and mean-squared error(MSE):

\begin{matrix} (MSE loss) & {L_{T}}_{i} (f_{ϕ}) = \sum_{x^{(j)}, y^{(j)} \sim T_{i}} | | f_{ϕ} (x^{(j)} - y^{(j)}) | |_{2}^{2} \end{matrix}

where $x^{(j)}$ , $y^{(j)}$ are an i/o pair sampled from task $T_{i}$ . In $K$ -shot regression tasks, $K$ i/o pairs are provided for learning for each task.

For discrete classification tasks with a cross entropy loss:

\begin{matrix} (cross-entropy loss) & {L_{T}}_{i} (f_{ϕ}) = \sum_{x^{(j)}, y^{(j)} \sim T_{i}} y^{(i)} l o g f_{ϕ} (x^{(j)}) + (1 - y^{(j)}) l o g (1 - f_{ϕ} (x^{(j)})) \end{matrix}

K-shot classification tasks use $K$ input/output pairs from each class, for a total of $N K$ datapoints for $N$ -way classification.

Pseudocode of MAML for Few-Shot Supervised Learning
randomly initialize $θ$ while not done do sample batch of tasks $T_{i} \sim p (T)$ for all $T_{i}$ do Sample $K$ datapoints $D = {x^{(j)}, y^{(j)}}$ from $ \mathcal{T}_i$ Evaluate $\nabla_{θ} {L_{T}}_{i} (f_{θ})$ using $D$ and ${L_{T}}_{i}$ Compute adapted parameters with gradient descent: $θ_{i}^{'} = θ - α \nabla_{θ} {L_{T}}_{i} (f_{θ})$ Sample datapoints $D_{i}^{'} = {x^{(j)}, y^{(j)}}$ from $ \mathcal{T}_i$ for the meta-update end for Update $θ \leftarrow θ - β \nabla_{θ} \sum_{T_{i} \sim p (T)} {L_{T}}_{i} (f_{θ_{i}^{'}})$ using each $ \mathcal{D}_i'$ and ${L_{T}}_{i}$ end while

Pseudocode of MAML for Few-Shot Supervised Learning

randomly initialize

θ

while not done do
sample batch of tasks

T_{i} \sim p (T)

for all

T_{i}

do
Sample

K

datapoints

D = {x^{(j)}, y^{(j)}}

from $ \mathcal{T}_i$
Evaluate

\nabla_{θ} {L_{T}}_{i} (f_{θ})

using

D

and

{L_{T}}_{i}

Compute adapted parameters with gradient descent:

θ_{i}^{'} = θ - α \nabla_{θ} {L_{T}}_{i} (f_{θ})

Sample datapoints

D_{i}^{'} = {x^{(j)}, y^{(j)}}

from $ \mathcal{T}_i$ for the meta-update
end for
Update

θ \leftarrow θ - β \nabla_{θ} \sum_{T_{i} \sim p (T)} {L_{T}}_{i} (f_{θ_{i}^{'}})

using each $ \mathcal{D}_i'$ and

{L_{T}}_{i}

end while

Reinforcement Learning(MDP)

Reinforcement learning is a learning mechanism that learns how to map from state to behavior in order to maximize the reward obtained. Such an agent needs to continuously experiment in the environment and optimize the state-behavior correspondence through the feedback (reward) given by the environment.

Each RL task $T_{i}$ contains an initial state distribution $q_{i} (x_{1})$ and a transition distribution $q_{i} (x_{t + 1} | x_{t}, a_{t})$ , and the loss $ \mathcal{L_T}_i$ corresponds to the reward function $R$ . The entire task is a Markov decision process(MDP) with horizon $H$ . The model being learned, $f_{θ}$ , is a policy that each maps from states $x_{t}$ (The current environment) to a distribution over actions $a_{t}$ (Actions that can be taken in the current state, here are adjustments to the parameters) at each timestep $t \in {1, 2, . . ., H}$ . The loss for task $ \mathcal{T}_i$ and model $f_{ϕ}$ takes the form

{L_{T}}_{i} (f_{ϕ}) = - E_{x_{t}, a_{t} \sim f_{ϕ}, q_{T_{i}}} [\sum_{t = 1}^{H} R_{i} (x_{t}, a_{t})]

Notice that instead of using the loss function directly, we've created an artificial loss function by first calculating the reward and then taking the negative sign.

$K$ rollouts from $f_{θ}$ and task $ \mathcal{T}_i$, $(x_{1}, a_{1}, . . ., x_{H})$ , and the corresponding rewards $R (x_{t}, a_{t})$ , may be used for adaptation on a new task $ \mathcal{T}_i$.

The reward function is usually not differentiable, and the loss function is not differentiable. We use policy gradient method s to estimate the gradient both for the model gradient update(s) and the meta-optimization. In the strategy gradient algorithm, the input to the strategy function is the state ss and the action a aa, and the output is a probability value between 0 and 1. A later article will explain how the strategy gradient is implemented.

Pseudocode of MAML for Reinforcement Learning
randomly initialize $θ$ while not done do sample batch of tasks $T_{i} \sim p (T)$ for all $T_{i}$ do Sample $K$ trajectories $D = {(x_{1}, a_{1}, . . . x_{H})}$ using $f_{θ}$ in $ \mathcal{T}_i$ Evaluate $\nabla_{θ} {L_{T}}_{i} (f_{θ})$ using $D$ and ${L_{T}}_{i}$ Compute adapted parameters with gradient descent: $θ_{i}^{'} = θ - α \nabla_{θ} {L_{T}}_{i} (f_{θ})$ Sample trajectories $D_{i}^{'} = {(x_{1}, a_{1}, . . . x_{H})}$ using $f_{θ_{i}^{'}}$ in $ \mathcal{T}_i$ end for Update $θ \leftarrow θ - β \nabla_{θ} \sum_{T_{i} \sim p (T)} {L_{T}}_{i} (f_{θ_{i}^{'}})$ using each $ \mathcal{D}_i'$ and ${L_{T}}_{i}$ end while

Pseudocode of MAML for Reinforcement Learning

randomly initialize

θ

while not done do
sample batch of tasks

T_{i} \sim p (T)

for all

T_{i}

do
Sample

K

trajectories

D = {(x_{1}, a_{1}, . . . x_{H})}

using

f_{θ}

in $ \mathcal{T}_i$
Evaluate

\nabla_{θ} {L_{T}}_{i} (f_{θ})

using

D

and

{L_{T}}_{i}

Compute adapted parameters with gradient descent:

θ_{i}^{'} = θ - α \nabla_{θ} {L_{T}}_{i} (f_{θ})

Sample trajectories

D_{i}^{'} = {(x_{1}, a_{1}, . . . x_{H})}

using

f_{θ_{i}^{'}}

in $ \mathcal{T}_i$
end for
Update

θ \leftarrow θ - β \nabla_{θ} \sum_{T_{i} \sim p (T)} {L_{T}}_{i} (f_{θ_{i}^{'}})

using each $ \mathcal{D}_i'$ and

{L_{T}}_{i}

end while

This algorithm has the same structure as Algorithm 2, with the principal difference being that steps 5 and 8 require sampling trajectories from the environment corresponding to task $ \mathcal{T}_i$.

Experimental Evaluation

I will not explain this part in detail, but a set of graphs is given in the original paper, which can be seen that the model using MAML has understood the characteristics of sine waves:

NickName

E-Mail

Website

Comments

Latest
Oldest
Hottest

Meta Learning

Preview: