# Neural Processes Explained

In this post, we will look at the Neural Process (NP), a model that borrows the concepts from Gaussian Process (GP) and Neural Network (NN). The Neural Process was proposed in the paper Neural Processes. To explain how it works, I will first give a simplified introduction to Gaussian Process, then introduce the NP concept one by one and arrive at the final design.

# What is the Gaussian Process

Here I give my understanding of the GP from the machine learning (ML) perspective. The statements below may not be rigorous enough and differ from the textbook definition, but I will try to make it easy to understand in the ML language.

Gaussian Process (GP) is a stochastic process with a sequence of pairs of random variables *{(Xᵢ, Yᵢ)| i ∈ [1,n]}*. Let’s denote *(xᵢ, yᵢ)* as the sample drawn from *(Xᵢ, Yᵢ)*, ** Y** as

*[Y₁, Y₂, …, Yₙ]*,

**as**

*X**[X₁, X₂, …, Xₙ]*, we define the prior distribution of

**given**

*Y***:**

*X*where *m* is the mean function, it is common to use *m(x) = 0* since the input data are usually normalized. *κ* is the kernel function that measures the distance between samples; a common choice is the radial basis function kernel (RBF).

Since Gaussian distribution is a conjugate prior, we can derive the posterior distribution for an unseen *yₙ₊₁* as a gaussian distribution:

whose mean is the prediction y*ₙ₊₁ and variance is the uncertainty of the prediction (I skip the proof above, but it should be straightforward to prove with Bayes rule).

We can see that GP is a non-parametric method, which can learn fast and incrementally from online samples by updating the posterior distribution and provide uncertainty estimation through variance. However, the drawbacks of GP are also apparent; some cases are hard for GP to deal with:

- High dimensional
or*X**Y* - A large number of training samples (since the inference cost increases linearly with
*n*) - Temporal correlation between
*(Xᵢ, Yᵢ)*(since the GP is finite exchangeable, which means the order of*(Xᵢ, Yᵢ)*does not affect the distribution)

# Make a Neural Process

Now let us look at the Neural Process (NP), which incorporates NN into GP’s concepts to address the first two drawbacks mentioned above. The final model can adapt from online samples, estimate uncertainty, deal with high-dimensional data like image, and have a constant inference time. Let us see how it archives that in steps.

# 1. Simplify the Prior by using Latent Gaussian Variable

NP uses a different prior distribution to GP. Instead of letting *Yₙ₊₁* (**target output**) directly depend on ** X** and

**(**

*Y***context inputs**), NP lets

*Yₙ₊₁*depends on a newly introduced latent gaussian variable

*z*, whose

*μ*and

*σ*depends on

**and**

*X***.**

*Y*# 2. Learn Contextual Features to Scale Up

Following the new prior, NP archives the constant inference time by learning an encoder+aggregator structure to encode the information from context inputs ** X** and

**to a constant size contextual feature, mean and variance of**

*Y**z*:

During inference, NP only needs to sample from *z* and feeds the *Xₙ₊₁* and a sample of *z* to a decoder to get the final prediction of *Y*ₙ₊₁*.

Online adaptation works by updating the contextual feature using new pairs of samples. The variance of *z* and stability of *Y*ₙ₊₁* are the estimates of uncertainty.

# 3. Maximize the Posterior Likelihood to Train

Like other probabilistic models, NP is also trained by maximizing the posterior likelihood. The paper uses the term evidence lower-bound (ELBO):

The design intention and procedure of the derivation for the equation above can be found on page 16 of [1]. We approximate the expectation with a classical stochastic mini-batch training scheme to implement the loss, replacing the first term with a prediction loss for *Yₙ₊₁*. The second term can be interpreted as a regularization to keep the *z* stable with different contextual inputs.

We train the model on a large number of small task episodes, where each episode consists of some observed *(x, y)* as contexts and some unobserved pairs as targets. The training setting is similar to a meta-learning task, and the paper also compares NP with MAML[3] in the experiments section.

# Experiments

The paper demonstrates the NP’s flexibility with experiments like function regression, bandits problem, and image completion. Here I show some exciting experiment result figures. I recommend you going back to the paper for more details.

# End

In this post, I introduce a machine learning model called Neural Process. It has always been fascinating to see some progress made from the combination of Bayesian Methods and Neural Network. I hope you find this post useful.

Thanks for reading!

# Reference

[1]: Tutorial on Variational Autoencoders https://arxiv.org/abs/1606.05908

[2]: Gaussian processes http://krasserm.github.io/2018/03/19/gaussian-processes/

[3]: Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks https://arxiv.org/abs/1703.03400