Gaussain Process#
Multivariate Gaussian distributions#
For a given set of training points, there are potentially infinitely many functions that fit the data. Gaussian processes offer an elegant solution to this problem by assigning a probability to each of these functions
The mean of this probability distribution then represents the most probable characterization of the data.
Furthermore, using a probabilistic approach allows us to incorporate the confidence of the prediction into the regression result.
Gaussian distributions have the nice algebraic property of being closed under conditioning and marginalization.
Being closed under conditioning and marginalization means that the resulting distributions from these operations are also Gaussian, which makes many problems in statistics and machine learning tractable.
Marginalization
\[ p_X(x) = \int_yp_{X,Y}(x, y)dy = \int_yp_{X|Y}(x|y)p_Y(y)dy \]Conditioning
\[\begin{split} X∣Y ∼N(μ_X+Σ_{XY}Σ_{YY}^{−1}(Y−μ_Y),Σ_{XX}−Σ_{XY}Σ_{YY}^{−1}Σ_{YX})\\ Y∣X ∼N(μ_Y+Σ_{YX}Σ_{XX}^{−1}(X−μ_X),Σ_{YY}−Σ_{YX}Σ_{XX}^{−1}Σ_{XY}) \end{split}\]In conditioning, the new mean only depends on the conditioned variable, while the covariance matrix is independent from this variable (constant and equal to the marginal variance/covariance).
Gaussian Processes#
Stochastic processes, such as Gaussian processes, are essentially a set of random variables. In addition, each of these random variables has a corresponding index i.
In Gaussian processes we treat each test point as a random variable. A multivariate Gaussian distribution has the same number of dimensions as the number of random variables. Since we want to predict the function values at \(∣X∣=N\) test points, the corresponding multivariate Gaussian distribution is also N-dimensional. Making a prediction using a Gaussian process ultimately boils down to drawing samples from this distribution. We then interpret the i-th component of the resulting vector as the function value corresponding to the ii-th test point.
Kernels#
The clever step of Gaussian processes is how we set up the covariance matrix \(\Sigma\). The covariance matrix will not only describe the shape of our distribution, but ultimately determines the characteristics of the function that we want to predict.
We generate the covariance matrix by evaluating the kernel k, which is often also called covariance function, pairwise on all the points.
The kernel receives two points \(t,t’ \in \mathbb{R}^n\) as an input and returns a similarity measure between those points in the form of a scalar: $\( k:\mathbb{R}^n \times \mathbb{R}^n, \Sigma = \text{Cov}(X, X') = k(t, t') \)$
Since the kernel describes the similarity between the values of our function, it controls the possible shape that a fitted function can adopt.
Kernels allow similarity measures that go far beyond the standard euclidean distance (L2-distance). Many of these kernels conceptually embed the input points into a higher dimensional space in which they then measure the similarity
Example of kernels
RBF: \(\sigma^2\exp\left(-{||t-t'||}\over{2l^2}\right)\)
Periodic: \(\sigma^2\exp\left(-{2\sin^2(\pi|t-t'|/p)}\over{l^2}\right)\)
Linear: \(\sigma_b^2 + \sigma^2(t-c)(t'-c)\)
Kernels can be separated into stationary and non-stationary kernels.
Stationary kernels, such as the RBF kernel or the periodic kernel, are functions invariant to translations, and the covariance of two points is only dependent on their relative position.
Non-stationary kernels, such as the linear kernel, do not have this constraint and depend on an absolute location.
A big benefit that kernels provide is that they can be combined together, resulting in a more specialized kernel.
The decision which kernel to use is highly dependent on prior knowledge about the data, e.g. if certain characteristics are expected.
The most common kernel combinations would be addition and multiplication. And there are more possibilities such as concatenation or composition with a functions
Qs:
didn’t understand what is a GP in an intuitive way and hot wo interpret it.
what does functions from xx distributions taht are created using different kernels mean?
each data sample/point is a variable in GP? the dimension is the number of data points?