How are hyperparameters determined in Bayesian optimization?

Click For Summary
SUMMARY

This discussion focuses on the determination of hyperparameters in Bayesian optimization using Gaussian Processes and the Expected Improvement (EI) acquisition function. The objective is to identify the optimal parameters, denoted as ##\theta##, for an unknown parametric function ##f(x, \theta)##. The process involves applying Bayes' theorem to approximate the posterior distribution and iteratively calculating the expected improvement based on the Gaussian process prior. The key takeaway is that the acquisition function guides the sampling of points to maximize the posterior, ultimately leading to the identification of hyperparameters through iterative refinement.

PREREQUISITES
  • Understanding of Bayesian optimization principles
  • Familiarity with Gaussian Processes and their application
  • Knowledge of the Expected Improvement (EI) acquisition function
  • Basic grasp of Bayes' theorem and posterior distributions
NEXT STEPS
  • Study Gaussian Process regression techniques in detail
  • Explore the mathematical foundations of Bayes' theorem in optimization
  • Learn about different acquisition functions beyond Expected Improvement
  • Investigate hyperparameter tuning methods in machine learning frameworks
USEFUL FOR

Data scientists, machine learning engineers, and researchers interested in optimizing complex functions through Bayesian methods will benefit from this discussion.

BRN
Messages
107
Reaction score
10
Hello,
I am better studying the theory that is the basis of Bayesian optimization with a Gaussian Process and the acquisition function EI.
I would like to expose what I think I understand and ask you to correct me if I'm wrong.

The aim is to find the best ##\theta## parameters for a parametric function ##f(x, \theta)## (objective function) of which the analytical form is not known.
The Bayes theorem is used to apply ##f## to approximate ##f## to the posterior and then the best parameter set are those that maximize the posterior.
Is used an normal function as likelihood and a Process Gaussian as prior:

$$\pi = \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}}\exp\left[-\frac{1}{2}(\mathbf{Y}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{Y}-\mathbf{\mu})\right] $$

Everything happens in an iterative way, point by point. The points on which the posterior is calculated are given by the acquisition function sampling points in a ##D_t## dataset. Then, the improvement is defined as

$$
I = \left\{ \begin{matrix}
0 & \text{for}\;f>h \\
h_{t+1}(x)-f(x^+) &\text{for}\;f<h
\end{matrix}\right.
$$

where, ##h_{t+1}(x)## is the posterior function evaluated in step ##t+ 1## and ##f(x^+)## is the maximum value that has been reached so far.

and one can determine the expected improvement

$$
\alpha_{\rm EI}(x^*|\mathcal{D}_t) = \mathbb{E}[I(h)] = \int I(h) \pi {\rm d}h
$$

That is the expected improvement depends on the Gaussian process (the prior).
Therefore, at each step, the posterior is calculated at point #x_{max}# defined as

$$x_{max} = {\rm argmax}_x \alpha_{\rm EI}(x|\mathcal{D}_{t-1})$$

I don't know if what I wrote is correct. I'm a little confused ...

If I was wrong, can someone explain myself better? Could you tell me where to find a complete explanation on this topic? On the net I find only sketchy explanations.

Thanks!
 
Last edited:
Technology news on Phys.org
BRN said:
Hello,
I am better studying the theory that is the basis of Bayesian optimization with a Gaussian Process and the acquisition function EI.
I would like to expose what I think I understand and ask you to correct me if I'm wrong.

The aim is to find the best ##\theta## parameters for a parametric function ##f(x, \theta)## (objective function) of which the analytical form is not known.
The Bayes theorem is used to apply ##f## to approximate ##f## to the posterior and then the best parameter set are those that maximize the posterior.
Is used an normal function as likelihood and a Process Gaussian as prior:

$$\pi = \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}}\exp\left[-\frac{1}{2}(\mathbf{Y}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{Y}-\mathbf{\mu})\right] $$

Everything happens in an iterative way, point by point. The points on which the posterior is calculated are given by the acquisition function sampling points in a ##D_t## dataset. Then, the improvement is defined as

$$
I = \left\{ \begin{matrix}
0 & \text{for}\;f>h \\
h_{t+1}(x)-f(x^+) &\text{for}\;f<h
\end{matrix}\right.
$$

where, ##h_{t+1}(x)## is the posterior function evaluated in step ##t+ 1## and ##f(x^+)## is the maximum value that has been reached so far.

and one can determine the expected improvement

$$
\alpha_{\rm EI}(x^*|\mathcal{D}_t) = \mathbb{E}[I(h)] = \int I(h) \pi {\rm d}h
$$

That is the expected improvement depends on the Gaussian process (the prior).
Therefore, at each step, the posterior is calculated at point #x_{max}# defined as

$$x_{max} = {\rm argmax}_x \alpha_{\rm EI}(x|\mathcal{D}_{t-1})$$

I don't know if what I wrote is correct. I'm a little confused ...

If I was wrong, can someone explain myself better? Could you tell me where to find a complete explanation on this topic? On the net I find only sketchy explanations.

Thanks!

Maybe this is a good resource?

http://www.gaussianprocess.org/
 
  • Like
Likes   Reactions: BRN
Ok, I'm understanding a little more, but I still have some doubts ...

Summarize in a schematic way:
  1. It starts with a random sampling of a dataset ##D_t## among all available data;
  2. with ##D_t## calculates the balck box functiuon for frist time;
  3. On the black box function solution a surrogate model is created: the Bayes theorem is applied $$P(f|D_t, \theta )\propto P(D_t|f, \theta)P(f)$$where: the posterior is the function that approximates the black box function; Likelihood is a normal function; The Prior is a Gaussian process with covariance that depends on data and hyperparameters.
  4. The acquisition function EI, which depends on the posterior, intelligently samples a new sampling, among all those still not used, to be added to the dataset ##D_t##;
  5. The steps 3 and 4 are repeated up to convergence or until a certain number of iterations are completed.
What I don't understand is who search better hyperparameters?
Are they found with the method of maximum likelihood at step 3 or found by acquisition function?
 
Last edited:

Similar threads

  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 36 ·
2
Replies
36
Views
6K
  • · Replies 1 ·
Replies
1
Views
1K
Replies
26
Views
5K
  • · Replies 0 ·
Replies
0
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
1
Views
3K