How are hyperparameters determined in Bayesian optimization?

In summary, Bayesian optimization with a Gaussian Process and the acquisition function EI involves finding the best parameters for a parametric function using the Bayes theorem to approximate the posterior. This is done iteratively by sampling points in a dataset and using an acquisition function to determine which points to add to the dataset. The hyperparameters for the Gaussian process are found using the method of maximum likelihood. These steps are repeated until convergence or a certain number of iterations are completed.
  • #1
BRN
108
10
Hello,
I am better studying the theory that is the basis of Bayesian optimization with a Gaussian Process and the acquisition function EI.
I would like to expose what I think I understand and ask you to correct me if I'm wrong.

The aim is to find the best ##\theta## parameters for a parametric function ##f(x, \theta)## (objective function) of which the analytical form is not known.
The Bayes theorem is used to apply ##f## to approximate ##f## to the posterior and then the best parameter set are those that maximize the posterior.
Is used an normal function as likelihood and a Process Gaussian as prior:

$$\pi = \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}}\exp\left[-\frac{1}{2}(\mathbf{Y}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{Y}-\mathbf{\mu})\right] $$

Everything happens in an iterative way, point by point. The points on which the posterior is calculated are given by the acquisition function sampling points in a ##D_t## dataset. Then, the improvement is defined as

$$
I = \left\{ \begin{matrix}
0 & \text{for}\;f>h \\
h_{t+1}(x)-f(x^+) &\text{for}\;f<h
\end{matrix}\right.
$$

where, ##h_{t+1}(x)## is the posterior function evaluated in step ##t+ 1## and ##f(x^+)## is the maximum value that has been reached so far.

and one can determine the expected improvement

$$
\alpha_{\rm EI}(x^*|\mathcal{D}_t) = \mathbb{E}[I(h)] = \int I(h) \pi {\rm d}h
$$

That is the expected improvement depends on the Gaussian process (the prior).
Therefore, at each step, the posterior is calculated at point #x_{max}# defined as

$$x_{max} = {\rm argmax}_x \alpha_{\rm EI}(x|\mathcal{D}_{t-1})$$

I don't know if what I wrote is correct. I'm a little confused ...

If I was wrong, can someone explain myself better? Could you tell me where to find a complete explanation on this topic? On the net I find only sketchy explanations.

Thanks!
 
Last edited:
Technology news on Phys.org
  • #2
BRN said:
Hello,
I am better studying the theory that is the basis of Bayesian optimization with a Gaussian Process and the acquisition function EI.
I would like to expose what I think I understand and ask you to correct me if I'm wrong.

The aim is to find the best ##\theta## parameters for a parametric function ##f(x, \theta)## (objective function) of which the analytical form is not known.
The Bayes theorem is used to apply ##f## to approximate ##f## to the posterior and then the best parameter set are those that maximize the posterior.
Is used an normal function as likelihood and a Process Gaussian as prior:

$$\pi = \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}}\exp\left[-\frac{1}{2}(\mathbf{Y}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{Y}-\mathbf{\mu})\right] $$

Everything happens in an iterative way, point by point. The points on which the posterior is calculated are given by the acquisition function sampling points in a ##D_t## dataset. Then, the improvement is defined as

$$
I = \left\{ \begin{matrix}
0 & \text{for}\;f>h \\
h_{t+1}(x)-f(x^+) &\text{for}\;f<h
\end{matrix}\right.
$$

where, ##h_{t+1}(x)## is the posterior function evaluated in step ##t+ 1## and ##f(x^+)## is the maximum value that has been reached so far.

and one can determine the expected improvement

$$
\alpha_{\rm EI}(x^*|\mathcal{D}_t) = \mathbb{E}[I(h)] = \int I(h) \pi {\rm d}h
$$

That is the expected improvement depends on the Gaussian process (the prior).
Therefore, at each step, the posterior is calculated at point #x_{max}# defined as

$$x_{max} = {\rm argmax}_x \alpha_{\rm EI}(x|\mathcal{D}_{t-1})$$

I don't know if what I wrote is correct. I'm a little confused ...

If I was wrong, can someone explain myself better? Could you tell me where to find a complete explanation on this topic? On the net I find only sketchy explanations.

Thanks!

Maybe this is a good resource?

http://www.gaussianprocess.org/
 
  • Like
Likes BRN
  • #3
Ok, I'm understanding a little more, but I still have some doubts ...

Summarize in a schematic way:
  1. It starts with a random sampling of a dataset ##D_t## among all available data;
  2. with ##D_t## calculates the balck box functiuon for frist time;
  3. On the black box function solution a surrogate model is created: the Bayes theorem is applied $$P(f|D_t, \theta )\propto P(D_t|f, \theta)P(f)$$where: the posterior is the function that approximates the black box function; Likelihood is a normal function; The Prior is a Gaussian process with covariance that depends on data and hyperparameters.
  4. The acquisition function EI, which depends on the posterior, intelligently samples a new sampling, among all those still not used, to be added to the dataset ##D_t##;
  5. The steps 3 and 4 are repeated up to convergence or until a certain number of iterations are completed.
What I don't understand is who search better hyperparameters?
Are they found with the method of maximum likelihood at step 3 or found by acquisition function?
 
Last edited:

1. What is Bayesian Optimization Theory?

Bayesian Optimization Theory is a mathematical framework for optimizing black-box functions with a limited number of evaluations. It uses Bayesian inference to model the unknown function and make informed decisions about where to sample next.

2. How does Bayesian Optimization work?

Bayesian Optimization works by building a probabilistic model of the unknown function, incorporating prior knowledge and information from previous evaluations. It then uses this model to select the next point to evaluate, balancing exploration and exploitation to efficiently find the optimal solution.

3. What are the advantages of using Bayesian Optimization?

Bayesian Optimization is advantageous because it can optimize complex, non-linear functions with a limited number of evaluations. It also incorporates prior knowledge and can handle noisy or expensive-to-evaluate functions. Additionally, it can handle multiple objectives and constraints in a single optimization problem.

4. How is Bayesian Optimization different from other optimization methods?

Unlike traditional optimization methods, Bayesian Optimization does not require knowledge of the gradient or derivatives of the function. It also does not rely on a fixed set of parameter values, making it more adaptable to different problems. Additionally, it can handle stochastic and non-deterministic functions, making it more robust.

5. What are some real-world applications of Bayesian Optimization?

Bayesian Optimization has been successfully applied in various fields, such as engineering, machine learning, and finance. Some examples include optimizing the parameters of machine learning models, tuning the hyperparameters of deep neural networks, and designing efficient energy systems. It has also been used in drug discovery, robotics, and aerospace engineering.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
696
  • Programming and Computer Science
2
Replies
36
Views
3K
Replies
2
Views
533
Replies
26
Views
4K
  • Classical Physics
Replies
1
Views
1K
Replies
2
Views
107
Replies
3
Views
1K
Replies
1
Views
1K
  • Introductory Physics Homework Help
Replies
10
Views
926
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
962
Back
Top