How are hyperparameters determined in Bayesian optimization?

AI Thread Summary
The discussion centers on understanding Bayesian optimization using Gaussian Processes (GP) and the Expected Improvement (EI) acquisition function. The primary goal is to identify the optimal parameters (θ) for an unknown objective function f(x, θ). The process begins by applying Bayes' theorem to approximate the posterior distribution of f, using a Gaussian Process as the prior and a normal function as the likelihood.The iterative nature of the optimization involves calculating the posterior at sampled points, guided by the acquisition function, which determines where to sample next based on potential improvements over the current maximum. The expected improvement is computed to evaluate the potential benefit of sampling at various points, with the next sample point chosen to maximize this expected improvement.Key points of confusion include the role of hyperparameter optimization within this framework. It is clarified that hyperparameters are typically optimized using maximum likelihood methods during the modeling process, rather than through the acquisition function. The discussion also highlights the need for clearer resources on the topic, as existing explanations are often insufficient.
BRN
Messages
107
Reaction score
10
Hello,
I am better studying the theory that is the basis of Bayesian optimization with a Gaussian Process and the acquisition function EI.
I would like to expose what I think I understand and ask you to correct me if I'm wrong.

The aim is to find the best ##\theta## parameters for a parametric function ##f(x, \theta)## (objective function) of which the analytical form is not known.
The Bayes theorem is used to apply ##f## to approximate ##f## to the posterior and then the best parameter set are those that maximize the posterior.
Is used an normal function as likelihood and a Process Gaussian as prior:

$$\pi = \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}}\exp\left[-\frac{1}{2}(\mathbf{Y}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{Y}-\mathbf{\mu})\right] $$

Everything happens in an iterative way, point by point. The points on which the posterior is calculated are given by the acquisition function sampling points in a ##D_t## dataset. Then, the improvement is defined as

$$
I = \left\{ \begin{matrix}
0 & \text{for}\;f>h \\
h_{t+1}(x)-f(x^+) &\text{for}\;f<h
\end{matrix}\right.
$$

where, ##h_{t+1}(x)## is the posterior function evaluated in step ##t+ 1## and ##f(x^+)## is the maximum value that has been reached so far.

and one can determine the expected improvement

$$
\alpha_{\rm EI}(x^*|\mathcal{D}_t) = \mathbb{E}[I(h)] = \int I(h) \pi {\rm d}h
$$

That is the expected improvement depends on the Gaussian process (the prior).
Therefore, at each step, the posterior is calculated at point #x_{max}# defined as

$$x_{max} = {\rm argmax}_x \alpha_{\rm EI}(x|\mathcal{D}_{t-1})$$

I don't know if what I wrote is correct. I'm a little confused ...

If I was wrong, can someone explain myself better? Could you tell me where to find a complete explanation on this topic? On the net I find only sketchy explanations.

Thanks!
 
Last edited:
Technology news on Phys.org
BRN said:
Hello,
I am better studying the theory that is the basis of Bayesian optimization with a Gaussian Process and the acquisition function EI.
I would like to expose what I think I understand and ask you to correct me if I'm wrong.

The aim is to find the best ##\theta## parameters for a parametric function ##f(x, \theta)## (objective function) of which the analytical form is not known.
The Bayes theorem is used to apply ##f## to approximate ##f## to the posterior and then the best parameter set are those that maximize the posterior.
Is used an normal function as likelihood and a Process Gaussian as prior:

$$\pi = \frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}}\exp\left[-\frac{1}{2}(\mathbf{Y}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{Y}-\mathbf{\mu})\right] $$

Everything happens in an iterative way, point by point. The points on which the posterior is calculated are given by the acquisition function sampling points in a ##D_t## dataset. Then, the improvement is defined as

$$
I = \left\{ \begin{matrix}
0 & \text{for}\;f>h \\
h_{t+1}(x)-f(x^+) &\text{for}\;f<h
\end{matrix}\right.
$$

where, ##h_{t+1}(x)## is the posterior function evaluated in step ##t+ 1## and ##f(x^+)## is the maximum value that has been reached so far.

and one can determine the expected improvement

$$
\alpha_{\rm EI}(x^*|\mathcal{D}_t) = \mathbb{E}[I(h)] = \int I(h) \pi {\rm d}h
$$

That is the expected improvement depends on the Gaussian process (the prior).
Therefore, at each step, the posterior is calculated at point #x_{max}# defined as

$$x_{max} = {\rm argmax}_x \alpha_{\rm EI}(x|\mathcal{D}_{t-1})$$

I don't know if what I wrote is correct. I'm a little confused ...

If I was wrong, can someone explain myself better? Could you tell me where to find a complete explanation on this topic? On the net I find only sketchy explanations.

Thanks!

Maybe this is a good resource?

http://www.gaussianprocess.org/
 
Ok, I'm understanding a little more, but I still have some doubts ...

Summarize in a schematic way:
  1. It starts with a random sampling of a dataset ##D_t## among all available data;
  2. with ##D_t## calculates the balck box functiuon for frist time;
  3. On the black box function solution a surrogate model is created: the Bayes theorem is applied $$P(f|D_t, \theta )\propto P(D_t|f, \theta)P(f)$$where: the posterior is the function that approximates the black box function; Likelihood is a normal function; The Prior is a Gaussian process with covariance that depends on data and hyperparameters.
  4. The acquisition function EI, which depends on the posterior, intelligently samples a new sampling, among all those still not used, to be added to the dataset ##D_t##;
  5. The steps 3 and 4 are repeated up to convergence or until a certain number of iterations are completed.
What I don't understand is who search better hyperparameters?
Are they found with the method of maximum likelihood at step 3 or found by acquisition function?
 
Last edited:
Thread 'Is this public key encryption?'
I've tried to intuit public key encryption but never quite managed. But this seems to wrap it up in a bow. This seems to be a very elegant way of transmitting a message publicly that only the sender and receiver can decipher. Is this how PKE works? No, it cant be. In the above case, the requester knows the target's "secret" key - because they have his ID, and therefore knows his birthdate.
Back
Top