Restricted Boltzmann machine understanding

Click For Summary
SUMMARY

Restricted Boltzmann Machines (RBMs) are utilized to infer probability distributions, specifically in scenarios with binary outcomes, such as 0 and 1 with probabilities p and 1-p. The process involves initializing a visible neuron for input representation and multiple hidden neurons, followed by a forward pass using the sigmoid function to compute hidden neuron values. A backward pass reconstructs the visible neuron, and Kullback-Leibler (K-L) divergence is calculated to adjust weights and biases via gradient descent. The discussion raises critical questions about the necessity of the forward and backward passes and the practical utility of RBMs in estimating probability distributions when the true distribution is already known.

PREREQUISITES
  • Understanding of Restricted Boltzmann Machines (RBMs)
  • Familiarity with Kullback-Leibler (K-L) divergence
  • Knowledge of gradient descent optimization techniques
  • Proficiency in using the sigmoid activation function
NEXT STEPS
  • Explore the mathematical foundations of Restricted Boltzmann Machines
  • Study the implementation of Kullback-Leibler divergence in machine learning
  • Learn about the role of hidden layers in neural networks
  • Investigate practical applications of RBMs in unsupervised learning
USEFUL FOR

Data scientists, machine learning practitioners, and researchers interested in probabilistic models and neural network architectures.

Jufa
Messages
101
Reaction score
15
Homework Statement
This post is for helping me with my Final Degree Project
Relevant Equations
No relevant equations
Suppose you have an experiment of 2 possible outcomes 0 and 1 with probabilities p and 1-p respectively. I've been told in University that Restricted Boltzmann machines (RBM) can be used to infer probability distributions so I guess that one could built a RBM in order to infer p in the example just mentioned. I wonder if someone can solve my doubts regarding this simple example. I would proceed as follows:

1-I would use one visible neuron in order to characterize every input as the two possible results of the measurement (0 or 1) and an arbitrary number of hidden neurons (I suppose that the more the better). I would also randomly initialize the weights and the biases.
2-Now I would perform the so-called forward pass. That it is, I would compute the values for all the hidden neurons using the well-known stochastic procedure involving the value of the single visible neuron and the sigmoid function.
3-Then I would perform the backward pass in order to reconstruct the visible neuron using the values of the hidden neurons and again the sigmoid function.
4-Now, is where i find it difficult to go on. In some articles I've seen that now it would be time to perform the Kullback-Leibler (K-L) divegence in order to measure the distance between the actual probability distribution (p, 1-p) and the estimated one (1/Z*SUM_h(e^E(v, h)) being Z the partition function and E the energy associated to a certain confirguration of the machine. After computing K-L divergence the weights and biases are readjusted via gradient descent in order to minimize the distance between distributions.
**My question is very simple: why are the forward and backward pass are needed? The K-L divergence can be computed without having reconstructed the visible layer and the readjustment of the weights and biases does not depend on the reconstruction as well.
**Another question that comes to my mind and I haven't be able to answer yet is: Where is the point of building a RBM for estimating a probability distribution? I mean, in order to perform gradient descent you already need to know the probability distribution (p, 1-p) of the data. Then what is actually the RBM helping us with?

Thanks in advance.
 
Physics news on Phys.org
Jufa said:
to perform gradient descent you already need to know the probability distribution (p, 1-p) of the data.

It isn't clear whether you are talking about an observed frequency ##p## as opposed to a probability ##p## . For example, if you generate training data by 100 independent trials with probability of ##p## of success on each trial, it is unlikely that the fraction of successes you generate will be exactly equal to ##p##.

Different sets of training data generated from the same probability distribution will have different observed fractions of successes.
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 5 ·
Replies
5
Views
1K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 1 ·
Replies
1
Views
5K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
1K
Replies
2
Views
3K
  • · Replies 80 ·
3
Replies
80
Views
8K
  • · Replies 15 ·
Replies
15
Views
3K
  • · Replies 6 ·
Replies
6
Views
4K