# Restricted Boltzmann machine understanding

Jufa
Homework Statement:
This post is for helping me with my Final Degree Project
Relevant Equations:
No relevant equations
Suppose you have an experiment of 2 possible outcomes 0 and 1 with probabilities p and 1-p respectively. I've been told in University that Restricted Boltzmann machines (RBM) can be used to infer probability distributions so I guess that one could built a RBM in order to infer p in the example just mentioned. I wonder if someone can solve my doubts regarding this simple example. I would proceed as follows:

1-I would use one visible neuron in order to characterize every input as the two possible results of the measurement (0 or 1) and an arbitrary number of hidden neurons (I suppose that the more the better). I would also randomly initialize the weights and the biases.
2-Now I would perform the so-called forward pass. That it is, I would compute the values for all the hidden neurons using the well-known stochastic procedure involving the value of the single visible neuron and the sigmoid function.
3-Then I would perform the backward pass in order to reconstruct the visible neuron using the values of the hidden neurons and again the sigmoid function.
4-Now, is where i find it difficult to go on. In some articles I've seen that now it would be time to perform the Kullback-Leibler (K-L) divegence in order to measure the distance between the actual probability distribution (p, 1-p) and the estimated one (1/Z*SUM_h(e^E(v, h)) being Z the partition function and E the energy associated to a certain confirguration of the machine. After computing K-L divergence the weights and biases are readjusted via gradient descent in order to minimize the distance between distributions.
**My question is very simple: why are the forward and backward pass are needed? The K-L divergence can be computed without having reconstructed the visible layer and the readjustment of the weights and biases does not depend on the reconstruction as well.
**Another question that comes to my mind and I haven't be able to answer yet is: Where is the point of building a RBM for estimating a probability distribution? I mean, in order to perform gradient descent you already need to know the probability distribution (p, 1-p) of the data. Then what is actually the RBM helping us with?

to perform gradient descent you already need to know the probability distribution (p, 1-p) of the data.

It isn't clear whether you are talking about an observed frequency ##p## as opposed to a probability ##p## . For example, if you generate training data by 100 independent trials with probability of ##p## of success on each trial, it is unlikely that the fraction of successes you generate will be exactly equal to ##p##.

Different sets of training data generated from the same probability distribution will have different observed fractions of successes.