## Metropolis Algorithm - Why does it work?

I'm doing a project based around the Metropolis algorithm to explore parameter spaces.The computational aspect I can understand and implement. I understand that a Markov Chain is constructed in the parameter space. However, I don't understand the theory behind why it works so well in exploring it. Being a Physics student, most of the statistical literature I have read has not helped me. Can anyone explain it to me, or point to basic references?

Thanks.
 PhysOrg.com science news on PhysOrg.com >> Galaxies fed by funnels of fuel>> The better to see you with: Scientists build record-setting metamaterial flat lens>> Google eyes emerging markets networks

Recognitions:
 Quote by Pi-Bond . I understand that a Markov Chain is constructed in the parameter space. However, I don't understand the theory behind why it works so well in exploring it
Do you have an understanding of the "stationary distribution" of a Markov Chain? That would be the place to start.
 If I'm not mistaken, the stationary distribution X with transition matrix P is given by $$XP=X$$ Is that correct?

Mentor

## Metropolis Algorithm - Why does it work?

Correct. I'll let Stephen take you further down this road.

You also implied that you would like some physical analog to help you understand this. A good physical analog lies in the annealing of ferrous metals. Simulated annealing emulates this. The Metropolis-Hastings algorithm is at the heart of simulated annealing. Simulated annealing is essentially Markov Chain Monte Carlo by another name.

Recognitions:
 Quote by Pi-Bond If I'm not mistaken, the stationary distribution X with transition matrix P is given by $$XP=X$$ Is that correct?
Yes. ( My preference is to multiply a column vector on the left by the matrix, so I would say $TX = X$ where $T$ is the transistion matrix.

It may be stretch to apply this metaphor to the Metropolis algorithm, but let's try. In that situation, there is some unknown probability distribution for frogs on pads. This doesn't involve letting them jump. Perhaps someone has created this distribution by placing them on the pads. Furthermore the pond is too large for you to see all the lily pads at once. Your job is to define a set of jump probabilities for each pad so that when the frogs are allowed to jump according to that Markov process, their distribution will approximately preserve the original distribution of the frogs. (i.e. the original distribution of the frogs will be the stationary distribution of the process.) You can count the number of frogs on a given pad P_a and you can count the number of frogs on the pads P_b,P_c,...around it, to which the frogs may jump. But you don't know the total number of frogs in the whole pond. How will you determine a set of jump probabilities involving pad P_a?

It may or may be intutitive that you can determine the probabilities by computations using ratios of frogs , such as (number of frogs on pad P_b)/ (number of frogs on pad P_a) without knowing the original distribution of frogs (e.g. not knowing (number of frogs on pad P_b)/ (total number of frogs in the pond) ). If this is intuitive to you then, since you're doing physics, it may be sufficient explanation! If not, we must do some algebra.
 Your explanation makes some sense to me, but I think I need more time for it to let it sink. But at the same time I'm looking to understand it more from a mathematical point of view. I'm not sure where the proposal pdf enters the discussion here. Are you saying that the equilibrium distribution of the frogs is equivalent to the distribution we were wanting to explore?

Recognitions:
 Quote by Pi-Bond . Are you saying that the equilibrium distribution of the frogs is equivalent to the distribution we were wanting to explore?
 I'm having some trouble thinking to this analogy in terms of a 1D problem. At each step we are just jumping to another point based on a ratio. In this analogy is there just one frog?
 Recognitions: Science Advisor I prefer to think of a hoard of frogs, each acting independently. You can think of the process being executed by one frog. If you think of one frog then you have to think getting empirical evidence about probabilities of jumping from pad P_a to pad P_b by sitting and waiting for him to land on pad P_a and then observing what fraction of his total jumps from pad P_a are to pad P_b. I don't like to wait.
 Ok, this is starting to make more sense now. Can you tell me how the factor to accept or reject a jump $$\alpha=min \left( \frac{P^{(c)}}{P^{(o)}},1 \right)$$ enters the analogy? (Pc is the probability of the candidate point being considered and P0 is the previous point in the chain)

Recognitions:
 Quote by Pi-Bond Ok, this is starting to make more sense now. Can you tell me how the factor to accept or reject a jump $$\alpha=min \left( \frac{P^{(c)}}{P^{(o)}},1 \right)$$ enters the analogy? (Pc is the probability of the candidate point being considered and P0 is the previous point in the chain)
First let's emphasize that we can know $\frac {P^(c)}{P^(o)}$ without knowing $P^{(c)}$ and $P^{(o)}$ individually. (After all, if we knew the distribution P, we might not need to use a fancy algorithm to sample from it.) In the frogs-on-lilly-pads scenario, we are given a target distribution for the frogs, but we can only look at it locally and see the ratio of frogs on a pad to frogs on pads accessible from it. The number of frogs on a pad doesn't translate to a know probablity because we don't know the total number of frogs in the pond. In the actual uses of the Metropolis algorithm, we know know some function that is proportional to the probability density, but doing the integration over all its possible values is intractable. So we can't figure out what factor to divide things by to normalize the function to a probability density.

I don't know if we can make a "perfect" explanation of the acceptance ratio purely from intuition. Let's do the best we can. Let's take it for granted that we are on pad $p_o$ and we must make a yes-or-no decision about whether to jump to pad $p_c$. All we know about the target distribution is a ratio of probabilities.

If the probability of a frog being on pad $p_c$ is, for example, 5 times as great as the probability of a frog being on pad $p_o$ then it makes sense for us (as a frog) to jump to pad $p_c$. This decision to jump doesn't directly implement the factor of 5. But given we must make a yes-or-no decision about jumping, its the only good choice.

Suppose we know that the probability of a frog being on pad $p_c$ on only 1/3 the probability of a frog being on pad $p_c$. We can attempt to implement the ratio of 1/3 by setting the probability of jumping from $p_o$ to pad $p_c$ to be 1/3 and then make a random draw to determine whether to jump. (We didn't have such a choice in the previous situation when the ratio was 5 since we aren't allowing ourselves to transform from one frog on pad $p_o$ into 5 frogs on pad $p_c$.)
 The accept-reject factor would have been chosen to satisfy "local balance" so that the markov chain is reversible, i.e. that the number of frogs jumping from pad 1 to pad 2 matches the number of frogs jumping from pad 2 to pad 1 (which is a slightly stronger condition than existence of an equilibrium). The off-diagonal terms of the transition matrix will look like a(1,2)*q(1,2) where a is the (as yet unknown) acceptance factor and q is the jump probability, so the local balance requirement says p(1)*a(1,2)*q(1,2) = p(2)*a(2,1)*q(2,1). Metropolis assumes q is symmetric, so if p(1)
 Thanks for the detailed explanation both of you. However, from what I am gathering it seems that the algorithm should give us data for how much more likely it is to be at a point as compared to a "reference" point. But using the algorithm computationally on a toy problem reproduces the probability distribution correctly (i.e. the data points lie along the curve of the distribution). How is that the case?

 Quote by Pi-Bond Thanks for the detailed explanation both of you. However, from what I am gathering it seems that the algorithm should give us data for how much more likely it is to be at a point as compared to a "reference" point. But using the algorithm computationally on a toy problem reproduces the probability distribution correctly (i.e. the data points lie along the curve of the distribution). How is that the case?
No, the algorithm simply allows us to draw from distributions where we don't know the normalising constant. Post #12 kind of explains why (modulo error: p(2)/p(1) should be p(1)/p(2)) - the key is in understanding local balance (aka detailed balance) and how it relates to the stationary distribution of a Markov chain.

A neat example to try is points uniformly distributed on the arc of an ellipse, using a spreadsheet and never needing to calculate an elliptic integral.
 Can you tell me why the balance equation being satisfied implies a stationary distribution exists for a chain?