# Metropolis Algorithm - Why does it work?

1. Nov 26, 2012

### Pi-Bond

I'm doing a project based around the Metropolis algorithm to explore parameter spaces.The computational aspect I can understand and implement. I understand that a Markov Chain is constructed in the parameter space. However, I don't understand the theory behind why it works so well in exploring it. Being a Physics student, most of the statistical literature I have read has not helped me. Can anyone explain it to me, or point to basic references?

Thanks.

2. Nov 28, 2012

### Stephen Tashi

Do you have an understanding of the "stationary distribution" of a Markov Chain? That would be the place to start.

3. Nov 28, 2012

### Pi-Bond

If I'm not mistaken, the stationary distribution X with transition matrix P is given by

$$XP=X$$

Is that correct?

4. Nov 28, 2012

### D H

Staff Emeritus
Correct. I'll let Stephen take you further down this road.

You also implied that you would like some physical analog to help you understand this. A good physical analog lies in the annealing of ferrous metals. Simulated annealing emulates this. The Metropolis-Hastings algorithm is at the heart of simulated annealing. Simulated annealing is essentially Markov Chain Monte Carlo by another name.

5. Nov 28, 2012

### Stephen Tashi

Yes. ( My preference is to multiply a column vector on the left by the matrix, so I would say $TX = X$ where $T$ is the transistion matrix.

It may be stretch to apply this metaphor to the Metropolis algorithm, but let's try. In that situation, there is some unknown probability distribution for frogs on pads. This doesn't involve letting them jump. Perhaps someone has created this distribution by placing them on the pads. Furthermore the pond is too large for you to see all the lily pads at once. Your job is to define a set of jump probabilities for each pad so that when the frogs are allowed to jump according to that Markov process, their distribution will approximately preserve the original distribution of the frogs. (i.e. the original distribution of the frogs will be the stationary distribution of the process.) You can count the number of frogs on a given pad P_a and you can count the number of frogs on the pads P_b,P_c,...around it, to which the frogs may jump. But you don't know the total number of frogs in the whole pond. How will you determine a set of jump probabilities involving pad P_a?

It may or may be intutitive that you can determine the probabilities by computations using ratios of frogs , such as (number of frogs on pad P_b)/ (number of frogs on pad P_a) without knowing the original distribution of frogs (e.g. not knowing (number of frogs on pad P_b)/ (total number of frogs in the pond) ). If this is intuitive to you then, since you're doing physics, it may be sufficient explanation! If not, we must do some algebra.

6. Nov 28, 2012

### Pi-Bond

Your explanation makes some sense to me, but I think I need more time for it to let it sink. But at the same time I'm looking to understand it more from a mathematical point of view. I'm not sure where the proposal pdf enters the discussion here. Are you saying that the equilibrium distribution of the frogs is equivalent to the distribution we were wanting to explore?

7. Nov 28, 2012

### Stephen Tashi

8. Nov 29, 2012

### Pi-Bond

I'm having some trouble thinking to this analogy in terms of a 1D problem. At each step we are just jumping to another point based on a ratio. In this analogy is there just one frog?

9. Nov 29, 2012

### Stephen Tashi

I prefer to think of a hoard of frogs, each acting independently. You can think of the process being executed by one frog. If you think of one frog then you have to think getting empirical evidence about probabilities of jumping from pad P_a to pad P_b by sitting and waiting for him to land on pad P_a and then observing what fraction of his total jumps from pad P_a are to pad P_b. I don't like to wait.

10. Nov 29, 2012

### Pi-Bond

Ok, this is starting to make more sense now. Can you tell me how the factor to accept or reject a jump

$$\alpha=min \left( \frac{P^{(c)}}{P^{(o)}},1 \right)$$

enters the analogy? (Pc is the probability of the candidate point being considered and P0 is the previous point in the chain)

11. Nov 29, 2012

### Stephen Tashi

First let's emphasize that we can know $\frac {P^(c)}{P^(o)}$ without knowing $P^{(c)}$ and $P^{(o)}$ individually. (After all, if we knew the distribution P, we might not need to use a fancy algorithm to sample from it.) In the frogs-on-lilly-pads scenario, we are given a target distribution for the frogs, but we can only look at it locally and see the ratio of frogs on a pad to frogs on pads accessible from it. The number of frogs on a pad doesn't translate to a know probablity because we don't know the total number of frogs in the pond. In the actual uses of the Metropolis algorithm, we know know some function that is proportional to the probability density, but doing the integration over all its possible values is intractable. So we can't figure out what factor to divide things by to normalize the function to a probability density.

I don't know if we can make a "perfect" explanation of the acceptance ratio purely from intuition. Let's do the best we can. Let's take it for granted that we are on pad $p_o$ and we must make a yes-or-no decision about whether to jump to pad $p_c$. All we know about the target distribution is a ratio of probabilities.

If the probability of a frog being on pad $p_c$ is, for example, 5 times as great as the probability of a frog being on pad $p_o$ then it makes sense for us (as a frog) to jump to pad $p_c$. This decision to jump doesn't directly implement the factor of 5. But given we must make a yes-or-no decision about jumping, its the only good choice.

Suppose we know that the probability of a frog being on pad $p_c$ on only 1/3 the probability of a frog being on pad $p_c$. We can attempt to implement the ratio of 1/3 by setting the probability of jumping from $p_o$ to pad $p_c$ to be 1/3 and then make a random draw to determine whether to jump. (We didn't have such a choice in the previous situation when the ratio was 5 since we aren't allowing ourselves to transform from one frog on pad $p_o$ into 5 frogs on pad $p_c$.)

12. Dec 1, 2012

### bpet

The accept-reject factor would have been chosen to satisfy "local balance" so that the markov chain is reversible, i.e. that the number of frogs jumping from pad 1 to pad 2 matches the number of frogs jumping from pad 2 to pad 1 (which is a slightly stronger condition than existence of an equilibrium).

The off-diagonal terms of the transition matrix will look like a(1,2)*q(1,2) where a is the (as yet unknown) acceptance factor and q is the jump probability, so the local balance requirement says p(1)*a(1,2)*q(1,2) = p(2)*a(2,1)*q(2,1).

Metropolis assumes q is symmetric, so if p(1)<p(2) we can set a(1,2)=1 and a(2,1)=p(2)/p(1).

HTH

13. Dec 1, 2012

### Pi-Bond

Thanks for the detailed explanation both of you. However, from what I am gathering it seems that the algorithm should give us data for how much more likely it is to be at a point as compared to a "reference" point. But using the algorithm computationally on a toy problem reproduces the probability distribution correctly (i.e. the data points lie along the curve of the distribution). How is that the case?

14. Dec 1, 2012

### bpet

No, the algorithm simply allows us to draw from distributions where we don't know the normalising constant. Post #12 kind of explains why (modulo error: p(2)/p(1) should be p(1)/p(2)) - the key is in understanding local balance (aka detailed balance) and how it relates to the stationary distribution of a Markov chain.

A neat example to try is points uniformly distributed on the arc of an ellipse, using a spreadsheet and never needing to calculate an elliptic integral.

15. Dec 4, 2012

### Pi-Bond

Can you tell me why the balance equation being satisfied implies a stationary distribution exists for a chain?

16. Dec 4, 2012

### Stephen Tashi

I don't know how far the frogs-on-lily-pads metaphor should be pushed, but it suits my level of thinking better than respectable arguments - hence:

Imagine a hoard of independently acting frogs on the lilly pads. Suppose the current fraction of frogs on pad A is given by the function X(A). We let all the frogs execute the markov process. If the function X has "detailed balance" then , for each pair of pads A,B, the number of frogs that jumps from pad A to B will be about the same as the number of frogs that jump from pad B to pad A, so we will end up with roughly the same distribution of frogs.

Shamelessly confusing the ideas of frequencies and probabilities, let N = the total number of frogs. The number that jump from pad A to pad B is about N X(A) pr(A->B) The number that jump from pad B to pad A is about N X(B) pr(B->A). So we must have X(A) pr(A->B) = X(B) Pr(B->A).

17. Dec 5, 2012

### Pi-Bond

Great, that makes sense!

So is it the accept reject factor I mentioned earlier than ensures that there is a stationary distribution for our chain?

18. Dec 5, 2012

### Stephen Tashi

Yes, it assures there is a particlar stationary distribution.

X(A) Pr(A->B) = X(B) Pr(B->A)

Suppose X(A) < X(B)

The Metropolis algorithm defines Pr(A->B) = 1 and Pr(B->A) = X(A)/X(B)

Those probabilities change above equation into an identity:

X(A) (1) = X(B) ( X(A)/X(B))

19. Dec 6, 2012

### Pi-Bond

Ok, I understand this better now.

However I'm having trouble with the concept of "state" here. For a conventional (discrete time) Markov Chain, we have a row vector which tells us the probability of being in a state. So the size of the row vector is the number of states.

I'm wondering what is the analogous vector for the Markov Chain we construct in the Metropolis Algorithm. Is it an infinite row vector, as there are infinitely many points in the parameter space? Then the transition matrix would also be infinite then.

20. Dec 7, 2012

### bpet

In a way, yes, though infinite vectors can be tricky to compute with and there's a few technical considerations in going to infinite number of states (especially if there is a continuum of lilypads) :)

If you have the time, some courses on measure theory and stochastic processes would certainly be worth your while, but the gist of it is that the same arguments for Metropolis-Hastings apply as long as the jump "density" is defined and non-zero between every pair of distinct lilypads.

HTH