Metropolis Algorithm - Why does it work?

Pi-Bond · Nov 26, 2012

I'm doing a project based around the Metropolis algorithm to explore parameter spaces.The computational aspect I can understand and implement. I understand that a Markov Chain is constructed in the parameter space. However, I don't understand the theory behind why it works so well in exploring it. Being a Physics student, most of the statistical literature I have read has not helped me. Can anyone explain it to me, or point to basic references?

Thanks.

Stephen Tashi · Nov 28, 2012

Pi-Bond said:

. I understand that a Markov Chain is constructed in the parameter space. However, I don't understand the theory behind why it works so well in exploring it

Do you have an understanding of the "stationary distribution" of a Markov Chain? That would be the place to start.

Pi-Bond · Nov 28, 2012

If I'm not mistaken, the stationary distribution X with transition matrix P is given by

[tex]XP=X[/tex]

Is that correct?

D H · Nov 28, 2012

Correct. I'll let Stephen take you further down this road.

You also implied that you would like some physical analog to help you understand this. A good physical analog lies in the annealing of ferrous metals. Simulated annealing emulates this. The Metropolis-Hastings algorithm is at the heart of simulated annealing. Simulated annealing is essentially Markov Chain Monte Carlo by another name.

Stephen Tashi · Nov 28, 2012

Pi-Bond said:

If I'm not mistaken, the stationary distribution X with transition matrix P is given by

[tex]XP=X[/tex]

Is that correct?

Yes. ( My preference is to multiply a column vector on the left by the matrix, so I would say [itex]TX = X[/itex] where [itex]T[/itex] is the transistion matrix.

A metaphor for teaching Markov chains involves frogs and lily pads. When a frog is on a particular lily pad at time t = n, he has a probability of staying on the pad at time t = n+1 and other probabilities of jumping to various other lily pads at that time. The probabilities depend only on which lily pad the frog is currently sitting. Instead of one frog imagine a hoard of tiny frogs that obey this process. If you begin at time t = 0 by tossing frogs on lily pads at random and let them all hop independently then there the population of frogs on each lily pad might roughly stabilize after a long time. Intuitively, the observed fraction of frogs on each pad would approximate the stationary distribution of the process. The is believable if the lily pads are arranged so a frog on one can eventually hop his way to any other pad, perhaps indirectly. - i.e. we aren't talking about a set of lily pads containing pads from two different ponds.

It may be stretch to apply this metaphor to the Metropolis algorithm, but let's try. In that situation, there is some unknown probability distribution for frogs on pads. This doesn't involve letting them jump. Perhaps someone has created this distribution by placing them on the pads. Furthermore the pond is too large for you to see all the lily pads at once. Your job is to define a set of jump probabilities for each pad so that when the frogs are allowed to jump according to that Markov process, their distribution will approximately preserve the original distribution of the frogs. (i.e. the original distribution of the frogs will be the stationary distribution of the process.) You can count the number of frogs on a given pad P_a and you can count the number of frogs on the pads P_b,P_c,...around it, to which the frogs may jump. But you don't know the total number of frogs in the whole pond. How will you determine a set of jump probabilities involving pad P_a?

It may or may be intutitive that you can determine the probabilities by computations using ratios of frogs , such as (number of frogs on pad P_b)/ (number of frogs on pad P_a) without knowing the original distribution of frogs (e.g. not knowing (number of frogs on pad P_b)/ (total number of frogs in the pond) ). If this is intuitive to you then, since you're doing physics, it may be sufficient explanation! If not, we must do some algebra.

Pi-Bond · Nov 28, 2012

Your explanation makes some sense to me, but I think I need more time for it to let it sink. But at the same time I'm looking to understand it more from a mathematical point of view. I'm not sure where the proposal pdf enters the discussion here. Are you saying that the equilibrium distribution of the frogs is equivalent to the distribution we were wanting to explore?

Stephen Tashi · Nov 28, 2012

Pi-Bond said:

. Are you saying that the equilibrium distribution of the frogs is equivalent to the distribution we were wanting to explore?

Yes. Instead of "explore", I would prefer to say "sample". Intutitively, if you got the pad-to-pad probabilities correct then you could initially toss the frogs on the pads at random and their jumping would eventually distribute them in the equiblibrium distribution. Perhaps we can think of an initial tossing of frogs onto pads as the proposal distribution. So you start with some target distribution of frogs where you can only see the ratios of frogs on adjacent pads. You establish jump probabilities based on these. Then you take all the frogs off the pads and toss them back on the pads at random. If you've gotten the jump probabilities correct, they reach an equilibrium that is the original target distribution.

Pi-Bond · Nov 29, 2012

I'm having some trouble thinking to this analogy in terms of a 1D problem. At each step we are just jumping to another point based on a ratio. In this analogy is there just one frog?

Stephen Tashi · Nov 29, 2012

I prefer to think of a hoard of frogs, each acting independently. You can think of the process being executed by one frog. If you think of one frog then you have to think getting empirical evidence about probabilities of jumping from pad P_a to pad P_b by sitting and waiting for him to land on pad P_a and then observing what fraction of his total jumps from pad P_a are to pad P_b. I don't like to wait.

Pi-Bond · Nov 29, 2012

Ok, this is starting to make more sense now. Can you tell me how the factor to accept or reject a jump

[tex]\alpha=min \left( \frac{P^{(c)}}{P^{(o)}},1 \right)[/tex]

enters the analogy? (P^c is the probability of the candidate point being considered and P⁰ is the previous point in the chain)

Stephen Tashi · Nov 29, 2012

Pi-Bond said:

Ok, this is starting to make more sense now. Can you tell me how the factor to accept or reject a jump

[tex]\alpha=min \left( \frac{P^{(c)}}{P^{(o)}},1 \right)[/tex]

enters the analogy? (P^c is the probability of the candidate point being considered and P⁰ is the previous point in the chain)

First let's emphasize that we can know [itex]\frac {P^(c)}{P^(o)}[/itex] without knowing [itex]P^{(c)}[/itex] and [itex]P^{(o)}[/itex] individually. (After all, if we knew the distribution P, we might not need to use a fancy algorithm to sample from it.) In the frogs-on-lilly-pads scenario, we are given a target distribution for the frogs, but we can only look at it locally and see the ratio of frogs on a pad to frogs on pads accessible from it. The number of frogs on a pad doesn't translate to a know probability because we don't know the total number of frogs in the pond. In the actual uses of the Metropolis algorithm, we know know some function that is proportional to the probability density, but doing the integration over all its possible values is intractable. So we can't figure out what factor to divide things by to normalize the function to a probability density.

I don't know if we can make a "perfect" explanation of the acceptance ratio purely from intuition. Let's do the best we can. Let's take it for granted that we are on pad [itex]p_o[/itex] and we must make a yes-or-no decision about whether to jump to pad [itex]p_c[/itex]. All we know about the target distribution is a ratio of probabilities.

If the probability of a frog being on pad [itex]p_c[/itex] is, for example, 5 times as great as the probability of a frog being on pad [itex]p_o[/itex] then it makes sense for us (as a frog) to jump to pad [itex]p_c[/itex]. This decision to jump doesn't directly implement the factor of 5. But given we must make a yes-or-no decision about jumping, its the only good choice.

Suppose we know that the probability of a frog being on pad [itex]p_c[/itex] on only 1/3 the probability of a frog being on pad [itex]p_c[/itex]. We can attempt to implement the ratio of 1/3 by setting the probability of jumping from [itex]p_o[/itex] to pad [itex]p_c[/itex] to be 1/3 and then make a random draw to determine whether to jump. (We didn't have such a choice in the previous situation when the ratio was 5 since we aren't allowing ourselves to transform from one frog on pad [itex]p_o[/itex] into 5 frogs on pad [itex]p_c[/itex].)

bpet · Dec 1, 2012

The accept-reject factor would have been chosen to satisfy "local balance" so that the markov chain is reversible, i.e. that the number of frogs jumping from pad 1 to pad 2 matches the number of frogs jumping from pad 2 to pad 1 (which is a slightly stronger condition than existence of an equilibrium).

The off-diagonal terms of the transition matrix will look like a(1,2)*q(1,2) where a is the (as yet unknown) acceptance factor and q is the jump probability, so the local balance requirement says p(1)*a(1,2)*q(1,2) = p(2)*a(2,1)*q(2,1).

Metropolis assumes q is symmetric, so if p(1)<p(2) we can set a(1,2)=1 and a(2,1)=p(2)/p(1).

HTH

Pi-Bond · Dec 1, 2012

Thanks for the detailed explanation both of you. However, from what I am gathering it seems that the algorithm should give us data for how much more likely it is to be at a point as compared to a "reference" point. But using the algorithm computationally on a toy problem reproduces the probability distribution correctly (i.e. the data points lie along the curve of the distribution). How is that the case?

bpet · Dec 1, 2012

Pi-Bond said:

Thanks for the detailed explanation both of you. However, from what I am gathering it seems that the algorithm should give us data for how much more likely it is to be at a point as compared to a "reference" point. But using the algorithm computationally on a toy problem reproduces the probability distribution correctly (i.e. the data points lie along the curve of the distribution). How is that the case?

No, the algorithm simply allows us to draw from distributions where we don't know the normalising constant. Post #12 kind of explains why (modulo error: p(2)/p(1) should be p(1)/p(2)) - the key is in understanding local balance (aka detailed balance) and how it relates to the stationary distribution of a Markov chain.

A neat example to try is points uniformly distributed on the arc of an ellipse, using a spreadsheet and never needing to calculate an elliptic integral.

Pi-Bond · Dec 4, 2012

Can you tell me why the balance equation being satisfied implies a stationary distribution exists for a chain?

Stephen Tashi · Dec 4, 2012

I don't know how far the frogs-on-lily-pads metaphor should be pushed, but it suits my level of thinking better than respectable arguments - hence:

Imagine a hoard of independently acting frogs on the lilly pads. Suppose the current fraction of frogs on pad A is given by the function X(A). We let all the frogs execute the markov process. If the function X has "detailed balance" then , for each pair of pads A,B, the number of frogs that jumps from pad A to B will be about the same as the number of frogs that jump from pad B to pad A, so we will end up with roughly the same distribution of frogs.

Shamelessly confusing the ideas of frequencies and probabilities, let N = the total number of frogs. The number that jump from pad A to pad B is about N X(A) pr(A->B) The number that jump from pad B to pad A is about N X(B) pr(B->A). So we must have X(A) pr(A->B) = X(B) Pr(B->A).

Pi-Bond · Dec 5, 2012

Great, that makes sense!

So is it the accept reject factor I mentioned earlier than ensures that there is a stationary distribution for our chain?

Stephen Tashi · Dec 5, 2012

Pi-Bond said:

So is it the accept reject factor I mentioned earlier than ensures that there is a stationary distribution for our chain?

Yes, it assures there is a particlar stationary distribution.

X(A) Pr(A->B) = X(B) Pr(B->A)

Suppose X(A) < X(B)

The Metropolis algorithm defines Pr(A->B) = 1 and Pr(B->A) = X(A)/X(B)

Those probabilities change above equation into an identity:

X(A) (1) = X(B) ( X(A)/X(B))

Pi-Bond · Dec 6, 2012

Ok, I understand this better now.

However I'm having trouble with the concept of "state" here. For a conventional (discrete time) Markov Chain, we have a row vector which tells us the probability of being in a state. So the size of the row vector is the number of states.

I'm wondering what is the analogous vector for the Markov Chain we construct in the Metropolis Algorithm. Is it an infinite row vector, as there are infinitely many points in the parameter space? Then the transition matrix would also be infinite then.

bpet · Dec 7, 2012

Pi-Bond said:

Ok, I understand this better now.

However I'm having trouble with the concept of "state" here. For a conventional (discrete time) Markov Chain, we have a row vector which tells us the probability of being in a state. So the size of the row vector is the number of states.

I'm wondering what is the analogous vector for the Markov Chain we construct in the Metropolis Algorithm. Is it an infinite row vector, as there are infinitely many points in the parameter space? Then the transition matrix would also be infinite then.

In a way, yes, though infinite vectors can be tricky to compute with and there's a few technical considerations in going to infinite number of states (especially if there is a continuum of lilypads) :)

If you have the time, some courses on measure theory and stochastic processes would certainly be worth your while, but the gist of it is that the same arguments for Metropolis-Hastings apply as long as the jump "density" is defined and non-zero between every pair of distinct lilypads.

HTH

Pi-Bond · Dec 11, 2012

Alright, thanks for the information.

Final query (I think)- should a Markov Chain be aperiodic if its stationary distribution is to be explored by the Metropolis Algorithm? (and why?)

chris ball · Dec 24, 2012

I recommend the third edition of the book Understanding Probability (Cambridge University Press, 2012) by Henk Tijms. The author provides a very clear explanation of Markov chain Monte Carlo simulation. The explanation of the Metropolis algorithm allows one to understand how to construct an algorithm of their own. Nice illustrations are given.

Metropolis Algorithm - Why does it work?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect