User:IssaRice/Metropolis–Hastings algorithm: Difference between revisions

Revision as of 22:14, 14 June 2021

without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ever.

in my opinion, the things a good explanation must cover are:

what the heck is sampling, even? once we have a fair coin, use that to generate samples for:
- arbitrary biased coin
- a discrete uniform distribution over 1,...,n
- a continuous uniform(0,1) distribution
- use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling
- bonus: go from a biased coin (with unknown bias) to a fair coin
why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?
- like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????
an actually convincing example of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.
- a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.
- the next level up is the example from these slides where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?
where the heck does the accept/reject rule come from? why this division thing to get the threshold?
why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?
how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?
how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?
how did anyone even think of the proposal-acceptance decomposition?
given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).
big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.
- this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out
if a move is rejected, does it matter whether you re-count the current state as a sampled state?
- it seems like it does matter that you re-count it, but why does it matter?

"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?

i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf "We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea.

some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf

usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.
we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.
the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.

read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm

http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf

read this later: https://joa.sh/posts/2016-08-21-metropolis.html