User:IssaRice/Metropolis–Hastings algorithm

without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ever.

in my opinion, the things a good explanation must cover are:

what the heck is sampling, even? once we have a fair coin, use that to generate samples for:
- arbitrary biased coin
- a discrete uniform distribution over 1,...,n
- a continuous uniform(0,1) distribution
- use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling
- bonus: go from a biased coin (with unknown bias) to a fair coin
why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?
- like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????
- this article http://www.stat.cmu.edu/~brian/463-663/week10/Chapter%2004.pdf says "Although the inversion method [aka inverse transform sampling / universality of the uniform] is very efficient and easy to implement, two key limitations reduce its usability as a general method for drawing samples from posterior densities. First, if the inverse function is impossible to derive analytically, obviously the method cannot be used." but why can't we just approximate it if we can't analytically solve for it?
an actually convincing example of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it. Since I'm interested in this mainly from an AI/inference point of view, i'd like to see a good example of this using bayes nets and sampling to compute some query.
- a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.
- the next level up is the example from these slides where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?
  - so in this case, you roll the die uniformly so the proposal is symmetric, so you just get the metropolis algorithm. now going through the reasoning in the chib-greenberg paper, i do get 1/5 for the acceptance probability from 6 to 1.
where the heck does the accept/reject rule come from? why this division thing to get the threshold?
why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?
how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?
- in the original metropolis paper, there was an physically intuitive notion of "state" as like the arrangement of molecules or whatever. and there was also an intuitive notion of proposal, to pick one molecule and make it move a little bit in some random direction (i don't remember the details). so there was some physical intuition to draw on. but in modern day bayesian stats stuff the "markov chain" is just the different states the parameter can be in, and the transitions also don't make sense, since you're not really trying to change the state! (you're trying to infer it). (see mark holder's double heads example for what i mean). so there is no longer this physical intuition to accompany the markov stuff.
how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?
- my guess for how it happened historically: people were already interested in markov chains for other reasons, and had built up a large body of theory about markov chains. so then when it came time to work on MCMC-type problems, people were already aware that detailed balance was a thing.
how did anyone even think of the proposal-acceptance decomposition?
- mark holder's notes give one intuition: if you just try to get the transition probabilities to satisfy detailed balance, then the probabilities for some transitions might sum to more than 1, which violates the fact that these transitions are supposed to be probability distributions.
- here's a question: why can't we just sum up the probabilities going out of a node and then renormalize for each node? I am guessing this won't work, but I don't understand why yet.
- chib-greenberg paper contains some insight on this, by saying that if we just started with some given transition probabilities, we probably wouldn't satisfy detailed balance, i.e. we would have an inequality instead of equality for some pairs of states. then the thought is, how can we reduce/increase the flow from some states to other states? and then you're supposed to come up with acceptance probabilities.
given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).
big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.
- this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out
if a move is rejected, does it matter whether you re-count the current state as a sampled state?
- it seems like it does matter that you re-count it, but why does it matter?
how do we know that the samples produced by metropolis-hastings are actually representative of the target distribution? the whole motivation is to sample in cases where we can't sample using other methods, i.e. where other methods give "bad" samples (despite theory/proofs), so how do we know the metropolis-hastings samples don't run into similar problems? i.e. how do we get a ground truth to test against? because i don't think the theoretical guarantees/theorems for M-H are any stronger than for other methods?
- for instance, M-H might be used to sample from a uniform probability distribution where the space of possible configurations is very large and unknown (see blitzstein and hwang for the claim that M-H is useful in this setting). but if that's the case, how do we know that M-H provides a good sample? how do we know that the samples we get aren't just tied to the spot where we happened to start?

"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?

i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf "We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea. (daphne koller also makes this point in the coursera course on graphical models)

some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf

usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.
we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.
the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.

read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm

http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf

read this later: https://joa.sh/posts/2016-08-21-metropolis.html

this one might be pretty good http://www.stat.cmu.edu/~brian/463-663/week10/Chapter%2004.pdf (i haven't read it in detail yet)