User:IssaRice/Metropolis–Hastings algorithm

From Machinelearning
Revision as of 08:37, 10 February 2020 by IssaRice (talk | contribs)

without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ever.

in my opinion, the things a good explanation must cover are:

  • what the heck is sampling, even? once we have a fair coin, use that to generate samples for:
    • arbitrary biased coin
    • a discrete uniform distribution over 1,...,n
    • a continuous uniform(0,1) distribution
    • use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling
    • bonus: go from a biased coin (with unknown bias) to a fair coin
  • why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings?
    • like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????
  • an actually convincing example of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.
  • where the heck does the accept/reject rule come from? why this division thing to get the threshold?
  • why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?
  • how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?
  • how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?
  • how did anyone even think of the proposal-acceptance decomposition?
  • given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).
  • big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.
    • this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out


"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?

i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf