https://machinelearning.subwiki.org/w/api.php?action=feedcontributions&user=IssaRice&feedformat=atomMachinelearning - User contributions [en]2021-06-21T18:59:35ZUser contributionsMediaWiki 1.29.2https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3318User:IssaRice/Metropolis–Hastings algorithm2021-06-17T17:55:03Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.''' Since I'm interested in this mainly from an AI/inference point of view, i'd like to see a good example of this using bayes nets and sampling to compute some query.<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
*** so in this case, you roll the die uniformly so the proposal is symmetric, so you just get the metropolis algorithm. now going through the reasoning in the chib-greenberg paper, i do get 1/5 for the acceptance probability from 6 to 1.<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
** in the original metropolis paper, there was an physically intuitive notion of "state" as like the arrangement of molecules or whatever. and there was also an intuitive notion of proposal, to pick one molecule and make it move a little bit in some random direction (i don't remember the details). so there was some physical intuition to draw on. but in modern day bayesian stats stuff the "markov chain" is just the different states the parameter can be in, and the transitions also don't make sense, since you're not really trying to change the state! (you're trying to infer it). (see mark holder's double heads example for what i mean). so there is no longer this physical intuition to accompany the markov stuff.<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
** my guess for how it happened historically: people were already interested in markov chains for other reasons, and had built up a large body of theory about markov chains. so then when it came time to work on MCMC-type problems, people were already aware that detailed balance was a thing.<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
** [http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf#page=10 mark holder's notes] give one intuition: if you just try to get the transition probabilities to satisfy detailed balance, then the probabilities for some transitions might sum to more than 1, which violates the fact that these transitions are supposed to be probability distributions.<br />
** here's a question: why can't we just sum up the probabilities going out of a node and then renormalize for each node? I am guessing this won't work, but I don't understand why yet.<br />
** chib-greenberg paper contains some insight on this, by saying that if we just started with some given transition probabilities, we probably wouldn't satisfy detailed balance, i.e. we would have an inequality instead of equality for some pairs of states. then the thought is, how can we reduce/increase the flow from some states to other states? and then you're supposed to come up with acceptance probabilities.<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
* '''how do we know that the samples produced by metropolis-hastings are actually representative of the target distribution? the whole motivation is to sample in cases where we can't sample using other methods, i.e. where other methods give "bad" samples (despite theory/proofs), so how do we know the metropolis-hastings samples don't run into similar problems?''' i.e. how do we get a ground truth to test against? because i don't think the theoretical guarantees/theorems for M-H are any stronger than for other methods?<br />
** for instance, M-H might be used to sample from a uniform probability distribution where the space of possible configurations is very large and unknown (see blitzstein and hwang for the claim that M-H is useful in this setting). but if that's the case, how do we know that M-H provides a good sample? how do we know that the samples we get aren't just tied to the spot where we happened to start?<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea. (daphne koller also makes this point in the coursera course on graphical models)<br />
<br />
<br />
some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf<br />
<br />
* usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.<br />
* we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.<br />
* the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.<br />
<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3317User:IssaRice/Metropolis–Hastings algorithm2021-06-15T18:33:31Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.''' Since I'm interested in this mainly from an AI/inference point of view, i'd like to see a good example of this using bayes nets and sampling to compute some query.<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
*** so in this case, you roll the die uniformly so the proposal is symmetric, so you just get the metropolis algorithm. now going through the reasoning in the chib-greenberg paper, i do get 1/5 for the acceptance probability from 6 to 1.<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
** in the original metropolis paper, there was an physically intuitive notion of "state" as like the arrangement of molecules or whatever. and there was also an intuitive notion of proposal, to pick one molecule and make it move a little bit in some random direction (i don't remember the details). so there was some physical intuition to draw on. but in modern day bayesian stats stuff the "markov chain" is just the different states the parameter can be in, and the transitions also don't make sense, since you're not really trying to change the state! (you're trying to infer it). (see mark holder's double heads example for what i mean). so there is no longer this physical intuition to accompany the markov stuff.<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
** my guess for how it happened historically: people were already interested in markov chains for other reasons, and had built up a large body of theory about markov chains. so then when it came time to work on MCMC-type problems, people were already aware that detailed balance was a thing.<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
** [http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf#page=10 mark holder's notes] give one intuition: if you just try to get the transition probabilities to satisfy detailed balance, then the probabilities for some transitions might sum to more than 1, which violates the fact that these transitions are supposed to be probability distributions.<br />
** here's a question: why can't we just sum up the probabilities going out of a node and then renormalize for each node? I am guessing this won't work, but I don't understand why yet.<br />
** chib-greenberg paper contains some insight on this, by saying that if we just started with some given transition probabilities, we probably wouldn't satisfy detailed balance, i.e. we would have an inequality instead of equality for some pairs of states. then the thought is, how can we reduce/increase the flow from some states to other states? and then you're supposed to come up with acceptance probabilities.<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
* '''how do we know that the samples produced by metropolis-hastings are actually representative of the target distribution? the whole motivation is to sample in cases where we can't sample using other methods, i.e. where other methods give "bad" samples (despite theory/proofs), so how do we know the metropolis-hastings samples don't run into similar problems?''' i.e. how do we get a ground truth to test against? because i don't think the theoretical guarantees/theorems for M-H are any stronger than for other methods?<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea. (daphne koller also makes this point in the coursera course on graphical models)<br />
<br />
<br />
some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf<br />
<br />
* usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.<br />
* we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.<br />
* the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.<br />
<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3316User:IssaRice/Metropolis–Hastings algorithm2021-06-15T07:53:52Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.''' Since I'm interested in this mainly from an AI/inference point of view, i'd like to see a good example of this using bayes nets and sampling to compute some query.<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
*** so in this case, you roll the die uniformly so the proposal is symmetric, so you just get the metropolis algorithm. now going through the reasoning in the chib-greenberg paper, i do get 1/5 for the acceptance probability from 6 to 1.<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
** in the original metropolis paper, there was an physically intuitive notion of "state" as like the arrangement of molecules or whatever. and there was also an intuitive notion of proposal, to pick one molecule and make it move a little bit in some random direction (i don't remember the details). so there was some physical intuition to draw on. but in modern day bayesian stats stuff the "markov chain" is just the different states the parameter can be in, and the transitions also don't make sense, since you're not really trying to change the state! (you're trying to infer it). (see mark holder's double heads example for what i mean). so there is no longer this physical intuition to accompany the markov stuff.<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
** my guess for how it happened historically: people were already interested in markov chains for other reasons, and had built up a large body of theory about markov chains. so then when it came time to work on MCMC-type problems, people were already aware that detailed balance was a thing.<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
** [http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf#page=10 mark holder's notes] give one intuition: if you just try to get the transition probabilities to satisfy detailed balance, then the probabilities for some transitions might sum to more than 1, which violates the fact that these transitions are supposed to be probability distributions.<br />
** here's a question: why can't we just sum up the probabilities going out of a node and then renormalize for each node? I am guessing this won't work, but I don't understand why yet.<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
* '''how do we know that the samples produced by metropolis-hastings are actually representative of the target distribution? the whole motivation is to sample in cases where we can't sample using other methods, i.e. where other methods give "bad" samples (despite theory/proofs), so how do we know the metropolis-hastings samples don't run into similar problems?''' i.e. how do we get a ground truth to test against? because i don't think the theoretical guarantees/theorems for M-H are any stronger than for other methods?<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea. (daphne koller also makes this point in the coursera course on graphical models)<br />
<br />
<br />
some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf<br />
<br />
* usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.<br />
* we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.<br />
* the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.<br />
<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3315User:IssaRice/Metropolis–Hastings algorithm2021-06-15T07:44:20Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.''' Since I'm interested in this mainly from an AI/inference point of view, i'd like to see a good example of this using bayes nets and sampling to compute some query.<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
** in the original metropolis paper, there was an physically intuitive notion of "state" as like the arrangement of molecules or whatever. and there was also an intuitive notion of proposal, to pick one molecule and make it move a little bit in some random direction (i don't remember the details). so there was some physical intuition to draw on. but in modern day bayesian stats stuff the "markov chain" is just the different states the parameter can be in, and the transitions also don't make sense, since you're not really trying to change the state! (you're trying to infer it). (see mark holder's double heads example for what i mean). so there is no longer this physical intuition to accompany the markov stuff.<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
** my guess for how it happened historically: people were already interested in markov chains for other reasons, and had built up a large body of theory about markov chains. so then when it came time to work on MCMC-type problems, people were already aware that detailed balance was a thing.<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
** [http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf#page=10 mark holder's notes] give one intuition: if you just try to get the transition probabilities to satisfy detailed balance, then the probabilities for some transitions might sum to more than 1, which violates the fact that these transitions are supposed to be probability distributions.<br />
** here's a question: why can't we just sum up the probabilities going out of a node and then renormalize for each node? I am guessing this won't work, but I don't understand why yet.<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
* '''how do we know that the samples produced by metropolis-hastings are actually representative of the target distribution? the whole motivation is to sample in cases where we can't sample using other methods, i.e. where other methods give "bad" samples (despite theory/proofs), so how do we know the metropolis-hastings samples don't run into similar problems?''' i.e. how do we get a ground truth to test against? because i don't think the theoretical guarantees/theorems for M-H are any stronger than for other methods?<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea. (daphne koller also makes this point in the coursera course on graphical models)<br />
<br />
<br />
some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf<br />
<br />
* usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.<br />
* we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.<br />
* the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.<br />
<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3314User:IssaRice/Metropolis–Hastings algorithm2021-06-15T07:40:46Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.''' Since I'm interested in this mainly from an AI/inference point of view, i'd like to see a good example of this using bayes nets and sampling to compute some query.<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
** my guess for how it happened historically: people were already interested in markov chains for other reasons, and had built up a large body of theory about markov chains. so then when it came time to work on MCMC-type problems, people were already aware that detailed balance was a thing.<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
** [http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf#page=10 mark holder's notes] give one intuition: if you just try to get the transition probabilities to satisfy detailed balance, then the probabilities for some transitions might sum to more than 1, which violates the fact that these transitions are supposed to be probability distributions.<br />
** here's a question: why can't we just sum up the probabilities going out of a node and then renormalize for each node? I am guessing this won't work, but I don't understand why yet.<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
* '''how do we know that the samples produced by metropolis-hastings are actually representative of the target distribution? the whole motivation is to sample in cases where we can't sample using other methods, i.e. where other methods give "bad" samples (despite theory/proofs), so how do we know the metropolis-hastings samples don't run into similar problems?''' i.e. how do we get a ground truth to test against? because i don't think the theoretical guarantees/theorems for M-H are any stronger than for other methods?<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea. (daphne koller also makes this point in the coursera course on graphical models)<br />
<br />
<br />
some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf<br />
<br />
* usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.<br />
* we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.<br />
* the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.<br />
<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3313User:IssaRice/Metropolis–Hastings algorithm2021-06-15T07:35:26Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.''' Since I'm interested in this mainly from an AI/inference point of view, i'd like to see a good example of this using bayes nets and sampling to compute some query.<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
** [http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf#page=10 mark holder's notes] give one intuition: if you just try to get the transition probabilities to satisfy detailed balance, then the probabilities for some transitions might sum to more than 1, which violates the fact that these transitions are supposed to be probability distributions.<br />
** here's a question: why can't we just sum up the probabilities going out of a node and then renormalize for each node? I am guessing this won't work, but I don't understand why yet.<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
* '''how do we know that the samples produced by metropolis-hastings are actually representative of the target distribution? the whole motivation is to sample in cases where we can't sample using other methods, i.e. where other methods give "bad" samples (despite theory/proofs), so how do we know the metropolis-hastings samples don't run into similar problems?''' i.e. how do we get a ground truth to test against? because i don't think the theoretical guarantees/theorems for M-H are any stronger than for other methods?<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea. (daphne koller also makes this point in the coursera course on graphical models)<br />
<br />
<br />
some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf<br />
<br />
* usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.<br />
* we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.<br />
* the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.<br />
<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3312User:IssaRice/Metropolis–Hastings algorithm2021-06-15T07:32:41Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.'''<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
** [http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf#page=10 mark holder's notes] give one intuition: if you just try to get the transition probabilities to satisfy detailed balance, then the probabilities for some transitions might sum to more than 1, which violates the fact that these transitions are supposed to be probability distributions.<br />
** here's a question: why can't we just sum up the probabilities going out of a node and then renormalize for each node? I am guessing this won't work, but I don't understand why yet.<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
* '''how do we know that the samples produced by metropolis-hastings are actually representative of the target distribution? the whole motivation is to sample in cases where we can't sample using other methods, i.e. where other methods give "bad" samples (despite theory/proofs), so how do we know the metropolis-hastings samples don't run into similar problems?''' i.e. how do we get a ground truth to test against? because i don't think the theoretical guarantees/theorems for M-H are any stronger than for other methods?<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea. (daphne koller also makes this point in the coursera course on graphical models)<br />
<br />
<br />
some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf<br />
<br />
* usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.<br />
* we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.<br />
* the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.<br />
<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3311User:IssaRice/Metropolis–Hastings algorithm2021-06-15T07:29:15Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.'''<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
** [http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf#page=10 mark holder's notes] give one intuition: if you just try to do the transition probabilities to satisfy detailed balance, then the probabilities for some transitions might sum to more than 1, which violates the fact that these transitions are supposed to be probability distributions.<br />
** here's a question: why can't we just sum up the probabilities going out of a node and then renormalize for each node? I am guessing this won't work, but I don't understand why yet.<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
* '''how do we know that the samples produced by metropolis-hastings are actually representative of the target distribution? the whole motivation is to sample in cases where we can't sample using other methods, i.e. where other methods give "bad" samples (despite theory/proofs), so how do we know the metropolis-hastings samples don't run into similar problems?''' i.e. how do we get a ground truth to test against? because i don't think the theoretical guarantees/theorems for M-H are any stronger than for other methods?<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea. (daphne koller also makes this point in the coursera course on graphical models)<br />
<br />
<br />
some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf<br />
<br />
* usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.<br />
* we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.<br />
* the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.<br />
<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3310User:IssaRice/Metropolis–Hastings algorithm2021-06-15T07:28:39Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.'''<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
** [http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf#page=10 mark holder's notes] give one intuition: if you just try to do the transition probabilities to satisfy detailed balance, then the probabilities for some transitions might sum to more than 1, which violates the fact that these transitions are supposed to be probability distributions.<br />
** here's a question: why can't we just sum up the probabilities going out of a node and then renormalize for each node? I am guessing this won't work, but I don't understand why yet.<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
* '''how do we know that the samples produced by metropolis-hastings are actually representative of the target distribution? the whole motivation is to sample in cases where we can't sample using other methods, i.e. where other methods give "bad" samples (despite theory/proofs), so how do we know the metropolis-hastings samples don't run into similar problems?''' i.e. how do we get a ground truth to test against?<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea. (daphne koller also makes this point in the coursera course on graphical models)<br />
<br />
<br />
some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf<br />
<br />
* usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.<br />
* we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.<br />
* the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.<br />
<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3309User:IssaRice/Metropolis–Hastings algorithm2021-06-15T07:18:17Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.'''<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
** [http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf#page=10 mark holder's notes] give one intuition: if you just try to do the transition probabilities to satisfy detailed balance, then the probabilities for some transitions might sum to more than 1, which violates the fact that these transitions are supposed to be probability distributions.<br />
** here's a question: why can't we just sum up the probabilities going out of a node and then renormalize for each node? I am guessing this won't work, but I don't understand why yet.<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
* '''how do we know that the samples produced by metropolis-hastings are actually representative of the target distribution? the whole motivation is to sample in cases where we can't sample using other methods, i.e. where other methods give "bad" samples (despite theory/proofs), so how do we know the metropolis-hastings samples don't run into similar problems?'''<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea. (daphne koller also makes this point in the coursera course on graphical models)<br />
<br />
<br />
some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf<br />
<br />
* usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.<br />
* we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.<br />
* the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.<br />
<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3308User:IssaRice/Metropolis–Hastings algorithm2021-06-14T23:43:30Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.'''<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
** [http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf#page=10 mark holder's notes] give one intuition: if you just try to do the transition probabilities to satisfy detailed balance, then the probabilities for some transitions might sum to more than 1, which violates the fact that these transitions are supposed to be probability distributions.<br />
** here's a question: why can't we just sum up the probabilities going out of a node and then renormalize for each node? I am guessing this won't work, but I don't understand why yet.<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
* '''how do we know that the samples produced by metropolis-hastings are actually representative of the target distribution? the whole motivation is to sample in cases where we can't sample using other methods, i.e. where other methods give "bad" samples (despite theory/proofs), so how do we know the metropolis-hastings samples don't run into similar problems?'''<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea.<br />
<br />
<br />
some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf<br />
<br />
* usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.<br />
* we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.<br />
* the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.<br />
<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3307User:IssaRice/Metropolis–Hastings algorithm2021-06-14T23:00:31Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.'''<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
** [http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf#page=10 mark holder's notes] give one intuition: if you just try to do the transition probabilities to satisfy detailed balance, then the probabilities for some transitions might sum to more than 1, which violates the fact that these transitions are supposed to be probability distributions.<br />
** here's a question: why can't we just sum up the probabilities going out of a node and then renormalize for each node? I am guessing this won't work, but I don't understand why yet.<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea.<br />
<br />
<br />
some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf<br />
<br />
* usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.<br />
* we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.<br />
* the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.<br />
<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3306User:IssaRice/Metropolis–Hastings algorithm2021-06-14T22:57:30Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.'''<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
** [http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf#page=10 mark holder's notes] give one intuition: if you just try to do the transition probabilities to satisfy detailed balance, then the probabilities for some transitions might sum to more than 1, which violates the fact that these transitions are supposed to be probability distributions.<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea.<br />
<br />
<br />
some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf<br />
<br />
* usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.<br />
* we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.<br />
* the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.<br />
<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3305User:IssaRice/Metropolis–Hastings algorithm2021-06-14T22:14:21Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.'''<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea.<br />
<br />
<br />
some intuitions from https://www.cise.ufl.edu/class/cap6617fa17/Readings/ChibGreenberg.pdf<br />
<br />
* usually in markov chain settings, we are trying to go from the transition probabilities to the stationary distribution: e.g. we know that if it is sunny then it will be rainy with probability 0.1 and sunny with probability 0.9, and if it is rainy then it will be sunny with probability 0.3 and rainy with probability 0.7, or whatever. and the goal is to find out the stationary distribution, that is, if we randomly wake up one day (without knowing any previous day's weather), what is the probability that it is raining? But in the metropolis-hastings setting, we are doing the reverse: we roughly (up to multiplicative constant) know what the stationary distribution is that we are trying to sample from. The goal is instead to pick the transition probabilities so that they give rise to this stationary distribution.<br />
* we set the acceptance probability to 1 in one of the cases because we aren't visiting that state enough (since detailed balance is violated), so we want to visit it as much as possible, and so the highest value we can choose is 1.<br />
* the min{1, ...} crap is just a shorthand to avoid having to split into cases depending on whether we visit x or y more.<br />
<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3304User:IssaRice/Metropolis–Hastings algorithm2021-06-14T20:42:39Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.'''<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
** it seems like it does matter that you re-count it, but '''why does it matter?'''<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea.<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3303User:IssaRice/Metropolis–Hastings algorithm2021-06-14T20:17:54Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.'''<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
* if a move is rejected, does it matter whether you re-count the current state as a sampled state?<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea.<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3293User:IssaRice/Metropolis–Hastings algorithm2021-06-09T06:38:20Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* '''an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.'''<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea.<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3292User:IssaRice/Metropolis–Hastings algorithm2021-06-08T18:50:46Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** '''like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????'''<br />
* an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea.<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Metropolis%E2%80%93Hastings_algorithm&diff=3291User:IssaRice/Metropolis–Hastings algorithm2021-06-08T18:50:02Z<p>IssaRice: </p>
<hr />
<div>without exception, every single explanation i have seen so far of this absolutely sucks. like, not just "most really suck, and some suck a little". literally everything just sucks really bad. this might be my best guess for the most horribly-explained thing ''ever''.<br />
<br />
in my opinion, the things a good explanation ''must'' cover are:<br />
<br />
* what the heck is sampling, even? once we have a fair coin, use that to generate samples for:<br />
** arbitrary biased coin<br />
** a discrete uniform distribution over 1,...,n<br />
** a continuous uniform(0,1) distribution<br />
** use a continuous uniform to sample from an arbitrary distribution using inverse transform sampling<br />
** bonus: go from a biased coin (with unknown bias) to a fair coin<br />
* '''why doesn't inverse transform sampling work in situations where we have to use metropolis-hastings? another phrasing: why do we need metropolis-hastings if there is inverse transform sampling?'''<br />
** like, what the heck does it mean to "only" have access to some constant multiple of the pdf? why would we ever get into such a situation, and why can't we just normalize and then numerically approximate the cdf, and then get the inverse to do inverse transform sampling??? literally NONE of the explanations even RAISE this question. why????<br />
* an actually convincing ''example'' of MCMC. the stuff i've seen so far are so boring i just don't even care if we can sample from it.<br />
** a toy example i like is sampling uniformly from 1,...,6 using coin flips. it's easy to see in this case that you can do a thing where you move left on heads and move right on tails (with wrap-around, of course), and that after a while you're spending 1/6 of your time on each number.<br />
** the next level up is the example from [https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L35.pdf#page=9 these slides] where 1,...5, have probability mass 1/10 each, and 6 has mass 1/2. now you can't just flip and move to adjacent numbers; you somehow have to distinguish the 6 as being more likely than the other numbers. how do you do it?<br />
* '''where the heck does the accept/reject rule come from? why this division thing to get the threshold?'''<br />
* '''why do we need a transition/proposal matrix, can this matrix be literally anything, and why do we care if it's symmetric?'''<br />
* '''how did anyone ever come up with the idea of using a markov chain, whose stationary distribution is the distribution we're trying to sample from? like, where did the markov chain idea even come from?'''<br />
* '''how could anyone have come up with detailed balance as a sufficient condition for existence of a stationary distribution?'''<br />
* '''how did anyone even think of the proposal-acceptance decomposition?'''<br />
* '''given the ratio condition for the acceptance function (see https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation where A(x',x)/A(x,x') is written as a fraction), why did people choose the particular form for A(x',x)? clearly lots of acceptance functions could have worked, and the particular choice seems unmotivated (although it is particularly simple-looking!).'''<br />
* big idea: https://stackoverflow.com/a/16826118/3422337 -- instead of sampling from 1,...,n by generating 2^k coin flips (in which case we might need to start over unless n is a power of 2), can we somehow "make every coin flip count"? randomly move to your neighbor with probability 1/2 by using the coin flip. it seems intuitive that "in the long run" you'll spend about an equal amount of time in each state.<br />
** this case is simplified in multiple ways: the proposal matrix is symmetric, and the acceptance probabilities are all equal to 1 (i.e. always accept), and the distribution we're trying to sample from is uniform. so everything nicely cancels out<br />
<br />
<br />
"""Our previous methods tend not to work in complex situations:Inverse CDF may not be available.Conditionals needed for ancestral/Gibbs sampling may be hard to compute.Rejection sampling tends to reject almost all samples.Importance sampling tends gives almost zero weight to all samples.""" https://www.cs.ubc.ca/~schmidtm/Courses/540-W17/L22.pdf -- why wouldn't inverse cdf not be available?<br />
<br />
i like this: http://phylo.bio.ku.edu/slides/BayesianMCMC-2013.pdf<br />
"We’d like an MCMC simulation that converges quickly so we should set the transition probabilities as high as possible. So, using m0,1 = 1 and m1,0 = 1/3 sounds best." -- omg!!!! somehow no other resource has ever explained this idea.<br />
<br />
read this page later: https://www.statlect.com/fundamentals-of-statistics/Metropolis-Hastings-algorithm<br />
<br />
http://www.stats.ox.ac.uk/~nicholls/MScMCMC15/L5MScMCMC15.pdf<br />
<br />
read this later: https://joa.sh/posts/2016-08-21-metropolis.html</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Diagonalization_lemma&diff=3290User:IssaRice/Computability and logic/Diagonalization lemma2021-06-07T19:35:11Z<p>IssaRice: /* Use of extra quantified variable to make a substitution */</p>
<hr />
<div>The '''diagonalization lemma''', also called the '''Gödel–Carnap fixed point theorem''', is a fixed point theorem in logic.<br />
<br />
A verbal version of this result, modified from GEB (p. 449), runs as follows: Take the predicate cannot-be-proved-when-diagonalized(x). This takes a predicate in the x input, and says whether the sentence can be proved when diagonalized (i.e. inserted into itself). For instance, cannot-be-proved-when-diagonalized("has-length-less-than-one-thousand(x)") claims that has-length-less-than-one-thousand("has-length-less-than-one-thousand(x)") cannot be proved. In this case, let's say it's false, since we can see that the string "has-length-less-than-one-thousand(x)" has length less than 1000, and let's assume our proof system is strong enough to prove this. Now, to diagonalize cannot-be-proved-when-diagonalized(x) is to form the sentence cannot-be-proved-when-diagonalized("cannot-be-proved-when-diagonalized(x)"). So can this sentence be proved or not? If it can be proved, then the sentence itself claims that it cannot be proved, a contradiction. So it must not be provable.<br />
<br />
Basically, unlike the English language, a sentence can't refer to itself using phrases like "this sentence itself", so there is no straightforward way to make claims like "This sentence cannot be proved". To get around this restriction, we must use diagonalization – substituting a sentence's own encoding (i.e. string representation, aka "Gödel number") into itself. This allows a predicate to talk of its own string representation. So now if that predicate happens to claim unprovability – then we get Gödel's first incompleteness theorem.<br />
<br />
The diagonalization lemma generalizes to talk about any predicate P(x), not just not-provable(x). We want to find a sentence G such that G is true if and only if P("G") is (this is a little sloppy -- it's not actually the string "G", but rather if we made whatever G happens to be into a string...). Let G be has-property-P-when-diagonalized("has-property-P-when-diagonalized(x)"). If G is true, then has-property-P-when-diagonalized(x) must have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") which means P("G") is true. If G is false, then has-property-P-when-diagonalized(x) must not have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") is false, which means P("G") is false. Thus G is indeed true if and only if P("G") is true, so we have successfully "diagonalized" the predicate P(x).<br />
<br />
==Rogers's fixed point theorem==<br />
<br />
Let <math>f</math> be a total computable function. Then there exists an index <math>e</math> such that <math>\varphi_e \simeq \varphi_{f(e)}</math>.<br />
<br />
(simplified)<br />
<br />
Define <math>d(x) = \varphi_x(x)</math> (this is actually slightly wrong, but it brings out the analogy better).<br />
<br />
Consider the function <math>f\circ d</math>. This is partial recursive, so <math>f\circ d \simeq \varphi_i</math> for some index <math>i</math>.<br />
<br />
Now <math>\varphi_{f(d(i))} \simeq \varphi_{\varphi_i(i)}</math> since <math>f\circ d \simeq \varphi_i</math>. This is equivalent to <math>\varphi_{d(i)}</math> by definition of <math>d</math>. Thus, we may take <math>e = d(i)</math> to complete the proof.<br />
<br />
It looks like we have <math>f(d(i)) = \varphi_i(i) = d(i)</math>, i.e. <math>f(e) = e</math>. Is this right?<br />
<br />
<br />
Repeatedly using the facts that (1) <math>i</math> is an index for <math>f\circ d</math>, and (2) <math>d(i) = \varphi_i(i)</math>, allows us to create an iteration effect:<br />
<br />
<math>\varphi_i(i) \simeq f(d(i)) \simeq f(\varphi_i(i)) \simeq f(f(d(i))) \simeq f(f(\varphi_i(i))) \simeq \cdots \simeq f\circ \cdots \circ f \circ d(i)</math><br />
<br />
(I'm wondering if there's some deeper meaning to this. So far it's just an interesting connection between diagonalization-based fixed points and iteration-based fixed points. I think there might be a connection between this and the [https://medium.com/@cdsmithus/fixpoints-in-haskell-294096a9fc10 fix function in Haskell].)<br />
<br />
<br />
In the more rigorous/careful version of the proof, we use the [[s-m-n theorem]] to get an index of a function, <math>s</math>, which is basically like <math>d</math>. The difference is that <math>\varphi_x(x)</math> might not be defined for all <math>x</math> (actually it isn't, since some partial functions are always undefined) so <math>d</math> is not total. On the other hand, <math>s</math> is obtained via the s-m-n theorem so is total. When <math>\varphi_x(x)</math> is undefined, <math>s(x)</math> gives an index of the always-undefined partial function. So <math>s</math> says "this is undefined" in a defined way. Thanks to this property, the expression <math>\varphi_{s(x)}</math> always makes sense, whereas <math>\varphi_{\varphi_x(x)}</math> sometimes doesn't make sense.<br />
<br />
<br />
See also https://machinelearning.subwiki.org/wiki/User:IssaRice/Computability_and_logic/Rogers_fixed_point_theorem_using_Sipser%27s_notation<br />
<br />
==Diagonalization lemma==<br />
<br />
(semantic version)<br />
<br />
Let <math>A</math> be a formula with one free variable. Then there exists a sentence <math>G</math> such that <math>G</math> iff <math>A(\ulcorner G\urcorner)</math>.<br />
<br />
Define <math>\mathrm{diag}(x)</math> to be <math>\ulcorner C(\ulcorner C\urcorner)\urcorner</math> where <math>x = \ulcorner C\urcorner</math>. In other words, given a number <math>x</math>, the function <math>\mathrm{diag}</math> finds the formula with that Godel number, then diagonalizes it (i.e. substitutes the Godel number of the formula into the formula itself), then returns the Godel number of the resulting sentence.<br />
<br />
Let <math>B</math> be <math>A(\mathrm{diag}(x))</math>, and let <math>G</math> be <math>B(\ulcorner B\urcorner)</math>.<br />
<br />
Then <math>G</math> is <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math>, by substituting <math>x = \ulcorner B\urcorner</math> in the definition of <math>B</math>.<br />
<br />
We also have <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner B(\ulcorner B\urcorner)\urcorner</math> by definition of <math>\mathrm{diag}</math>. By definition of <math>G</math>, this is <math>\ulcorner G\urcorner</math>, so we have <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner G\urcorner</math>.<br />
<br />
To complete the proof, apply <math>A</math> to both sides of the final equality to obtain <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math> iff <math>A(\ulcorner G\urcorner)</math>; this simplifies to <math>G</math> iff <math>A(\ulcorner G\urcorner)</math>.<br />
<br />
<ref name="gaifman">Haim Gaifman. [https://web.archive.org/web/20180205090617/http://www.columbia.edu/~hg17/naming-diag.pdf "Naming and Diagonalization, from Cantor to Gödel to Kleene"].</ref><br />
<br />
<ref>https://mathoverflow.net/questions/30874/arithmetic-fixed-point-theorem</ref><br />
<br />
===Use of extra quantified variable to make a substitution===<br />
<br />
(see p. 448 of GEB)<br />
<br />
outside the formal system, if we have some function f, a constant a, and some one-place relation R, we can substitute f(a) into R like: R(f(a)). but many systems of formal logic don't have a way to directly talk about outputs of functions like f(a). instead, they have a relation like F(a,y) to mean f(a)=y. [why on earth would they do this? i think the basic reason is that if functions are just relations, we have fewer cases to have to prove in those annoying structural induction proofs. The simpler our formal system is, the shorter our proofs, but also the more annoying the system is to use in practice. Peter Smith mentions a similar trade-off between axiomatic/Hilbertian systems and natural deduction systems.] so how do we express an idea like R(f(a))? we can make use of an extra variable to hold the output, e.g. <math>F(a,y)\wedge R(y)</math>. but this leaves y free, so actually we want <math>\exists y (F(a,y)\wedge R(y))</math>. alternatively, we can also say <math>F(a,y) \to R(y)</math>. in this case we want <math>\forall y (F(a,y) \to R(y))</math>. it's easy to prove that both of these clumsy ways of writing are logically equivalent to R(f(a)).<br />
<br />
something similar happens when we want to diagonalize formulas. given a formula <math>A(x)</math> that has just x free, it's easy enough to diagonalize it: <math>A(\ulcorner A(x)\urcorner)</math>. but what about a sentence like <math>B</math>? how do we "substitute" in something into something that has no free variable? [why on earth would you want to try that? i think it simplifies the proof a little if we assume diagonalization is defined for any sentence. but i forgot where exactly the simplification occurs.] the idea is again to make use of a separate variable: <math>\exists x (x = \ulcorner A(x)\urcorner \wedge A(x))</math>. again, i think we could also do <math>\forall x (x = \ulcorner A(x)\urcorner \to A(x))</math>. we're basically considering a function that finds the godel number of a sentence. except unlike a relation, a single-free-variable formula fixes some specific variable that it leaves free (a relation doesn't know whether it's x or y that's free -- it just expresses some idea), so we need to fix some single variable to use throughout.<br />
<br />
==Trying to discover the lemma==<br />
<br />
===approach 1===<br />
<br />
https://mathoverflow.net/a/31374<br />
<br />
===approach 2===<br />
<br />
see Owings paper.<br />
<br />
In the framework of this paper, we have a matrix where each entry is of a certain type. Then we apply the function <math>\alpha</math> to the diagonal. If the diagonal turns into one of the rows, <math>\alpha</math> has a fixed point.<br />
<br />
So now the trick is to figure out what our <math>\alpha</math> should be, and also what our matrix should look like.<br />
<br />
Picking the <math>\alpha</math> doesn't seem hard: we want a fixed point for the operation <math>\varphi_{f(-)}</math>, so we can pick <math>\alpha(\varphi_e) = \varphi_{f(e)}</math>. One problem is that this might not be well-defined, but we can just go with this for now (it ends up not mattering, for reasons I don't really understand, but the Owings paper has another workaround, which is to use relations; I find that more confusing).<br />
<br />
The matrix that works turns out to have entries <math>\varphi_{\varphi_j(k)}</math>. I'm not sure how one would have figured this out. One might also think <math>\varphi_j(k)</math> would work, but notice that then we fail the type checking with <math>\alpha</math> (which takes a function, not a natural number).<br />
<br />
So now we take the diagonal, which has entries <math>\varphi_{\varphi_k(k)}</math>, for <math>k = 0, 1, 2, \ldots</math>, and apply <math>\alpha</math>. We get <math>\varphi_{f(\varphi_k(k))}</math>. But <math>d</math> defined by <math>d(x) = \varphi_x(x)</math> is a recursive function, so the diagonal has turned into <math>\varphi_{f(d(k))} = \varphi_{f\circ d(k)}</math>. Since a composition of recursive functions is itself recursive, <math>f\circ d</math> is recursive. So we have some index <math>e</math> for it, i.e. <math>f\circ d \simeq \varphi_e</math>. So <math>\alpha</math> applied to the diagonal results in <math>\varphi_{\varphi_e(k)}</math>, which is one of the rows (the <math>e</math>th row). This means <math>\alpha</math> has a fixed point, in the <math>e</math>th entry, i.e. at <math>\varphi_{\varphi_e(e)}</math>. So we expect <math>\alpha(\varphi_{\varphi_e(e)})=\varphi_{\varphi_e(e)}</math>. Since <math>\alpha(\varphi_{\varphi_e(e)}) = \varphi_{f(\varphi_e(e))}</math>, the "real" fixed point for the operator will be at <math>\varphi_e(e)</math>. Indeed, <math>\varphi_{f(\varphi_e(e))} \simeq \varphi_{f\circ d(e)} \simeq \varphi_{\varphi_e(e)}</math>.<br />
<br />
Now we have to verify that <math>\alpha</math> doesn't need to be well-defined.<br />
<br />
===approach 3===<br />
<br />
Take Cantor's theorem, generalize it to mention fixed points, then take the contrapositive. See the Yanofsky paper for details.<br />
<br />
This version still has some mystery for me, e.g. replacing "the set has at least two elements" with "there is a function from the set to itself without a fixed point". The logical equivalence is easy to see, but getting the idea for rephrasing this condition to mention fixed points is not obvious at all.<br />
<br />
The use of the s-m-n theorem also isn't obvious to me. Why use it at all? Why use it on <math>g</math>? Why do we care about the index of <math>s</math>?<br />
<br />
It's also not clear to me why we use <math>T = \mathbf N</math> and <math>Y = \mathcal F</math>. In some sense it does make sense, like the natural numbers are all the algorithms, and the set of computable functions are the "properties" (a.k.a. "the objects being named").<br />
<br />
===approach 4===<br />
<br />
http://www.andrew.cmu.edu/user/kk3n/complearn/chapter8.pdf -- see section 8.1<br />
<br />
also see Moore and Mertens's section on lambda calculus<br />
<br />
in the untyped lambda calculus, there is only one type of entity, namely functions, which can operate on other functions. This makes it easy for functions to operate on themselves, which creates self-reference.<br />
<br />
but when working with partial recursive functions, we don't have this. instead, we have numbers and then partial functions that operate on numbers. to get self-reference, we need some kind of encoding. this is why we numbered the partial recursive functions.<br />
<br />
but now, one of the familiar facts about the lambda calculus is the existence of the fixed point combinator (aka y combinator). (note: this passes the buck to wondering how one would have come up with the lambda calculus, or how one would come up with the fixed point combinator in that setting; but this seems easier to answer.) since this theorem works in one setting in which we have self-reference, we might wonder if we can "port over" the theorem to the case where we have self-reference in a different setting.<br />
<br />
==Comparison table==<br />
<br />
Some things to notice:<br />
<br />
* The two theorems are essentially identical, with identical proofs, as seen by the matching rows. The analogy breaks down slightly at the very end, where we apply <math>\varphi_{f(\cdot)}</math> vs <math>A(\cdot)</math> (the latter corresponds to <math>f</math> until the very end).<br />
* In the partial recursive functions world, it's easy to go from the index (e.g. <math>e</math>) to the partial function (<math>\varphi_e</math>). In the formulas world it's the reverse, where it's easy to go from a formula (e.g. <math>A</math>) to its Godel number <math>\ulcorner A\urcorner</math>). I wonder if there is something essential here, or if it is simply some sort of historical accident in notation.<br />
* For the diagonalization lemma, here we have done the semantic version (? I think...), but usually the manipulations are done inside a formal system with reference to some theory <math>T</math> to derive a syntactic result (i.e. we have some theory that is strong enough to do all these manipulations within the object-level language). For partial recursive functions, as far as I know, there is no analogous distinction between semantics vs syntax.<br />
* The diagonalization part is not completely correct/as strong as possible for both proofs. For the partial recursive functions side, we want to make sure that <math>\varphi_{\varphi_x(x)}</math> is actually defined in each case. For the logic side, I think often the diagonalization is defined as <math>\exists x(x = \ulcorner A\urcorner \wedge A)</math> so that it is defined for all formulas, not just ones with one free variable. But the essential ideas are all present below, and since this makes the comparison easier, the presentation is simplified.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Step !! Rogers's fixed point theorem !! Diagonalization lemma<br />
|-<br />
| Theorem statement (note: quantifiers are part of the metalanguage) || <math>(\forall f \exists e)\ \varphi_e \simeq \varphi_{f(e)}</math> || <math>(\forall A \exists G)\ G \leftrightarrow A(\ulcorner G\urcorner)</math><br />
|-<br />
| Given mapping || <math>f</math> || <math>A</math><br />
|-<br />
| Definition of diagonal function || <math>d(x) = \varphi_x(x)</math> || <math>\mathrm{diag}(\ulcorner C\urcorner) = \ulcorner C(\ulcorner C\urcorner)\urcorner</math><br />
|-<br />
| Composition of given mapping with diagonal function (<math>\mathrm{given} \circ \mathrm{diagonal}</math>) || <math>f(d(x))</math> || <math>A(\mathrm{diag}(x))</math><br />
|-<br />
| Naming the <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition || <math>f\circ d</math> (name not given because compositions are easy to express outside a formal language) || <math>B</math><br />
|-<br />
| Index of <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition || <math>i</math> || <math>\ulcorner B\urcorner</math><br />
|-<br />
| Expanding using definition of diagonal || <math>d(i) = \varphi_i(i)</math> || <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner B(\ulcorner B\urcorner) \urcorner</math><br />
|-<br />
| The <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition applied to own index (i.e. diagonalization of the composition) || <math>f\circ d(i)</math> || <math>B(\ulcorner B\urcorner)</math><br />
|-<br />
| G defined || <math>\varphi_i(i)</math> (no equivalent definition) || <math>G</math> is <math>B(\ulcorner B\urcorner)</math><br />
|-<br />
| || <math>f(d(i)) = \varphi_i(i)</math> || <math>A(\mathrm{diag}(\ulcorner B\urcorner)) \leftrightarrow B(\ulcorner B\urcorner)</math><br />
|-<br />
| Renaming index || <math>e = d(i)</math> || <math>\ulcorner G\urcorner = \mathrm{diag}(\ulcorner B\urcorner)</math><br />
|-<br />
| Leibniz law to previous row || Apply <math>\varphi_{f(\cdot)}</math> to obtain <math>\varphi_{f(e)} = \varphi_{f(d(i))}</math> || Apply <math>A(\cdot)</math> to obtain <math>A(\ulcorner G\urcorner) \leftrightarrow A(\mathrm{diag}(\ulcorner B\urcorner))</math><br />
|-<br />
| Use definition of G || <math>\varphi_{f(e)} = \varphi_{\varphi_i(i)} = \varphi_e</math> || <math>A(\ulcorner G\urcorner) \leftrightarrow B(\ulcorner B\urcorner) \leftrightarrow G</math><br />
|-<br />
| (Definition of G)? || <math>\varphi_i(i)</math> is <math>f(d(i))</math> || <math>G</math> is <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math><br />
|}<br />
<br />
==Quotes==<br />
<br />
"All of these theorems tend to strain one's intuition; in fact, many people find them almost paradoxical. The most popular proofs of these theorems only serve to aggravate the situation because they are completely unmotivated, seem to depend upon a low combinatorial trick, and are so barbarically short as to be nearly incapable of rational analysis."<ref>James C. Owings, Jr. "Diagonalization and the Recursion Theorem". 1973.</ref><br />
<br />
"This is just a lovely result, insightful in its concept and far reaching in its consequences. We’d love to say that the proof was also lovely and enlightening, but to be honest, we don’t have an enlightening sort of proof to show you. Sometimes the best way to describe a proof is that the argument sort of picks you up and shakes you until you agree that it does, in fact, establish what it is supposed to establish. That’s what you get here."<ref>Christopher C. Leary; Lars Kristiansen. ''A Friendly Introduction to Mathematical Logic'' (2nd ed). p. 172.</ref><br />
<br />
"The brevity of the proof does not make for transparency; it has the aura of a magician’s trick. How did Gödel ever come up with the idea? As a matter of fact, Gödel did not come up with that idea."<ref name="gaifman"/><br />
<br />
==Questions/things to explain==<br />
<br />
* In Peter Smith's book, he defines Gdl(m,n) as Prf(m, diag(n)). What is the analogue of Gld for the Rogers fixed point theorem?<br />
* I like the <math>D(\ulcorner \varphi \urcorner) \iff \varphi(\ulcorner \varphi \urcorner)</math> that begins [https://mathoverflow.net/a/31374 this answer], but what is the analogue for partial functions? It seems like it is <math>d(x) = \varphi_x(x)</math>, which ''does'' exist (because we are allowed to have undefined values). So the motivation that works for the logic version doesn't work for the partial functions version, which bugs me.<br />
<br />
==References==<br />
<br />
<references/></div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Diagonalization_lemma&diff=3289User:IssaRice/Computability and logic/Diagonalization lemma2021-06-07T19:34:24Z<p>IssaRice: /* Use of extra quantified variable to make a substitution */</p>
<hr />
<div>The '''diagonalization lemma''', also called the '''Gödel–Carnap fixed point theorem''', is a fixed point theorem in logic.<br />
<br />
A verbal version of this result, modified from GEB (p. 449), runs as follows: Take the predicate cannot-be-proved-when-diagonalized(x). This takes a predicate in the x input, and says whether the sentence can be proved when diagonalized (i.e. inserted into itself). For instance, cannot-be-proved-when-diagonalized("has-length-less-than-one-thousand(x)") claims that has-length-less-than-one-thousand("has-length-less-than-one-thousand(x)") cannot be proved. In this case, let's say it's false, since we can see that the string "has-length-less-than-one-thousand(x)" has length less than 1000, and let's assume our proof system is strong enough to prove this. Now, to diagonalize cannot-be-proved-when-diagonalized(x) is to form the sentence cannot-be-proved-when-diagonalized("cannot-be-proved-when-diagonalized(x)"). So can this sentence be proved or not? If it can be proved, then the sentence itself claims that it cannot be proved, a contradiction. So it must not be provable.<br />
<br />
Basically, unlike the English language, a sentence can't refer to itself using phrases like "this sentence itself", so there is no straightforward way to make claims like "This sentence cannot be proved". To get around this restriction, we must use diagonalization – substituting a sentence's own encoding (i.e. string representation, aka "Gödel number") into itself. This allows a predicate to talk of its own string representation. So now if that predicate happens to claim unprovability – then we get Gödel's first incompleteness theorem.<br />
<br />
The diagonalization lemma generalizes to talk about any predicate P(x), not just not-provable(x). We want to find a sentence G such that G is true if and only if P("G") is (this is a little sloppy -- it's not actually the string "G", but rather if we made whatever G happens to be into a string...). Let G be has-property-P-when-diagonalized("has-property-P-when-diagonalized(x)"). If G is true, then has-property-P-when-diagonalized(x) must have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") which means P("G") is true. If G is false, then has-property-P-when-diagonalized(x) must not have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") is false, which means P("G") is false. Thus G is indeed true if and only if P("G") is true, so we have successfully "diagonalized" the predicate P(x).<br />
<br />
==Rogers's fixed point theorem==<br />
<br />
Let <math>f</math> be a total computable function. Then there exists an index <math>e</math> such that <math>\varphi_e \simeq \varphi_{f(e)}</math>.<br />
<br />
(simplified)<br />
<br />
Define <math>d(x) = \varphi_x(x)</math> (this is actually slightly wrong, but it brings out the analogy better).<br />
<br />
Consider the function <math>f\circ d</math>. This is partial recursive, so <math>f\circ d \simeq \varphi_i</math> for some index <math>i</math>.<br />
<br />
Now <math>\varphi_{f(d(i))} \simeq \varphi_{\varphi_i(i)}</math> since <math>f\circ d \simeq \varphi_i</math>. This is equivalent to <math>\varphi_{d(i)}</math> by definition of <math>d</math>. Thus, we may take <math>e = d(i)</math> to complete the proof.<br />
<br />
It looks like we have <math>f(d(i)) = \varphi_i(i) = d(i)</math>, i.e. <math>f(e) = e</math>. Is this right?<br />
<br />
<br />
Repeatedly using the facts that (1) <math>i</math> is an index for <math>f\circ d</math>, and (2) <math>d(i) = \varphi_i(i)</math>, allows us to create an iteration effect:<br />
<br />
<math>\varphi_i(i) \simeq f(d(i)) \simeq f(\varphi_i(i)) \simeq f(f(d(i))) \simeq f(f(\varphi_i(i))) \simeq \cdots \simeq f\circ \cdots \circ f \circ d(i)</math><br />
<br />
(I'm wondering if there's some deeper meaning to this. So far it's just an interesting connection between diagonalization-based fixed points and iteration-based fixed points. I think there might be a connection between this and the [https://medium.com/@cdsmithus/fixpoints-in-haskell-294096a9fc10 fix function in Haskell].)<br />
<br />
<br />
In the more rigorous/careful version of the proof, we use the [[s-m-n theorem]] to get an index of a function, <math>s</math>, which is basically like <math>d</math>. The difference is that <math>\varphi_x(x)</math> might not be defined for all <math>x</math> (actually it isn't, since some partial functions are always undefined) so <math>d</math> is not total. On the other hand, <math>s</math> is obtained via the s-m-n theorem so is total. When <math>\varphi_x(x)</math> is undefined, <math>s(x)</math> gives an index of the always-undefined partial function. So <math>s</math> says "this is undefined" in a defined way. Thanks to this property, the expression <math>\varphi_{s(x)}</math> always makes sense, whereas <math>\varphi_{\varphi_x(x)}</math> sometimes doesn't make sense.<br />
<br />
<br />
See also https://machinelearning.subwiki.org/wiki/User:IssaRice/Computability_and_logic/Rogers_fixed_point_theorem_using_Sipser%27s_notation<br />
<br />
==Diagonalization lemma==<br />
<br />
(semantic version)<br />
<br />
Let <math>A</math> be a formula with one free variable. Then there exists a sentence <math>G</math> such that <math>G</math> iff <math>A(\ulcorner G\urcorner)</math>.<br />
<br />
Define <math>\mathrm{diag}(x)</math> to be <math>\ulcorner C(\ulcorner C\urcorner)\urcorner</math> where <math>x = \ulcorner C\urcorner</math>. In other words, given a number <math>x</math>, the function <math>\mathrm{diag}</math> finds the formula with that Godel number, then diagonalizes it (i.e. substitutes the Godel number of the formula into the formula itself), then returns the Godel number of the resulting sentence.<br />
<br />
Let <math>B</math> be <math>A(\mathrm{diag}(x))</math>, and let <math>G</math> be <math>B(\ulcorner B\urcorner)</math>.<br />
<br />
Then <math>G</math> is <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math>, by substituting <math>x = \ulcorner B\urcorner</math> in the definition of <math>B</math>.<br />
<br />
We also have <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner B(\ulcorner B\urcorner)\urcorner</math> by definition of <math>\mathrm{diag}</math>. By definition of <math>G</math>, this is <math>\ulcorner G\urcorner</math>, so we have <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner G\urcorner</math>.<br />
<br />
To complete the proof, apply <math>A</math> to both sides of the final equality to obtain <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math> iff <math>A(\ulcorner G\urcorner)</math>; this simplifies to <math>G</math> iff <math>A(\ulcorner G\urcorner)</math>.<br />
<br />
<ref name="gaifman">Haim Gaifman. [https://web.archive.org/web/20180205090617/http://www.columbia.edu/~hg17/naming-diag.pdf "Naming and Diagonalization, from Cantor to Gödel to Kleene"].</ref><br />
<br />
<ref>https://mathoverflow.net/questions/30874/arithmetic-fixed-point-theorem</ref><br />
<br />
===Use of extra quantified variable to make a substitution===<br />
<br />
(see p. 448 of GEB)<br />
<br />
outside the formal system, if we have some function f, a constant a, and some one-place relation R, we can substitute f(a) into R like: R(f(a)). but many systems of formal logic don't have a way to directly talk about outputs of functions like f(a). instead, they have a relation like F(a,y) to mean f(a)=y. [why on earth would they do this? i think the basic reason is that if functions are just relations, we have fewer cases to have to prove in those annoying structural induction proofs. The simpler our formal system is, the shorter our proofs, but also the more annoying the system is to use in practice.] so how do we express an idea like R(f(a))? we can make use of an extra variable to hold the output, e.g. <math>F(a,y)\wedge R(y)</math>. but this leaves y free, so actually we want <math>\exists y (F(a,y)\wedge R(y))</math>. alternatively, we can also say <math>F(a,y) \to R(y)</math>. in this case we want <math>\forall y (F(a,y) \to R(y))</math>. it's easy to prove that both of these clumsy ways of writing are logically equivalent to R(f(a)).<br />
<br />
something similar happens when we want to diagonalize formulas. given a formula <math>A(x)</math> that has just x free, it's easy enough to diagonalize it: <math>A(\ulcorner A(x)\urcorner)</math>. but what about a sentence like <math>B</math>? how do we "substitute" in something into something that has no free variable? [why on earth would you want to try that? i think it simplifies the proof a little if we assume diagonalization is defined for any sentence. but i forgot where exactly the simplification occurs.] the idea is again to make use of a separate variable: <math>\exists x (x = \ulcorner A(x)\urcorner \wedge A(x))</math>. again, i think we could also do <math>\forall x (x = \ulcorner A(x)\urcorner \to A(x))</math>. we're basically considering a function that finds the godel number of a sentence. except unlike a relation, a single-free-variable formula fixes some specific variable that it leaves free (a relation doesn't know whether it's x or y that's free -- it just expresses some idea), so we need to fix some single variable to use throughout.<br />
<br />
==Trying to discover the lemma==<br />
<br />
===approach 1===<br />
<br />
https://mathoverflow.net/a/31374<br />
<br />
===approach 2===<br />
<br />
see Owings paper.<br />
<br />
In the framework of this paper, we have a matrix where each entry is of a certain type. Then we apply the function <math>\alpha</math> to the diagonal. If the diagonal turns into one of the rows, <math>\alpha</math> has a fixed point.<br />
<br />
So now the trick is to figure out what our <math>\alpha</math> should be, and also what our matrix should look like.<br />
<br />
Picking the <math>\alpha</math> doesn't seem hard: we want a fixed point for the operation <math>\varphi_{f(-)}</math>, so we can pick <math>\alpha(\varphi_e) = \varphi_{f(e)}</math>. One problem is that this might not be well-defined, but we can just go with this for now (it ends up not mattering, for reasons I don't really understand, but the Owings paper has another workaround, which is to use relations; I find that more confusing).<br />
<br />
The matrix that works turns out to have entries <math>\varphi_{\varphi_j(k)}</math>. I'm not sure how one would have figured this out. One might also think <math>\varphi_j(k)</math> would work, but notice that then we fail the type checking with <math>\alpha</math> (which takes a function, not a natural number).<br />
<br />
So now we take the diagonal, which has entries <math>\varphi_{\varphi_k(k)}</math>, for <math>k = 0, 1, 2, \ldots</math>, and apply <math>\alpha</math>. We get <math>\varphi_{f(\varphi_k(k))}</math>. But <math>d</math> defined by <math>d(x) = \varphi_x(x)</math> is a recursive function, so the diagonal has turned into <math>\varphi_{f(d(k))} = \varphi_{f\circ d(k)}</math>. Since a composition of recursive functions is itself recursive, <math>f\circ d</math> is recursive. So we have some index <math>e</math> for it, i.e. <math>f\circ d \simeq \varphi_e</math>. So <math>\alpha</math> applied to the diagonal results in <math>\varphi_{\varphi_e(k)}</math>, which is one of the rows (the <math>e</math>th row). This means <math>\alpha</math> has a fixed point, in the <math>e</math>th entry, i.e. at <math>\varphi_{\varphi_e(e)}</math>. So we expect <math>\alpha(\varphi_{\varphi_e(e)})=\varphi_{\varphi_e(e)}</math>. Since <math>\alpha(\varphi_{\varphi_e(e)}) = \varphi_{f(\varphi_e(e))}</math>, the "real" fixed point for the operator will be at <math>\varphi_e(e)</math>. Indeed, <math>\varphi_{f(\varphi_e(e))} \simeq \varphi_{f\circ d(e)} \simeq \varphi_{\varphi_e(e)}</math>.<br />
<br />
Now we have to verify that <math>\alpha</math> doesn't need to be well-defined.<br />
<br />
===approach 3===<br />
<br />
Take Cantor's theorem, generalize it to mention fixed points, then take the contrapositive. See the Yanofsky paper for details.<br />
<br />
This version still has some mystery for me, e.g. replacing "the set has at least two elements" with "there is a function from the set to itself without a fixed point". The logical equivalence is easy to see, but getting the idea for rephrasing this condition to mention fixed points is not obvious at all.<br />
<br />
The use of the s-m-n theorem also isn't obvious to me. Why use it at all? Why use it on <math>g</math>? Why do we care about the index of <math>s</math>?<br />
<br />
It's also not clear to me why we use <math>T = \mathbf N</math> and <math>Y = \mathcal F</math>. In some sense it does make sense, like the natural numbers are all the algorithms, and the set of computable functions are the "properties" (a.k.a. "the objects being named").<br />
<br />
===approach 4===<br />
<br />
http://www.andrew.cmu.edu/user/kk3n/complearn/chapter8.pdf -- see section 8.1<br />
<br />
also see Moore and Mertens's section on lambda calculus<br />
<br />
in the untyped lambda calculus, there is only one type of entity, namely functions, which can operate on other functions. This makes it easy for functions to operate on themselves, which creates self-reference.<br />
<br />
but when working with partial recursive functions, we don't have this. instead, we have numbers and then partial functions that operate on numbers. to get self-reference, we need some kind of encoding. this is why we numbered the partial recursive functions.<br />
<br />
but now, one of the familiar facts about the lambda calculus is the existence of the fixed point combinator (aka y combinator). (note: this passes the buck to wondering how one would have come up with the lambda calculus, or how one would come up with the fixed point combinator in that setting; but this seems easier to answer.) since this theorem works in one setting in which we have self-reference, we might wonder if we can "port over" the theorem to the case where we have self-reference in a different setting.<br />
<br />
==Comparison table==<br />
<br />
Some things to notice:<br />
<br />
* The two theorems are essentially identical, with identical proofs, as seen by the matching rows. The analogy breaks down slightly at the very end, where we apply <math>\varphi_{f(\cdot)}</math> vs <math>A(\cdot)</math> (the latter corresponds to <math>f</math> until the very end).<br />
* In the partial recursive functions world, it's easy to go from the index (e.g. <math>e</math>) to the partial function (<math>\varphi_e</math>). In the formulas world it's the reverse, where it's easy to go from a formula (e.g. <math>A</math>) to its Godel number <math>\ulcorner A\urcorner</math>). I wonder if there is something essential here, or if it is simply some sort of historical accident in notation.<br />
* For the diagonalization lemma, here we have done the semantic version (? I think...), but usually the manipulations are done inside a formal system with reference to some theory <math>T</math> to derive a syntactic result (i.e. we have some theory that is strong enough to do all these manipulations within the object-level language). For partial recursive functions, as far as I know, there is no analogous distinction between semantics vs syntax.<br />
* The diagonalization part is not completely correct/as strong as possible for both proofs. For the partial recursive functions side, we want to make sure that <math>\varphi_{\varphi_x(x)}</math> is actually defined in each case. For the logic side, I think often the diagonalization is defined as <math>\exists x(x = \ulcorner A\urcorner \wedge A)</math> so that it is defined for all formulas, not just ones with one free variable. But the essential ideas are all present below, and since this makes the comparison easier, the presentation is simplified.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Step !! Rogers's fixed point theorem !! Diagonalization lemma<br />
|-<br />
| Theorem statement (note: quantifiers are part of the metalanguage) || <math>(\forall f \exists e)\ \varphi_e \simeq \varphi_{f(e)}</math> || <math>(\forall A \exists G)\ G \leftrightarrow A(\ulcorner G\urcorner)</math><br />
|-<br />
| Given mapping || <math>f</math> || <math>A</math><br />
|-<br />
| Definition of diagonal function || <math>d(x) = \varphi_x(x)</math> || <math>\mathrm{diag}(\ulcorner C\urcorner) = \ulcorner C(\ulcorner C\urcorner)\urcorner</math><br />
|-<br />
| Composition of given mapping with diagonal function (<math>\mathrm{given} \circ \mathrm{diagonal}</math>) || <math>f(d(x))</math> || <math>A(\mathrm{diag}(x))</math><br />
|-<br />
| Naming the <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition || <math>f\circ d</math> (name not given because compositions are easy to express outside a formal language) || <math>B</math><br />
|-<br />
| Index of <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition || <math>i</math> || <math>\ulcorner B\urcorner</math><br />
|-<br />
| Expanding using definition of diagonal || <math>d(i) = \varphi_i(i)</math> || <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner B(\ulcorner B\urcorner) \urcorner</math><br />
|-<br />
| The <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition applied to own index (i.e. diagonalization of the composition) || <math>f\circ d(i)</math> || <math>B(\ulcorner B\urcorner)</math><br />
|-<br />
| G defined || <math>\varphi_i(i)</math> (no equivalent definition) || <math>G</math> is <math>B(\ulcorner B\urcorner)</math><br />
|-<br />
| || <math>f(d(i)) = \varphi_i(i)</math> || <math>A(\mathrm{diag}(\ulcorner B\urcorner)) \leftrightarrow B(\ulcorner B\urcorner)</math><br />
|-<br />
| Renaming index || <math>e = d(i)</math> || <math>\ulcorner G\urcorner = \mathrm{diag}(\ulcorner B\urcorner)</math><br />
|-<br />
| Leibniz law to previous row || Apply <math>\varphi_{f(\cdot)}</math> to obtain <math>\varphi_{f(e)} = \varphi_{f(d(i))}</math> || Apply <math>A(\cdot)</math> to obtain <math>A(\ulcorner G\urcorner) \leftrightarrow A(\mathrm{diag}(\ulcorner B\urcorner))</math><br />
|-<br />
| Use definition of G || <math>\varphi_{f(e)} = \varphi_{\varphi_i(i)} = \varphi_e</math> || <math>A(\ulcorner G\urcorner) \leftrightarrow B(\ulcorner B\urcorner) \leftrightarrow G</math><br />
|-<br />
| (Definition of G)? || <math>\varphi_i(i)</math> is <math>f(d(i))</math> || <math>G</math> is <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math><br />
|}<br />
<br />
==Quotes==<br />
<br />
"All of these theorems tend to strain one's intuition; in fact, many people find them almost paradoxical. The most popular proofs of these theorems only serve to aggravate the situation because they are completely unmotivated, seem to depend upon a low combinatorial trick, and are so barbarically short as to be nearly incapable of rational analysis."<ref>James C. Owings, Jr. "Diagonalization and the Recursion Theorem". 1973.</ref><br />
<br />
"This is just a lovely result, insightful in its concept and far reaching in its consequences. We’d love to say that the proof was also lovely and enlightening, but to be honest, we don’t have an enlightening sort of proof to show you. Sometimes the best way to describe a proof is that the argument sort of picks you up and shakes you until you agree that it does, in fact, establish what it is supposed to establish. That’s what you get here."<ref>Christopher C. Leary; Lars Kristiansen. ''A Friendly Introduction to Mathematical Logic'' (2nd ed). p. 172.</ref><br />
<br />
"The brevity of the proof does not make for transparency; it has the aura of a magician’s trick. How did Gödel ever come up with the idea? As a matter of fact, Gödel did not come up with that idea."<ref name="gaifman"/><br />
<br />
==Questions/things to explain==<br />
<br />
* In Peter Smith's book, he defines Gdl(m,n) as Prf(m, diag(n)). What is the analogue of Gld for the Rogers fixed point theorem?<br />
* I like the <math>D(\ulcorner \varphi \urcorner) \iff \varphi(\ulcorner \varphi \urcorner)</math> that begins [https://mathoverflow.net/a/31374 this answer], but what is the analogue for partial functions? It seems like it is <math>d(x) = \varphi_x(x)</math>, which ''does'' exist (because we are allowed to have undefined values). So the motivation that works for the logic version doesn't work for the partial functions version, which bugs me.<br />
<br />
==References==<br />
<br />
<references/></div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Diagonalization_lemma&diff=3288User:IssaRice/Computability and logic/Diagonalization lemma2021-06-07T19:33:06Z<p>IssaRice: /* Diagonalization lemma */</p>
<hr />
<div>The '''diagonalization lemma''', also called the '''Gödel–Carnap fixed point theorem''', is a fixed point theorem in logic.<br />
<br />
A verbal version of this result, modified from GEB (p. 449), runs as follows: Take the predicate cannot-be-proved-when-diagonalized(x). This takes a predicate in the x input, and says whether the sentence can be proved when diagonalized (i.e. inserted into itself). For instance, cannot-be-proved-when-diagonalized("has-length-less-than-one-thousand(x)") claims that has-length-less-than-one-thousand("has-length-less-than-one-thousand(x)") cannot be proved. In this case, let's say it's false, since we can see that the string "has-length-less-than-one-thousand(x)" has length less than 1000, and let's assume our proof system is strong enough to prove this. Now, to diagonalize cannot-be-proved-when-diagonalized(x) is to form the sentence cannot-be-proved-when-diagonalized("cannot-be-proved-when-diagonalized(x)"). So can this sentence be proved or not? If it can be proved, then the sentence itself claims that it cannot be proved, a contradiction. So it must not be provable.<br />
<br />
Basically, unlike the English language, a sentence can't refer to itself using phrases like "this sentence itself", so there is no straightforward way to make claims like "This sentence cannot be proved". To get around this restriction, we must use diagonalization – substituting a sentence's own encoding (i.e. string representation, aka "Gödel number") into itself. This allows a predicate to talk of its own string representation. So now if that predicate happens to claim unprovability – then we get Gödel's first incompleteness theorem.<br />
<br />
The diagonalization lemma generalizes to talk about any predicate P(x), not just not-provable(x). We want to find a sentence G such that G is true if and only if P("G") is (this is a little sloppy -- it's not actually the string "G", but rather if we made whatever G happens to be into a string...). Let G be has-property-P-when-diagonalized("has-property-P-when-diagonalized(x)"). If G is true, then has-property-P-when-diagonalized(x) must have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") which means P("G") is true. If G is false, then has-property-P-when-diagonalized(x) must not have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") is false, which means P("G") is false. Thus G is indeed true if and only if P("G") is true, so we have successfully "diagonalized" the predicate P(x).<br />
<br />
==Rogers's fixed point theorem==<br />
<br />
Let <math>f</math> be a total computable function. Then there exists an index <math>e</math> such that <math>\varphi_e \simeq \varphi_{f(e)}</math>.<br />
<br />
(simplified)<br />
<br />
Define <math>d(x) = \varphi_x(x)</math> (this is actually slightly wrong, but it brings out the analogy better).<br />
<br />
Consider the function <math>f\circ d</math>. This is partial recursive, so <math>f\circ d \simeq \varphi_i</math> for some index <math>i</math>.<br />
<br />
Now <math>\varphi_{f(d(i))} \simeq \varphi_{\varphi_i(i)}</math> since <math>f\circ d \simeq \varphi_i</math>. This is equivalent to <math>\varphi_{d(i)}</math> by definition of <math>d</math>. Thus, we may take <math>e = d(i)</math> to complete the proof.<br />
<br />
It looks like we have <math>f(d(i)) = \varphi_i(i) = d(i)</math>, i.e. <math>f(e) = e</math>. Is this right?<br />
<br />
<br />
Repeatedly using the facts that (1) <math>i</math> is an index for <math>f\circ d</math>, and (2) <math>d(i) = \varphi_i(i)</math>, allows us to create an iteration effect:<br />
<br />
<math>\varphi_i(i) \simeq f(d(i)) \simeq f(\varphi_i(i)) \simeq f(f(d(i))) \simeq f(f(\varphi_i(i))) \simeq \cdots \simeq f\circ \cdots \circ f \circ d(i)</math><br />
<br />
(I'm wondering if there's some deeper meaning to this. So far it's just an interesting connection between diagonalization-based fixed points and iteration-based fixed points. I think there might be a connection between this and the [https://medium.com/@cdsmithus/fixpoints-in-haskell-294096a9fc10 fix function in Haskell].)<br />
<br />
<br />
In the more rigorous/careful version of the proof, we use the [[s-m-n theorem]] to get an index of a function, <math>s</math>, which is basically like <math>d</math>. The difference is that <math>\varphi_x(x)</math> might not be defined for all <math>x</math> (actually it isn't, since some partial functions are always undefined) so <math>d</math> is not total. On the other hand, <math>s</math> is obtained via the s-m-n theorem so is total. When <math>\varphi_x(x)</math> is undefined, <math>s(x)</math> gives an index of the always-undefined partial function. So <math>s</math> says "this is undefined" in a defined way. Thanks to this property, the expression <math>\varphi_{s(x)}</math> always makes sense, whereas <math>\varphi_{\varphi_x(x)}</math> sometimes doesn't make sense.<br />
<br />
<br />
See also https://machinelearning.subwiki.org/wiki/User:IssaRice/Computability_and_logic/Rogers_fixed_point_theorem_using_Sipser%27s_notation<br />
<br />
==Diagonalization lemma==<br />
<br />
(semantic version)<br />
<br />
Let <math>A</math> be a formula with one free variable. Then there exists a sentence <math>G</math> such that <math>G</math> iff <math>A(\ulcorner G\urcorner)</math>.<br />
<br />
Define <math>\mathrm{diag}(x)</math> to be <math>\ulcorner C(\ulcorner C\urcorner)\urcorner</math> where <math>x = \ulcorner C\urcorner</math>. In other words, given a number <math>x</math>, the function <math>\mathrm{diag}</math> finds the formula with that Godel number, then diagonalizes it (i.e. substitutes the Godel number of the formula into the formula itself), then returns the Godel number of the resulting sentence.<br />
<br />
Let <math>B</math> be <math>A(\mathrm{diag}(x))</math>, and let <math>G</math> be <math>B(\ulcorner B\urcorner)</math>.<br />
<br />
Then <math>G</math> is <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math>, by substituting <math>x = \ulcorner B\urcorner</math> in the definition of <math>B</math>.<br />
<br />
We also have <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner B(\ulcorner B\urcorner)\urcorner</math> by definition of <math>\mathrm{diag}</math>. By definition of <math>G</math>, this is <math>\ulcorner G\urcorner</math>, so we have <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner G\urcorner</math>.<br />
<br />
To complete the proof, apply <math>A</math> to both sides of the final equality to obtain <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math> iff <math>A(\ulcorner G\urcorner)</math>; this simplifies to <math>G</math> iff <math>A(\ulcorner G\urcorner)</math>.<br />
<br />
<ref name="gaifman">Haim Gaifman. [https://web.archive.org/web/20180205090617/http://www.columbia.edu/~hg17/naming-diag.pdf "Naming and Diagonalization, from Cantor to Gödel to Kleene"].</ref><br />
<br />
<ref>https://mathoverflow.net/questions/30874/arithmetic-fixed-point-theorem</ref><br />
<br />
===Use of extra quantified variable to make a substitution===<br />
<br />
(see p. 448 of GEB)<br />
<br />
outside the formal system, if we have some function f, a constant a, and some one-place relation R, we can substitute f(a) into R like: R(f(a)). but many systems of formal logic don't have a way to directly talk about outputs of functions like f(a). instead, they have a relation like F(a,y) to mean f(a)=y. [why on earth would they do this? i think the basic reason is that if functions are just relations, we have fewer cases to have to prove in those annoying structural induction proofs.] so how do we express an idea like R(f(a))? we can make use of an extra variable to hold the output, e.g. <math>F(a,y)\wedge R(y)</math>. but this leaves y free, so actually we want <math>\exists y (F(a,y)\wedge R(y))</math>. alternatively, we can also say <math>F(a,y) \to R(y)</math>. in this case we want <math>\forall y (F(a,y) \to R(y))</math>. it's easy to prove that both of these clumsy ways of writing are logically equivalent to R(f(a)).<br />
<br />
something similar happens when we want to diagonalize formulas. given a formula <math>A(x)</math> that has just x free, it's easy enough to diagonalize it: <math>A(\ulcorner A(x)\urcorner)</math>. but what about a sentence like <math>B</math>? how do we "substitute" in something into something that has no free variable? [why on earth would you want to try that? i think it simplifies the proof a little if we assume diagonalization is defined for any sentence. but i forgot where exactly the simplification occurs.] the idea is again to make use of a separate variable: <math>\exists x (x = \ulcorner A(x)\urcorner \wedge A(x))</math>. again, i think we could also do <math>\forall x (x = \ulcorner A(x)\urcorner \to A(x))</math>. we're basically considering a function that finds the godel number of a sentence. except unlike a relation, a single-free-variable formula fixes some specific variable that it leaves free (a relation doesn't know whether it's x or y that's free -- it just expresses some idea), so we need to fix some single variable to use throughout.<br />
<br />
==Trying to discover the lemma==<br />
<br />
===approach 1===<br />
<br />
https://mathoverflow.net/a/31374<br />
<br />
===approach 2===<br />
<br />
see Owings paper.<br />
<br />
In the framework of this paper, we have a matrix where each entry is of a certain type. Then we apply the function <math>\alpha</math> to the diagonal. If the diagonal turns into one of the rows, <math>\alpha</math> has a fixed point.<br />
<br />
So now the trick is to figure out what our <math>\alpha</math> should be, and also what our matrix should look like.<br />
<br />
Picking the <math>\alpha</math> doesn't seem hard: we want a fixed point for the operation <math>\varphi_{f(-)}</math>, so we can pick <math>\alpha(\varphi_e) = \varphi_{f(e)}</math>. One problem is that this might not be well-defined, but we can just go with this for now (it ends up not mattering, for reasons I don't really understand, but the Owings paper has another workaround, which is to use relations; I find that more confusing).<br />
<br />
The matrix that works turns out to have entries <math>\varphi_{\varphi_j(k)}</math>. I'm not sure how one would have figured this out. One might also think <math>\varphi_j(k)</math> would work, but notice that then we fail the type checking with <math>\alpha</math> (which takes a function, not a natural number).<br />
<br />
So now we take the diagonal, which has entries <math>\varphi_{\varphi_k(k)}</math>, for <math>k = 0, 1, 2, \ldots</math>, and apply <math>\alpha</math>. We get <math>\varphi_{f(\varphi_k(k))}</math>. But <math>d</math> defined by <math>d(x) = \varphi_x(x)</math> is a recursive function, so the diagonal has turned into <math>\varphi_{f(d(k))} = \varphi_{f\circ d(k)}</math>. Since a composition of recursive functions is itself recursive, <math>f\circ d</math> is recursive. So we have some index <math>e</math> for it, i.e. <math>f\circ d \simeq \varphi_e</math>. So <math>\alpha</math> applied to the diagonal results in <math>\varphi_{\varphi_e(k)}</math>, which is one of the rows (the <math>e</math>th row). This means <math>\alpha</math> has a fixed point, in the <math>e</math>th entry, i.e. at <math>\varphi_{\varphi_e(e)}</math>. So we expect <math>\alpha(\varphi_{\varphi_e(e)})=\varphi_{\varphi_e(e)}</math>. Since <math>\alpha(\varphi_{\varphi_e(e)}) = \varphi_{f(\varphi_e(e))}</math>, the "real" fixed point for the operator will be at <math>\varphi_e(e)</math>. Indeed, <math>\varphi_{f(\varphi_e(e))} \simeq \varphi_{f\circ d(e)} \simeq \varphi_{\varphi_e(e)}</math>.<br />
<br />
Now we have to verify that <math>\alpha</math> doesn't need to be well-defined.<br />
<br />
===approach 3===<br />
<br />
Take Cantor's theorem, generalize it to mention fixed points, then take the contrapositive. See the Yanofsky paper for details.<br />
<br />
This version still has some mystery for me, e.g. replacing "the set has at least two elements" with "there is a function from the set to itself without a fixed point". The logical equivalence is easy to see, but getting the idea for rephrasing this condition to mention fixed points is not obvious at all.<br />
<br />
The use of the s-m-n theorem also isn't obvious to me. Why use it at all? Why use it on <math>g</math>? Why do we care about the index of <math>s</math>?<br />
<br />
It's also not clear to me why we use <math>T = \mathbf N</math> and <math>Y = \mathcal F</math>. In some sense it does make sense, like the natural numbers are all the algorithms, and the set of computable functions are the "properties" (a.k.a. "the objects being named").<br />
<br />
===approach 4===<br />
<br />
http://www.andrew.cmu.edu/user/kk3n/complearn/chapter8.pdf -- see section 8.1<br />
<br />
also see Moore and Mertens's section on lambda calculus<br />
<br />
in the untyped lambda calculus, there is only one type of entity, namely functions, which can operate on other functions. This makes it easy for functions to operate on themselves, which creates self-reference.<br />
<br />
but when working with partial recursive functions, we don't have this. instead, we have numbers and then partial functions that operate on numbers. to get self-reference, we need some kind of encoding. this is why we numbered the partial recursive functions.<br />
<br />
but now, one of the familiar facts about the lambda calculus is the existence of the fixed point combinator (aka y combinator). (note: this passes the buck to wondering how one would have come up with the lambda calculus, or how one would come up with the fixed point combinator in that setting; but this seems easier to answer.) since this theorem works in one setting in which we have self-reference, we might wonder if we can "port over" the theorem to the case where we have self-reference in a different setting.<br />
<br />
==Comparison table==<br />
<br />
Some things to notice:<br />
<br />
* The two theorems are essentially identical, with identical proofs, as seen by the matching rows. The analogy breaks down slightly at the very end, where we apply <math>\varphi_{f(\cdot)}</math> vs <math>A(\cdot)</math> (the latter corresponds to <math>f</math> until the very end).<br />
* In the partial recursive functions world, it's easy to go from the index (e.g. <math>e</math>) to the partial function (<math>\varphi_e</math>). In the formulas world it's the reverse, where it's easy to go from a formula (e.g. <math>A</math>) to its Godel number <math>\ulcorner A\urcorner</math>). I wonder if there is something essential here, or if it is simply some sort of historical accident in notation.<br />
* For the diagonalization lemma, here we have done the semantic version (? I think...), but usually the manipulations are done inside a formal system with reference to some theory <math>T</math> to derive a syntactic result (i.e. we have some theory that is strong enough to do all these manipulations within the object-level language). For partial recursive functions, as far as I know, there is no analogous distinction between semantics vs syntax.<br />
* The diagonalization part is not completely correct/as strong as possible for both proofs. For the partial recursive functions side, we want to make sure that <math>\varphi_{\varphi_x(x)}</math> is actually defined in each case. For the logic side, I think often the diagonalization is defined as <math>\exists x(x = \ulcorner A\urcorner \wedge A)</math> so that it is defined for all formulas, not just ones with one free variable. But the essential ideas are all present below, and since this makes the comparison easier, the presentation is simplified.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Step !! Rogers's fixed point theorem !! Diagonalization lemma<br />
|-<br />
| Theorem statement (note: quantifiers are part of the metalanguage) || <math>(\forall f \exists e)\ \varphi_e \simeq \varphi_{f(e)}</math> || <math>(\forall A \exists G)\ G \leftrightarrow A(\ulcorner G\urcorner)</math><br />
|-<br />
| Given mapping || <math>f</math> || <math>A</math><br />
|-<br />
| Definition of diagonal function || <math>d(x) = \varphi_x(x)</math> || <math>\mathrm{diag}(\ulcorner C\urcorner) = \ulcorner C(\ulcorner C\urcorner)\urcorner</math><br />
|-<br />
| Composition of given mapping with diagonal function (<math>\mathrm{given} \circ \mathrm{diagonal}</math>) || <math>f(d(x))</math> || <math>A(\mathrm{diag}(x))</math><br />
|-<br />
| Naming the <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition || <math>f\circ d</math> (name not given because compositions are easy to express outside a formal language) || <math>B</math><br />
|-<br />
| Index of <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition || <math>i</math> || <math>\ulcorner B\urcorner</math><br />
|-<br />
| Expanding using definition of diagonal || <math>d(i) = \varphi_i(i)</math> || <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner B(\ulcorner B\urcorner) \urcorner</math><br />
|-<br />
| The <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition applied to own index (i.e. diagonalization of the composition) || <math>f\circ d(i)</math> || <math>B(\ulcorner B\urcorner)</math><br />
|-<br />
| G defined || <math>\varphi_i(i)</math> (no equivalent definition) || <math>G</math> is <math>B(\ulcorner B\urcorner)</math><br />
|-<br />
| || <math>f(d(i)) = \varphi_i(i)</math> || <math>A(\mathrm{diag}(\ulcorner B\urcorner)) \leftrightarrow B(\ulcorner B\urcorner)</math><br />
|-<br />
| Renaming index || <math>e = d(i)</math> || <math>\ulcorner G\urcorner = \mathrm{diag}(\ulcorner B\urcorner)</math><br />
|-<br />
| Leibniz law to previous row || Apply <math>\varphi_{f(\cdot)}</math> to obtain <math>\varphi_{f(e)} = \varphi_{f(d(i))}</math> || Apply <math>A(\cdot)</math> to obtain <math>A(\ulcorner G\urcorner) \leftrightarrow A(\mathrm{diag}(\ulcorner B\urcorner))</math><br />
|-<br />
| Use definition of G || <math>\varphi_{f(e)} = \varphi_{\varphi_i(i)} = \varphi_e</math> || <math>A(\ulcorner G\urcorner) \leftrightarrow B(\ulcorner B\urcorner) \leftrightarrow G</math><br />
|-<br />
| (Definition of G)? || <math>\varphi_i(i)</math> is <math>f(d(i))</math> || <math>G</math> is <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math><br />
|}<br />
<br />
==Quotes==<br />
<br />
"All of these theorems tend to strain one's intuition; in fact, many people find them almost paradoxical. The most popular proofs of these theorems only serve to aggravate the situation because they are completely unmotivated, seem to depend upon a low combinatorial trick, and are so barbarically short as to be nearly incapable of rational analysis."<ref>James C. Owings, Jr. "Diagonalization and the Recursion Theorem". 1973.</ref><br />
<br />
"This is just a lovely result, insightful in its concept and far reaching in its consequences. We’d love to say that the proof was also lovely and enlightening, but to be honest, we don’t have an enlightening sort of proof to show you. Sometimes the best way to describe a proof is that the argument sort of picks you up and shakes you until you agree that it does, in fact, establish what it is supposed to establish. That’s what you get here."<ref>Christopher C. Leary; Lars Kristiansen. ''A Friendly Introduction to Mathematical Logic'' (2nd ed). p. 172.</ref><br />
<br />
"The brevity of the proof does not make for transparency; it has the aura of a magician’s trick. How did Gödel ever come up with the idea? As a matter of fact, Gödel did not come up with that idea."<ref name="gaifman"/><br />
<br />
==Questions/things to explain==<br />
<br />
* In Peter Smith's book, he defines Gdl(m,n) as Prf(m, diag(n)). What is the analogue of Gld for the Rogers fixed point theorem?<br />
* I like the <math>D(\ulcorner \varphi \urcorner) \iff \varphi(\ulcorner \varphi \urcorner)</math> that begins [https://mathoverflow.net/a/31374 this answer], but what is the analogue for partial functions? It seems like it is <math>d(x) = \varphi_x(x)</math>, which ''does'' exist (because we are allowed to have undefined values). So the motivation that works for the logic version doesn't work for the partial functions version, which bugs me.<br />
<br />
==References==<br />
<br />
<references/></div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Diagonalization_lemma&diff=3287User:IssaRice/Computability and logic/Diagonalization lemma2021-06-07T19:29:56Z<p>IssaRice: </p>
<hr />
<div>The '''diagonalization lemma''', also called the '''Gödel–Carnap fixed point theorem''', is a fixed point theorem in logic.<br />
<br />
A verbal version of this result, modified from GEB (p. 449), runs as follows: Take the predicate cannot-be-proved-when-diagonalized(x). This takes a predicate in the x input, and says whether the sentence can be proved when diagonalized (i.e. inserted into itself). For instance, cannot-be-proved-when-diagonalized("has-length-less-than-one-thousand(x)") claims that has-length-less-than-one-thousand("has-length-less-than-one-thousand(x)") cannot be proved. In this case, let's say it's false, since we can see that the string "has-length-less-than-one-thousand(x)" has length less than 1000, and let's assume our proof system is strong enough to prove this. Now, to diagonalize cannot-be-proved-when-diagonalized(x) is to form the sentence cannot-be-proved-when-diagonalized("cannot-be-proved-when-diagonalized(x)"). So can this sentence be proved or not? If it can be proved, then the sentence itself claims that it cannot be proved, a contradiction. So it must not be provable.<br />
<br />
Basically, unlike the English language, a sentence can't refer to itself using phrases like "this sentence itself", so there is no straightforward way to make claims like "This sentence cannot be proved". To get around this restriction, we must use diagonalization – substituting a sentence's own encoding (i.e. string representation, aka "Gödel number") into itself. This allows a predicate to talk of its own string representation. So now if that predicate happens to claim unprovability – then we get Gödel's first incompleteness theorem.<br />
<br />
The diagonalization lemma generalizes to talk about any predicate P(x), not just not-provable(x). We want to find a sentence G such that G is true if and only if P("G") is (this is a little sloppy -- it's not actually the string "G", but rather if we made whatever G happens to be into a string...). Let G be has-property-P-when-diagonalized("has-property-P-when-diagonalized(x)"). If G is true, then has-property-P-when-diagonalized(x) must have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") which means P("G") is true. If G is false, then has-property-P-when-diagonalized(x) must not have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") is false, which means P("G") is false. Thus G is indeed true if and only if P("G") is true, so we have successfully "diagonalized" the predicate P(x).<br />
<br />
==Rogers's fixed point theorem==<br />
<br />
Let <math>f</math> be a total computable function. Then there exists an index <math>e</math> such that <math>\varphi_e \simeq \varphi_{f(e)}</math>.<br />
<br />
(simplified)<br />
<br />
Define <math>d(x) = \varphi_x(x)</math> (this is actually slightly wrong, but it brings out the analogy better).<br />
<br />
Consider the function <math>f\circ d</math>. This is partial recursive, so <math>f\circ d \simeq \varphi_i</math> for some index <math>i</math>.<br />
<br />
Now <math>\varphi_{f(d(i))} \simeq \varphi_{\varphi_i(i)}</math> since <math>f\circ d \simeq \varphi_i</math>. This is equivalent to <math>\varphi_{d(i)}</math> by definition of <math>d</math>. Thus, we may take <math>e = d(i)</math> to complete the proof.<br />
<br />
It looks like we have <math>f(d(i)) = \varphi_i(i) = d(i)</math>, i.e. <math>f(e) = e</math>. Is this right?<br />
<br />
<br />
Repeatedly using the facts that (1) <math>i</math> is an index for <math>f\circ d</math>, and (2) <math>d(i) = \varphi_i(i)</math>, allows us to create an iteration effect:<br />
<br />
<math>\varphi_i(i) \simeq f(d(i)) \simeq f(\varphi_i(i)) \simeq f(f(d(i))) \simeq f(f(\varphi_i(i))) \simeq \cdots \simeq f\circ \cdots \circ f \circ d(i)</math><br />
<br />
(I'm wondering if there's some deeper meaning to this. So far it's just an interesting connection between diagonalization-based fixed points and iteration-based fixed points. I think there might be a connection between this and the [https://medium.com/@cdsmithus/fixpoints-in-haskell-294096a9fc10 fix function in Haskell].)<br />
<br />
<br />
In the more rigorous/careful version of the proof, we use the [[s-m-n theorem]] to get an index of a function, <math>s</math>, which is basically like <math>d</math>. The difference is that <math>\varphi_x(x)</math> might not be defined for all <math>x</math> (actually it isn't, since some partial functions are always undefined) so <math>d</math> is not total. On the other hand, <math>s</math> is obtained via the s-m-n theorem so is total. When <math>\varphi_x(x)</math> is undefined, <math>s(x)</math> gives an index of the always-undefined partial function. So <math>s</math> says "this is undefined" in a defined way. Thanks to this property, the expression <math>\varphi_{s(x)}</math> always makes sense, whereas <math>\varphi_{\varphi_x(x)}</math> sometimes doesn't make sense.<br />
<br />
<br />
See also https://machinelearning.subwiki.org/wiki/User:IssaRice/Computability_and_logic/Rogers_fixed_point_theorem_using_Sipser%27s_notation<br />
<br />
==Diagonalization lemma==<br />
<br />
(semantic version)<br />
<br />
Let <math>A</math> be a formula with one free variable. Then there exists a sentence <math>G</math> such that <math>G</math> iff <math>A(\ulcorner G\urcorner)</math>.<br />
<br />
Define <math>\mathrm{diag}(x)</math> to be <math>\ulcorner C(\ulcorner C\urcorner)\urcorner</math> where <math>x = \ulcorner C\urcorner</math>. In other words, given a number <math>x</math>, the function <math>\mathrm{diag}</math> finds the formula with that Godel number, then diagonalizes it (i.e. substitutes the Godel number of the formula into the formula itself), then returns the Godel number of the resulting sentence.<br />
<br />
Let <math>B</math> be <math>A(\mathrm{diag}(x))</math>, and let <math>G</math> be <math>B(\ulcorner B\urcorner)</math>.<br />
<br />
Then <math>G</math> is <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math>, by substituting <math>x = \ulcorner B\urcorner</math> in the definition of <math>B</math>.<br />
<br />
We also have <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner B(\ulcorner B\urcorner)\urcorner</math> by definition of <math>\mathrm{diag}</math>. By definition of <math>G</math>, this is <math>\ulcorner G\urcorner</math>, so we have <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner G\urcorner</math>.<br />
<br />
To complete the proof, apply <math>A</math> to both sides of the final equality to obtain <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math> iff <math>A(\ulcorner G\urcorner)</math>; this simplifies to <math>G</math> iff <math>A(\ulcorner G\urcorner)</math>.<br />
<br />
<ref name="gaifman">Haim Gaifman. [https://web.archive.org/web/20180205090617/http://www.columbia.edu/~hg17/naming-diag.pdf "Naming and Diagonalization, from Cantor to Gödel to Kleene"].</ref><br />
<br />
<ref>https://mathoverflow.net/questions/30874/arithmetic-fixed-point-theorem</ref><br />
<br />
===use of extra quantified variable to make a substitution===<br />
<br />
(see p. 448 of GEB)<br />
<br />
outside the formal system, if we have some function f, a constant a, and some one-place relation R, we can substitute f(a) into R like: R(f(a)). but many systems of formal logic don't have a way to directly talk about outputs of functions like f(a). instead, they have a relation like F(a,y) to mean f(a)=y. [why on earth would they do this? i think the basic reason is that if functions are just relations, we have fewer cases to have to prove in those annoying structural induction proofs.] so how do we express an idea like R(f(a))? we can make use of an extra variable to hold the output, e.g. <math>F(a,y)\wedge R(y)</math>. but this leaves y free, so actually we want <math>\exists y (F(a,y)\wedge R(y))</math>. alternatively, we can also say <math>F(a,y) \to R(y)</math>. in this case we want <math>\forall y (F(a,y) \to R(y))</math>. it's easy to prove that both of these clumsy ways of writing are logically equivalent to R(f(a)).<br />
<br />
something similar happens when we want to diagonalize formulas. given a formula <math>A(x)</math> that has just x free, it's easy enough to diagonalize it: <math>A(\ulcorner A(x)\urcorner)</math>. but what about a sentence like <math>B</math>? how do we "substitute" in something into something that has no free variable? [why on earth would you want to try that? i think it simplifies the proof a little if we assume diagonalization is defined for any sentence. but i forgot where exactly the simplification occurs.] the idea is again to make use of a separate variable: <math>\exists x (x = \ulcorner A(x)\urcorner \wedge A(x))</math>. again, i think we could also do <math>\forall x (x = \ulcorner A(x)\urcorner \to A(x))</math>. we're basically considering a function that finds the godel number of a sentence. except unlike a relation, a single-free-variable formula fixes some specific variable that it leaves free (a relation doesn't know whether it's x or y that's free -- it just expresses some idea), so we need to fix some single variable to use throughout.<br />
<br />
==Trying to discover the lemma==<br />
<br />
===approach 1===<br />
<br />
https://mathoverflow.net/a/31374<br />
<br />
===approach 2===<br />
<br />
see Owings paper.<br />
<br />
In the framework of this paper, we have a matrix where each entry is of a certain type. Then we apply the function <math>\alpha</math> to the diagonal. If the diagonal turns into one of the rows, <math>\alpha</math> has a fixed point.<br />
<br />
So now the trick is to figure out what our <math>\alpha</math> should be, and also what our matrix should look like.<br />
<br />
Picking the <math>\alpha</math> doesn't seem hard: we want a fixed point for the operation <math>\varphi_{f(-)}</math>, so we can pick <math>\alpha(\varphi_e) = \varphi_{f(e)}</math>. One problem is that this might not be well-defined, but we can just go with this for now (it ends up not mattering, for reasons I don't really understand, but the Owings paper has another workaround, which is to use relations; I find that more confusing).<br />
<br />
The matrix that works turns out to have entries <math>\varphi_{\varphi_j(k)}</math>. I'm not sure how one would have figured this out. One might also think <math>\varphi_j(k)</math> would work, but notice that then we fail the type checking with <math>\alpha</math> (which takes a function, not a natural number).<br />
<br />
So now we take the diagonal, which has entries <math>\varphi_{\varphi_k(k)}</math>, for <math>k = 0, 1, 2, \ldots</math>, and apply <math>\alpha</math>. We get <math>\varphi_{f(\varphi_k(k))}</math>. But <math>d</math> defined by <math>d(x) = \varphi_x(x)</math> is a recursive function, so the diagonal has turned into <math>\varphi_{f(d(k))} = \varphi_{f\circ d(k)}</math>. Since a composition of recursive functions is itself recursive, <math>f\circ d</math> is recursive. So we have some index <math>e</math> for it, i.e. <math>f\circ d \simeq \varphi_e</math>. So <math>\alpha</math> applied to the diagonal results in <math>\varphi_{\varphi_e(k)}</math>, which is one of the rows (the <math>e</math>th row). This means <math>\alpha</math> has a fixed point, in the <math>e</math>th entry, i.e. at <math>\varphi_{\varphi_e(e)}</math>. So we expect <math>\alpha(\varphi_{\varphi_e(e)})=\varphi_{\varphi_e(e)}</math>. Since <math>\alpha(\varphi_{\varphi_e(e)}) = \varphi_{f(\varphi_e(e))}</math>, the "real" fixed point for the operator will be at <math>\varphi_e(e)</math>. Indeed, <math>\varphi_{f(\varphi_e(e))} \simeq \varphi_{f\circ d(e)} \simeq \varphi_{\varphi_e(e)}</math>.<br />
<br />
Now we have to verify that <math>\alpha</math> doesn't need to be well-defined.<br />
<br />
===approach 3===<br />
<br />
Take Cantor's theorem, generalize it to mention fixed points, then take the contrapositive. See the Yanofsky paper for details.<br />
<br />
This version still has some mystery for me, e.g. replacing "the set has at least two elements" with "there is a function from the set to itself without a fixed point". The logical equivalence is easy to see, but getting the idea for rephrasing this condition to mention fixed points is not obvious at all.<br />
<br />
The use of the s-m-n theorem also isn't obvious to me. Why use it at all? Why use it on <math>g</math>? Why do we care about the index of <math>s</math>?<br />
<br />
It's also not clear to me why we use <math>T = \mathbf N</math> and <math>Y = \mathcal F</math>. In some sense it does make sense, like the natural numbers are all the algorithms, and the set of computable functions are the "properties" (a.k.a. "the objects being named").<br />
<br />
===approach 4===<br />
<br />
http://www.andrew.cmu.edu/user/kk3n/complearn/chapter8.pdf -- see section 8.1<br />
<br />
also see Moore and Mertens's section on lambda calculus<br />
<br />
in the untyped lambda calculus, there is only one type of entity, namely functions, which can operate on other functions. This makes it easy for functions to operate on themselves, which creates self-reference.<br />
<br />
but when working with partial recursive functions, we don't have this. instead, we have numbers and then partial functions that operate on numbers. to get self-reference, we need some kind of encoding. this is why we numbered the partial recursive functions.<br />
<br />
but now, one of the familiar facts about the lambda calculus is the existence of the fixed point combinator (aka y combinator). (note: this passes the buck to wondering how one would have come up with the lambda calculus, or how one would come up with the fixed point combinator in that setting; but this seems easier to answer.) since this theorem works in one setting in which we have self-reference, we might wonder if we can "port over" the theorem to the case where we have self-reference in a different setting.<br />
<br />
==Comparison table==<br />
<br />
Some things to notice:<br />
<br />
* The two theorems are essentially identical, with identical proofs, as seen by the matching rows. The analogy breaks down slightly at the very end, where we apply <math>\varphi_{f(\cdot)}</math> vs <math>A(\cdot)</math> (the latter corresponds to <math>f</math> until the very end).<br />
* In the partial recursive functions world, it's easy to go from the index (e.g. <math>e</math>) to the partial function (<math>\varphi_e</math>). In the formulas world it's the reverse, where it's easy to go from a formula (e.g. <math>A</math>) to its Godel number <math>\ulcorner A\urcorner</math>). I wonder if there is something essential here, or if it is simply some sort of historical accident in notation.<br />
* For the diagonalization lemma, here we have done the semantic version (? I think...), but usually the manipulations are done inside a formal system with reference to some theory <math>T</math> to derive a syntactic result (i.e. we have some theory that is strong enough to do all these manipulations within the object-level language). For partial recursive functions, as far as I know, there is no analogous distinction between semantics vs syntax.<br />
* The diagonalization part is not completely correct/as strong as possible for both proofs. For the partial recursive functions side, we want to make sure that <math>\varphi_{\varphi_x(x)}</math> is actually defined in each case. For the logic side, I think often the diagonalization is defined as <math>\exists x(x = \ulcorner A\urcorner \wedge A)</math> so that it is defined for all formulas, not just ones with one free variable. But the essential ideas are all present below, and since this makes the comparison easier, the presentation is simplified.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Step !! Rogers's fixed point theorem !! Diagonalization lemma<br />
|-<br />
| Theorem statement (note: quantifiers are part of the metalanguage) || <math>(\forall f \exists e)\ \varphi_e \simeq \varphi_{f(e)}</math> || <math>(\forall A \exists G)\ G \leftrightarrow A(\ulcorner G\urcorner)</math><br />
|-<br />
| Given mapping || <math>f</math> || <math>A</math><br />
|-<br />
| Definition of diagonal function || <math>d(x) = \varphi_x(x)</math> || <math>\mathrm{diag}(\ulcorner C\urcorner) = \ulcorner C(\ulcorner C\urcorner)\urcorner</math><br />
|-<br />
| Composition of given mapping with diagonal function (<math>\mathrm{given} \circ \mathrm{diagonal}</math>) || <math>f(d(x))</math> || <math>A(\mathrm{diag}(x))</math><br />
|-<br />
| Naming the <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition || <math>f\circ d</math> (name not given because compositions are easy to express outside a formal language) || <math>B</math><br />
|-<br />
| Index of <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition || <math>i</math> || <math>\ulcorner B\urcorner</math><br />
|-<br />
| Expanding using definition of diagonal || <math>d(i) = \varphi_i(i)</math> || <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner B(\ulcorner B\urcorner) \urcorner</math><br />
|-<br />
| The <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition applied to own index (i.e. diagonalization of the composition) || <math>f\circ d(i)</math> || <math>B(\ulcorner B\urcorner)</math><br />
|-<br />
| G defined || <math>\varphi_i(i)</math> (no equivalent definition) || <math>G</math> is <math>B(\ulcorner B\urcorner)</math><br />
|-<br />
| || <math>f(d(i)) = \varphi_i(i)</math> || <math>A(\mathrm{diag}(\ulcorner B\urcorner)) \leftrightarrow B(\ulcorner B\urcorner)</math><br />
|-<br />
| Renaming index || <math>e = d(i)</math> || <math>\ulcorner G\urcorner = \mathrm{diag}(\ulcorner B\urcorner)</math><br />
|-<br />
| Leibniz law to previous row || Apply <math>\varphi_{f(\cdot)}</math> to obtain <math>\varphi_{f(e)} = \varphi_{f(d(i))}</math> || Apply <math>A(\cdot)</math> to obtain <math>A(\ulcorner G\urcorner) \leftrightarrow A(\mathrm{diag}(\ulcorner B\urcorner))</math><br />
|-<br />
| Use definition of G || <math>\varphi_{f(e)} = \varphi_{\varphi_i(i)} = \varphi_e</math> || <math>A(\ulcorner G\urcorner) \leftrightarrow B(\ulcorner B\urcorner) \leftrightarrow G</math><br />
|-<br />
| (Definition of G)? || <math>\varphi_i(i)</math> is <math>f(d(i))</math> || <math>G</math> is <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math><br />
|}<br />
<br />
==Quotes==<br />
<br />
"All of these theorems tend to strain one's intuition; in fact, many people find them almost paradoxical. The most popular proofs of these theorems only serve to aggravate the situation because they are completely unmotivated, seem to depend upon a low combinatorial trick, and are so barbarically short as to be nearly incapable of rational analysis."<ref>James C. Owings, Jr. "Diagonalization and the Recursion Theorem". 1973.</ref><br />
<br />
"This is just a lovely result, insightful in its concept and far reaching in its consequences. We’d love to say that the proof was also lovely and enlightening, but to be honest, we don’t have an enlightening sort of proof to show you. Sometimes the best way to describe a proof is that the argument sort of picks you up and shakes you until you agree that it does, in fact, establish what it is supposed to establish. That’s what you get here."<ref>Christopher C. Leary; Lars Kristiansen. ''A Friendly Introduction to Mathematical Logic'' (2nd ed). p. 172.</ref><br />
<br />
"The brevity of the proof does not make for transparency; it has the aura of a magician’s trick. How did Gödel ever come up with the idea? As a matter of fact, Gödel did not come up with that idea."<ref name="gaifman"/><br />
<br />
==Questions/things to explain==<br />
<br />
* In Peter Smith's book, he defines Gdl(m,n) as Prf(m, diag(n)). What is the analogue of Gld for the Rogers fixed point theorem?<br />
* I like the <math>D(\ulcorner \varphi \urcorner) \iff \varphi(\ulcorner \varphi \urcorner)</math> that begins [https://mathoverflow.net/a/31374 this answer], but what is the analogue for partial functions? It seems like it is <math>d(x) = \varphi_x(x)</math>, which ''does'' exist (because we are allowed to have undefined values). So the motivation that works for the logic version doesn't work for the partial functions version, which bugs me.<br />
<br />
==References==<br />
<br />
<references/></div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Diagonalization_lemma&diff=3286User:IssaRice/Computability and logic/Diagonalization lemma2021-06-07T19:25:00Z<p>IssaRice: </p>
<hr />
<div>The '''diagonalization lemma''', also called the '''Gödel–Carnap fixed point theorem''', is a fixed point theorem in logic.<br />
<br />
A verbal version of this result, modified from GEB (p. 449), runs as follows: Take the predicate cannot-be-proved-when-diagonalized(x). This takes a predicate in the x input, and says whether the sentence can be proved when diagonalized (i.e. inserted into itself). For instance, cannot-be-proved-when-diagonalized("has-length-less-than-one-thousand(x)") claims that has-length-less-than-one-thousand("has-length-less-than-one-thousand(x)") cannot be proved. In this case, let's say it's false, since we can see that the string "has-length-less-than-one-thousand(x)" has length less than 1000, and let's assume our proof system is strong enough to prove this. Now, to diagonalize cannot-be-proved-when-diagonalized(x) is to form the sentence cannot-be-proved-when-diagonalized("cannot-be-proved-when-diagonalized(x)"). So can this sentence be proved or not? If it can be proved, then the sentence itself claims that it cannot be proved, a contradiction. So it must not be provable.<br />
<br />
Basically, unlike the English language, a sentence can't refer to itself using phrases like "this sentence itself", so there is no straightforward way to make claims like "This sentence cannot be proved". To get around this restriction, we must use diagonalization – substituting a sentence's own encoding (i.e. string representation, aka "Gödel number") into itself. This allows a predicate to talk of its own string representation. So now if that predicate happens to claim unprovability – then we get Gödel's first incompleteness theorem.<br />
<br />
The diagonalization lemma generalizes to talk about any predicate P(x), not just not-provable(x). We want to find a sentence G such that G is true if and only if P("G") is (this is a little sloppy -- it's not actually the string "G", but rather if we made whatever G happens to be into a string...). Let G be has-property-P-when-diagonalized("has-property-P-when-diagonalized(x)"). If G is true, then has-property-P-when-diagonalized(x) must have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") which means P("G") is true. If G is false, then has-property-P-when-diagonalized(x) must not have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") is false.<br />
<br />
==Rogers's fixed point theorem==<br />
<br />
Let <math>f</math> be a total computable function. Then there exists an index <math>e</math> such that <math>\varphi_e \simeq \varphi_{f(e)}</math>.<br />
<br />
(simplified)<br />
<br />
Define <math>d(x) = \varphi_x(x)</math> (this is actually slightly wrong, but it brings out the analogy better).<br />
<br />
Consider the function <math>f\circ d</math>. This is partial recursive, so <math>f\circ d \simeq \varphi_i</math> for some index <math>i</math>.<br />
<br />
Now <math>\varphi_{f(d(i))} \simeq \varphi_{\varphi_i(i)}</math> since <math>f\circ d \simeq \varphi_i</math>. This is equivalent to <math>\varphi_{d(i)}</math> by definition of <math>d</math>. Thus, we may take <math>e = d(i)</math> to complete the proof.<br />
<br />
It looks like we have <math>f(d(i)) = \varphi_i(i) = d(i)</math>, i.e. <math>f(e) = e</math>. Is this right?<br />
<br />
<br />
Repeatedly using the facts that (1) <math>i</math> is an index for <math>f\circ d</math>, and (2) <math>d(i) = \varphi_i(i)</math>, allows us to create an iteration effect:<br />
<br />
<math>\varphi_i(i) \simeq f(d(i)) \simeq f(\varphi_i(i)) \simeq f(f(d(i))) \simeq f(f(\varphi_i(i))) \simeq \cdots \simeq f\circ \cdots \circ f \circ d(i)</math><br />
<br />
(I'm wondering if there's some deeper meaning to this. So far it's just an interesting connection between diagonalization-based fixed points and iteration-based fixed points. I think there might be a connection between this and the [https://medium.com/@cdsmithus/fixpoints-in-haskell-294096a9fc10 fix function in Haskell].)<br />
<br />
<br />
In the more rigorous/careful version of the proof, we use the [[s-m-n theorem]] to get an index of a function, <math>s</math>, which is basically like <math>d</math>. The difference is that <math>\varphi_x(x)</math> might not be defined for all <math>x</math> (actually it isn't, since some partial functions are always undefined) so <math>d</math> is not total. On the other hand, <math>s</math> is obtained via the s-m-n theorem so is total. When <math>\varphi_x(x)</math> is undefined, <math>s(x)</math> gives an index of the always-undefined partial function. So <math>s</math> says "this is undefined" in a defined way. Thanks to this property, the expression <math>\varphi_{s(x)}</math> always makes sense, whereas <math>\varphi_{\varphi_x(x)}</math> sometimes doesn't make sense.<br />
<br />
<br />
See also https://machinelearning.subwiki.org/wiki/User:IssaRice/Computability_and_logic/Rogers_fixed_point_theorem_using_Sipser%27s_notation<br />
<br />
==Diagonalization lemma==<br />
<br />
(semantic version)<br />
<br />
Let <math>A</math> be a formula with one free variable. Then there exists a sentence <math>G</math> such that <math>G</math> iff <math>A(\ulcorner G\urcorner)</math>.<br />
<br />
Define <math>\mathrm{diag}(x)</math> to be <math>\ulcorner C(\ulcorner C\urcorner)\urcorner</math> where <math>x = \ulcorner C\urcorner</math>. In other words, given a number <math>x</math>, the function <math>\mathrm{diag}</math> finds the formula with that Godel number, then diagonalizes it (i.e. substitutes the Godel number of the formula into the formula itself), then returns the Godel number of the resulting sentence.<br />
<br />
Let <math>B</math> be <math>A(\mathrm{diag}(x))</math>, and let <math>G</math> be <math>B(\ulcorner B\urcorner)</math>.<br />
<br />
Then <math>G</math> is <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math>, by substituting <math>x = \ulcorner B\urcorner</math> in the definition of <math>B</math>.<br />
<br />
We also have <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner B(\ulcorner B\urcorner)\urcorner</math> by definition of <math>\mathrm{diag}</math>. By definition of <math>G</math>, this is <math>\ulcorner G\urcorner</math>, so we have <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner G\urcorner</math>.<br />
<br />
To complete the proof, apply <math>A</math> to both sides of the final equality to obtain <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math> iff <math>A(\ulcorner G\urcorner)</math>; this simplifies to <math>G</math> iff <math>A(\ulcorner G\urcorner)</math>.<br />
<br />
<ref name="gaifman">Haim Gaifman. [https://web.archive.org/web/20180205090617/http://www.columbia.edu/~hg17/naming-diag.pdf "Naming and Diagonalization, from Cantor to Gödel to Kleene"].</ref><br />
<br />
<ref>https://mathoverflow.net/questions/30874/arithmetic-fixed-point-theorem</ref><br />
<br />
===use of extra quantified variable to make a substitution===<br />
<br />
(see p. 448 of GEB)<br />
<br />
outside the formal system, if we have some function f, a constant a, and some one-place relation R, we can substitute f(a) into R like: R(f(a)). but many systems of formal logic don't have a way to directly talk about outputs of functions like f(a). instead, they have a relation like F(a,y) to mean f(a)=y. [why on earth would they do this? i think the basic reason is that if functions are just relations, we have fewer cases to have to prove in those annoying structural induction proofs.] so how do we express an idea like R(f(a))? we can make use of an extra variable to hold the output, e.g. <math>F(a,y)\wedge R(y)</math>. but this leaves y free, so actually we want <math>\exists y (F(a,y)\wedge R(y))</math>. alternatively, we can also say <math>F(a,y) \to R(y)</math>. in this case we want <math>\forall y (F(a,y) \to R(y))</math>. it's easy to prove that both of these clumsy ways of writing are logically equivalent to R(f(a)).<br />
<br />
something similar happens when we want to diagonalize formulas. given a formula <math>A(x)</math> that has just x free, it's easy enough to diagonalize it: <math>A(\ulcorner A(x)\urcorner)</math>. but what about a sentence like <math>B</math>? how do we "substitute" in something into something that has no free variable? [why on earth would you want to try that? i think it simplifies the proof a little if we assume diagonalization is defined for any sentence. but i forgot where exactly the simplification occurs.] the idea is again to make use of a separate variable: <math>\exists x (x = \ulcorner A(x)\urcorner \wedge A(x))</math>. again, i think we could also do <math>\forall x (x = \ulcorner A(x)\urcorner \to A(x))</math>. we're basically considering a function that finds the godel number of a sentence. except unlike a relation, a single-free-variable formula fixes some specific variable that it leaves free (a relation doesn't know whether it's x or y that's free -- it just expresses some idea), so we need to fix some single variable to use throughout.<br />
<br />
==Trying to discover the lemma==<br />
<br />
===approach 1===<br />
<br />
https://mathoverflow.net/a/31374<br />
<br />
===approach 2===<br />
<br />
see Owings paper.<br />
<br />
In the framework of this paper, we have a matrix where each entry is of a certain type. Then we apply the function <math>\alpha</math> to the diagonal. If the diagonal turns into one of the rows, <math>\alpha</math> has a fixed point.<br />
<br />
So now the trick is to figure out what our <math>\alpha</math> should be, and also what our matrix should look like.<br />
<br />
Picking the <math>\alpha</math> doesn't seem hard: we want a fixed point for the operation <math>\varphi_{f(-)}</math>, so we can pick <math>\alpha(\varphi_e) = \varphi_{f(e)}</math>. One problem is that this might not be well-defined, but we can just go with this for now (it ends up not mattering, for reasons I don't really understand, but the Owings paper has another workaround, which is to use relations; I find that more confusing).<br />
<br />
The matrix that works turns out to have entries <math>\varphi_{\varphi_j(k)}</math>. I'm not sure how one would have figured this out. One might also think <math>\varphi_j(k)</math> would work, but notice that then we fail the type checking with <math>\alpha</math> (which takes a function, not a natural number).<br />
<br />
So now we take the diagonal, which has entries <math>\varphi_{\varphi_k(k)}</math>, for <math>k = 0, 1, 2, \ldots</math>, and apply <math>\alpha</math>. We get <math>\varphi_{f(\varphi_k(k))}</math>. But <math>d</math> defined by <math>d(x) = \varphi_x(x)</math> is a recursive function, so the diagonal has turned into <math>\varphi_{f(d(k))} = \varphi_{f\circ d(k)}</math>. Since a composition of recursive functions is itself recursive, <math>f\circ d</math> is recursive. So we have some index <math>e</math> for it, i.e. <math>f\circ d \simeq \varphi_e</math>. So <math>\alpha</math> applied to the diagonal results in <math>\varphi_{\varphi_e(k)}</math>, which is one of the rows (the <math>e</math>th row). This means <math>\alpha</math> has a fixed point, in the <math>e</math>th entry, i.e. at <math>\varphi_{\varphi_e(e)}</math>. So we expect <math>\alpha(\varphi_{\varphi_e(e)})=\varphi_{\varphi_e(e)}</math>. Since <math>\alpha(\varphi_{\varphi_e(e)}) = \varphi_{f(\varphi_e(e))}</math>, the "real" fixed point for the operator will be at <math>\varphi_e(e)</math>. Indeed, <math>\varphi_{f(\varphi_e(e))} \simeq \varphi_{f\circ d(e)} \simeq \varphi_{\varphi_e(e)}</math>.<br />
<br />
Now we have to verify that <math>\alpha</math> doesn't need to be well-defined.<br />
<br />
===approach 3===<br />
<br />
Take Cantor's theorem, generalize it to mention fixed points, then take the contrapositive. See the Yanofsky paper for details.<br />
<br />
This version still has some mystery for me, e.g. replacing "the set has at least two elements" with "there is a function from the set to itself without a fixed point". The logical equivalence is easy to see, but getting the idea for rephrasing this condition to mention fixed points is not obvious at all.<br />
<br />
The use of the s-m-n theorem also isn't obvious to me. Why use it at all? Why use it on <math>g</math>? Why do we care about the index of <math>s</math>?<br />
<br />
It's also not clear to me why we use <math>T = \mathbf N</math> and <math>Y = \mathcal F</math>. In some sense it does make sense, like the natural numbers are all the algorithms, and the set of computable functions are the "properties" (a.k.a. "the objects being named").<br />
<br />
===approach 4===<br />
<br />
http://www.andrew.cmu.edu/user/kk3n/complearn/chapter8.pdf -- see section 8.1<br />
<br />
also see Moore and Mertens's section on lambda calculus<br />
<br />
in the untyped lambda calculus, there is only one type of entity, namely functions, which can operate on other functions. This makes it easy for functions to operate on themselves, which creates self-reference.<br />
<br />
but when working with partial recursive functions, we don't have this. instead, we have numbers and then partial functions that operate on numbers. to get self-reference, we need some kind of encoding. this is why we numbered the partial recursive functions.<br />
<br />
but now, one of the familiar facts about the lambda calculus is the existence of the fixed point combinator (aka y combinator). (note: this passes the buck to wondering how one would have come up with the lambda calculus, or how one would come up with the fixed point combinator in that setting; but this seems easier to answer.) since this theorem works in one setting in which we have self-reference, we might wonder if we can "port over" the theorem to the case where we have self-reference in a different setting.<br />
<br />
==Comparison table==<br />
<br />
Some things to notice:<br />
<br />
* The two theorems are essentially identical, with identical proofs, as seen by the matching rows. The analogy breaks down slightly at the very end, where we apply <math>\varphi_{f(\cdot)}</math> vs <math>A(\cdot)</math> (the latter corresponds to <math>f</math> until the very end).<br />
* In the partial recursive functions world, it's easy to go from the index (e.g. <math>e</math>) to the partial function (<math>\varphi_e</math>). In the formulas world it's the reverse, where it's easy to go from a formula (e.g. <math>A</math>) to its Godel number <math>\ulcorner A\urcorner</math>). I wonder if there is something essential here, or if it is simply some sort of historical accident in notation.<br />
* For the diagonalization lemma, here we have done the semantic version (? I think...), but usually the manipulations are done inside a formal system with reference to some theory <math>T</math> to derive a syntactic result (i.e. we have some theory that is strong enough to do all these manipulations within the object-level language). For partial recursive functions, as far as I know, there is no analogous distinction between semantics vs syntax.<br />
* The diagonalization part is not completely correct/as strong as possible for both proofs. For the partial recursive functions side, we want to make sure that <math>\varphi_{\varphi_x(x)}</math> is actually defined in each case. For the logic side, I think often the diagonalization is defined as <math>\exists x(x = \ulcorner A\urcorner \wedge A)</math> so that it is defined for all formulas, not just ones with one free variable. But the essential ideas are all present below, and since this makes the comparison easier, the presentation is simplified.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Step !! Rogers's fixed point theorem !! Diagonalization lemma<br />
|-<br />
| Theorem statement (note: quantifiers are part of the metalanguage) || <math>(\forall f \exists e)\ \varphi_e \simeq \varphi_{f(e)}</math> || <math>(\forall A \exists G)\ G \leftrightarrow A(\ulcorner G\urcorner)</math><br />
|-<br />
| Given mapping || <math>f</math> || <math>A</math><br />
|-<br />
| Definition of diagonal function || <math>d(x) = \varphi_x(x)</math> || <math>\mathrm{diag}(\ulcorner C\urcorner) = \ulcorner C(\ulcorner C\urcorner)\urcorner</math><br />
|-<br />
| Composition of given mapping with diagonal function (<math>\mathrm{given} \circ \mathrm{diagonal}</math>) || <math>f(d(x))</math> || <math>A(\mathrm{diag}(x))</math><br />
|-<br />
| Naming the <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition || <math>f\circ d</math> (name not given because compositions are easy to express outside a formal language) || <math>B</math><br />
|-<br />
| Index of <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition || <math>i</math> || <math>\ulcorner B\urcorner</math><br />
|-<br />
| Expanding using definition of diagonal || <math>d(i) = \varphi_i(i)</math> || <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner B(\ulcorner B\urcorner) \urcorner</math><br />
|-<br />
| The <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition applied to own index (i.e. diagonalization of the composition) || <math>f\circ d(i)</math> || <math>B(\ulcorner B\urcorner)</math><br />
|-<br />
| G defined || <math>\varphi_i(i)</math> (no equivalent definition) || <math>G</math> is <math>B(\ulcorner B\urcorner)</math><br />
|-<br />
| || <math>f(d(i)) = \varphi_i(i)</math> || <math>A(\mathrm{diag}(\ulcorner B\urcorner)) \leftrightarrow B(\ulcorner B\urcorner)</math><br />
|-<br />
| Renaming index || <math>e = d(i)</math> || <math>\ulcorner G\urcorner = \mathrm{diag}(\ulcorner B\urcorner)</math><br />
|-<br />
| Leibniz law to previous row || Apply <math>\varphi_{f(\cdot)}</math> to obtain <math>\varphi_{f(e)} = \varphi_{f(d(i))}</math> || Apply <math>A(\cdot)</math> to obtain <math>A(\ulcorner G\urcorner) \leftrightarrow A(\mathrm{diag}(\ulcorner B\urcorner))</math><br />
|-<br />
| Use definition of G || <math>\varphi_{f(e)} = \varphi_{\varphi_i(i)} = \varphi_e</math> || <math>A(\ulcorner G\urcorner) \leftrightarrow B(\ulcorner B\urcorner) \leftrightarrow G</math><br />
|-<br />
| (Definition of G)? || <math>\varphi_i(i)</math> is <math>f(d(i))</math> || <math>G</math> is <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math><br />
|}<br />
<br />
==Quotes==<br />
<br />
"All of these theorems tend to strain one's intuition; in fact, many people find them almost paradoxical. The most popular proofs of these theorems only serve to aggravate the situation because they are completely unmotivated, seem to depend upon a low combinatorial trick, and are so barbarically short as to be nearly incapable of rational analysis."<ref>James C. Owings, Jr. "Diagonalization and the Recursion Theorem". 1973.</ref><br />
<br />
"This is just a lovely result, insightful in its concept and far reaching in its consequences. We’d love to say that the proof was also lovely and enlightening, but to be honest, we don’t have an enlightening sort of proof to show you. Sometimes the best way to describe a proof is that the argument sort of picks you up and shakes you until you agree that it does, in fact, establish what it is supposed to establish. That’s what you get here."<ref>Christopher C. Leary; Lars Kristiansen. ''A Friendly Introduction to Mathematical Logic'' (2nd ed). p. 172.</ref><br />
<br />
"The brevity of the proof does not make for transparency; it has the aura of a magician’s trick. How did Gödel ever come up with the idea? As a matter of fact, Gödel did not come up with that idea."<ref name="gaifman"/><br />
<br />
==Questions/things to explain==<br />
<br />
* In Peter Smith's book, he defines Gdl(m,n) as Prf(m, diag(n)). What is the analogue of Gld for the Rogers fixed point theorem?<br />
* I like the <math>D(\ulcorner \varphi \urcorner) \iff \varphi(\ulcorner \varphi \urcorner)</math> that begins [https://mathoverflow.net/a/31374 this answer], but what is the analogue for partial functions? It seems like it is <math>d(x) = \varphi_x(x)</math>, which ''does'' exist (because we are allowed to have undefined values). So the motivation that works for the logic version doesn't work for the partial functions version, which bugs me.<br />
<br />
==References==<br />
<br />
<references/></div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Diagonalization_lemma&diff=3285User:IssaRice/Computability and logic/Diagonalization lemma2021-06-07T19:24:30Z<p>IssaRice: </p>
<hr />
<div>The '''diagonalization lemma''', also called the '''Gödel–Carnap fixed point theorem''', is a fixed point theorem in logic.<br />
<br />
A verbal version of this result, modified from GEB (p. 449), runs as follows: Take the predicate cannot-be-proved-when-diagonalized(x). This takes a predicate in the x input, and says whether the sentence can be proved when diagonalized (i.e. inserted into itself). For instance, cannot-be-proved-when-diagonalized("has-length-less-than-one-thousand(x)") claims that has-length-less-than-one-thousand("has-length-less-than-one-thousand(x)") cannot be proved. In this case, let's say it's false, since we can see that the string "has-length-less-than-one-thousand(x)" has length less than 1000, and let's assume our proof system is strong enough to prove this. Now, to diagonalize cannot-be-proved-when-diagonalized(x) is to form the sentence cannot-be-proved-when-diagonalized("cannot-be-proved-when-diagonalized(x)"). So can this sentence be proved or not? If it can be proved, then the sentence itself claims that it cannot be proved, a contradiction. So it must not be provable.<br />
<br />
Basically, unlike the English language, a sentence can't refer to itself using phrases like "this sentence itself", so there is no straightforward way to make claims like "This sentence cannot be proved". To get around this restriction, we must use diagonalization -- substituting a sentence's own encoding (i.e. string representation, aka "Gödel number") into itself. This allows a predicate to talk of its own string representation. So now if that predicate happens to claim unprovability -- then we get Gödel's first incompleteness theorem.<br />
<br />
The diagonalization lemma generalizes to talk about any predicate P(x), not just not-provable(x). We want to find a sentence G such that G is true if and only if P("G") is (this is a little sloppy -- it's not actually the string "G", but rather if we made whatever G happens to be into a string...). Let G be has-property-P-when-diagonalized("has-property-P-when-diagonalized(x)"). If G is true, then has-property-P-when-diagonalized(x) must have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") which means P("G") is true. If G is false, then has-property-P-when-diagonalized(x) must not have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") is false.<br />
<br />
==Rogers's fixed point theorem==<br />
<br />
Let <math>f</math> be a total computable function. Then there exists an index <math>e</math> such that <math>\varphi_e \simeq \varphi_{f(e)}</math>.<br />
<br />
(simplified)<br />
<br />
Define <math>d(x) = \varphi_x(x)</math> (this is actually slightly wrong, but it brings out the analogy better).<br />
<br />
Consider the function <math>f\circ d</math>. This is partial recursive, so <math>f\circ d \simeq \varphi_i</math> for some index <math>i</math>.<br />
<br />
Now <math>\varphi_{f(d(i))} \simeq \varphi_{\varphi_i(i)}</math> since <math>f\circ d \simeq \varphi_i</math>. This is equivalent to <math>\varphi_{d(i)}</math> by definition of <math>d</math>. Thus, we may take <math>e = d(i)</math> to complete the proof.<br />
<br />
It looks like we have <math>f(d(i)) = \varphi_i(i) = d(i)</math>, i.e. <math>f(e) = e</math>. Is this right?<br />
<br />
<br />
Repeatedly using the facts that (1) <math>i</math> is an index for <math>f\circ d</math>, and (2) <math>d(i) = \varphi_i(i)</math>, allows us to create an iteration effect:<br />
<br />
<math>\varphi_i(i) \simeq f(d(i)) \simeq f(\varphi_i(i)) \simeq f(f(d(i))) \simeq f(f(\varphi_i(i))) \simeq \cdots \simeq f\circ \cdots \circ f \circ d(i)</math><br />
<br />
(I'm wondering if there's some deeper meaning to this. So far it's just an interesting connection between diagonalization-based fixed points and iteration-based fixed points. I think there might be a connection between this and the [https://medium.com/@cdsmithus/fixpoints-in-haskell-294096a9fc10 fix function in Haskell].)<br />
<br />
<br />
In the more rigorous/careful version of the proof, we use the [[s-m-n theorem]] to get an index of a function, <math>s</math>, which is basically like <math>d</math>. The difference is that <math>\varphi_x(x)</math> might not be defined for all <math>x</math> (actually it isn't, since some partial functions are always undefined) so <math>d</math> is not total. On the other hand, <math>s</math> is obtained via the s-m-n theorem so is total. When <math>\varphi_x(x)</math> is undefined, <math>s(x)</math> gives an index of the always-undefined partial function. So <math>s</math> says "this is undefined" in a defined way. Thanks to this property, the expression <math>\varphi_{s(x)}</math> always makes sense, whereas <math>\varphi_{\varphi_x(x)}</math> sometimes doesn't make sense.<br />
<br />
<br />
See also https://machinelearning.subwiki.org/wiki/User:IssaRice/Computability_and_logic/Rogers_fixed_point_theorem_using_Sipser%27s_notation<br />
<br />
==Diagonalization lemma==<br />
<br />
(semantic version)<br />
<br />
Let <math>A</math> be a formula with one free variable. Then there exists a sentence <math>G</math> such that <math>G</math> iff <math>A(\ulcorner G\urcorner)</math>.<br />
<br />
Define <math>\mathrm{diag}(x)</math> to be <math>\ulcorner C(\ulcorner C\urcorner)\urcorner</math> where <math>x = \ulcorner C\urcorner</math>. In other words, given a number <math>x</math>, the function <math>\mathrm{diag}</math> finds the formula with that Godel number, then diagonalizes it (i.e. substitutes the Godel number of the formula into the formula itself), then returns the Godel number of the resulting sentence.<br />
<br />
Let <math>B</math> be <math>A(\mathrm{diag}(x))</math>, and let <math>G</math> be <math>B(\ulcorner B\urcorner)</math>.<br />
<br />
Then <math>G</math> is <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math>, by substituting <math>x = \ulcorner B\urcorner</math> in the definition of <math>B</math>.<br />
<br />
We also have <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner B(\ulcorner B\urcorner)\urcorner</math> by definition of <math>\mathrm{diag}</math>. By definition of <math>G</math>, this is <math>\ulcorner G\urcorner</math>, so we have <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner G\urcorner</math>.<br />
<br />
To complete the proof, apply <math>A</math> to both sides of the final equality to obtain <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math> iff <math>A(\ulcorner G\urcorner)</math>; this simplifies to <math>G</math> iff <math>A(\ulcorner G\urcorner)</math>.<br />
<br />
<ref name="gaifman">Haim Gaifman. [https://web.archive.org/web/20180205090617/http://www.columbia.edu/~hg17/naming-diag.pdf "Naming and Diagonalization, from Cantor to Gödel to Kleene"].</ref><br />
<br />
<ref>https://mathoverflow.net/questions/30874/arithmetic-fixed-point-theorem</ref><br />
<br />
===use of extra quantified variable to make a substitution===<br />
<br />
(see p. 448 of GEB)<br />
<br />
outside the formal system, if we have some function f, a constant a, and some one-place relation R, we can substitute f(a) into R like: R(f(a)). but many systems of formal logic don't have a way to directly talk about outputs of functions like f(a). instead, they have a relation like F(a,y) to mean f(a)=y. [why on earth would they do this? i think the basic reason is that if functions are just relations, we have fewer cases to have to prove in those annoying structural induction proofs.] so how do we express an idea like R(f(a))? we can make use of an extra variable to hold the output, e.g. <math>F(a,y)\wedge R(y)</math>. but this leaves y free, so actually we want <math>\exists y (F(a,y)\wedge R(y))</math>. alternatively, we can also say <math>F(a,y) \to R(y)</math>. in this case we want <math>\forall y (F(a,y) \to R(y))</math>. it's easy to prove that both of these clumsy ways of writing are logically equivalent to R(f(a)).<br />
<br />
something similar happens when we want to diagonalize formulas. given a formula <math>A(x)</math> that has just x free, it's easy enough to diagonalize it: <math>A(\ulcorner A(x)\urcorner)</math>. but what about a sentence like <math>B</math>? how do we "substitute" in something into something that has no free variable? [why on earth would you want to try that? i think it simplifies the proof a little if we assume diagonalization is defined for any sentence. but i forgot where exactly the simplification occurs.] the idea is again to make use of a separate variable: <math>\exists x (x = \ulcorner A(x)\urcorner \wedge A(x))</math>. again, i think we could also do <math>\forall x (x = \ulcorner A(x)\urcorner \to A(x))</math>. we're basically considering a function that finds the godel number of a sentence. except unlike a relation, a single-free-variable formula fixes some specific variable that it leaves free (a relation doesn't know whether it's x or y that's free -- it just expresses some idea), so we need to fix some single variable to use throughout.<br />
<br />
==Trying to discover the lemma==<br />
<br />
===approach 1===<br />
<br />
https://mathoverflow.net/a/31374<br />
<br />
===approach 2===<br />
<br />
see Owings paper.<br />
<br />
In the framework of this paper, we have a matrix where each entry is of a certain type. Then we apply the function <math>\alpha</math> to the diagonal. If the diagonal turns into one of the rows, <math>\alpha</math> has a fixed point.<br />
<br />
So now the trick is to figure out what our <math>\alpha</math> should be, and also what our matrix should look like.<br />
<br />
Picking the <math>\alpha</math> doesn't seem hard: we want a fixed point for the operation <math>\varphi_{f(-)}</math>, so we can pick <math>\alpha(\varphi_e) = \varphi_{f(e)}</math>. One problem is that this might not be well-defined, but we can just go with this for now (it ends up not mattering, for reasons I don't really understand, but the Owings paper has another workaround, which is to use relations; I find that more confusing).<br />
<br />
The matrix that works turns out to have entries <math>\varphi_{\varphi_j(k)}</math>. I'm not sure how one would have figured this out. One might also think <math>\varphi_j(k)</math> would work, but notice that then we fail the type checking with <math>\alpha</math> (which takes a function, not a natural number).<br />
<br />
So now we take the diagonal, which has entries <math>\varphi_{\varphi_k(k)}</math>, for <math>k = 0, 1, 2, \ldots</math>, and apply <math>\alpha</math>. We get <math>\varphi_{f(\varphi_k(k))}</math>. But <math>d</math> defined by <math>d(x) = \varphi_x(x)</math> is a recursive function, so the diagonal has turned into <math>\varphi_{f(d(k))} = \varphi_{f\circ d(k)}</math>. Since a composition of recursive functions is itself recursive, <math>f\circ d</math> is recursive. So we have some index <math>e</math> for it, i.e. <math>f\circ d \simeq \varphi_e</math>. So <math>\alpha</math> applied to the diagonal results in <math>\varphi_{\varphi_e(k)}</math>, which is one of the rows (the <math>e</math>th row). This means <math>\alpha</math> has a fixed point, in the <math>e</math>th entry, i.e. at <math>\varphi_{\varphi_e(e)}</math>. So we expect <math>\alpha(\varphi_{\varphi_e(e)})=\varphi_{\varphi_e(e)}</math>. Since <math>\alpha(\varphi_{\varphi_e(e)}) = \varphi_{f(\varphi_e(e))}</math>, the "real" fixed point for the operator will be at <math>\varphi_e(e)</math>. Indeed, <math>\varphi_{f(\varphi_e(e))} \simeq \varphi_{f\circ d(e)} \simeq \varphi_{\varphi_e(e)}</math>.<br />
<br />
Now we have to verify that <math>\alpha</math> doesn't need to be well-defined.<br />
<br />
===approach 3===<br />
<br />
Take Cantor's theorem, generalize it to mention fixed points, then take the contrapositive. See the Yanofsky paper for details.<br />
<br />
This version still has some mystery for me, e.g. replacing "the set has at least two elements" with "there is a function from the set to itself without a fixed point". The logical equivalence is easy to see, but getting the idea for rephrasing this condition to mention fixed points is not obvious at all.<br />
<br />
The use of the s-m-n theorem also isn't obvious to me. Why use it at all? Why use it on <math>g</math>? Why do we care about the index of <math>s</math>?<br />
<br />
It's also not clear to me why we use <math>T = \mathbf N</math> and <math>Y = \mathcal F</math>. In some sense it does make sense, like the natural numbers are all the algorithms, and the set of computable functions are the "properties" (a.k.a. "the objects being named").<br />
<br />
===approach 4===<br />
<br />
http://www.andrew.cmu.edu/user/kk3n/complearn/chapter8.pdf -- see section 8.1<br />
<br />
also see Moore and Mertens's section on lambda calculus<br />
<br />
in the untyped lambda calculus, there is only one type of entity, namely functions, which can operate on other functions. This makes it easy for functions to operate on themselves, which creates self-reference.<br />
<br />
but when working with partial recursive functions, we don't have this. instead, we have numbers and then partial functions that operate on numbers. to get self-reference, we need some kind of encoding. this is why we numbered the partial recursive functions.<br />
<br />
but now, one of the familiar facts about the lambda calculus is the existence of the fixed point combinator (aka y combinator). (note: this passes the buck to wondering how one would have come up with the lambda calculus, or how one would come up with the fixed point combinator in that setting; but this seems easier to answer.) since this theorem works in one setting in which we have self-reference, we might wonder if we can "port over" the theorem to the case where we have self-reference in a different setting.<br />
<br />
==Comparison table==<br />
<br />
Some things to notice:<br />
<br />
* The two theorems are essentially identical, with identical proofs, as seen by the matching rows. The analogy breaks down slightly at the very end, where we apply <math>\varphi_{f(\cdot)}</math> vs <math>A(\cdot)</math> (the latter corresponds to <math>f</math> until the very end).<br />
* In the partial recursive functions world, it's easy to go from the index (e.g. <math>e</math>) to the partial function (<math>\varphi_e</math>). In the formulas world it's the reverse, where it's easy to go from a formula (e.g. <math>A</math>) to its Godel number <math>\ulcorner A\urcorner</math>). I wonder if there is something essential here, or if it is simply some sort of historical accident in notation.<br />
* For the diagonalization lemma, here we have done the semantic version (? I think...), but usually the manipulations are done inside a formal system with reference to some theory <math>T</math> to derive a syntactic result (i.e. we have some theory that is strong enough to do all these manipulations within the object-level language). For partial recursive functions, as far as I know, there is no analogous distinction between semantics vs syntax.<br />
* The diagonalization part is not completely correct/as strong as possible for both proofs. For the partial recursive functions side, we want to make sure that <math>\varphi_{\varphi_x(x)}</math> is actually defined in each case. For the logic side, I think often the diagonalization is defined as <math>\exists x(x = \ulcorner A\urcorner \wedge A)</math> so that it is defined for all formulas, not just ones with one free variable. But the essential ideas are all present below, and since this makes the comparison easier, the presentation is simplified.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Step !! Rogers's fixed point theorem !! Diagonalization lemma<br />
|-<br />
| Theorem statement (note: quantifiers are part of the metalanguage) || <math>(\forall f \exists e)\ \varphi_e \simeq \varphi_{f(e)}</math> || <math>(\forall A \exists G)\ G \leftrightarrow A(\ulcorner G\urcorner)</math><br />
|-<br />
| Given mapping || <math>f</math> || <math>A</math><br />
|-<br />
| Definition of diagonal function || <math>d(x) = \varphi_x(x)</math> || <math>\mathrm{diag}(\ulcorner C\urcorner) = \ulcorner C(\ulcorner C\urcorner)\urcorner</math><br />
|-<br />
| Composition of given mapping with diagonal function (<math>\mathrm{given} \circ \mathrm{diagonal}</math>) || <math>f(d(x))</math> || <math>A(\mathrm{diag}(x))</math><br />
|-<br />
| Naming the <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition || <math>f\circ d</math> (name not given because compositions are easy to express outside a formal language) || <math>B</math><br />
|-<br />
| Index of <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition || <math>i</math> || <math>\ulcorner B\urcorner</math><br />
|-<br />
| Expanding using definition of diagonal || <math>d(i) = \varphi_i(i)</math> || <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner B(\ulcorner B\urcorner) \urcorner</math><br />
|-<br />
| The <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition applied to own index (i.e. diagonalization of the composition) || <math>f\circ d(i)</math> || <math>B(\ulcorner B\urcorner)</math><br />
|-<br />
| G defined || <math>\varphi_i(i)</math> (no equivalent definition) || <math>G</math> is <math>B(\ulcorner B\urcorner)</math><br />
|-<br />
| || <math>f(d(i)) = \varphi_i(i)</math> || <math>A(\mathrm{diag}(\ulcorner B\urcorner)) \leftrightarrow B(\ulcorner B\urcorner)</math><br />
|-<br />
| Renaming index || <math>e = d(i)</math> || <math>\ulcorner G\urcorner = \mathrm{diag}(\ulcorner B\urcorner)</math><br />
|-<br />
| Leibniz law to previous row || Apply <math>\varphi_{f(\cdot)}</math> to obtain <math>\varphi_{f(e)} = \varphi_{f(d(i))}</math> || Apply <math>A(\cdot)</math> to obtain <math>A(\ulcorner G\urcorner) \leftrightarrow A(\mathrm{diag}(\ulcorner B\urcorner))</math><br />
|-<br />
| Use definition of G || <math>\varphi_{f(e)} = \varphi_{\varphi_i(i)} = \varphi_e</math> || <math>A(\ulcorner G\urcorner) \leftrightarrow B(\ulcorner B\urcorner) \leftrightarrow G</math><br />
|-<br />
| (Definition of G)? || <math>\varphi_i(i)</math> is <math>f(d(i))</math> || <math>G</math> is <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math><br />
|}<br />
<br />
==Quotes==<br />
<br />
"All of these theorems tend to strain one's intuition; in fact, many people find them almost paradoxical. The most popular proofs of these theorems only serve to aggravate the situation because they are completely unmotivated, seem to depend upon a low combinatorial trick, and are so barbarically short as to be nearly incapable of rational analysis."<ref>James C. Owings, Jr. "Diagonalization and the Recursion Theorem". 1973.</ref><br />
<br />
"This is just a lovely result, insightful in its concept and far reaching in its consequences. We’d love to say that the proof was also lovely and enlightening, but to be honest, we don’t have an enlightening sort of proof to show you. Sometimes the best way to describe a proof is that the argument sort of picks you up and shakes you until you agree that it does, in fact, establish what it is supposed to establish. That’s what you get here."<ref>Christopher C. Leary; Lars Kristiansen. ''A Friendly Introduction to Mathematical Logic'' (2nd ed). p. 172.</ref><br />
<br />
"The brevity of the proof does not make for transparency; it has the aura of a magician’s trick. How did Gödel ever come up with the idea? As a matter of fact, Gödel did not come up with that idea."<ref name="gaifman"/><br />
<br />
==Questions/things to explain==<br />
<br />
* In Peter Smith's book, he defines Gdl(m,n) as Prf(m, diag(n)). What is the analogue of Gld for the Rogers fixed point theorem?<br />
* I like the <math>D(\ulcorner \varphi \urcorner) \iff \varphi(\ulcorner \varphi \urcorner)</math> that begins [https://mathoverflow.net/a/31374 this answer], but what is the analogue for partial functions? It seems like it is <math>d(x) = \varphi_x(x)</math>, which ''does'' exist (because we are allowed to have undefined values). So the motivation that works for the logic version doesn't work for the partial functions version, which bugs me.<br />
<br />
==References==<br />
<br />
<references/></div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Diagonalization_lemma&diff=3284User:IssaRice/Computability and logic/Diagonalization lemma2021-06-07T19:21:58Z<p>IssaRice: </p>
<hr />
<div>The '''diagonalization lemma''', also called the Godel-Carnap fixed point theorem, is a fixed point theorem in logic.<br />
<br />
a verbal version of this result, modified from GEB (p. 449): take the predicate cannot-be-proved-when-diagonalized(x). This takes a predicate in the x input, and says whether the sentence can be proved when diagonalized (i.e. inserted into itself). For instance, cannot-be-proved-when-diagonalized("has-length-less-than-one-thousand(x)") claims that has-length-less-than-one-thousand("has-length-less-than-one-thousand(x)") cannot be proved. In this case, let's say it's false, since we can see that the string "has-length-less-than-one-thousand(x)" has length less than 1000, and let's assume our proof system is strong enough to prove this. Now, to diagonalize cannot-be-proved-when-diagonalized(x) is to form the sentence cannot-be-proved-when-diagonalized("cannot-be-proved-when-diagonalized(x)"). So can this sentence be proved or not? If it can be proved, then the sentence itself claims that it cannot be proved, a contradiction. So it must not be provable.<br />
<br />
Basically, unlike the English language, a sentence can't refer to itself using phrases like "this sentence itself", so there is no straightforward way to make claims like "This sentence cannot be proved". To get around this restriction, we must use diagonalization -- substituting a sentence's own encoding (i.e. string representation, aka "godel number") into itself. This allows a predicate to talk of its own string representation. So now if that predicate happens to claim unprovability -- then we get godel's first incompleteness theorem.<br />
<br />
The diagonalization lemma generalizes to talk about any predicate P(x), not just not-provable(x). We want to find a sentence G such that G is true if and only if P("G") is (this is a little sloppy -- it's not actually the string "G", but rather if we made whatever G happens to be into a string...). Let G be has-property-P-when-diagonalized("has-property-P-when-diagonalized(x)"). If G is true, then has-property-P-when-diagonalized(x) must have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") which means P("G") is true. If G is false, then has-property-P-when-diagonalized(x) must not have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") is false.<br />
<br />
==Rogers's fixed point theorem==<br />
<br />
Let <math>f</math> be a total computable function. Then there exists an index <math>e</math> such that <math>\varphi_e \simeq \varphi_{f(e)}</math>.<br />
<br />
(simplified)<br />
<br />
Define <math>d(x) = \varphi_x(x)</math> (this is actually slightly wrong, but it brings out the analogy better).<br />
<br />
Consider the function <math>f\circ d</math>. This is partial recursive, so <math>f\circ d \simeq \varphi_i</math> for some index <math>i</math>.<br />
<br />
Now <math>\varphi_{f(d(i))} \simeq \varphi_{\varphi_i(i)}</math> since <math>f\circ d \simeq \varphi_i</math>. This is equivalent to <math>\varphi_{d(i)}</math> by definition of <math>d</math>. Thus, we may take <math>e = d(i)</math> to complete the proof.<br />
<br />
It looks like we have <math>f(d(i)) = \varphi_i(i) = d(i)</math>, i.e. <math>f(e) = e</math>. Is this right?<br />
<br />
<br />
Repeatedly using the facts that (1) <math>i</math> is an index for <math>f\circ d</math>, and (2) <math>d(i) = \varphi_i(i)</math>, allows us to create an iteration effect:<br />
<br />
<math>\varphi_i(i) \simeq f(d(i)) \simeq f(\varphi_i(i)) \simeq f(f(d(i))) \simeq f(f(\varphi_i(i))) \simeq \cdots \simeq f\circ \cdots \circ f \circ d(i)</math><br />
<br />
(I'm wondering if there's some deeper meaning to this. So far it's just an interesting connection between diagonalization-based fixed points and iteration-based fixed points. I think there might be a connection between this and the [https://medium.com/@cdsmithus/fixpoints-in-haskell-294096a9fc10 fix function in Haskell].)<br />
<br />
<br />
In the more rigorous/careful version of the proof, we use the [[s-m-n theorem]] to get an index of a function, <math>s</math>, which is basically like <math>d</math>. The difference is that <math>\varphi_x(x)</math> might not be defined for all <math>x</math> (actually it isn't, since some partial functions are always undefined) so <math>d</math> is not total. On the other hand, <math>s</math> is obtained via the s-m-n theorem so is total. When <math>\varphi_x(x)</math> is undefined, <math>s(x)</math> gives an index of the always-undefined partial function. So <math>s</math> says "this is undefined" in a defined way. Thanks to this property, the expression <math>\varphi_{s(x)}</math> always makes sense, whereas <math>\varphi_{\varphi_x(x)}</math> sometimes doesn't make sense.<br />
<br />
<br />
See also https://machinelearning.subwiki.org/wiki/User:IssaRice/Computability_and_logic/Rogers_fixed_point_theorem_using_Sipser%27s_notation<br />
<br />
==Diagonalization lemma==<br />
<br />
(semantic version)<br />
<br />
Let <math>A</math> be a formula with one free variable. Then there exists a sentence <math>G</math> such that <math>G</math> iff <math>A(\ulcorner G\urcorner)</math>.<br />
<br />
Define <math>\mathrm{diag}(x)</math> to be <math>\ulcorner C(\ulcorner C\urcorner)\urcorner</math> where <math>x = \ulcorner C\urcorner</math>. In other words, given a number <math>x</math>, the function <math>\mathrm{diag}</math> finds the formula with that Godel number, then diagonalizes it (i.e. substitutes the Godel number of the formula into the formula itself), then returns the Godel number of the resulting sentence.<br />
<br />
Let <math>B</math> be <math>A(\mathrm{diag}(x))</math>, and let <math>G</math> be <math>B(\ulcorner B\urcorner)</math>.<br />
<br />
Then <math>G</math> is <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math>, by substituting <math>x = \ulcorner B\urcorner</math> in the definition of <math>B</math>.<br />
<br />
We also have <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner B(\ulcorner B\urcorner)\urcorner</math> by definition of <math>\mathrm{diag}</math>. By definition of <math>G</math>, this is <math>\ulcorner G\urcorner</math>, so we have <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner G\urcorner</math>.<br />
<br />
To complete the proof, apply <math>A</math> to both sides of the final equality to obtain <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math> iff <math>A(\ulcorner G\urcorner)</math>; this simplifies to <math>G</math> iff <math>A(\ulcorner G\urcorner)</math>.<br />
<br />
<ref name="gaifman">Haim Gaifman. [https://web.archive.org/web/20180205090617/http://www.columbia.edu/~hg17/naming-diag.pdf "Naming and Diagonalization, from Cantor to Gödel to Kleene"].</ref><br />
<br />
<ref>https://mathoverflow.net/questions/30874/arithmetic-fixed-point-theorem</ref><br />
<br />
===use of extra quantified variable to make a substitution===<br />
<br />
(see p. 448 of GEB)<br />
<br />
outside the formal system, if we have some function f, a constant a, and some one-place relation R, we can substitute f(a) into R like: R(f(a)). but many systems of formal logic don't have a way to directly talk about outputs of functions like f(a). instead, they have a relation like F(a,y) to mean f(a)=y. [why on earth would they do this? i think the basic reason is that if functions are just relations, we have fewer cases to have to prove in those annoying structural induction proofs.] so how do we express an idea like R(f(a))? we can make use of an extra variable to hold the output, e.g. <math>F(a,y)\wedge R(y)</math>. but this leaves y free, so actually we want <math>\exists y (F(a,y)\wedge R(y))</math>. alternatively, we can also say <math>F(a,y) \to R(y)</math>. in this case we want <math>\forall y (F(a,y) \to R(y))</math>. it's easy to prove that both of these clumsy ways of writing are logically equivalent to R(f(a)).<br />
<br />
something similar happens when we want to diagonalize formulas. given a formula <math>A(x)</math> that has just x free, it's easy enough to diagonalize it: <math>A(\ulcorner A(x)\urcorner)</math>. but what about a sentence like <math>B</math>? how do we "substitute" in something into something that has no free variable? [why on earth would you want to try that? i think it simplifies the proof a little if we assume diagonalization is defined for any sentence. but i forgot where exactly the simplification occurs.] the idea is again to make use of a separate variable: <math>\exists x (x = \ulcorner A(x)\urcorner \wedge A(x))</math>. again, i think we could also do <math>\forall x (x = \ulcorner A(x)\urcorner \to A(x))</math>. we're basically considering a function that finds the godel number of a sentence. except unlike a relation, a single-free-variable formula fixes some specific variable that it leaves free (a relation doesn't know whether it's x or y that's free -- it just expresses some idea), so we need to fix some single variable to use throughout.<br />
<br />
==Trying to discover the lemma==<br />
<br />
===approach 1===<br />
<br />
https://mathoverflow.net/a/31374<br />
<br />
===approach 2===<br />
<br />
see Owings paper.<br />
<br />
In the framework of this paper, we have a matrix where each entry is of a certain type. Then we apply the function <math>\alpha</math> to the diagonal. If the diagonal turns into one of the rows, <math>\alpha</math> has a fixed point.<br />
<br />
So now the trick is to figure out what our <math>\alpha</math> should be, and also what our matrix should look like.<br />
<br />
Picking the <math>\alpha</math> doesn't seem hard: we want a fixed point for the operation <math>\varphi_{f(-)}</math>, so we can pick <math>\alpha(\varphi_e) = \varphi_{f(e)}</math>. One problem is that this might not be well-defined, but we can just go with this for now (it ends up not mattering, for reasons I don't really understand, but the Owings paper has another workaround, which is to use relations; I find that more confusing).<br />
<br />
The matrix that works turns out to have entries <math>\varphi_{\varphi_j(k)}</math>. I'm not sure how one would have figured this out. One might also think <math>\varphi_j(k)</math> would work, but notice that then we fail the type checking with <math>\alpha</math> (which takes a function, not a natural number).<br />
<br />
So now we take the diagonal, which has entries <math>\varphi_{\varphi_k(k)}</math>, for <math>k = 0, 1, 2, \ldots</math>, and apply <math>\alpha</math>. We get <math>\varphi_{f(\varphi_k(k))}</math>. But <math>d</math> defined by <math>d(x) = \varphi_x(x)</math> is a recursive function, so the diagonal has turned into <math>\varphi_{f(d(k))} = \varphi_{f\circ d(k)}</math>. Since a composition of recursive functions is itself recursive, <math>f\circ d</math> is recursive. So we have some index <math>e</math> for it, i.e. <math>f\circ d \simeq \varphi_e</math>. So <math>\alpha</math> applied to the diagonal results in <math>\varphi_{\varphi_e(k)}</math>, which is one of the rows (the <math>e</math>th row). This means <math>\alpha</math> has a fixed point, in the <math>e</math>th entry, i.e. at <math>\varphi_{\varphi_e(e)}</math>. So we expect <math>\alpha(\varphi_{\varphi_e(e)})=\varphi_{\varphi_e(e)}</math>. Since <math>\alpha(\varphi_{\varphi_e(e)}) = \varphi_{f(\varphi_e(e))}</math>, the "real" fixed point for the operator will be at <math>\varphi_e(e)</math>. Indeed, <math>\varphi_{f(\varphi_e(e))} \simeq \varphi_{f\circ d(e)} \simeq \varphi_{\varphi_e(e)}</math>.<br />
<br />
Now we have to verify that <math>\alpha</math> doesn't need to be well-defined.<br />
<br />
===approach 3===<br />
<br />
Take Cantor's theorem, generalize it to mention fixed points, then take the contrapositive. See the Yanofsky paper for details.<br />
<br />
This version still has some mystery for me, e.g. replacing "the set has at least two elements" with "there is a function from the set to itself without a fixed point". The logical equivalence is easy to see, but getting the idea for rephrasing this condition to mention fixed points is not obvious at all.<br />
<br />
The use of the s-m-n theorem also isn't obvious to me. Why use it at all? Why use it on <math>g</math>? Why do we care about the index of <math>s</math>?<br />
<br />
It's also not clear to me why we use <math>T = \mathbf N</math> and <math>Y = \mathcal F</math>. In some sense it does make sense, like the natural numbers are all the algorithms, and the set of computable functions are the "properties" (a.k.a. "the objects being named").<br />
<br />
===approach 4===<br />
<br />
http://www.andrew.cmu.edu/user/kk3n/complearn/chapter8.pdf -- see section 8.1<br />
<br />
also see Moore and Mertens's section on lambda calculus<br />
<br />
in the untyped lambda calculus, there is only one type of entity, namely functions, which can operate on other functions. This makes it easy for functions to operate on themselves, which creates self-reference.<br />
<br />
but when working with partial recursive functions, we don't have this. instead, we have numbers and then partial functions that operate on numbers. to get self-reference, we need some kind of encoding. this is why we numbered the partial recursive functions.<br />
<br />
but now, one of the familiar facts about the lambda calculus is the existence of the fixed point combinator (aka y combinator). (note: this passes the buck to wondering how one would have come up with the lambda calculus, or how one would come up with the fixed point combinator in that setting; but this seems easier to answer.) since this theorem works in one setting in which we have self-reference, we might wonder if we can "port over" the theorem to the case where we have self-reference in a different setting.<br />
<br />
==Comparison table==<br />
<br />
Some things to notice:<br />
<br />
* The two theorems are essentially identical, with identical proofs, as seen by the matching rows. The analogy breaks down slightly at the very end, where we apply <math>\varphi_{f(\cdot)}</math> vs <math>A(\cdot)</math> (the latter corresponds to <math>f</math> until the very end).<br />
* In the partial recursive functions world, it's easy to go from the index (e.g. <math>e</math>) to the partial function (<math>\varphi_e</math>). In the formulas world it's the reverse, where it's easy to go from a formula (e.g. <math>A</math>) to its Godel number <math>\ulcorner A\urcorner</math>). I wonder if there is something essential here, or if it is simply some sort of historical accident in notation.<br />
* For the diagonalization lemma, here we have done the semantic version (? I think...), but usually the manipulations are done inside a formal system with reference to some theory <math>T</math> to derive a syntactic result (i.e. we have some theory that is strong enough to do all these manipulations within the object-level language). For partial recursive functions, as far as I know, there is no analogous distinction between semantics vs syntax.<br />
* The diagonalization part is not completely correct/as strong as possible for both proofs. For the partial recursive functions side, we want to make sure that <math>\varphi_{\varphi_x(x)}</math> is actually defined in each case. For the logic side, I think often the diagonalization is defined as <math>\exists x(x = \ulcorner A\urcorner \wedge A)</math> so that it is defined for all formulas, not just ones with one free variable. But the essential ideas are all present below, and since this makes the comparison easier, the presentation is simplified.<br />
<br />
{| class="wikitable"<br />
|-<br />
! Step !! Rogers's fixed point theorem !! Diagonalization lemma<br />
|-<br />
| Theorem statement (note: quantifiers are part of the metalanguage) || <math>(\forall f \exists e)\ \varphi_e \simeq \varphi_{f(e)}</math> || <math>(\forall A \exists G)\ G \leftrightarrow A(\ulcorner G\urcorner)</math><br />
|-<br />
| Given mapping || <math>f</math> || <math>A</math><br />
|-<br />
| Definition of diagonal function || <math>d(x) = \varphi_x(x)</math> || <math>\mathrm{diag}(\ulcorner C\urcorner) = \ulcorner C(\ulcorner C\urcorner)\urcorner</math><br />
|-<br />
| Composition of given mapping with diagonal function (<math>\mathrm{given} \circ \mathrm{diagonal}</math>) || <math>f(d(x))</math> || <math>A(\mathrm{diag}(x))</math><br />
|-<br />
| Naming the <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition || <math>f\circ d</math> (name not given because compositions are easy to express outside a formal language) || <math>B</math><br />
|-<br />
| Index of <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition || <math>i</math> || <math>\ulcorner B\urcorner</math><br />
|-<br />
| Expanding using definition of diagonal || <math>d(i) = \varphi_i(i)</math> || <math>\mathrm{diag}(\ulcorner B\urcorner) = \ulcorner B(\ulcorner B\urcorner) \urcorner</math><br />
|-<br />
| The <math>\mathrm{given} \circ \mathrm{diagonal}</math> composition applied to own index (i.e. diagonalization of the composition) || <math>f\circ d(i)</math> || <math>B(\ulcorner B\urcorner)</math><br />
|-<br />
| G defined || <math>\varphi_i(i)</math> (no equivalent definition) || <math>G</math> is <math>B(\ulcorner B\urcorner)</math><br />
|-<br />
| || <math>f(d(i)) = \varphi_i(i)</math> || <math>A(\mathrm{diag}(\ulcorner B\urcorner)) \leftrightarrow B(\ulcorner B\urcorner)</math><br />
|-<br />
| Renaming index || <math>e = d(i)</math> || <math>\ulcorner G\urcorner = \mathrm{diag}(\ulcorner B\urcorner)</math><br />
|-<br />
| Leibniz law to previous row || Apply <math>\varphi_{f(\cdot)}</math> to obtain <math>\varphi_{f(e)} = \varphi_{f(d(i))}</math> || Apply <math>A(\cdot)</math> to obtain <math>A(\ulcorner G\urcorner) \leftrightarrow A(\mathrm{diag}(\ulcorner B\urcorner))</math><br />
|-<br />
| Use definition of G || <math>\varphi_{f(e)} = \varphi_{\varphi_i(i)} = \varphi_e</math> || <math>A(\ulcorner G\urcorner) \leftrightarrow B(\ulcorner B\urcorner) \leftrightarrow G</math><br />
|-<br />
| (Definition of G)? || <math>\varphi_i(i)</math> is <math>f(d(i))</math> || <math>G</math> is <math>A(\mathrm{diag}(\ulcorner B\urcorner))</math><br />
|}<br />
<br />
==Quotes==<br />
<br />
"All of these theorems tend to strain one's intuition; in fact, many people find them almost paradoxical. The most popular proofs of these theorems only serve to aggravate the situation because they are completely unmotivated, seem to depend upon a low combinatorial trick, and are so barbarically short as to be nearly incapable of rational analysis."<ref>James C. Owings, Jr. "Diagonalization and the Recursion Theorem". 1973.</ref><br />
<br />
"This is just a lovely result, insightful in its concept and far reaching in its consequences. We’d love to say that the proof was also lovely and enlightening, but to be honest, we don’t have an enlightening sort of proof to show you. Sometimes the best way to describe a proof is that the argument sort of picks you up and shakes you until you agree that it does, in fact, establish what it is supposed to establish. That’s what you get here."<ref>Christopher C. Leary; Lars Kristiansen. ''A Friendly Introduction to Mathematical Logic'' (2nd ed). p. 172.</ref><br />
<br />
"The brevity of the proof does not make for transparency; it has the aura of a magician’s trick. How did Gödel ever come up with the idea? As a matter of fact, Gödel did not come up with that idea."<ref name="gaifman"/><br />
<br />
==Questions/things to explain==<br />
<br />
* In Peter Smith's book, he defines Gdl(m,n) as Prf(m, diag(n)). What is the analogue of Gld for the Rogers fixed point theorem?<br />
* I like the <math>D(\ulcorner \varphi \urcorner) \iff \varphi(\ulcorner \varphi \urcorner)</math> that begins [https://mathoverflow.net/a/31374 this answer], but what is the analogue for partial functions? It seems like it is <math>d(x) = \varphi_x(x)</math>, which ''does'' exist (because we are allowed to have undefined values). So the motivation that works for the logic version doesn't work for the partial functions version, which bugs me.<br />
<br />
==References==<br />
<br />
<references/></div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Subfield_of_math_that_is_best_for_introducing_proofs&diff=3283User:IssaRice/Subfield of math that is best for introducing proofs2021-05-31T07:48:37Z<p>IssaRice: /* Set theory */</p>
<hr />
<div>This page compares subfields of math by how good they are as an introduction to proofs and rigor. In other words, the question to ask is "What is the first proof-based math course one should take?"<br />
<br />
Everything here should be viewed from the lens of ''good pedagogy''. This isn't directly about what the most beautiful branch is, or what's most interesting. It's also not about the elegance of the finished theory; for pedagogy, it's often a good idea to do things in a roundabout way.<br />
<br />
I think I personally got my start at doing proofs by three routes: intro to proofs, discrete math, and real analysis. i think i pretty haphazardly jumped between various books, following my curiosity.<br />
<br />
==Intro to proofs==<br />
<br />
By "intro to proofs", i mean the kind of stuff that is in velleman's ''How to Prove It''. so like, basic logic, proof strategies, quantifiers, sets, relations, functions, mathematical induction. maybe countable vs uncountable infinity.<br />
<br />
Pros:<br />
<br />
* you're learning one thing at a time. With something like real analysis, you have to learn two things at a time: how to write a proof, and the actual content of real analysis. But with intro to proofs, you just learn how to write a proof. Most of the problems are artificial; they are designed specifically to get you to write good proofs, to show you common pitfalls.<br />
* i do like how some books here (i think rosen's discrete math book?) show how to prove things in first order logic notation. it's not literally doing derivations in FOL, but it's more robotic than how you would write a proof in prose. i think it's valuable to see this, and "intro to proofs" contexts are the main ones where i have seen this done. I can't think of why it can't be done in other contexts though (other than lack of time). An example of what i mean by this is something like writing the definition of limit like <math>\forall \varepsilon > 0 \ \exists N \ \forall n \geq N \ |a_n - L| < \varepsilon</math>, and then negating this statement by reversing all the quantifiers in a robotic way.<br />
* since this part is so "easy" in some sense (you don't need to build up lots of complicated objects), it is possible to do a lot of this stuff in Lean. that means you get a more "gamified" feeling of doing this kind of math.<br />
<br />
Cons:<br />
<br />
* there's not much "meat" in here? personally i really enjoyed working through basic logic, but it does feel pretty empty at the end of the day. (there is stuff like pursuing what material implication means that i spent a lot of time on as a teenager, but i doubt most people care about it.)<br />
<br />
==Real analysis==<br />
<br />
Pros:<br />
<br />
* students are already familiar with the basic objects of study (assuming they have taken calculus), including the real numbers, continuous functions, etc. so less motivation is needed for explaining why these objects are interesting to study<br />
* lots of crazy things happen in real analysis (see Abbott's book or chapter 1 of Tao's book) that motivate the need for rigor<br />
<br />
Cons:<br />
<br />
* lots of subtle stuff happen, which might make it particularly challenging as an introduction to proofs<br />
* Bloch's ''The Real Numbers and Real Analysis'' (p. xxiv) mentions nested quantifiers as the differentiating factor that makes analysis harder than linear or abstract algebra for beginner students.<br />
* Bloch also mentions the need to distinguish between scratch work and actual proof as another unique difficulty.<br />
<br />
there is a [https://www.greaterwrong.com/posts/Ym5k6bbwAFPqb2syt/the-value-of-learning-mathematical-proof/comment/YsePZH9s9BMcs5PtZ lesswrong thread] about this. Some books, like Tao's, avoid to some extent the problems mentioned there by starting with the natural numbers, integers, rationals, and basic set theory. It's only like half way through the book that the real numbers are even introduced. This gives the student some time to get used to writing proofs in a discrete domain.<br />
<br />
==Complex analysis==<br />
<br />
everybody says complex analysis is so beautiful and stuff, but i don't think i've ever seen it used as a first course in doing proofs. maybe there is a good reason?<br />
<br />
==Linear algebra==<br />
<br />
Pros:<br />
<br />
* objects of study (linear maps) are simple<br />
* there aren't a lot of paradoxical/unintuitive things<br />
* knowledge of linear algebra helps in many places<br />
<br />
Cons:<br />
<br />
* boring? i think SVD might be the only interesting theorem.<br />
* everything is isomorphic to R^n so vector spaces are not a good example of abstraction (Tim Gowers makes this point somewhere)<br />
* almost every book sucks?<br />
<br />
==Abstract algebra==<br />
<br />
Pros:<br />
<br />
* unsolvability of the quintic might be a good target to work toward<br />
* although algebraic structures are "abstract", it's possible to give many concrete finite examples that build intuition. when examples are finite, you can "see everything"/specify things completely in a way you can't e.g. specify a continuous function completely (by giving a list of where inputs map).<br />
<br />
Cons:<br />
<br />
* boring? i think many of the introductory stuff feels like "what's the point of this?" why care about groups, subgroups, normal subgroups?<br />
* something that's kinda funky about intro group theory: many of the non-trivial proofs are actually about number theory. like, because things like lagrange's theorem are phrased in terms of multiples of numbers, the practice problems are also phrased that way. same thing with stuff like x^n=e implies n is a multiple of ord(x), and talking about groups of prime order. also of course, cyclic groups <=> modular arithmetic connection. so yeah. it's like, you thought you were in here to learn algebra, but in fact, you're just being forced to work through a bunch of basic number theory. and since the algebra book isn't a number theory book, it's not exactly gentle/pedantic/rigorous/self-contained about the number theory it teaches. So you get an "intro to number theory" aspect but it's actually sort of a mickey mouse/dumbed down/not-the-real-thing experience. My current guess in light of this is that it's best to cover basic number theory in very rigorous detail ''before'' you start doing abstract algebra. That way, you can focus on the algebra parts without getting sidetracked into number theory.<br />
<br />
==Computability and logic==<br />
<br />
Pros:<br />
<br />
* proofs in analysis and linear algebra (and probably other places too) often make use of "algorithmic" ideas, e.g. the bisection proof of bolzano-weierstrass theorem. there is a sense in which we like our proofs to be computable, but without learning computability it's hard to express what we even mean by this. I think Stillwell's ''Reverse Mathematics'' talks about this issue?<br />
* several interesting theorems, including equivalence of semi-decidable and recursively enumerable sets, the existence of non-r.e. sets, various examples of diagonalization.<br />
<br />
Cons:<br />
<br />
* this isn't usually taken to be an intro-to-proofs subject, so the textbooks might not assume a level of innocence. In other words, the teaching material might be better for other subfields.<br />
* some of the material seems like unhelpful pedantry, like how interpretations are defined, and the proof of the soundness theorem. it takes a rare kind of mind (or substantial experience) to even realize that the soundness theorem is important.<br />
* might be too meta as an introduction to proofs, e.g. always distinguishing between object level and meta level<br />
<br />
==Discrete math==<br />
<br />
Pros:<br />
<br />
* many topics to pick and choose from<br />
* many interesting topics that can be covered that have wide applicability<br />
* topics tend to be concrete, so you can easily play with them (e.g. automata, finite graphs)<br />
** this also means many of the topics are amenable to programmamatic treatment -- you can write toy programs to test your theories and so on.<br />
* no philosophical issues (?)<br />
<br />
Cons:<br />
<br />
* there's something about proofs in discrete math/abstract algebra where when you can see the whole thing (because it's finite), it becomes really tempting to say that it's "obvious" and handwave through a proof.<br />
* are there any really interesting results that are also easy enough to understand? like a "crowning jewel" type theorem?<br />
<br />
==Number theory==<br />
<br />
I wish there was a book called something like "extremely basic number theory in extremely rigorous detail" that covered the very most basic results in number theory that appear over and over again in other areas of math. so things like gcd/lcm properties, prime number infinity theorem, uniqueness of prime factorization, and stuff like that (anything else?). The reason for this is that a few other fields (like abstract algebra) make use of a lot of these results, but the books on algebra are unwilling to treat number theory in the level of rigorous detail that i think is good for someone starting out with proofs.<br />
<br />
Pros:<br />
<br />
* students are already intimately familiar with the integers.<br />
* the problems in number theory are easy to state.<br />
* good concepts like gcd that "feel obvious" but require careful treatment. (e.g. what is gcd(0,0)?)<br />
* non-trivial results like bezout's lemma, euclid's algorithm for finding gcd.<br />
* number theory can be used to branch out into many different topics/fields: things like gcd can be used to cover the notion of algorithm, some stuff around fermat's little theorem can be used to talk about group theory<br />
<br />
Cons:<br />
<br />
* i'm not sure what a good book for this is.<br />
<br />
==Topology==<br />
<br />
i think i should split this up into metric spaces vs general topology, because i have different opinions about them.<br />
<br />
Pros:<br />
<br />
* i find metric spaces fun. it's a good example of abstraction. you can draw stuff a lot of times, to help you think.<br />
* lots of counterintuitive results, which makes it fun and also teaches you to rewire your intuitions.<br />
* non-vacuous examples of vacuous conditions, like clopen sets :)<br />
<br />
Cons:<br />
<br />
* i think general topology only makes much sense after going through metric spaces/point-set topology<br />
* even metric spaces might be too hard unless you've done some real analysis in R or R^n first. i'm not sure about this. (i first went through parts of spivak and tao which do things in R, so i have no idea how i would have done if i had started with metric spaces first. I didn't like how folland's advanced calculus does things in R^n though, or maybe i just don't like folland's style. my thinking was something like "if you're gonna do the R->R^n abstraction, why not go the full way to metric spaces?")<br />
* i think one reason topology is harder to understand than metric spaces is that the usual examples of topological spaces (except the indiscrete/trivial topology) are all metrizable, so the theory of topological spaces just reduces to the theory of metric spaces. (?)<br />
<br />
==Euclidean geometry==<br />
<br />
this is the original proof-based branch of math! i think a lot of kids (including me) got exposure to this in middle school/high school. it doesn't seem to be treated at the undergraduate level though, and i've never understood why.<br />
<br />
there are things like 'euclid: the game' now that make this more fun.<br />
<br />
==Set theory==<br />
<br />
i'm thinking of things like halmos's naive set theory, and chapters 3 and 8 in tao's analysis I as examples, ''not'' the kind of set theory you find in books on mathematical logic.<br />
<br />
pros:<br />
<br />
* used everywhere<br />
* close connection to propositional/predicate logic, such that learning how to work with sets at the same time can highlight those connections<br />
* some good theorems like schroeder-bernstein theorem<br />
<br />
cons:<br />
<br />
* boring/much of it is devoid of content in a sense<br />
* tricky philosophical issues like axiom of choice, which are difficult for a beginner</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Subfield_of_math_that_is_best_for_introducing_proofs&diff=3282User:IssaRice/Subfield of math that is best for introducing proofs2021-05-31T07:46:01Z<p>IssaRice: </p>
<hr />
<div>This page compares subfields of math by how good they are as an introduction to proofs and rigor. In other words, the question to ask is "What is the first proof-based math course one should take?"<br />
<br />
Everything here should be viewed from the lens of ''good pedagogy''. This isn't directly about what the most beautiful branch is, or what's most interesting. It's also not about the elegance of the finished theory; for pedagogy, it's often a good idea to do things in a roundabout way.<br />
<br />
I think I personally got my start at doing proofs by three routes: intro to proofs, discrete math, and real analysis. i think i pretty haphazardly jumped between various books, following my curiosity.<br />
<br />
==Intro to proofs==<br />
<br />
By "intro to proofs", i mean the kind of stuff that is in velleman's ''How to Prove It''. so like, basic logic, proof strategies, quantifiers, sets, relations, functions, mathematical induction. maybe countable vs uncountable infinity.<br />
<br />
Pros:<br />
<br />
* you're learning one thing at a time. With something like real analysis, you have to learn two things at a time: how to write a proof, and the actual content of real analysis. But with intro to proofs, you just learn how to write a proof. Most of the problems are artificial; they are designed specifically to get you to write good proofs, to show you common pitfalls.<br />
* i do like how some books here (i think rosen's discrete math book?) show how to prove things in first order logic notation. it's not literally doing derivations in FOL, but it's more robotic than how you would write a proof in prose. i think it's valuable to see this, and "intro to proofs" contexts are the main ones where i have seen this done. I can't think of why it can't be done in other contexts though (other than lack of time). An example of what i mean by this is something like writing the definition of limit like <math>\forall \varepsilon > 0 \ \exists N \ \forall n \geq N \ |a_n - L| < \varepsilon</math>, and then negating this statement by reversing all the quantifiers in a robotic way.<br />
* since this part is so "easy" in some sense (you don't need to build up lots of complicated objects), it is possible to do a lot of this stuff in Lean. that means you get a more "gamified" feeling of doing this kind of math.<br />
<br />
Cons:<br />
<br />
* there's not much "meat" in here? personally i really enjoyed working through basic logic, but it does feel pretty empty at the end of the day. (there is stuff like pursuing what material implication means that i spent a lot of time on as a teenager, but i doubt most people care about it.)<br />
<br />
==Real analysis==<br />
<br />
Pros:<br />
<br />
* students are already familiar with the basic objects of study (assuming they have taken calculus), including the real numbers, continuous functions, etc. so less motivation is needed for explaining why these objects are interesting to study<br />
* lots of crazy things happen in real analysis (see Abbott's book or chapter 1 of Tao's book) that motivate the need for rigor<br />
<br />
Cons:<br />
<br />
* lots of subtle stuff happen, which might make it particularly challenging as an introduction to proofs<br />
* Bloch's ''The Real Numbers and Real Analysis'' (p. xxiv) mentions nested quantifiers as the differentiating factor that makes analysis harder than linear or abstract algebra for beginner students.<br />
* Bloch also mentions the need to distinguish between scratch work and actual proof as another unique difficulty.<br />
<br />
there is a [https://www.greaterwrong.com/posts/Ym5k6bbwAFPqb2syt/the-value-of-learning-mathematical-proof/comment/YsePZH9s9BMcs5PtZ lesswrong thread] about this. Some books, like Tao's, avoid to some extent the problems mentioned there by starting with the natural numbers, integers, rationals, and basic set theory. It's only like half way through the book that the real numbers are even introduced. This gives the student some time to get used to writing proofs in a discrete domain.<br />
<br />
==Complex analysis==<br />
<br />
everybody says complex analysis is so beautiful and stuff, but i don't think i've ever seen it used as a first course in doing proofs. maybe there is a good reason?<br />
<br />
==Linear algebra==<br />
<br />
Pros:<br />
<br />
* objects of study (linear maps) are simple<br />
* there aren't a lot of paradoxical/unintuitive things<br />
* knowledge of linear algebra helps in many places<br />
<br />
Cons:<br />
<br />
* boring? i think SVD might be the only interesting theorem.<br />
* everything is isomorphic to R^n so vector spaces are not a good example of abstraction (Tim Gowers makes this point somewhere)<br />
* almost every book sucks?<br />
<br />
==Abstract algebra==<br />
<br />
Pros:<br />
<br />
* unsolvability of the quintic might be a good target to work toward<br />
* although algebraic structures are "abstract", it's possible to give many concrete finite examples that build intuition. when examples are finite, you can "see everything"/specify things completely in a way you can't e.g. specify a continuous function completely (by giving a list of where inputs map).<br />
<br />
Cons:<br />
<br />
* boring? i think many of the introductory stuff feels like "what's the point of this?" why care about groups, subgroups, normal subgroups?<br />
* something that's kinda funky about intro group theory: many of the non-trivial proofs are actually about number theory. like, because things like lagrange's theorem are phrased in terms of multiples of numbers, the practice problems are also phrased that way. same thing with stuff like x^n=e implies n is a multiple of ord(x), and talking about groups of prime order. also of course, cyclic groups <=> modular arithmetic connection. so yeah. it's like, you thought you were in here to learn algebra, but in fact, you're just being forced to work through a bunch of basic number theory. and since the algebra book isn't a number theory book, it's not exactly gentle/pedantic/rigorous/self-contained about the number theory it teaches. So you get an "intro to number theory" aspect but it's actually sort of a mickey mouse/dumbed down/not-the-real-thing experience. My current guess in light of this is that it's best to cover basic number theory in very rigorous detail ''before'' you start doing abstract algebra. That way, you can focus on the algebra parts without getting sidetracked into number theory.<br />
<br />
==Computability and logic==<br />
<br />
Pros:<br />
<br />
* proofs in analysis and linear algebra (and probably other places too) often make use of "algorithmic" ideas, e.g. the bisection proof of bolzano-weierstrass theorem. there is a sense in which we like our proofs to be computable, but without learning computability it's hard to express what we even mean by this. I think Stillwell's ''Reverse Mathematics'' talks about this issue?<br />
* several interesting theorems, including equivalence of semi-decidable and recursively enumerable sets, the existence of non-r.e. sets, various examples of diagonalization.<br />
<br />
Cons:<br />
<br />
* this isn't usually taken to be an intro-to-proofs subject, so the textbooks might not assume a level of innocence. In other words, the teaching material might be better for other subfields.<br />
* some of the material seems like unhelpful pedantry, like how interpretations are defined, and the proof of the soundness theorem. it takes a rare kind of mind (or substantial experience) to even realize that the soundness theorem is important.<br />
* might be too meta as an introduction to proofs, e.g. always distinguishing between object level and meta level<br />
<br />
==Discrete math==<br />
<br />
Pros:<br />
<br />
* many topics to pick and choose from<br />
* many interesting topics that can be covered that have wide applicability<br />
* topics tend to be concrete, so you can easily play with them (e.g. automata, finite graphs)<br />
** this also means many of the topics are amenable to programmamatic treatment -- you can write toy programs to test your theories and so on.<br />
* no philosophical issues (?)<br />
<br />
Cons:<br />
<br />
* there's something about proofs in discrete math/abstract algebra where when you can see the whole thing (because it's finite), it becomes really tempting to say that it's "obvious" and handwave through a proof.<br />
* are there any really interesting results that are also easy enough to understand? like a "crowning jewel" type theorem?<br />
<br />
==Number theory==<br />
<br />
I wish there was a book called something like "extremely basic number theory in extremely rigorous detail" that covered the very most basic results in number theory that appear over and over again in other areas of math. so things like gcd/lcm properties, prime number infinity theorem, uniqueness of prime factorization, and stuff like that (anything else?). The reason for this is that a few other fields (like abstract algebra) make use of a lot of these results, but the books on algebra are unwilling to treat number theory in the level of rigorous detail that i think is good for someone starting out with proofs.<br />
<br />
Pros:<br />
<br />
* students are already intimately familiar with the integers.<br />
* the problems in number theory are easy to state.<br />
* good concepts like gcd that "feel obvious" but require careful treatment. (e.g. what is gcd(0,0)?)<br />
* non-trivial results like bezout's lemma, euclid's algorithm for finding gcd.<br />
* number theory can be used to branch out into many different topics/fields: things like gcd can be used to cover the notion of algorithm, some stuff around fermat's little theorem can be used to talk about group theory<br />
<br />
Cons:<br />
<br />
* i'm not sure what a good book for this is.<br />
<br />
==Topology==<br />
<br />
i think i should split this up into metric spaces vs general topology, because i have different opinions about them.<br />
<br />
Pros:<br />
<br />
* i find metric spaces fun. it's a good example of abstraction. you can draw stuff a lot of times, to help you think.<br />
* lots of counterintuitive results, which makes it fun and also teaches you to rewire your intuitions.<br />
* non-vacuous examples of vacuous conditions, like clopen sets :)<br />
<br />
Cons:<br />
<br />
* i think general topology only makes much sense after going through metric spaces/point-set topology<br />
* even metric spaces might be too hard unless you've done some real analysis in R or R^n first. i'm not sure about this. (i first went through parts of spivak and tao which do things in R, so i have no idea how i would have done if i had started with metric spaces first. I didn't like how folland's advanced calculus does things in R^n though, or maybe i just don't like folland's style. my thinking was something like "if you're gonna do the R->R^n abstraction, why not go the full way to metric spaces?")<br />
* i think one reason topology is harder to understand than metric spaces is that the usual examples of topological spaces (except the indiscrete/trivial topology) are all metrizable, so the theory of topological spaces just reduces to the theory of metric spaces. (?)<br />
<br />
==Euclidean geometry==<br />
<br />
this is the original proof-based branch of math! i think a lot of kids (including me) got exposure to this in middle school/high school. it doesn't seem to be treated at the undergraduate level though, and i've never understood why.<br />
<br />
there are things like 'euclid: the game' now that make this more fun.<br />
<br />
==Set theory==<br />
<br />
i'm thinking of things like halmos's naive set theory, and chapters 3 and 8 in tao's analysis I as examples, ''not'' the kind of set theory you find in books on mathematical logic.</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Subfield_of_math_that_is_best_for_introducing_proofs&diff=3281User:IssaRice/Subfield of math that is best for introducing proofs2021-05-31T07:43:45Z<p>IssaRice: /* Topology */</p>
<hr />
<div>This page compares subfields of math by how good they are as an introduction to proofs and rigor. In other words, the question to ask is "What is the first proof-based math course one should take?"<br />
<br />
Everything here should be viewed from the lens of ''good pedagogy''. This isn't directly about what the most beautiful branch is, or what's most interesting. It's also not about the elegance of the finished theory; for pedagogy, it's often a good idea to do things in a roundabout way.<br />
<br />
I think I personally got my start at doing proofs by three routes: intro to proofs, discrete math, and real analysis. i think i pretty haphazardly jumped between various books, following my curiosity.<br />
<br />
==Intro to proofs==<br />
<br />
By "intro to proofs", i mean the kind of stuff that is in velleman's ''How to Prove It''. so like, basic logic, proof strategies, quantifiers, sets, relations, functions, mathematical induction. maybe countable vs uncountable infinity.<br />
<br />
Pros:<br />
<br />
* you're learning one thing at a time. With something like real analysis, you have to learn two things at a time: how to write a proof, and the actual content of real analysis. But with intro to proofs, you just learn how to write a proof. Most of the problems are artificial; they are designed specifically to get you to write good proofs, to show you common pitfalls.<br />
* i do like how some books here (i think rosen's discrete math book?) show how to prove things in first order logic notation. it's not literally doing derivations in FOL, but it's more robotic than how you would write a proof in prose. i think it's valuable to see this, and "intro to proofs" contexts are the main ones where i have seen this done. I can't think of why it can't be done in other contexts though (other than lack of time). An example of what i mean by this is something like writing the definition of limit like <math>\forall \varepsilon > 0 \ \exists N \ \forall n \geq N \ |a_n - L| < \varepsilon</math>, and then negating this statement by reversing all the quantifiers in a robotic way.<br />
* since this part is so "easy" in some sense (you don't need to build up lots of complicated objects), it is possible to do a lot of this stuff in Lean. that means you get a more "gamified" feeling of doing this kind of math.<br />
<br />
Cons:<br />
<br />
* there's not much "meat" in here? personally i really enjoyed working through basic logic, but it does feel pretty empty at the end of the day. (there is stuff like pursuing what material implication means that i spent a lot of time on as a teenager, but i doubt most people care about it.)<br />
<br />
==Real analysis==<br />
<br />
Pros:<br />
<br />
* students are already familiar with the basic objects of study (assuming they have taken calculus), including the real numbers, continuous functions, etc. so less motivation is needed for explaining why these objects are interesting to study<br />
* lots of crazy things happen in real analysis (see Abbott's book or chapter 1 of Tao's book) that motivate the need for rigor<br />
<br />
Cons:<br />
<br />
* lots of subtle stuff happen, which might make it particularly challenging as an introduction to proofs<br />
* Bloch's ''The Real Numbers and Real Analysis'' (p. xxiv) mentions nested quantifiers as the differentiating factor that makes analysis harder than linear or abstract algebra for beginner students.<br />
* Bloch also mentions the need to distinguish between scratch work and actual proof as another unique difficulty.<br />
<br />
there is a [https://www.greaterwrong.com/posts/Ym5k6bbwAFPqb2syt/the-value-of-learning-mathematical-proof/comment/YsePZH9s9BMcs5PtZ lesswrong thread] about this. Some books, like Tao's, avoid to some extent the problems mentioned there by starting with the natural numbers, integers, rationals, and basic set theory. It's only like half way through the book that the real numbers are even introduced. This gives the student some time to get used to writing proofs in a discrete domain.<br />
<br />
==Complex analysis==<br />
<br />
everybody says complex analysis is so beautiful and stuff, but i don't think i've ever seen it used as a first course in doing proofs. maybe there is a good reason?<br />
<br />
==Linear algebra==<br />
<br />
Pros:<br />
<br />
* objects of study (linear maps) are simple<br />
* there aren't a lot of paradoxical/unintuitive things<br />
* knowledge of linear algebra helps in many places<br />
<br />
Cons:<br />
<br />
* boring? i think SVD might be the only interesting theorem.<br />
* everything is isomorphic to R^n so vector spaces are not a good example of abstraction (Tim Gowers makes this point somewhere)<br />
* almost every book sucks?<br />
<br />
==Abstract algebra==<br />
<br />
Pros:<br />
<br />
* unsolvability of the quintic might be a good target to work toward<br />
* although algebraic structures are "abstract", it's possible to give many concrete finite examples that build intuition. when examples are finite, you can "see everything"/specify things completely in a way you can't e.g. specify a continuous function completely (by giving a list of where inputs map).<br />
<br />
Cons:<br />
<br />
* boring? i think many of the introductory stuff feels like "what's the point of this?" why care about groups, subgroups, normal subgroups?<br />
* something that's kinda funky about intro group theory: many of the non-trivial proofs are actually about number theory. like, because things like lagrange's theorem are phrased in terms of multiples of numbers, the practice problems are also phrased that way. same thing with stuff like x^n=e implies n is a multiple of ord(x), and talking about groups of prime order. also of course, cyclic groups <=> modular arithmetic connection. so yeah. it's like, you thought you were in here to learn algebra, but in fact, you're just being forced to work through a bunch of basic number theory. and since the algebra book isn't a number theory book, it's not exactly gentle/pedantic/rigorous/self-contained about the number theory it teaches. So you get an "intro to number theory" aspect but it's actually sort of a mickey mouse/dumbed down/not-the-real-thing experience. My current guess in light of this is that it's best to cover basic number theory in very rigorous detail ''before'' you start doing abstract algebra. That way, you can focus on the algebra parts without getting sidetracked into number theory.<br />
<br />
==Computability and logic==<br />
<br />
Pros:<br />
<br />
* proofs in analysis and linear algebra (and probably other places too) often make use of "algorithmic" ideas, e.g. the bisection proof of bolzano-weierstrass theorem. there is a sense in which we like our proofs to be computable, but without learning computability it's hard to express what we even mean by this. I think Stillwell's ''Reverse Mathematics'' talks about this issue?<br />
* several interesting theorems, including equivalence of semi-decidable and recursively enumerable sets, the existence of non-r.e. sets, various examples of diagonalization.<br />
<br />
Cons:<br />
<br />
* this isn't usually taken to be an intro-to-proofs subject, so the textbooks might not assume a level of innocence. In other words, the teaching material might be better for other subfields.<br />
* some of the material seems like unhelpful pedantry, like how interpretations are defined, and the proof of the soundness theorem. it takes a rare kind of mind (or substantial experience) to even realize that the soundness theorem is important.<br />
* might be too meta as an introduction to proofs, e.g. always distinguishing between object level and meta level<br />
<br />
==Discrete math==<br />
<br />
Pros:<br />
<br />
* many topics to pick and choose from<br />
* many interesting topics that can be covered that have wide applicability<br />
* topics tend to be concrete, so you can easily play with them (e.g. automata, finite graphs)<br />
** this also means many of the topics are amenable to programmamatic treatment -- you can write toy programs to test your theories and so on.<br />
* no philosophical issues (?)<br />
<br />
Cons:<br />
<br />
* there's something about proofs in discrete math/abstract algebra where when you can see the whole thing (because it's finite), it becomes really tempting to say that it's "obvious" and handwave through a proof.<br />
* are there any really interesting results that are also easy enough to understand? like a "crowning jewel" type theorem?<br />
<br />
==Number theory==<br />
<br />
I wish there was a book called something like "extremely basic number theory in extremely rigorous detail" that covered the very most basic results in number theory that appear over and over again in other areas of math. so things like gcd/lcm properties, prime number infinity theorem, uniqueness of prime factorization, and stuff like that (anything else?). The reason for this is that a few other fields (like abstract algebra) make use of a lot of these results, but the books on algebra are unwilling to treat number theory in the level of rigorous detail that i think is good for someone starting out with proofs.<br />
<br />
Pros:<br />
<br />
* students are already intimately familiar with the integers.<br />
* the problems in number theory are easy to state.<br />
* good concepts like gcd that "feel obvious" but require careful treatment. (e.g. what is gcd(0,0)?)<br />
* non-trivial results like bezout's lemma, euclid's algorithm for finding gcd.<br />
* number theory can be used to branch out into many different topics/fields: things like gcd can be used to cover the notion of algorithm, some stuff around fermat's little theorem can be used to talk about group theory<br />
<br />
Cons:<br />
<br />
* i'm not sure what a good book for this is.<br />
<br />
==Topology==<br />
<br />
i think i should split this up into metric spaces vs general topology, because i have different opinions about them.<br />
<br />
Pros:<br />
<br />
* i find metric spaces fun. it's a good example of abstraction. you can draw stuff a lot of times, to help you think.<br />
* lots of counterintuitive results, which makes it fun and also teaches you to rewire your intuitions.<br />
* non-vacuous examples of vacuous conditions, like clopen sets :)<br />
<br />
Cons:<br />
<br />
* i think general topology only makes much sense after going through metric spaces/point-set topology<br />
* even metric spaces might be too hard unless you've done some real analysis in R or R^n first. i'm not sure about this. (i first went through parts of spivak and tao which do things in R, so i have no idea how i would have done if i had started with metric spaces first. I didn't like how folland's advanced calculus does things in R^n though, or maybe i just don't like folland's style. my thinking was something like "if you're gonna do the R->R^n abstraction, why not go the full way to metric spaces?")<br />
* i think one reason topology is harder to understand than metric spaces is that the usual examples of topological spaces (except the indiscrete/trivial topology) are all metrizable, so the theory of topological spaces just reduces to the theory of metric spaces. (?)<br />
<br />
==Euclidean geometry==<br />
<br />
this is the original proof-based branch of math! i think a lot of kids (including me) got exposure to this in middle school/high school. it doesn't seem to be treated at the undergraduate level though, and i've never understood why.<br />
<br />
there are things like 'euclid: the game' now that make this more fun.</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Subfield_of_math_that_is_best_for_introducing_proofs&diff=3280User:IssaRice/Subfield of math that is best for introducing proofs2021-05-31T07:42:57Z<p>IssaRice: /* Number theory */</p>
<hr />
<div>This page compares subfields of math by how good they are as an introduction to proofs and rigor. In other words, the question to ask is "What is the first proof-based math course one should take?"<br />
<br />
Everything here should be viewed from the lens of ''good pedagogy''. This isn't directly about what the most beautiful branch is, or what's most interesting. It's also not about the elegance of the finished theory; for pedagogy, it's often a good idea to do things in a roundabout way.<br />
<br />
I think I personally got my start at doing proofs by three routes: intro to proofs, discrete math, and real analysis. i think i pretty haphazardly jumped between various books, following my curiosity.<br />
<br />
==Intro to proofs==<br />
<br />
By "intro to proofs", i mean the kind of stuff that is in velleman's ''How to Prove It''. so like, basic logic, proof strategies, quantifiers, sets, relations, functions, mathematical induction. maybe countable vs uncountable infinity.<br />
<br />
Pros:<br />
<br />
* you're learning one thing at a time. With something like real analysis, you have to learn two things at a time: how to write a proof, and the actual content of real analysis. But with intro to proofs, you just learn how to write a proof. Most of the problems are artificial; they are designed specifically to get you to write good proofs, to show you common pitfalls.<br />
* i do like how some books here (i think rosen's discrete math book?) show how to prove things in first order logic notation. it's not literally doing derivations in FOL, but it's more robotic than how you would write a proof in prose. i think it's valuable to see this, and "intro to proofs" contexts are the main ones where i have seen this done. I can't think of why it can't be done in other contexts though (other than lack of time). An example of what i mean by this is something like writing the definition of limit like <math>\forall \varepsilon > 0 \ \exists N \ \forall n \geq N \ |a_n - L| < \varepsilon</math>, and then negating this statement by reversing all the quantifiers in a robotic way.<br />
* since this part is so "easy" in some sense (you don't need to build up lots of complicated objects), it is possible to do a lot of this stuff in Lean. that means you get a more "gamified" feeling of doing this kind of math.<br />
<br />
Cons:<br />
<br />
* there's not much "meat" in here? personally i really enjoyed working through basic logic, but it does feel pretty empty at the end of the day. (there is stuff like pursuing what material implication means that i spent a lot of time on as a teenager, but i doubt most people care about it.)<br />
<br />
==Real analysis==<br />
<br />
Pros:<br />
<br />
* students are already familiar with the basic objects of study (assuming they have taken calculus), including the real numbers, continuous functions, etc. so less motivation is needed for explaining why these objects are interesting to study<br />
* lots of crazy things happen in real analysis (see Abbott's book or chapter 1 of Tao's book) that motivate the need for rigor<br />
<br />
Cons:<br />
<br />
* lots of subtle stuff happen, which might make it particularly challenging as an introduction to proofs<br />
* Bloch's ''The Real Numbers and Real Analysis'' (p. xxiv) mentions nested quantifiers as the differentiating factor that makes analysis harder than linear or abstract algebra for beginner students.<br />
* Bloch also mentions the need to distinguish between scratch work and actual proof as another unique difficulty.<br />
<br />
there is a [https://www.greaterwrong.com/posts/Ym5k6bbwAFPqb2syt/the-value-of-learning-mathematical-proof/comment/YsePZH9s9BMcs5PtZ lesswrong thread] about this. Some books, like Tao's, avoid to some extent the problems mentioned there by starting with the natural numbers, integers, rationals, and basic set theory. It's only like half way through the book that the real numbers are even introduced. This gives the student some time to get used to writing proofs in a discrete domain.<br />
<br />
==Complex analysis==<br />
<br />
everybody says complex analysis is so beautiful and stuff, but i don't think i've ever seen it used as a first course in doing proofs. maybe there is a good reason?<br />
<br />
==Linear algebra==<br />
<br />
Pros:<br />
<br />
* objects of study (linear maps) are simple<br />
* there aren't a lot of paradoxical/unintuitive things<br />
* knowledge of linear algebra helps in many places<br />
<br />
Cons:<br />
<br />
* boring? i think SVD might be the only interesting theorem.<br />
* everything is isomorphic to R^n so vector spaces are not a good example of abstraction (Tim Gowers makes this point somewhere)<br />
* almost every book sucks?<br />
<br />
==Abstract algebra==<br />
<br />
Pros:<br />
<br />
* unsolvability of the quintic might be a good target to work toward<br />
* although algebraic structures are "abstract", it's possible to give many concrete finite examples that build intuition. when examples are finite, you can "see everything"/specify things completely in a way you can't e.g. specify a continuous function completely (by giving a list of where inputs map).<br />
<br />
Cons:<br />
<br />
* boring? i think many of the introductory stuff feels like "what's the point of this?" why care about groups, subgroups, normal subgroups?<br />
* something that's kinda funky about intro group theory: many of the non-trivial proofs are actually about number theory. like, because things like lagrange's theorem are phrased in terms of multiples of numbers, the practice problems are also phrased that way. same thing with stuff like x^n=e implies n is a multiple of ord(x), and talking about groups of prime order. also of course, cyclic groups <=> modular arithmetic connection. so yeah. it's like, you thought you were in here to learn algebra, but in fact, you're just being forced to work through a bunch of basic number theory. and since the algebra book isn't a number theory book, it's not exactly gentle/pedantic/rigorous/self-contained about the number theory it teaches. So you get an "intro to number theory" aspect but it's actually sort of a mickey mouse/dumbed down/not-the-real-thing experience. My current guess in light of this is that it's best to cover basic number theory in very rigorous detail ''before'' you start doing abstract algebra. That way, you can focus on the algebra parts without getting sidetracked into number theory.<br />
<br />
==Computability and logic==<br />
<br />
Pros:<br />
<br />
* proofs in analysis and linear algebra (and probably other places too) often make use of "algorithmic" ideas, e.g. the bisection proof of bolzano-weierstrass theorem. there is a sense in which we like our proofs to be computable, but without learning computability it's hard to express what we even mean by this. I think Stillwell's ''Reverse Mathematics'' talks about this issue?<br />
* several interesting theorems, including equivalence of semi-decidable and recursively enumerable sets, the existence of non-r.e. sets, various examples of diagonalization.<br />
<br />
Cons:<br />
<br />
* this isn't usually taken to be an intro-to-proofs subject, so the textbooks might not assume a level of innocence. In other words, the teaching material might be better for other subfields.<br />
* some of the material seems like unhelpful pedantry, like how interpretations are defined, and the proof of the soundness theorem. it takes a rare kind of mind (or substantial experience) to even realize that the soundness theorem is important.<br />
* might be too meta as an introduction to proofs, e.g. always distinguishing between object level and meta level<br />
<br />
==Discrete math==<br />
<br />
Pros:<br />
<br />
* many topics to pick and choose from<br />
* many interesting topics that can be covered that have wide applicability<br />
* topics tend to be concrete, so you can easily play with them (e.g. automata, finite graphs)<br />
** this also means many of the topics are amenable to programmamatic treatment -- you can write toy programs to test your theories and so on.<br />
* no philosophical issues (?)<br />
<br />
Cons:<br />
<br />
* there's something about proofs in discrete math/abstract algebra where when you can see the whole thing (because it's finite), it becomes really tempting to say that it's "obvious" and handwave through a proof.<br />
* are there any really interesting results that are also easy enough to understand? like a "crowning jewel" type theorem?<br />
<br />
==Number theory==<br />
<br />
I wish there was a book called something like "extremely basic number theory in extremely rigorous detail" that covered the very most basic results in number theory that appear over and over again in other areas of math. so things like gcd/lcm properties, prime number infinity theorem, uniqueness of prime factorization, and stuff like that (anything else?). The reason for this is that a few other fields (like abstract algebra) make use of a lot of these results, but the books on algebra are unwilling to treat number theory in the level of rigorous detail that i think is good for someone starting out with proofs.<br />
<br />
Pros:<br />
<br />
* students are already intimately familiar with the integers.<br />
* the problems in number theory are easy to state.<br />
* good concepts like gcd that "feel obvious" but require careful treatment. (e.g. what is gcd(0,0)?)<br />
* non-trivial results like bezout's lemma, euclid's algorithm for finding gcd.<br />
* number theory can be used to branch out into many different topics/fields: things like gcd can be used to cover the notion of algorithm, some stuff around fermat's little theorem can be used to talk about group theory<br />
<br />
Cons:<br />
<br />
* i'm not sure what a good book for this is.<br />
<br />
==Topology==<br />
<br />
i think i should split this up into metric spaces vs general topology, because i have different opinions about them.<br />
<br />
Pros:<br />
<br />
* i find metric spaces fun. it's a good example of abstraction. you can draw stuff a lot of times, to help you think.<br />
* lots of counterintuitive results, which makes it fun and also teaches you to rewire your intuitions.<br />
<br />
Cons:<br />
<br />
* i think general topology only makes much sense after going through metric spaces/point-set topology<br />
* even metric spaces might be too hard unless you've done some real analysis in R or R^n first. i'm not sure about this. (i first went through parts of spivak and tao which do things in R, so i have no idea how i would have done if i had started with metric spaces first. I didn't like how folland's advanced calculus does things in R^n though, or maybe i just don't like folland's style. my thinking was something like "if you're gonna do the R->R^n abstraction, why not go the full way to metric spaces?")<br />
* i think one reason topology is harder to understand than metric spaces is that the usual examples of topological spaces (except the indiscrete/trivial topology) are all metrizable, so the theory of topological spaces just reduces to the theory of metric spaces. (?)<br />
<br />
==Euclidean geometry==<br />
<br />
this is the original proof-based branch of math! i think a lot of kids (including me) got exposure to this in middle school/high school. it doesn't seem to be treated at the undergraduate level though, and i've never understood why.<br />
<br />
there are things like 'euclid: the game' now that make this more fun.</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/A_matrix_is_only_similar_to_itself_if_and_only_if_it_is_a_scalar_multiple_of_the_identity_matrix&diff=3279User:IssaRice/Linear algebra/A matrix is only similar to itself if and only if it is a scalar multiple of the identity matrix2021-04-01T23:26:10Z<p>IssaRice: </p>
<hr />
<div>("matrix is only similar to itself" means that the linear map expressed in any single basis has the same matrix)<br />
<br />
The identity matrix has the remarkable property that it is only similar to itself: if A is matrix similar to I, then A=I. Why? We have <math>A = QIQ^{-1}</math> for some invertible Q by definition of matrix similarity. but the rhs simplifies to I.<br />
<br />
Are there any other matrices with this property? If <math>\lambda \in \mathbf R</math>, then for <math>\lambda I</math> we have <math>Q(\lambda I)Q^{-1} = \lambda (Q I Q^{-1}) = \lambda I</math> so any scalar multiple of the identity matrix also has this property.<br />
<br />
Are there any others? It turns out there aren't! We want to show that if a matrix A does not have the form <math>\lambda I</math>, then there is a distinct matrix B that it is similar to. More precisely, suppose <math>A \ne \lambda I</math> for any <math>\lambda \in \mathbf R</math>. Then there exists a matrix <math>B \ne A</math> and an invertible matrix Q such that <math>QAQ^{-1} = B</math>.<br />
<br />
We split this proof into two cases:<br />
<br />
# Suppose A is not a diagonal matrix. Then there is an off-diagonal entry that is non-zero, say <math>a_{jk}</math> (row j, column k, j!=k). Let E be the elementary matrix that multiplies row j by 2. Then <math>E^{-1}</math>, when applied from the right, divides column j by 2. Now if we consider <math>EAE^{-1}</math>, its entry j,k will be <math>2a_{jk} \ne a_{jk}</math>. So <math>EAE^{-1} \ne A</math>, even though the two matrices are similar.<br />
# Suppose A is a diagonal matrix where not all entries on the diagonal are equal. Call two of those non-diagonal entries j and k, so that <math>a_{jj} \ne a_{kk}</math>. Let E be the elementary matrix that transposes row j and row k. Then <math>E^{-1}</math>, when applied from the right, transposes column j with column k. Thus <math>EAE^{-1}</math> will be the matrix with <math>a_{jj}</math> and <math>a_{kk}</math> swapped, but all other entries staying the same. So we see that <math>EAE^{-1} \ne A</math>, even though the two matrices are similar.</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/A_matrix_is_only_similar_to_itself_if_and_only_if_it_is_a_scalar_multiple_of_the_identity_matrix&diff=3278User:IssaRice/Linear algebra/A matrix is only similar to itself if and only if it is a scalar multiple of the identity matrix2021-04-01T22:57:51Z<p>IssaRice: </p>
<hr />
<div>"matrix is only similar to itself" means that the linear map expressed in any single basis has the same matrix<br />
<br />
The identity matrix has the remarkable property that it is only similar to itself: if A is matrix similar to I, then A=I. Why? We have <math>A = QIQ^{-1}</math> for some invertible Q by definition of matrix similarity. but the rhs simplifies to I.<br />
<br />
Are there any other matrices with this property? If <math>\lambda \in \mathbf R</math>, then for <math>\lambda I</math> we have <math>Q(\lambda I)Q^{-1} = \lambda (Q I Q^{-1}) = \lambda I</math> so any scalar multiple of the identity matrix also has this property.<br />
<br />
Are there any others? It turns out there aren't! We want to show that if a matrix A does not have the form <math>\lambda I</math>, then there is a distinct matrix B that it is similar to. More precisely, suppose <math>A \ne \lambda I</math> for any <math>\lambda \in \mathbf R</math>. Then there exists a matrix <math>B \ne A</math> and an invertible matrix Q such that <math>QAQ^{-1} = B</math>.<br />
<br />
We split this proof into two cases:<br />
<br />
# Suppose A is not a diagonal matrix. Then there is an off-diagonal entry that is non-zero, say <math>a_{jk}</math> (row j, column k, j!=k). Let E be the elementary matrix that multiplies row j by 2. Then <math>E^{-1}</math>, when applied from the right, divides column j by 2. Now if we consider <math>EAE^{-1}</math>, its entry j,k will be <math>2a_{jk} \ne a_{jk}</math>. So <math>EAE^{-1} \ne A</math>, even though the two matrices are similar.<br />
# Suppose A is a diagonal matrix where not all entries on the diagonal are equal. Call two of those non-diagonal entries j and k, so that <math>a_{jj} \ne a_{kk}</math>. Let E be the elementary matrix that transposes row j and row k. Then <math>E^{-1}</math>, when applied from the right, transposes column j with column k. Thus <math>EAE^{-1}</math> will be the matrix with <math>a_{jj}</math> and <math>a_{kk}</math> swapped, but all other entries staying the same. So we see that <math>EAE^{-1} \ne A</math>, even though the two matrices are similar.</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/A_matrix_is_only_similar_to_itself_if_and_only_if_it_is_a_scalar_multiple_of_the_identity_matrix&diff=3277User:IssaRice/Linear algebra/A matrix is only similar to itself if and only if it is a scalar multiple of the identity matrix2021-04-01T22:56:27Z<p>IssaRice: </p>
<hr />
<div>"matrix is only similar to itself" means that the linear map expressed in any single basis has the same matrix<br />
<br />
The identity matrix has the remarkable property that it is only similar to itself: if A is matrix similar to I, then A=I. Why? We have <math>A = QIQ^{-1}</math> for some invertible Q by definition of matrix similarity. but the rhs simplifies to I.<br />
<br />
Are there any other matrices with this property? If <math>\lambda \in \mathbf R</math>, then for <math>\lambda I</math> we have <math>Q(\lambda I)Q^{-1} = \lambda (Q I Q^{-1}) = \lambda I</math> so any scalar multiple of the identity matrix also has this property.<br />
<br />
Are there any others? It turns out there aren't! We want to show that if a matrix A does not have the form <math>\lambda I</math>, then there is a distinct matrix B that it is similar to. More precisely, suppose <math>A \ne \lambda I</math> for any <math>\lambda \in \mathbf R</math>. Then there exists a matrix B and an invertible matrix Q such that <math>QAQ^{-1} = B</math>.<br />
<br />
We split this proof into two cases:<br />
<br />
# Suppose A is not a diagonal matrix. Then there is an off-diagonal entry that is non-zero, say <math>a_{jk}</math> (row j, column k, j!=k). Let E be the elementary matrix that multiplies row j by 2. Then <math>E^{-1}</math>, when applied from the right, divides column j by 2. Now if we consider <math>EAE^{-1}</math>, its entry j,k will be <math>2a_{jk} \ne a_{jk}</math>. So <math>EAE^{-1} \ne A</math>, even though the two matrices are similar.<br />
# Suppose A is a diagonal matrix where not all entries on the diagonal are equal. Call two of those non-diagonal entries j and k, so that <math>a_{jj} \ne a_{kk}</math>. Let E be the elementary matrix that transposes row j and row k. Then <math>E^{-1}</math>, when applied from the right, transposes column j with column k. Thus <math>EAE^{-1}</math> will be the matrix with <math>a_{jj}</math> and <math>a_{kk}</math> swapped, but all other entries staying the same. So we see that <math>EAE^{-1} \ne A</math>, even though the two matrices are similar.</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/A_matrix_is_only_similar_to_itself_if_and_only_if_it_is_a_scalar_multiple_of_the_identity_matrix&diff=3276User:IssaRice/Linear algebra/A matrix is only similar to itself if and only if it is a scalar multiple of the identity matrix2021-04-01T22:55:52Z<p>IssaRice: </p>
<hr />
<div>"matrix is only similar to itself" means that the linear map expressed in any single basis has the same matrix<br />
<br />
The identity matrix has the remarkable property that it is only similar to itself: if A is matrix similar to I, then A=I. Why? We have <math>A = QIQ^{-1}</math> for some invertible Q by definition of matrix similarity. but the rhs simplifies to I.<br />
<br />
Are there any other matrices with this property? If <math>\lambda \in \mathbf R</math>, then for <math>\lambda I</math> we have <math>Q(\lambda I)Q^{-1} = \lambda (Q I Q^{-1}) = \lambda I</math> so any scalar multiple of the identity matrix also has this property.<br />
<br />
Are there any others? It turns out there aren't! We want to show that if a matrix A does not have the form <math>\lambda I</math>, then there is a distinct matrix B that it is similar to. More precisely, suppose <math>A \ne \lambda I</math> for any <math>\lambda \in \mathbf R</math>. Then there exists a matrix B and an invertible matrix Q such that <math>A = QBQ^{-1}</math>.<br />
<br />
We split this proof into two cases:<br />
<br />
# Suppose A is not a diagonal matrix. Then there is an off-diagonal entry that is non-zero, say <math>a_{jk}</math> (row j, column k, j!=k). Let E be the elementary matrix that multiplies row j by 2. Then <math>E^{-1}</math>, when applied from the right, divides column j by 2. Now if we consider <math>EAE^{-1}</math>, its entry j,k will be <math>2a_{jk} \ne a_{jk}</math>. So <math>EAE^{-1} \ne A</math>, even though the two matrices are similar.<br />
# Suppose A is a diagonal matrix where not all entries on the diagonal are equal. Call two of those non-diagonal entries j and k, so that <math>a_{jj} \ne a_{kk}</math>. Let E be the elementary matrix that transposes row j and row k. Then <math>E^{-1}</math>, when applied from the right, transposes column j with column k. Thus <math>EAE^{-1}</math> will be the matrix with <math>a_{jj}</math> and <math>a_{kk}</math> swapped, but all other entries staying the same. So we see that <math>EAE^{-1} \ne A</math>, even though the two matrices are similar.</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/A_matrix_is_only_similar_to_itself_if_and_only_if_it_is_a_scalar_multiple_of_the_identity_matrix&diff=3275User:IssaRice/Linear algebra/A matrix is only similar to itself if and only if it is a scalar multiple of the identity matrix2021-04-01T22:52:53Z<p>IssaRice: </p>
<hr />
<div>"matrix is only similar to itself" means that the linear map expressed in any single basis has the same matrix<br />
<br />
The identity matrix has the remarkable property that it is only similar to itself: if A is matrix similar to I, then A=I. Why? We have <math>A = QIQ^{-1}</math> for some invertible Q by definition of matrix similarity. but the rhs simplifies to I.<br />
<br />
Are there any other matrices with this property? If <math>\lambda \in \mathbf R</math>, then for <math>\lambda I</math> we have <math>Q(\lambda I)Q^{-1} = \lambda (Q I Q^{-1}) = \lambda I</math> so any scalar multiple of the identity matrix also has this property.<br />
<br />
Are there any others? It turns out there aren't! We want to show that if a matrix A does not have the form <math>\lambda I</math>, then there is a distinct matrix B that it is similar to. More precisely, suppose <math>A \ne \lambda I</math> for any <math>\lambda \in \mathbf R</math>. Then there exists a matrix B and an invertible matrix Q such that <math>A = QBQ^{-1}</math>.<br />
<br />
We split this proof into two cases:<br />
<br />
# Suppose A is not a diagonal matrix. Then there is an off-diagonal entry that is non-zero, say <math>a_{jk}</math> (row j, column k, j!=k). Let E be the elementary matrix that multiplies row j by 2. Then <math>E^{-1}</math>, when applied from the right, divides column j by 2. Now if we consider <math>EAE{^-1}</math>, its entry j,k will be <math>2a_{jk} \ne a_{jk}</math>. So <math>EAE{^-1} \ne A</math>, even though the two matrices are similar.<br />
# Suppose A is a diagonal matrix where not all entries on the diagonal are equal. Call two of those non-diagonal entries j and k, so that <math>a_{jj} \ne a_{kk}</math>. Let E be the elementary matrix that transposes row j and row k. Then <math>E^{-1}</math>, when applied from the right, transposes column j with column k. Thus <math>EAE^{-1}</math> will be the matrix with <math>a_{jj}</math> and <math>a_{kk}</math> swapped, but all other entries staying the same. So we see that <math>EAE{^-1} \ne A</math>, even though the two matrices are similar.</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/A_matrix_is_only_similar_to_itself_if_and_only_if_it_is_a_scalar_multiple_of_the_identity_matrix&diff=3274User:IssaRice/Linear algebra/A matrix is only similar to itself if and only if it is a scalar multiple of the identity matrix2021-04-01T22:52:35Z<p>IssaRice: </p>
<hr />
<div>"matrix is only similar to itself" means that the linear map expressed in any single basis has the same matrix<br />
<br />
The identity matrix has the remarkable property that it is only similar to itself: if A is matrix similar to I, then A=I. Why? We have <math>A = QIQ^{-1}</math> for some invertible Q by definition of matrix similarity. but the rhs simplifies to I.<br />
<br />
Are there any other matrices with this property? If <math>\lambda \in \mathbf R</math>, then for <math>\lambda I</math> we have <math>Q(\lambda I)Q^{-1} = \lambda (Q I Q^{-1}) = \lambda I</math> so any scalar multiple of the identity matrix also has this property.<br />
<br />
Are there any others? It turns out there aren't! We want to show that if a matrix A does not have the form <math>\lambda I</math>, then there is a distinct matrix B that it is similar to. More precisely, suppose <math>A \ne \lambda I</math> for any <math>\lambda \in \mathbf R</math>. Then there exists a matrix B and an invertible matrix Q such that <math>A = QBQ^{-1}</math>.<br />
<br />
We split this proof into two cases:<br />
<br />
1. Suppose A is not a diagonal matrix. Then there is an off-diagonal entry that is non-zero, say <math>a_{jk}</math> (row j, column k, j!=k). Let E be the elementary matrix that multiplies row j by 2. Then <math>E^{-1}</math>, when applied from the right, divides column j by 2. Now if we consider <math>EAE{^-1}</math>, its entry j,k will be <math>2a_{jk} \ne a_{jk}</math>. So <math>EAE{^-1} \ne A</math>, even though the two matrices are similar.<br />
<br />
2. Suppose A is a diagonal matrix where not all entries on the diagonal are equal. Call two of those non-diagonal entries j and k, so that <math>a_{jj} \ne a_{kk}</math>. Let E be the elementary matrix that transposes row j and row k. Then <math>E^{-1}</math>, when applied from the right, transposes column j with column k. Thus <math>EAE^{-1}</math> will be the matrix with <math>a_{jj}</math> and <math>a_{kk}</math> swapped, but all other entries staying the same. So we see that <math>EAE{^-1} \ne A</math>, even though the two matrices are similar.</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/A_matrix_is_only_similar_to_itself_if_and_only_if_it_is_a_scalar_multiple_of_the_identity_matrix&diff=3273User:IssaRice/Linear algebra/A matrix is only similar to itself if and only if it is a scalar multiple of the identity matrix2021-04-01T22:42:13Z<p>IssaRice: </p>
<hr />
<div>"matrix is only similar to itself" means that the linear map expressed in any single basis has the same matrix<br />
<br />
The identity matrix has the remarkable property that it is only similar to itself: if A is matrix similar to I, then A=I. Why? We have <math>A = QIQ^{-1}</math> for some invertible Q by definition of matrix similarity. but the rhs simplifies to I.<br />
<br />
Are there any other matrices with this property? If <math>\lambda \in \mathbf R</math>, then for <math>\lambda I</math> we have <math>Q(\lambda I)Q^{-1} = \lambda (Q I Q^{-1}) = \lambda I</math> so any scalar multiple of the identity matrix also has this property.<br />
<br />
Are there any others? It turns out there aren't! We want to show that if a matrix A does not have the form <math>\lambda I</math>, then there is a distinct matrix B that it is similar to. More precisely, suppose <math>A \ne \lambda I</math> for any <math>\lambda \in \mathbf R</math>. Then there exists a matrix B and an invertible matrix Q such that <math>A = QBQ^{-1}</math>.<br />
<br />
We split this proof into two cases:<br />
<br />
1. Suppose A is not a diagonal matrix.<br />
<br />
2. Suppose A is a diagonal matrix where not all entries on the diagonal are equal.</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/A_matrix_is_only_similar_to_itself_if_and_only_if_it_is_a_scalar_multiple_of_the_identity_matrix&diff=3272User:IssaRice/Linear algebra/A matrix is only similar to itself if and only if it is a scalar multiple of the identity matrix2021-04-01T22:37:57Z<p>IssaRice: </p>
<hr />
<div>"matrix is only similar to itself" means that the linear map expressed in any single basis has the same matrix<br />
<br />
The identity matrix has the remarkable property that it is only similar to itself: if A is matrix similar to I, then A=I. Why? We have <math>A = QIQ^{-1}</math> for some invertible Q by definition of matrix similarity. but the rhs simplifies to I.<br />
<br />
Are there any other matrices with this property? If <math>\lambda \in \mathbf R</math>, then for <math>\lambda I</math> we have <math>Q(\lambda I)Q^{-1} = \lambda (Q I Q^{-1}) = \lambda I</math> so any scalar multiple of the identity matrix also has this property.<br />
<br />
Are there any others? It turns out there aren't! We want to show that if a matrix A does not have the form <math>\lambda I</math>, then there is a distinct matrix B that it is similar to. We split this proof into two cases:<br />
<br />
1. Suppose A is not a diagonal matrix.<br />
<br />
2. Suppose A is a diagonal matrix where not all entries on the diagonal are equal.</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/A_matrix_is_only_similar_to_itself_if_and_only_if_it_is_a_scalar_multiple_of_the_identity_matrix&diff=3271User:IssaRice/Linear algebra/A matrix is only similar to itself if and only if it is a scalar multiple of the identity matrix2021-04-01T22:35:29Z<p>IssaRice: </p>
<hr />
<div>"matrix is only similar to itself" means that the linear map expressed in any single basis has the same matrix<br />
<br />
The identity matrix has the remarkable property that it is only similar to itself: if A is matrix similar to I, then A=I. Why? We have <math>A = QIQ^{-1}</math> for some invertible Q by definition of matrix similarity. but the rhs simplifies to I.<br />
<br />
Are there any other matrices with this property? If <math>\lambda \in \mathbf R</math>, then for <math>\lambda I</math> we have <math>Q(\lambda I)Q^{-1} = \lambda (Q I Q^{-1}) = \lambda I</math> so any scalar multiple of the identity matrix also has this property.</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/A_matrix_is_only_similar_to_itself_if_and_only_if_it_is_a_scalar_multiple_of_the_identity_matrix&diff=3270User:IssaRice/Linear algebra/A matrix is only similar to itself if and only if it is a scalar multiple of the identity matrix2021-03-29T18:35:41Z<p>IssaRice: </p>
<hr />
<div>"matrix is only similar to itself" means that the linear map expressed in any single basis has the same matrix</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/A_matrix_is_only_similar_to_itself_if_and_only_if_it_is_a_scalar_multiple_of_the_identity_matrix&diff=3269User:IssaRice/Linear algebra/A matrix is only similar to itself if and only if it is a scalar multiple of the identity matrix2021-03-29T02:57:27Z<p>IssaRice: Created page with "yo mayn"</p>
<hr />
<div>yo mayn</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Subfield_of_math_that_is_best_for_introducing_proofs&diff=3268User:IssaRice/Subfield of math that is best for introducing proofs2021-03-28T01:40:02Z<p>IssaRice: /* Euclidean geometry */</p>
<hr />
<div>This page compares subfields of math by how good they are as an introduction to proofs and rigor. In other words, the question to ask is "What is the first proof-based math course one should take?"<br />
<br />
Everything here should be viewed from the lens of ''good pedagogy''. This isn't directly about what the most beautiful branch is, or what's most interesting. It's also not about the elegance of the finished theory; for pedagogy, it's often a good idea to do things in a roundabout way.<br />
<br />
I think I personally got my start at doing proofs by three routes: intro to proofs, discrete math, and real analysis. i think i pretty haphazardly jumped between various books, following my curiosity.<br />
<br />
==Intro to proofs==<br />
<br />
By "intro to proofs", i mean the kind of stuff that is in velleman's ''How to Prove It''. so like, basic logic, proof strategies, quantifiers, sets, relations, functions, mathematical induction. maybe countable vs uncountable infinity.<br />
<br />
Pros:<br />
<br />
* you're learning one thing at a time. With something like real analysis, you have to learn two things at a time: how to write a proof, and the actual content of real analysis. But with intro to proofs, you just learn how to write a proof. Most of the problems are artificial; they are designed specifically to get you to write good proofs, to show you common pitfalls.<br />
* i do like how some books here (i think rosen's discrete math book?) show how to prove things in first order logic notation. it's not literally doing derivations in FOL, but it's more robotic than how you would write a proof in prose. i think it's valuable to see this, and "intro to proofs" contexts are the main ones where i have seen this done. I can't think of why it can't be done in other contexts though (other than lack of time). An example of what i mean by this is something like writing the definition of limit like <math>\forall \varepsilon > 0 \ \exists N \ \forall n \geq N \ |a_n - L| < \varepsilon</math>, and then negating this statement by reversing all the quantifiers in a robotic way.<br />
* since this part is so "easy" in some sense (you don't need to build up lots of complicated objects), it is possible to do a lot of this stuff in Lean. that means you get a more "gamified" feeling of doing this kind of math.<br />
<br />
Cons:<br />
<br />
* there's not much "meat" in here? personally i really enjoyed working through basic logic, but it does feel pretty empty at the end of the day. (there is stuff like pursuing what material implication means that i spent a lot of time on as a teenager, but i doubt most people care about it.)<br />
<br />
==Real analysis==<br />
<br />
Pros:<br />
<br />
* students are already familiar with the basic objects of study (assuming they have taken calculus), including the real numbers, continuous functions, etc. so less motivation is needed for explaining why these objects are interesting to study<br />
* lots of crazy things happen in real analysis (see Abbott's book or chapter 1 of Tao's book) that motivate the need for rigor<br />
<br />
Cons:<br />
<br />
* lots of subtle stuff happen, which might make it particularly challenging as an introduction to proofs<br />
* Bloch's ''The Real Numbers and Real Analysis'' (p. xxiv) mentions nested quantifiers as the differentiating factor that makes analysis harder than linear or abstract algebra for beginner students.<br />
* Bloch also mentions the need to distinguish between scratch work and actual proof as another unique difficulty.<br />
<br />
there is a [https://www.greaterwrong.com/posts/Ym5k6bbwAFPqb2syt/the-value-of-learning-mathematical-proof/comment/YsePZH9s9BMcs5PtZ lesswrong thread] about this. Some books, like Tao's, avoid to some extent the problems mentioned there by starting with the natural numbers, integers, rationals, and basic set theory. It's only like half way through the book that the real numbers are even introduced. This gives the student some time to get used to writing proofs in a discrete domain.<br />
<br />
==Complex analysis==<br />
<br />
everybody says complex analysis is so beautiful and stuff, but i don't think i've ever seen it used as a first course in doing proofs. maybe there is a good reason?<br />
<br />
==Linear algebra==<br />
<br />
Pros:<br />
<br />
* objects of study (linear maps) are simple<br />
* there aren't a lot of paradoxical/unintuitive things<br />
* knowledge of linear algebra helps in many places<br />
<br />
Cons:<br />
<br />
* boring? i think SVD might be the only interesting theorem.<br />
* everything is isomorphic to R^n so vector spaces are not a good example of abstraction (Tim Gowers makes this point somewhere)<br />
* almost every book sucks?<br />
<br />
==Abstract algebra==<br />
<br />
Pros:<br />
<br />
* unsolvability of the quintic might be a good target to work toward<br />
* although algebraic structures are "abstract", it's possible to give many concrete finite examples that build intuition. when examples are finite, you can "see everything"/specify things completely in a way you can't e.g. specify a continuous function completely (by giving a list of where inputs map).<br />
<br />
Cons:<br />
<br />
* boring? i think many of the introductory stuff feels like "what's the point of this?" why care about groups, subgroups, normal subgroups?<br />
* something that's kinda funky about intro group theory: many of the non-trivial proofs are actually about number theory. like, because things like lagrange's theorem are phrased in terms of multiples of numbers, the practice problems are also phrased that way. same thing with stuff like x^n=e implies n is a multiple of ord(x), and talking about groups of prime order. also of course, cyclic groups <=> modular arithmetic connection. so yeah. it's like, you thought you were in here to learn algebra, but in fact, you're just being forced to work through a bunch of basic number theory. and since the algebra book isn't a number theory book, it's not exactly gentle/pedantic/rigorous/self-contained about the number theory it teaches. So you get an "intro to number theory" aspect but it's actually sort of a mickey mouse/dumbed down/not-the-real-thing experience. My current guess in light of this is that it's best to cover basic number theory in very rigorous detail ''before'' you start doing abstract algebra. That way, you can focus on the algebra parts without getting sidetracked into number theory.<br />
<br />
==Computability and logic==<br />
<br />
Pros:<br />
<br />
* proofs in analysis and linear algebra (and probably other places too) often make use of "algorithmic" ideas, e.g. the bisection proof of bolzano-weierstrass theorem. there is a sense in which we like our proofs to be computable, but without learning computability it's hard to express what we even mean by this. I think Stillwell's ''Reverse Mathematics'' talks about this issue?<br />
* several interesting theorems, including equivalence of semi-decidable and recursively enumerable sets, the existence of non-r.e. sets, various examples of diagonalization.<br />
<br />
Cons:<br />
<br />
* this isn't usually taken to be an intro-to-proofs subject, so the textbooks might not assume a level of innocence. In other words, the teaching material might be better for other subfields.<br />
* some of the material seems like unhelpful pedantry, like how interpretations are defined, and the proof of the soundness theorem. it takes a rare kind of mind (or substantial experience) to even realize that the soundness theorem is important.<br />
* might be too meta as an introduction to proofs, e.g. always distinguishing between object level and meta level<br />
<br />
==Discrete math==<br />
<br />
Pros:<br />
<br />
* many topics to pick and choose from<br />
* many interesting topics that can be covered that have wide applicability<br />
* topics tend to be concrete, so you can easily play with them (e.g. automata, finite graphs)<br />
** this also means many of the topics are amenable to programmamatic treatment -- you can write toy programs to test your theories and so on.<br />
* no philosophical issues (?)<br />
<br />
Cons:<br />
<br />
* there's something about proofs in discrete math/abstract algebra where when you can see the whole thing (because it's finite), it becomes really tempting to say that it's "obvious" and handwave through a proof.<br />
* are there any really interesting results that are also easy enough to understand? like a "crowning jewel" type theorem?<br />
<br />
==Number theory==<br />
<br />
I wish there was a book called something like "extremely basic number theory in extremely rigorous detail" that covered the very most basic results in number theory that appear over and over again in other areas of math. so things like gcd/lcm properties, prime number infinity theorem, uniqueness of prime factorization, and stuff like that (anything else?). The reason for this is that a few other fields (like abstract algebra) make use of a lot of these results, but the books on algebra are unwilling to treat number theory in the level of rigorous detail that i think is good for someone starting out with proofs.<br />
<br />
Pros:<br />
<br />
* students are already intimately familiar with the integers.<br />
* the problems in number theory are easy to state.<br />
* good concepts like gcd that "feel obvious" but require careful treatment. (e.g. what is gcd(0,0)?)<br />
* non-trivial results like bezout's lemma, euclid's algorithm for finding gcd.<br />
<br />
Cons:<br />
<br />
* i'm not sure what a good book for this is.<br />
<br />
==Topology==<br />
<br />
i think i should split this up into metric spaces vs general topology, because i have different opinions about them.<br />
<br />
Pros:<br />
<br />
* i find metric spaces fun. it's a good example of abstraction. you can draw stuff a lot of times, to help you think.<br />
* lots of counterintuitive results, which makes it fun and also teaches you to rewire your intuitions.<br />
<br />
Cons:<br />
<br />
* i think general topology only makes much sense after going through metric spaces/point-set topology<br />
* even metric spaces might be too hard unless you've done some real analysis in R or R^n first. i'm not sure about this. (i first went through parts of spivak and tao which do things in R, so i have no idea how i would have done if i had started with metric spaces first. I didn't like how folland's advanced calculus does things in R^n though, or maybe i just don't like folland's style. my thinking was something like "if you're gonna do the R->R^n abstraction, why not go the full way to metric spaces?")<br />
* i think one reason topology is harder to understand than metric spaces is that the usual examples of topological spaces (except the indiscrete/trivial topology) are all metrizable, so the theory of topological spaces just reduces to the theory of metric spaces. (?)<br />
<br />
==Euclidean geometry==<br />
<br />
this is the original proof-based branch of math! i think a lot of kids (including me) got exposure to this in middle school/high school. it doesn't seem to be treated at the undergraduate level though, and i've never understood why.<br />
<br />
there are things like 'euclid: the game' now that make this more fun.</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Subfield_of_math_that_is_best_for_introducing_proofs&diff=3267User:IssaRice/Subfield of math that is best for introducing proofs2021-03-28T01:39:36Z<p>IssaRice: </p>
<hr />
<div>This page compares subfields of math by how good they are as an introduction to proofs and rigor. In other words, the question to ask is "What is the first proof-based math course one should take?"<br />
<br />
Everything here should be viewed from the lens of ''good pedagogy''. This isn't directly about what the most beautiful branch is, or what's most interesting. It's also not about the elegance of the finished theory; for pedagogy, it's often a good idea to do things in a roundabout way.<br />
<br />
I think I personally got my start at doing proofs by three routes: intro to proofs, discrete math, and real analysis. i think i pretty haphazardly jumped between various books, following my curiosity.<br />
<br />
==Intro to proofs==<br />
<br />
By "intro to proofs", i mean the kind of stuff that is in velleman's ''How to Prove It''. so like, basic logic, proof strategies, quantifiers, sets, relations, functions, mathematical induction. maybe countable vs uncountable infinity.<br />
<br />
Pros:<br />
<br />
* you're learning one thing at a time. With something like real analysis, you have to learn two things at a time: how to write a proof, and the actual content of real analysis. But with intro to proofs, you just learn how to write a proof. Most of the problems are artificial; they are designed specifically to get you to write good proofs, to show you common pitfalls.<br />
* i do like how some books here (i think rosen's discrete math book?) show how to prove things in first order logic notation. it's not literally doing derivations in FOL, but it's more robotic than how you would write a proof in prose. i think it's valuable to see this, and "intro to proofs" contexts are the main ones where i have seen this done. I can't think of why it can't be done in other contexts though (other than lack of time). An example of what i mean by this is something like writing the definition of limit like <math>\forall \varepsilon > 0 \ \exists N \ \forall n \geq N \ |a_n - L| < \varepsilon</math>, and then negating this statement by reversing all the quantifiers in a robotic way.<br />
* since this part is so "easy" in some sense (you don't need to build up lots of complicated objects), it is possible to do a lot of this stuff in Lean. that means you get a more "gamified" feeling of doing this kind of math.<br />
<br />
Cons:<br />
<br />
* there's not much "meat" in here? personally i really enjoyed working through basic logic, but it does feel pretty empty at the end of the day. (there is stuff like pursuing what material implication means that i spent a lot of time on as a teenager, but i doubt most people care about it.)<br />
<br />
==Real analysis==<br />
<br />
Pros:<br />
<br />
* students are already familiar with the basic objects of study (assuming they have taken calculus), including the real numbers, continuous functions, etc. so less motivation is needed for explaining why these objects are interesting to study<br />
* lots of crazy things happen in real analysis (see Abbott's book or chapter 1 of Tao's book) that motivate the need for rigor<br />
<br />
Cons:<br />
<br />
* lots of subtle stuff happen, which might make it particularly challenging as an introduction to proofs<br />
* Bloch's ''The Real Numbers and Real Analysis'' (p. xxiv) mentions nested quantifiers as the differentiating factor that makes analysis harder than linear or abstract algebra for beginner students.<br />
* Bloch also mentions the need to distinguish between scratch work and actual proof as another unique difficulty.<br />
<br />
there is a [https://www.greaterwrong.com/posts/Ym5k6bbwAFPqb2syt/the-value-of-learning-mathematical-proof/comment/YsePZH9s9BMcs5PtZ lesswrong thread] about this. Some books, like Tao's, avoid to some extent the problems mentioned there by starting with the natural numbers, integers, rationals, and basic set theory. It's only like half way through the book that the real numbers are even introduced. This gives the student some time to get used to writing proofs in a discrete domain.<br />
<br />
==Complex analysis==<br />
<br />
everybody says complex analysis is so beautiful and stuff, but i don't think i've ever seen it used as a first course in doing proofs. maybe there is a good reason?<br />
<br />
==Linear algebra==<br />
<br />
Pros:<br />
<br />
* objects of study (linear maps) are simple<br />
* there aren't a lot of paradoxical/unintuitive things<br />
* knowledge of linear algebra helps in many places<br />
<br />
Cons:<br />
<br />
* boring? i think SVD might be the only interesting theorem.<br />
* everything is isomorphic to R^n so vector spaces are not a good example of abstraction (Tim Gowers makes this point somewhere)<br />
* almost every book sucks?<br />
<br />
==Abstract algebra==<br />
<br />
Pros:<br />
<br />
* unsolvability of the quintic might be a good target to work toward<br />
* although algebraic structures are "abstract", it's possible to give many concrete finite examples that build intuition. when examples are finite, you can "see everything"/specify things completely in a way you can't e.g. specify a continuous function completely (by giving a list of where inputs map).<br />
<br />
Cons:<br />
<br />
* boring? i think many of the introductory stuff feels like "what's the point of this?" why care about groups, subgroups, normal subgroups?<br />
* something that's kinda funky about intro group theory: many of the non-trivial proofs are actually about number theory. like, because things like lagrange's theorem are phrased in terms of multiples of numbers, the practice problems are also phrased that way. same thing with stuff like x^n=e implies n is a multiple of ord(x), and talking about groups of prime order. also of course, cyclic groups <=> modular arithmetic connection. so yeah. it's like, you thought you were in here to learn algebra, but in fact, you're just being forced to work through a bunch of basic number theory. and since the algebra book isn't a number theory book, it's not exactly gentle/pedantic/rigorous/self-contained about the number theory it teaches. So you get an "intro to number theory" aspect but it's actually sort of a mickey mouse/dumbed down/not-the-real-thing experience. My current guess in light of this is that it's best to cover basic number theory in very rigorous detail ''before'' you start doing abstract algebra. That way, you can focus on the algebra parts without getting sidetracked into number theory.<br />
<br />
==Computability and logic==<br />
<br />
Pros:<br />
<br />
* proofs in analysis and linear algebra (and probably other places too) often make use of "algorithmic" ideas, e.g. the bisection proof of bolzano-weierstrass theorem. there is a sense in which we like our proofs to be computable, but without learning computability it's hard to express what we even mean by this. I think Stillwell's ''Reverse Mathematics'' talks about this issue?<br />
* several interesting theorems, including equivalence of semi-decidable and recursively enumerable sets, the existence of non-r.e. sets, various examples of diagonalization.<br />
<br />
Cons:<br />
<br />
* this isn't usually taken to be an intro-to-proofs subject, so the textbooks might not assume a level of innocence. In other words, the teaching material might be better for other subfields.<br />
* some of the material seems like unhelpful pedantry, like how interpretations are defined, and the proof of the soundness theorem. it takes a rare kind of mind (or substantial experience) to even realize that the soundness theorem is important.<br />
* might be too meta as an introduction to proofs, e.g. always distinguishing between object level and meta level<br />
<br />
==Discrete math==<br />
<br />
Pros:<br />
<br />
* many topics to pick and choose from<br />
* many interesting topics that can be covered that have wide applicability<br />
* topics tend to be concrete, so you can easily play with them (e.g. automata, finite graphs)<br />
** this also means many of the topics are amenable to programmamatic treatment -- you can write toy programs to test your theories and so on.<br />
* no philosophical issues (?)<br />
<br />
Cons:<br />
<br />
* there's something about proofs in discrete math/abstract algebra where when you can see the whole thing (because it's finite), it becomes really tempting to say that it's "obvious" and handwave through a proof.<br />
* are there any really interesting results that are also easy enough to understand? like a "crowning jewel" type theorem?<br />
<br />
==Number theory==<br />
<br />
I wish there was a book called something like "extremely basic number theory in extremely rigorous detail" that covered the very most basic results in number theory that appear over and over again in other areas of math. so things like gcd/lcm properties, prime number infinity theorem, uniqueness of prime factorization, and stuff like that (anything else?). The reason for this is that a few other fields (like abstract algebra) make use of a lot of these results, but the books on algebra are unwilling to treat number theory in the level of rigorous detail that i think is good for someone starting out with proofs.<br />
<br />
Pros:<br />
<br />
* students are already intimately familiar with the integers.<br />
* the problems in number theory are easy to state.<br />
* good concepts like gcd that "feel obvious" but require careful treatment. (e.g. what is gcd(0,0)?)<br />
* non-trivial results like bezout's lemma, euclid's algorithm for finding gcd.<br />
<br />
Cons:<br />
<br />
* i'm not sure what a good book for this is.<br />
<br />
==Topology==<br />
<br />
i think i should split this up into metric spaces vs general topology, because i have different opinions about them.<br />
<br />
Pros:<br />
<br />
* i find metric spaces fun. it's a good example of abstraction. you can draw stuff a lot of times, to help you think.<br />
* lots of counterintuitive results, which makes it fun and also teaches you to rewire your intuitions.<br />
<br />
Cons:<br />
<br />
* i think general topology only makes much sense after going through metric spaces/point-set topology<br />
* even metric spaces might be too hard unless you've done some real analysis in R or R^n first. i'm not sure about this. (i first went through parts of spivak and tao which do things in R, so i have no idea how i would have done if i had started with metric spaces first. I didn't like how folland's advanced calculus does things in R^n though, or maybe i just don't like folland's style. my thinking was something like "if you're gonna do the R->R^n abstraction, why not go the full way to metric spaces?")<br />
* i think one reason topology is harder to understand than metric spaces is that the usual examples of topological spaces (except the indiscrete/trivial topology) are all metrizable, so the theory of topological spaces just reduces to the theory of metric spaces. (?)<br />
<br />
==Euclidean geometry==<br />
<br />
this is the original proof-based branch of math! i think a lot of kids (including me) got exposure to this in middle school/high school. it doesn't seem to be treated at the undergraduate level though, and i've never understood why.</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Subfield_of_math_that_is_best_for_introducing_proofs&diff=3266User:IssaRice/Subfield of math that is best for introducing proofs2021-03-28T01:36:18Z<p>IssaRice: </p>
<hr />
<div>This page compares subfields of math by how good they are as an introduction to proofs and rigor. In other words, the question to ask is "What is the first proof-based math course one should take?"<br />
<br />
Everything here should be viewed from the lens of ''good pedagogy''. This isn't directly about what the most beautiful branch is, or what's most interesting. It's also not about the elegance of the finished theory; for pedagogy, it's often a good idea to do things in a roundabout way.<br />
<br />
I think I personally got my start at doing proofs by three routes: intro to proofs, discrete math, and real analysis. i think i pretty haphazardly jumped between various books, following my curiosity.<br />
<br />
==Intro to proofs==<br />
<br />
By "intro to proofs", i mean the kind of stuff that is in velleman's ''How to Prove It''. so like, basic logic, proof strategies, quantifiers, sets, relations, functions, mathematical induction. maybe countable vs uncountable infinity.<br />
<br />
Pros:<br />
<br />
* you're learning one thing at a time. With something like real analysis, you have to learn two things at a time: how to write a proof, and the actual content of real analysis. But with intro to proofs, you just learn how to write a proof. Most of the problems are artificial; they are designed specifically to get you to write good proofs, to show you common pitfalls.<br />
* i do like how some books here (i think rosen's discrete math book?) show how to prove things in first order logic notation. it's not literally doing derivations in FOL, but it's more robotic than how you would write a proof in prose. i think it's valuable to see this, and "intro to proofs" contexts are the main ones where i have seen this done. I can't think of why it can't be done in other contexts though (other than lack of time). An example of what i mean by this is something like writing the definition of limit like <math>\forall \varepsilon > 0 \ \exists N \ \forall n \geq N \ |a_n - L| < \varepsilon</math>, and then negating this statement by reversing all the quantifiers in a robotic way.<br />
* since this part is so "easy" in some sense (you don't need to build up lots of complicated objects), it is possible to do a lot of this stuff in Lean. that means you get a more "gamified" feeling of doing this kind of math.<br />
<br />
Cons:<br />
<br />
* there's not much "meat" in here? personally i really enjoyed working through basic logic, but it does feel pretty empty at the end of the day. (there is stuff like pursuing what material implication means that i spent a lot of time on as a teenager, but i doubt most people care about it.)<br />
<br />
==Real analysis==<br />
<br />
Pros:<br />
<br />
* students are already familiar with the basic objects of study (assuming they have taken calculus), including the real numbers, continuous functions, etc. so less motivation is needed for explaining why these objects are interesting to study<br />
* lots of crazy things happen in real analysis (see Abbott's book or chapter 1 of Tao's book) that motivate the need for rigor<br />
<br />
Cons:<br />
<br />
* lots of subtle stuff happen, which might make it particularly challenging as an introduction to proofs<br />
* Bloch's ''The Real Numbers and Real Analysis'' (p. xxiv) mentions nested quantifiers as the differentiating factor that makes analysis harder than linear or abstract algebra for beginner students.<br />
* Bloch also mentions the need to distinguish between scratch work and actual proof as another unique difficulty.<br />
<br />
there is a [https://www.greaterwrong.com/posts/Ym5k6bbwAFPqb2syt/the-value-of-learning-mathematical-proof/comment/YsePZH9s9BMcs5PtZ lesswrong thread] about this. Some books, like Tao's, avoid to some extent the problems mentioned there by starting with the natural numbers, integers, rationals, and basic set theory. It's only like half way through the book that the real numbers are even introduced. This gives the student some time to get used to writing proofs in a discrete domain.<br />
<br />
==Complex analysis==<br />
<br />
everybody says complex analysis is so beautiful and stuff, but i don't think i've ever seen it used as a first course in doing proofs. maybe there is a good reason?<br />
<br />
==Linear algebra==<br />
<br />
Pros:<br />
<br />
* objects of study (linear maps) are simple<br />
* there aren't a lot of paradoxical/unintuitive things<br />
* knowledge of linear algebra helps in many places<br />
<br />
Cons:<br />
<br />
* boring? i think SVD might be the only interesting theorem.<br />
* everything is isomorphic to R^n so vector spaces are not a good example of abstraction (Tim Gowers makes this point somewhere)<br />
* almost every book sucks?<br />
<br />
==Abstract algebra==<br />
<br />
Pros:<br />
<br />
* unsolvability of the quintic might be a good target to work toward<br />
* although algebraic structures are "abstract", it's possible to give many concrete finite examples that build intuition. when examples are finite, you can "see everything"/specify things completely in a way you can't e.g. specify a continuous function completely (by giving a list of where inputs map).<br />
<br />
Cons:<br />
<br />
* boring? i think many of the introductory stuff feels like "what's the point of this?" why care about groups, subgroups, normal subgroups?<br />
* something that's kinda funky about intro group theory: many of the non-trivial proofs are actually about number theory. like, because things like lagrange's theorem are phrased in terms of multiples of numbers, the practice problems are also phrased that way. same thing with stuff like x^n=e implies n is a multiple of ord(x), and talking about groups of prime order. also of course, cyclic groups <=> modular arithmetic connection. so yeah. it's like, you thought you were in here to learn algebra, but in fact, you're just being forced to work through a bunch of basic number theory. and since the algebra book isn't a number theory book, it's not exactly gentle/pedantic/rigorous/self-contained about the number theory it teaches. So you get an "intro to number theory" aspect but it's actually sort of a mickey mouse/dumbed down/not-the-real-thing experience. My current guess in light of this is that it's best to cover basic number theory in very rigorous detail ''before'' you start doing abstract algebra. That way, you can focus on the algebra parts without getting sidetracked into number theory.<br />
<br />
==Computability and logic==<br />
<br />
Pros:<br />
<br />
* proofs in analysis and linear algebra (and probably other places too) often make use of "algorithmic" ideas, e.g. the bisection proof of bolzano-weierstrass theorem. there is a sense in which we like our proofs to be computable, but without learning computability it's hard to express what we even mean by this. I think Stillwell's ''Reverse Mathematics'' talks about this issue?<br />
* several interesting theorems, including equivalence of semi-decidable and recursively enumerable sets, the existence of non-r.e. sets, various examples of diagonalization.<br />
<br />
Cons:<br />
<br />
* this isn't usually taken to be an intro-to-proofs subject, so the textbooks might not assume a level of innocence. In other words, the teaching material might be better for other subfields.<br />
* some of the material seems like unhelpful pedantry, like how interpretations are defined, and the proof of the soundness theorem. it takes a rare kind of mind (or substantial experience) to even realize that the soundness theorem is important.<br />
* might be too meta as an introduction to proofs, e.g. always distinguishing between object level and meta level<br />
<br />
==Discrete math==<br />
<br />
Pros:<br />
<br />
* many topics to pick and choose from<br />
* many interesting topics that can be covered that have wide applicability<br />
* topics tend to be concrete, so you can easily play with them (e.g. automata, finite graphs)<br />
** this also means many of the topics are amenable to programmamatic treatment -- you can write toy programs to test your theories and so on.<br />
* no philosophical issues (?)<br />
<br />
Cons:<br />
<br />
* there's something about proofs in discrete math/abstract algebra where when you can see the whole thing (because it's finite), it becomes really tempting to say that it's "obvious" and handwave through a proof.<br />
* are there any really interesting results that are also easy enough to understand? like a "crowning jewel" type theorem?<br />
<br />
==Number theory==<br />
<br />
I wish there was a book called something like "extremely basic number theory in extremely rigorous detail" that covered the very most basic results in number theory that appear over and over again in other areas of math. so things like gcd/lcm properties, prime number infinity theorem, uniqueness of prime factorization, and stuff like that (anything else?). The reason for this is that a few other fields (like abstract algebra) make use of a lot of these results, but the books on algebra are unwilling to treat number theory in the level of rigorous detail that i think is good for someone starting out with proofs.<br />
<br />
Pros:<br />
<br />
* students are already intimately familiar with the integers.<br />
* the problems in number theory are easy to state.<br />
* good concepts like gcd that "feel obvious" but require careful treatment. (e.g. what is gcd(0,0)?)<br />
* non-trivial results like bezout's lemma, euclid's algorithm for finding gcd.<br />
<br />
Cons:<br />
<br />
* i'm not sure what a good book for this is.<br />
<br />
==Topology==<br />
<br />
i think i should split this up into metric spaces vs general topology, because i have different opinions about them.<br />
<br />
Pros:<br />
<br />
* i find metric spaces fun. it's a good example of abstraction. you can draw stuff a lot of times, to help you think.<br />
* lots of counterintuitive results, which makes it fun and also teaches you to rewire your intuitions.<br />
<br />
Cons:<br />
<br />
* i think general topology only makes much sense after going through metric spaces/point-set topology<br />
* even metric spaces might be too hard unless you've done some real analysis in R or R^n first. i'm not sure about this. (i first went through parts of spivak and tao which do things in R, so i have no idea how i would have done if i had started with metric spaces first. I didn't like how folland's advanced calculus does things in R^n though, or maybe i just don't like folland's style. my thinking was something like "if you're gonna do the R->R^n abstraction, why not go the full way to metric spaces?")<br />
* i think one reason topology is harder to understand than metric spaces is that the usual examples of topological spaces (except the indiscrete/trivial topology) are all metrizable, so the theory of topological spaces just reduces to the theory of metric spaces. (?)</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Subfield_of_math_that_is_best_for_introducing_proofs&diff=3265User:IssaRice/Subfield of math that is best for introducing proofs2021-03-28T01:35:31Z<p>IssaRice: </p>
<hr />
<div>This page compares subfields of math by how good they are as an introduction to proofs and rigor. In other words, the question to ask is "What is the first proof-based math course one should take?"<br />
<br />
Everything here should be viewed from the lens of ''good pedagogy''. This isn't directly about what the most beautiful branch is, or what's most interesting. It's also not about the elegance of the finished theory; for pedagogy, it's often a good idea to do things in a roundabout way.<br />
<br />
I think I personally got my start at doing proofs by three routes: intro to proofs, discrete math, and real analysis. i think i pretty haphazardly jumped between various books, following my curiosity.<br />
<br />
==Intro to proofs==<br />
<br />
By "intro to proofs", i mean the kind of stuff that is in velleman's ''How to Prove It''. so like, basic logic, proof strategies, quantifiers, sets, relations, functions, mathematical induction. maybe countable vs uncountable infinity.<br />
<br />
Pros:<br />
<br />
* you're learning one thing at a time. With something like real analysis, you have to learn two things at a time: how to write a proof, and the actual content of real analysis. But with intro to proofs, you just learn how to write a proof. Most of the problems are artificial; they are designed specifically to get you to write good proofs, to show you common pitfalls.<br />
* i do like how some books here (i think rosen's discrete math book?) show how to prove things in first order logic notation. it's not literally doing derivations in FOL, but it's more robotic than how you would write a proof in prose. i think it's valuable to see this, and "intro to proofs" contexts are the main ones where i have seen this done. I can't think of why it can't be done in other contexts though (other than lack of time). An example of what i mean by this is something like writing the definition of limit like <math>\forall \varepsilon > 0 \ \exists N \ \forall n \geq N \ |a_n - L| < \varepsilon</math>, and then negating this statement by reversing all the quantifiers in a robotic way.<br />
* since this part is so "easy" in some sense (you don't need to build up lots of complicated objects), it is possible to do a lot of this stuff in Lean. that means you get a more "gamified" feeling of doing this kind of math.<br />
<br />
Cons:<br />
<br />
* there's not much "meat" in here? personally i really enjoyed working through basic logic, but it does feel pretty empty at the end of the day. (there is stuff like pursuing what material implication means that i spent a lot of time on as a teenager, but i doubt most people care about it.)<br />
<br />
==Real analysis==<br />
<br />
Pros:<br />
<br />
* students are already familiar with the basic objects of study (assuming they have taken calculus), including the real numbers, continuous functions, etc. so less motivation is needed for explaining why these objects are interesting to study<br />
* lots of crazy things happen in real analysis (see Abbott's book or chapter 1 of Tao's book) that motivate the need for rigor<br />
<br />
Cons:<br />
<br />
* lots of subtle stuff happen, which might make it particularly challenging as an introduction to proofs<br />
* Bloch's ''The Real Numbers and Real Analysis'' (p. xxiv) mentions nested quantifiers as the differentiating factor that makes analysis harder than linear or abstract algebra for beginner students.<br />
* Bloch also mentions the need to distinguish between scratch work and actual proof as another unique difficulty.<br />
<br />
there is a [https://www.greaterwrong.com/posts/Ym5k6bbwAFPqb2syt/the-value-of-learning-mathematical-proof/comment/YsePZH9s9BMcs5PtZ lesswrong thread] about this. Some books, like Tao's, avoid to some extent the problems mentioned there by starting with the natural numbers, integers, rationals, and basic set theory. It's only like half way through the book that the real numbers are even introduced. This gives the student some time to get used to writing proofs in a discrete domain.<br />
<br />
==Linear algebra==<br />
<br />
Pros:<br />
<br />
* objects of study (linear maps) are simple<br />
* there aren't a lot of paradoxical/unintuitive things<br />
* knowledge of linear algebra helps in many places<br />
<br />
Cons:<br />
<br />
* boring? i think SVD might be the only interesting theorem.<br />
* everything is isomorphic to R^n so vector spaces are not a good example of abstraction (Tim Gowers makes this point somewhere)<br />
* almost every book sucks?<br />
<br />
==Abstract algebra==<br />
<br />
Pros:<br />
<br />
* unsolvability of the quintic might be a good target to work toward<br />
* although algebraic structures are "abstract", it's possible to give many concrete finite examples that build intuition. when examples are finite, you can "see everything"/specify things completely in a way you can't e.g. specify a continuous function completely (by giving a list of where inputs map).<br />
<br />
Cons:<br />
<br />
* boring? i think many of the introductory stuff feels like "what's the point of this?" why care about groups, subgroups, normal subgroups?<br />
* something that's kinda funky about intro group theory: many of the non-trivial proofs are actually about number theory. like, because things like lagrange's theorem are phrased in terms of multiples of numbers, the practice problems are also phrased that way. same thing with stuff like x^n=e implies n is a multiple of ord(x), and talking about groups of prime order. also of course, cyclic groups <=> modular arithmetic connection. so yeah. it's like, you thought you were in here to learn algebra, but in fact, you're just being forced to work through a bunch of basic number theory. and since the algebra book isn't a number theory book, it's not exactly gentle/pedantic/rigorous/self-contained about the number theory it teaches. So you get an "intro to number theory" aspect but it's actually sort of a mickey mouse/dumbed down/not-the-real-thing experience. My current guess in light of this is that it's best to cover basic number theory in very rigorous detail ''before'' you start doing abstract algebra. That way, you can focus on the algebra parts without getting sidetracked into number theory.<br />
<br />
==Computability and logic==<br />
<br />
Pros:<br />
<br />
* proofs in analysis and linear algebra (and probably other places too) often make use of "algorithmic" ideas, e.g. the bisection proof of bolzano-weierstrass theorem. there is a sense in which we like our proofs to be computable, but without learning computability it's hard to express what we even mean by this. I think Stillwell's ''Reverse Mathematics'' talks about this issue?<br />
* several interesting theorems, including equivalence of semi-decidable and recursively enumerable sets, the existence of non-r.e. sets, various examples of diagonalization.<br />
<br />
Cons:<br />
<br />
* this isn't usually taken to be an intro-to-proofs subject, so the textbooks might not assume a level of innocence. In other words, the teaching material might be better for other subfields.<br />
* some of the material seems like unhelpful pedantry, like how interpretations are defined, and the proof of the soundness theorem. it takes a rare kind of mind (or substantial experience) to even realize that the soundness theorem is important.<br />
* might be too meta as an introduction to proofs, e.g. always distinguishing between object level and meta level<br />
<br />
==Discrete math==<br />
<br />
Pros:<br />
<br />
* many topics to pick and choose from<br />
* many interesting topics that can be covered that have wide applicability<br />
* topics tend to be concrete, so you can easily play with them (e.g. automata, finite graphs)<br />
** this also means many of the topics are amenable to programmamatic treatment -- you can write toy programs to test your theories and so on.<br />
* no philosophical issues (?)<br />
<br />
Cons:<br />
<br />
* there's something about proofs in discrete math/abstract algebra where when you can see the whole thing (because it's finite), it becomes really tempting to say that it's "obvious" and handwave through a proof.<br />
* are there any really interesting results that are also easy enough to understand? like a "crowning jewel" type theorem?<br />
<br />
==Number theory==<br />
<br />
I wish there was a book called something like "extremely basic number theory in extremely rigorous detail" that covered the very most basic results in number theory that appear over and over again in other areas of math. so things like gcd/lcm properties, prime number infinity theorem, uniqueness of prime factorization, and stuff like that (anything else?). The reason for this is that a few other fields (like abstract algebra) make use of a lot of these results, but the books on algebra are unwilling to treat number theory in the level of rigorous detail that i think is good for someone starting out with proofs.<br />
<br />
Pros:<br />
<br />
* students are already intimately familiar with the integers.<br />
* the problems in number theory are easy to state.<br />
* good concepts like gcd that "feel obvious" but require careful treatment. (e.g. what is gcd(0,0)?)<br />
* non-trivial results like bezout's lemma, euclid's algorithm for finding gcd.<br />
<br />
Cons:<br />
<br />
* i'm not sure what a good book for this is.<br />
<br />
==Topology==<br />
<br />
i think i should split this up into metric spaces vs general topology, because i have different opinions about them.<br />
<br />
Pros:<br />
<br />
* i find metric spaces fun. it's a good example of abstraction. you can draw stuff a lot of times, to help you think.<br />
* lots of counterintuitive results, which makes it fun and also teaches you to rewire your intuitions.<br />
<br />
Cons:<br />
<br />
* i think general topology only makes much sense after going through metric spaces/point-set topology<br />
* even metric spaces might be too hard unless you've done some real analysis in R or R^n first. i'm not sure about this. (i first went through parts of spivak and tao which do things in R, so i have no idea how i would have done if i had started with metric spaces first. I didn't like how folland's advanced calculus does things in R^n though, or maybe i just don't like folland's style. my thinking was something like "if you're gonna do the R->R^n abstraction, why not go the full way to metric spaces?")<br />
* i think one reason topology is harder to understand than metric spaces is that the usual examples of topological spaces (except the indiscrete/trivial topology) are all metrizable, so the theory of topological spaces just reduces to the theory of metric spaces. (?)</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Subfield_of_math_that_is_best_for_introducing_proofs&diff=3264User:IssaRice/Subfield of math that is best for introducing proofs2021-03-28T01:33:14Z<p>IssaRice: /* Discrete math */</p>
<hr />
<div>This page compares subfields of math by how good they are as an introduction to proofs and rigor. In other words, the question to ask is "What is the first proof-based math course one should take?"<br />
<br />
Everything here should be viewed from the lens of ''good pedagogy''. This isn't directly about what the most beautiful branch is, or what's most interesting. It's also not about the elegance of the finished theory; for pedagogy, it's often a good idea to do things in a roundabout way.<br />
<br />
==Intro to proofs==<br />
<br />
By "intro to proofs", i mean the kind of stuff that is in velleman's ''How to Prove It''. so like, basic logic, proof strategies, quantifiers, sets, relations, functions, mathematical induction. maybe countable vs uncountable infinity.<br />
<br />
Pros:<br />
<br />
* you're learning one thing at a time. With something like real analysis, you have to learn two things at a time: how to write a proof, and the actual content of real analysis. But with intro to proofs, you just learn how to write a proof. Most of the problems are artificial; they are designed specifically to get you to write good proofs, to show you common pitfalls.<br />
* i do like how some books here (i think rosen's discrete math book?) show how to prove things in first order logic notation. it's not literally doing derivations in FOL, but it's more robotic than how you would write a proof in prose. i think it's valuable to see this, and "intro to proofs" contexts are the main ones where i have seen this done. I can't think of why it can't be done in other contexts though (other than lack of time). An example of what i mean by this is something like writing the definition of limit like <math>\forall \varepsilon > 0 \ \exists N \ \forall n \geq N \ |a_n - L| < \varepsilon</math>, and then negating this statement by reversing all the quantifiers in a robotic way.<br />
* since this part is so "easy" in some sense (you don't need to build up lots of complicated objects), it is possible to do a lot of this stuff in Lean. that means you get a more "gamified" feeling of doing this kind of math.<br />
<br />
Cons:<br />
<br />
* there's not much "meat" in here? personally i really enjoyed working through basic logic, but it does feel pretty empty at the end of the day. (there is stuff like pursuing what material implication means that i spent a lot of time on as a teenager, but i doubt most people care about it.)<br />
<br />
==Real analysis==<br />
<br />
Pros:<br />
<br />
* students are already familiar with the basic objects of study (assuming they have taken calculus), including the real numbers, continuous functions, etc. so less motivation is needed for explaining why these objects are interesting to study<br />
* lots of crazy things happen in real analysis (see Abbott's book or chapter 1 of Tao's book) that motivate the need for rigor<br />
<br />
Cons:<br />
<br />
* lots of subtle stuff happen, which might make it particularly challenging as an introduction to proofs<br />
* Bloch's ''The Real Numbers and Real Analysis'' (p. xxiv) mentions nested quantifiers as the differentiating factor that makes analysis harder than linear or abstract algebra for beginner students.<br />
* Bloch also mentions the need to distinguish between scratch work and actual proof as another unique difficulty.<br />
<br />
there is a [https://www.greaterwrong.com/posts/Ym5k6bbwAFPqb2syt/the-value-of-learning-mathematical-proof/comment/YsePZH9s9BMcs5PtZ lesswrong thread] about this. Some books, like Tao's, avoid to some extent the problems mentioned there by starting with the natural numbers, integers, rationals, and basic set theory. It's only like half way through the book that the real numbers are even introduced. This gives the student some time to get used to writing proofs in a discrete domain.<br />
<br />
==Linear algebra==<br />
<br />
Pros:<br />
<br />
* objects of study (linear maps) are simple<br />
* there aren't a lot of paradoxical/unintuitive things<br />
* knowledge of linear algebra helps in many places<br />
<br />
Cons:<br />
<br />
* boring? i think SVD might be the only interesting theorem.<br />
* everything is isomorphic to R^n so vector spaces are not a good example of abstraction (Tim Gowers makes this point somewhere)<br />
* almost every book sucks?<br />
<br />
==Abstract algebra==<br />
<br />
Pros:<br />
<br />
* unsolvability of the quintic might be a good target to work toward<br />
* although algebraic structures are "abstract", it's possible to give many concrete finite examples that build intuition. when examples are finite, you can "see everything"/specify things completely in a way you can't e.g. specify a continuous function completely (by giving a list of where inputs map).<br />
<br />
Cons:<br />
<br />
* boring? i think many of the introductory stuff feels like "what's the point of this?" why care about groups, subgroups, normal subgroups?<br />
* something that's kinda funky about intro group theory: many of the non-trivial proofs are actually about number theory. like, because things like lagrange's theorem are phrased in terms of multiples of numbers, the practice problems are also phrased that way. same thing with stuff like x^n=e implies n is a multiple of ord(x), and talking about groups of prime order. also of course, cyclic groups <=> modular arithmetic connection. so yeah. it's like, you thought you were in here to learn algebra, but in fact, you're just being forced to work through a bunch of basic number theory. and since the algebra book isn't a number theory book, it's not exactly gentle/pedantic/rigorous/self-contained about the number theory it teaches. So you get an "intro to number theory" aspect but it's actually sort of a mickey mouse/dumbed down/not-the-real-thing experience. My current guess in light of this is that it's best to cover basic number theory in very rigorous detail ''before'' you start doing abstract algebra. That way, you can focus on the algebra parts without getting sidetracked into number theory.<br />
<br />
==Computability and logic==<br />
<br />
Pros:<br />
<br />
* proofs in analysis and linear algebra (and probably other places too) often make use of "algorithmic" ideas, e.g. the bisection proof of bolzano-weierstrass theorem. there is a sense in which we like our proofs to be computable, but without learning computability it's hard to express what we even mean by this. I think Stillwell's ''Reverse Mathematics'' talks about this issue?<br />
* several interesting theorems, including equivalence of semi-decidable and recursively enumerable sets, the existence of non-r.e. sets, various examples of diagonalization.<br />
<br />
Cons:<br />
<br />
* this isn't usually taken to be an intro-to-proofs subject, so the textbooks might not assume a level of innocence. In other words, the teaching material might be better for other subfields.<br />
* some of the material seems like unhelpful pedantry, like how interpretations are defined, and the proof of the soundness theorem. it takes a rare kind of mind (or substantial experience) to even realize that the soundness theorem is important.<br />
* might be too meta as an introduction to proofs, e.g. always distinguishing between object level and meta level<br />
<br />
==Discrete math==<br />
<br />
Pros:<br />
<br />
* many topics to pick and choose from<br />
* many interesting topics that can be covered that have wide applicability<br />
* topics tend to be concrete, so you can easily play with them (e.g. automata, finite graphs)<br />
** this also means many of the topics are amenable to programmamatic treatment -- you can write toy programs to test your theories and so on.<br />
* no philosophical issues (?)<br />
<br />
Cons:<br />
<br />
* there's something about proofs in discrete math/abstract algebra where when you can see the whole thing (because it's finite), it becomes really tempting to say that it's "obvious" and handwave through a proof.<br />
* are there any really interesting results that are also easy enough to understand? like a "crowning jewel" type theorem?<br />
<br />
==Number theory==<br />
<br />
I wish there was a book called something like "extremely basic number theory in extremely rigorous detail" that covered the very most basic results in number theory that appear over and over again in other areas of math. so things like gcd/lcm properties, prime number infinity theorem, uniqueness of prime factorization, and stuff like that (anything else?). The reason for this is that a few other fields (like abstract algebra) make use of a lot of these results, but the books on algebra are unwilling to treat number theory in the level of rigorous detail that i think is good for someone starting out with proofs.<br />
<br />
Pros:<br />
<br />
* students are already intimately familiar with the integers.<br />
* the problems in number theory are easy to state.<br />
* good concepts like gcd that "feel obvious" but require careful treatment. (e.g. what is gcd(0,0)?)<br />
* non-trivial results like bezout's lemma, euclid's algorithm for finding gcd.<br />
<br />
Cons:<br />
<br />
* i'm not sure what a good book for this is.<br />
<br />
==Topology==<br />
<br />
i think i should split this up into metric spaces vs general topology, because i have different opinions about them.<br />
<br />
Pros:<br />
<br />
* i find metric spaces fun. it's a good example of abstraction. you can draw stuff a lot of times, to help you think.<br />
* lots of counterintuitive results, which makes it fun and also teaches you to rewire your intuitions.<br />
<br />
Cons:<br />
<br />
* i think general topology only makes much sense after going through metric spaces/point-set topology<br />
* even metric spaces might be too hard unless you've done some real analysis in R or R^n first. i'm not sure about this. (i first went through parts of spivak and tao which do things in R, so i have no idea how i would have done if i had started with metric spaces first. I didn't like how folland's advanced calculus does things in R^n though, or maybe i just don't like folland's style. my thinking was something like "if you're gonna do the R->R^n abstraction, why not go the full way to metric spaces?")<br />
* i think one reason topology is harder to understand than metric spaces is that the usual examples of topological spaces (except the indiscrete/trivial topology) are all metrizable, so the theory of topological spaces just reduces to the theory of metric spaces. (?)</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Subfield_of_math_that_is_best_for_introducing_proofs&diff=3263User:IssaRice/Subfield of math that is best for introducing proofs2021-03-28T01:32:29Z<p>IssaRice: /* Discrete math */</p>
<hr />
<div>This page compares subfields of math by how good they are as an introduction to proofs and rigor. In other words, the question to ask is "What is the first proof-based math course one should take?"<br />
<br />
Everything here should be viewed from the lens of ''good pedagogy''. This isn't directly about what the most beautiful branch is, or what's most interesting. It's also not about the elegance of the finished theory; for pedagogy, it's often a good idea to do things in a roundabout way.<br />
<br />
==Intro to proofs==<br />
<br />
By "intro to proofs", i mean the kind of stuff that is in velleman's ''How to Prove It''. so like, basic logic, proof strategies, quantifiers, sets, relations, functions, mathematical induction. maybe countable vs uncountable infinity.<br />
<br />
Pros:<br />
<br />
* you're learning one thing at a time. With something like real analysis, you have to learn two things at a time: how to write a proof, and the actual content of real analysis. But with intro to proofs, you just learn how to write a proof. Most of the problems are artificial; they are designed specifically to get you to write good proofs, to show you common pitfalls.<br />
* i do like how some books here (i think rosen's discrete math book?) show how to prove things in first order logic notation. it's not literally doing derivations in FOL, but it's more robotic than how you would write a proof in prose. i think it's valuable to see this, and "intro to proofs" contexts are the main ones where i have seen this done. I can't think of why it can't be done in other contexts though (other than lack of time). An example of what i mean by this is something like writing the definition of limit like <math>\forall \varepsilon > 0 \ \exists N \ \forall n \geq N \ |a_n - L| < \varepsilon</math>, and then negating this statement by reversing all the quantifiers in a robotic way.<br />
* since this part is so "easy" in some sense (you don't need to build up lots of complicated objects), it is possible to do a lot of this stuff in Lean. that means you get a more "gamified" feeling of doing this kind of math.<br />
<br />
Cons:<br />
<br />
* there's not much "meat" in here? personally i really enjoyed working through basic logic, but it does feel pretty empty at the end of the day. (there is stuff like pursuing what material implication means that i spent a lot of time on as a teenager, but i doubt most people care about it.)<br />
<br />
==Real analysis==<br />
<br />
Pros:<br />
<br />
* students are already familiar with the basic objects of study (assuming they have taken calculus), including the real numbers, continuous functions, etc. so less motivation is needed for explaining why these objects are interesting to study<br />
* lots of crazy things happen in real analysis (see Abbott's book or chapter 1 of Tao's book) that motivate the need for rigor<br />
<br />
Cons:<br />
<br />
* lots of subtle stuff happen, which might make it particularly challenging as an introduction to proofs<br />
* Bloch's ''The Real Numbers and Real Analysis'' (p. xxiv) mentions nested quantifiers as the differentiating factor that makes analysis harder than linear or abstract algebra for beginner students.<br />
* Bloch also mentions the need to distinguish between scratch work and actual proof as another unique difficulty.<br />
<br />
there is a [https://www.greaterwrong.com/posts/Ym5k6bbwAFPqb2syt/the-value-of-learning-mathematical-proof/comment/YsePZH9s9BMcs5PtZ lesswrong thread] about this. Some books, like Tao's, avoid to some extent the problems mentioned there by starting with the natural numbers, integers, rationals, and basic set theory. It's only like half way through the book that the real numbers are even introduced. This gives the student some time to get used to writing proofs in a discrete domain.<br />
<br />
==Linear algebra==<br />
<br />
Pros:<br />
<br />
* objects of study (linear maps) are simple<br />
* there aren't a lot of paradoxical/unintuitive things<br />
* knowledge of linear algebra helps in many places<br />
<br />
Cons:<br />
<br />
* boring? i think SVD might be the only interesting theorem.<br />
* everything is isomorphic to R^n so vector spaces are not a good example of abstraction (Tim Gowers makes this point somewhere)<br />
* almost every book sucks?<br />
<br />
==Abstract algebra==<br />
<br />
Pros:<br />
<br />
* unsolvability of the quintic might be a good target to work toward<br />
* although algebraic structures are "abstract", it's possible to give many concrete finite examples that build intuition. when examples are finite, you can "see everything"/specify things completely in a way you can't e.g. specify a continuous function completely (by giving a list of where inputs map).<br />
<br />
Cons:<br />
<br />
* boring? i think many of the introductory stuff feels like "what's the point of this?" why care about groups, subgroups, normal subgroups?<br />
* something that's kinda funky about intro group theory: many of the non-trivial proofs are actually about number theory. like, because things like lagrange's theorem are phrased in terms of multiples of numbers, the practice problems are also phrased that way. same thing with stuff like x^n=e implies n is a multiple of ord(x), and talking about groups of prime order. also of course, cyclic groups <=> modular arithmetic connection. so yeah. it's like, you thought you were in here to learn algebra, but in fact, you're just being forced to work through a bunch of basic number theory. and since the algebra book isn't a number theory book, it's not exactly gentle/pedantic/rigorous/self-contained about the number theory it teaches. So you get an "intro to number theory" aspect but it's actually sort of a mickey mouse/dumbed down/not-the-real-thing experience. My current guess in light of this is that it's best to cover basic number theory in very rigorous detail ''before'' you start doing abstract algebra. That way, you can focus on the algebra parts without getting sidetracked into number theory.<br />
<br />
==Computability and logic==<br />
<br />
Pros:<br />
<br />
* proofs in analysis and linear algebra (and probably other places too) often make use of "algorithmic" ideas, e.g. the bisection proof of bolzano-weierstrass theorem. there is a sense in which we like our proofs to be computable, but without learning computability it's hard to express what we even mean by this. I think Stillwell's ''Reverse Mathematics'' talks about this issue?<br />
* several interesting theorems, including equivalence of semi-decidable and recursively enumerable sets, the existence of non-r.e. sets, various examples of diagonalization.<br />
<br />
Cons:<br />
<br />
* this isn't usually taken to be an intro-to-proofs subject, so the textbooks might not assume a level of innocence. In other words, the teaching material might be better for other subfields.<br />
* some of the material seems like unhelpful pedantry, like how interpretations are defined, and the proof of the soundness theorem. it takes a rare kind of mind (or substantial experience) to even realize that the soundness theorem is important.<br />
* might be too meta as an introduction to proofs, e.g. always distinguishing between object level and meta level<br />
<br />
==Discrete math==<br />
<br />
Pros:<br />
<br />
* many topics to pick and choose from<br />
* many interesting topics that can be covered that have wide applicability<br />
* topics tend to be concrete, so you can easily play with them (e.g. automata, finite graphs)<br />
** this also means many of the topics are amenable to programmamatic treatment -- you can write toy programs to test your theories and so on.<br />
* no philosophical issues (?)<br />
<br />
Cons:<br />
<br />
* there's something about proofs in discrete math/abstract algebra where when you can see the whole thing (because it's finite), it becomes really tempting to say that it's "obvious" and handwave through a proof.<br />
<br />
==Number theory==<br />
<br />
I wish there was a book called something like "extremely basic number theory in extremely rigorous detail" that covered the very most basic results in number theory that appear over and over again in other areas of math. so things like gcd/lcm properties, prime number infinity theorem, uniqueness of prime factorization, and stuff like that (anything else?). The reason for this is that a few other fields (like abstract algebra) make use of a lot of these results, but the books on algebra are unwilling to treat number theory in the level of rigorous detail that i think is good for someone starting out with proofs.<br />
<br />
Pros:<br />
<br />
* students are already intimately familiar with the integers.<br />
* the problems in number theory are easy to state.<br />
* good concepts like gcd that "feel obvious" but require careful treatment. (e.g. what is gcd(0,0)?)<br />
* non-trivial results like bezout's lemma, euclid's algorithm for finding gcd.<br />
<br />
Cons:<br />
<br />
* i'm not sure what a good book for this is.<br />
<br />
==Topology==<br />
<br />
i think i should split this up into metric spaces vs general topology, because i have different opinions about them.<br />
<br />
Pros:<br />
<br />
* i find metric spaces fun. it's a good example of abstraction. you can draw stuff a lot of times, to help you think.<br />
* lots of counterintuitive results, which makes it fun and also teaches you to rewire your intuitions.<br />
<br />
Cons:<br />
<br />
* i think general topology only makes much sense after going through metric spaces/point-set topology<br />
* even metric spaces might be too hard unless you've done some real analysis in R or R^n first. i'm not sure about this. (i first went through parts of spivak and tao which do things in R, so i have no idea how i would have done if i had started with metric spaces first. I didn't like how folland's advanced calculus does things in R^n though, or maybe i just don't like folland's style. my thinking was something like "if you're gonna do the R->R^n abstraction, why not go the full way to metric spaces?")<br />
* i think one reason topology is harder to understand than metric spaces is that the usual examples of topological spaces (except the indiscrete/trivial topology) are all metrizable, so the theory of topological spaces just reduces to the theory of metric spaces. (?)</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Subfield_of_math_that_is_best_for_introducing_proofs&diff=3262User:IssaRice/Subfield of math that is best for introducing proofs2021-03-28T01:30:45Z<p>IssaRice: /* Intro to proofs */</p>
<hr />
<div>This page compares subfields of math by how good they are as an introduction to proofs and rigor. In other words, the question to ask is "What is the first proof-based math course one should take?"<br />
<br />
Everything here should be viewed from the lens of ''good pedagogy''. This isn't directly about what the most beautiful branch is, or what's most interesting. It's also not about the elegance of the finished theory; for pedagogy, it's often a good idea to do things in a roundabout way.<br />
<br />
==Intro to proofs==<br />
<br />
By "intro to proofs", i mean the kind of stuff that is in velleman's ''How to Prove It''. so like, basic logic, proof strategies, quantifiers, sets, relations, functions, mathematical induction. maybe countable vs uncountable infinity.<br />
<br />
Pros:<br />
<br />
* you're learning one thing at a time. With something like real analysis, you have to learn two things at a time: how to write a proof, and the actual content of real analysis. But with intro to proofs, you just learn how to write a proof. Most of the problems are artificial; they are designed specifically to get you to write good proofs, to show you common pitfalls.<br />
* i do like how some books here (i think rosen's discrete math book?) show how to prove things in first order logic notation. it's not literally doing derivations in FOL, but it's more robotic than how you would write a proof in prose. i think it's valuable to see this, and "intro to proofs" contexts are the main ones where i have seen this done. I can't think of why it can't be done in other contexts though (other than lack of time). An example of what i mean by this is something like writing the definition of limit like <math>\forall \varepsilon > 0 \ \exists N \ \forall n \geq N \ |a_n - L| < \varepsilon</math>, and then negating this statement by reversing all the quantifiers in a robotic way.<br />
* since this part is so "easy" in some sense (you don't need to build up lots of complicated objects), it is possible to do a lot of this stuff in Lean. that means you get a more "gamified" feeling of doing this kind of math.<br />
<br />
Cons:<br />
<br />
* there's not much "meat" in here? personally i really enjoyed working through basic logic, but it does feel pretty empty at the end of the day. (there is stuff like pursuing what material implication means that i spent a lot of time on as a teenager, but i doubt most people care about it.)<br />
<br />
==Real analysis==<br />
<br />
Pros:<br />
<br />
* students are already familiar with the basic objects of study (assuming they have taken calculus), including the real numbers, continuous functions, etc. so less motivation is needed for explaining why these objects are interesting to study<br />
* lots of crazy things happen in real analysis (see Abbott's book or chapter 1 of Tao's book) that motivate the need for rigor<br />
<br />
Cons:<br />
<br />
* lots of subtle stuff happen, which might make it particularly challenging as an introduction to proofs<br />
* Bloch's ''The Real Numbers and Real Analysis'' (p. xxiv) mentions nested quantifiers as the differentiating factor that makes analysis harder than linear or abstract algebra for beginner students.<br />
* Bloch also mentions the need to distinguish between scratch work and actual proof as another unique difficulty.<br />
<br />
there is a [https://www.greaterwrong.com/posts/Ym5k6bbwAFPqb2syt/the-value-of-learning-mathematical-proof/comment/YsePZH9s9BMcs5PtZ lesswrong thread] about this. Some books, like Tao's, avoid to some extent the problems mentioned there by starting with the natural numbers, integers, rationals, and basic set theory. It's only like half way through the book that the real numbers are even introduced. This gives the student some time to get used to writing proofs in a discrete domain.<br />
<br />
==Linear algebra==<br />
<br />
Pros:<br />
<br />
* objects of study (linear maps) are simple<br />
* there aren't a lot of paradoxical/unintuitive things<br />
* knowledge of linear algebra helps in many places<br />
<br />
Cons:<br />
<br />
* boring? i think SVD might be the only interesting theorem.<br />
* everything is isomorphic to R^n so vector spaces are not a good example of abstraction (Tim Gowers makes this point somewhere)<br />
* almost every book sucks?<br />
<br />
==Abstract algebra==<br />
<br />
Pros:<br />
<br />
* unsolvability of the quintic might be a good target to work toward<br />
* although algebraic structures are "abstract", it's possible to give many concrete finite examples that build intuition. when examples are finite, you can "see everything"/specify things completely in a way you can't e.g. specify a continuous function completely (by giving a list of where inputs map).<br />
<br />
Cons:<br />
<br />
* boring? i think many of the introductory stuff feels like "what's the point of this?" why care about groups, subgroups, normal subgroups?<br />
* something that's kinda funky about intro group theory: many of the non-trivial proofs are actually about number theory. like, because things like lagrange's theorem are phrased in terms of multiples of numbers, the practice problems are also phrased that way. same thing with stuff like x^n=e implies n is a multiple of ord(x), and talking about groups of prime order. also of course, cyclic groups <=> modular arithmetic connection. so yeah. it's like, you thought you were in here to learn algebra, but in fact, you're just being forced to work through a bunch of basic number theory. and since the algebra book isn't a number theory book, it's not exactly gentle/pedantic/rigorous/self-contained about the number theory it teaches. So you get an "intro to number theory" aspect but it's actually sort of a mickey mouse/dumbed down/not-the-real-thing experience. My current guess in light of this is that it's best to cover basic number theory in very rigorous detail ''before'' you start doing abstract algebra. That way, you can focus on the algebra parts without getting sidetracked into number theory.<br />
<br />
==Computability and logic==<br />
<br />
Pros:<br />
<br />
* proofs in analysis and linear algebra (and probably other places too) often make use of "algorithmic" ideas, e.g. the bisection proof of bolzano-weierstrass theorem. there is a sense in which we like our proofs to be computable, but without learning computability it's hard to express what we even mean by this. I think Stillwell's ''Reverse Mathematics'' talks about this issue?<br />
* several interesting theorems, including equivalence of semi-decidable and recursively enumerable sets, the existence of non-r.e. sets, various examples of diagonalization.<br />
<br />
Cons:<br />
<br />
* this isn't usually taken to be an intro-to-proofs subject, so the textbooks might not assume a level of innocence. In other words, the teaching material might be better for other subfields.<br />
* some of the material seems like unhelpful pedantry, like how interpretations are defined, and the proof of the soundness theorem. it takes a rare kind of mind (or substantial experience) to even realize that the soundness theorem is important.<br />
* might be too meta as an introduction to proofs, e.g. always distinguishing between object level and meta level<br />
<br />
==Discrete math==<br />
<br />
Pros:<br />
<br />
* many topics to pick and choose from<br />
* many interesting topics that can be covered that have wide applicability<br />
* topics tend to be concrete, so you can easily play with them (e.g. automata, finite graphs)<br />
* no philosophical issues (?)<br />
<br />
Cons:<br />
<br />
* there's something about proofs in discrete math/abstract algebra where when you can see the whole thing (because it's finite), it becomes really tempting to say that it's "obvious" and handwave through a proof.<br />
<br />
==Number theory==<br />
<br />
I wish there was a book called something like "extremely basic number theory in extremely rigorous detail" that covered the very most basic results in number theory that appear over and over again in other areas of math. so things like gcd/lcm properties, prime number infinity theorem, uniqueness of prime factorization, and stuff like that (anything else?). The reason for this is that a few other fields (like abstract algebra) make use of a lot of these results, but the books on algebra are unwilling to treat number theory in the level of rigorous detail that i think is good for someone starting out with proofs.<br />
<br />
Pros:<br />
<br />
* students are already intimately familiar with the integers.<br />
* the problems in number theory are easy to state.<br />
* good concepts like gcd that "feel obvious" but require careful treatment. (e.g. what is gcd(0,0)?)<br />
* non-trivial results like bezout's lemma, euclid's algorithm for finding gcd.<br />
<br />
Cons:<br />
<br />
* i'm not sure what a good book for this is.<br />
<br />
==Topology==<br />
<br />
i think i should split this up into metric spaces vs general topology, because i have different opinions about them.<br />
<br />
Pros:<br />
<br />
* i find metric spaces fun. it's a good example of abstraction. you can draw stuff a lot of times, to help you think.<br />
* lots of counterintuitive results, which makes it fun and also teaches you to rewire your intuitions.<br />
<br />
Cons:<br />
<br />
* i think general topology only makes much sense after going through metric spaces/point-set topology<br />
* even metric spaces might be too hard unless you've done some real analysis in R or R^n first. i'm not sure about this. (i first went through parts of spivak and tao which do things in R, so i have no idea how i would have done if i had started with metric spaces first. I didn't like how folland's advanced calculus does things in R^n though, or maybe i just don't like folland's style. my thinking was something like "if you're gonna do the R->R^n abstraction, why not go the full way to metric spaces?")<br />
* i think one reason topology is harder to understand than metric spaces is that the usual examples of topological spaces (except the indiscrete/trivial topology) are all metrizable, so the theory of topological spaces just reduces to the theory of metric spaces. (?)</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Subfield_of_math_that_is_best_for_introducing_proofs&diff=3261User:IssaRice/Subfield of math that is best for introducing proofs2021-03-28T01:30:01Z<p>IssaRice: /* Intro to proofs */</p>
<hr />
<div>This page compares subfields of math by how good they are as an introduction to proofs and rigor. In other words, the question to ask is "What is the first proof-based math course one should take?"<br />
<br />
Everything here should be viewed from the lens of ''good pedagogy''. This isn't directly about what the most beautiful branch is, or what's most interesting. It's also not about the elegance of the finished theory; for pedagogy, it's often a good idea to do things in a roundabout way.<br />
<br />
==Intro to proofs==<br />
<br />
By "intro to proofs", i mean the kind of stuff that is in velleman's ''How to Prove It''. so like, basic logic, proof strategies, quantifiers, sets, relations, functions, mathematical induction. maybe countable vs uncountable infinity.<br />
<br />
Pros:<br />
<br />
* you're learning one thing at a time. With something like real analysis, you have to learn two things at a time: how to write a proof, and the actual content of real analysis. But with intro to proofs, you just learn how to write a proof. Most of the problems are artificial; they are designed specifically to get you to write good proofs, to show you common pitfalls.<br />
* i do like how some books here (i think rosen's discrete math book?) show how to prove things in first order logic notation. it's not literally doing derivations in FOL, but it's more robotic than how you would write a proof in prose. i think it's valuable to see this, and "intro to proofs" contexts are the main ones where i have seen this done. I can't think of why it can't be done in other contexts though (other than lack of time). An example of what i mean by this is something like writing the definition of limit like <math>\forall \varepsilon > 0 \ \exists N \ \forall n \geq N \ |a_n - L| < \varepsilon</math>, and then negating this statement by reversing all the quantifiers in a robotic way.<br />
* since this part is so "easy" in some sense (you don't need to build up lots of complicated objects), it is possible to do a lot of this stuff in Lean. that means you get a more "gamified" feeling of doing this kind of math.<br />
<br />
Cons:<br />
<br />
* there's not much "meat" in here? personally i really enjoyed working through basic logic, but it does feel pretty empty at the end of the day.<br />
<br />
==Real analysis==<br />
<br />
Pros:<br />
<br />
* students are already familiar with the basic objects of study (assuming they have taken calculus), including the real numbers, continuous functions, etc. so less motivation is needed for explaining why these objects are interesting to study<br />
* lots of crazy things happen in real analysis (see Abbott's book or chapter 1 of Tao's book) that motivate the need for rigor<br />
<br />
Cons:<br />
<br />
* lots of subtle stuff happen, which might make it particularly challenging as an introduction to proofs<br />
* Bloch's ''The Real Numbers and Real Analysis'' (p. xxiv) mentions nested quantifiers as the differentiating factor that makes analysis harder than linear or abstract algebra for beginner students.<br />
* Bloch also mentions the need to distinguish between scratch work and actual proof as another unique difficulty.<br />
<br />
there is a [https://www.greaterwrong.com/posts/Ym5k6bbwAFPqb2syt/the-value-of-learning-mathematical-proof/comment/YsePZH9s9BMcs5PtZ lesswrong thread] about this. Some books, like Tao's, avoid to some extent the problems mentioned there by starting with the natural numbers, integers, rationals, and basic set theory. It's only like half way through the book that the real numbers are even introduced. This gives the student some time to get used to writing proofs in a discrete domain.<br />
<br />
==Linear algebra==<br />
<br />
Pros:<br />
<br />
* objects of study (linear maps) are simple<br />
* there aren't a lot of paradoxical/unintuitive things<br />
* knowledge of linear algebra helps in many places<br />
<br />
Cons:<br />
<br />
* boring? i think SVD might be the only interesting theorem.<br />
* everything is isomorphic to R^n so vector spaces are not a good example of abstraction (Tim Gowers makes this point somewhere)<br />
* almost every book sucks?<br />
<br />
==Abstract algebra==<br />
<br />
Pros:<br />
<br />
* unsolvability of the quintic might be a good target to work toward<br />
* although algebraic structures are "abstract", it's possible to give many concrete finite examples that build intuition. when examples are finite, you can "see everything"/specify things completely in a way you can't e.g. specify a continuous function completely (by giving a list of where inputs map).<br />
<br />
Cons:<br />
<br />
* boring? i think many of the introductory stuff feels like "what's the point of this?" why care about groups, subgroups, normal subgroups?<br />
* something that's kinda funky about intro group theory: many of the non-trivial proofs are actually about number theory. like, because things like lagrange's theorem are phrased in terms of multiples of numbers, the practice problems are also phrased that way. same thing with stuff like x^n=e implies n is a multiple of ord(x), and talking about groups of prime order. also of course, cyclic groups <=> modular arithmetic connection. so yeah. it's like, you thought you were in here to learn algebra, but in fact, you're just being forced to work through a bunch of basic number theory. and since the algebra book isn't a number theory book, it's not exactly gentle/pedantic/rigorous/self-contained about the number theory it teaches. So you get an "intro to number theory" aspect but it's actually sort of a mickey mouse/dumbed down/not-the-real-thing experience. My current guess in light of this is that it's best to cover basic number theory in very rigorous detail ''before'' you start doing abstract algebra. That way, you can focus on the algebra parts without getting sidetracked into number theory.<br />
<br />
==Computability and logic==<br />
<br />
Pros:<br />
<br />
* proofs in analysis and linear algebra (and probably other places too) often make use of "algorithmic" ideas, e.g. the bisection proof of bolzano-weierstrass theorem. there is a sense in which we like our proofs to be computable, but without learning computability it's hard to express what we even mean by this. I think Stillwell's ''Reverse Mathematics'' talks about this issue?<br />
* several interesting theorems, including equivalence of semi-decidable and recursively enumerable sets, the existence of non-r.e. sets, various examples of diagonalization.<br />
<br />
Cons:<br />
<br />
* this isn't usually taken to be an intro-to-proofs subject, so the textbooks might not assume a level of innocence. In other words, the teaching material might be better for other subfields.<br />
* some of the material seems like unhelpful pedantry, like how interpretations are defined, and the proof of the soundness theorem. it takes a rare kind of mind (or substantial experience) to even realize that the soundness theorem is important.<br />
* might be too meta as an introduction to proofs, e.g. always distinguishing between object level and meta level<br />
<br />
==Discrete math==<br />
<br />
Pros:<br />
<br />
* many topics to pick and choose from<br />
* many interesting topics that can be covered that have wide applicability<br />
* topics tend to be concrete, so you can easily play with them (e.g. automata, finite graphs)<br />
* no philosophical issues (?)<br />
<br />
Cons:<br />
<br />
* there's something about proofs in discrete math/abstract algebra where when you can see the whole thing (because it's finite), it becomes really tempting to say that it's "obvious" and handwave through a proof.<br />
<br />
==Number theory==<br />
<br />
I wish there was a book called something like "extremely basic number theory in extremely rigorous detail" that covered the very most basic results in number theory that appear over and over again in other areas of math. so things like gcd/lcm properties, prime number infinity theorem, uniqueness of prime factorization, and stuff like that (anything else?). The reason for this is that a few other fields (like abstract algebra) make use of a lot of these results, but the books on algebra are unwilling to treat number theory in the level of rigorous detail that i think is good for someone starting out with proofs.<br />
<br />
Pros:<br />
<br />
* students are already intimately familiar with the integers.<br />
* the problems in number theory are easy to state.<br />
* good concepts like gcd that "feel obvious" but require careful treatment. (e.g. what is gcd(0,0)?)<br />
* non-trivial results like bezout's lemma, euclid's algorithm for finding gcd.<br />
<br />
Cons:<br />
<br />
* i'm not sure what a good book for this is.<br />
<br />
==Topology==<br />
<br />
i think i should split this up into metric spaces vs general topology, because i have different opinions about them.<br />
<br />
Pros:<br />
<br />
* i find metric spaces fun. it's a good example of abstraction. you can draw stuff a lot of times, to help you think.<br />
* lots of counterintuitive results, which makes it fun and also teaches you to rewire your intuitions.<br />
<br />
Cons:<br />
<br />
* i think general topology only makes much sense after going through metric spaces/point-set topology<br />
* even metric spaces might be too hard unless you've done some real analysis in R or R^n first. i'm not sure about this. (i first went through parts of spivak and tao which do things in R, so i have no idea how i would have done if i had started with metric spaces first. I didn't like how folland's advanced calculus does things in R^n though, or maybe i just don't like folland's style. my thinking was something like "if you're gonna do the R->R^n abstraction, why not go the full way to metric spaces?")<br />
* i think one reason topology is harder to understand than metric spaces is that the usual examples of topological spaces (except the indiscrete/trivial topology) are all metrizable, so the theory of topological spaces just reduces to the theory of metric spaces. (?)</div>IssaRicehttps://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Subfield_of_math_that_is_best_for_introducing_proofs&diff=3260User:IssaRice/Subfield of math that is best for introducing proofs2021-03-26T15:31:37Z<p>IssaRice: /* Abstract algebra */</p>
<hr />
<div>This page compares subfields of math by how good they are as an introduction to proofs and rigor. In other words, the question to ask is "What is the first proof-based math course one should take?"<br />
<br />
Everything here should be viewed from the lens of ''good pedagogy''. This isn't directly about what the most beautiful branch is, or what's most interesting. It's also not about the elegance of the finished theory; for pedagogy, it's often a good idea to do things in a roundabout way.<br />
<br />
==Intro to proofs==<br />
<br />
By "intro to proofs", i mean the kind of stuff that is in velleman's ''How to Prove It''. so like, basic logic, proof strategies, quantifiers, sets, relations, functions, mathematical induction. maybe countable vs uncountable infinity.<br />
<br />
Pros:<br />
<br />
* you're learning one thing at a time. With something like real analysis, you have to learn two things at a time: how to write a proof, and the actual content of real analysis. But with intro to proofs, you just learn how to write a proof. Most of the problems are artificial; they are designed specifically to get you to write good proofs, to show you common pitfalls.<br />
* i do like how some books here (i think rosen's discrete math book?) show how to prove things in first order logic notation. it's not literally doing derivations in FOL, but it's more robotic than how you would write a proof in prose. i think it's valuable to see this, and "intro to proofs" contexts are the main ones where i have seen this done. I can't think of why it can't be done in other contexts though (other than lack of time). An example of what i mean by this is something like writing the definition of limit like <math>\forall \varepsilon > 0 \ \exists N \ \forall n \geq N \ |a_n - L| < \varepsilon</math>, and then negating this statement by reversing all the quantifiers in a robotic way.<br />
<br />
Cons:<br />
<br />
* there's not much "meat" in here? personally i really enjoyed working through basic logic, but it does feel pretty empty at the end of the day.<br />
<br />
==Real analysis==<br />
<br />
Pros:<br />
<br />
* students are already familiar with the basic objects of study (assuming they have taken calculus), including the real numbers, continuous functions, etc. so less motivation is needed for explaining why these objects are interesting to study<br />
* lots of crazy things happen in real analysis (see Abbott's book or chapter 1 of Tao's book) that motivate the need for rigor<br />
<br />
Cons:<br />
<br />
* lots of subtle stuff happen, which might make it particularly challenging as an introduction to proofs<br />
* Bloch's ''The Real Numbers and Real Analysis'' (p. xxiv) mentions nested quantifiers as the differentiating factor that makes analysis harder than linear or abstract algebra for beginner students.<br />
* Bloch also mentions the need to distinguish between scratch work and actual proof as another unique difficulty.<br />
<br />
there is a [https://www.greaterwrong.com/posts/Ym5k6bbwAFPqb2syt/the-value-of-learning-mathematical-proof/comment/YsePZH9s9BMcs5PtZ lesswrong thread] about this. Some books, like Tao's, avoid to some extent the problems mentioned there by starting with the natural numbers, integers, rationals, and basic set theory. It's only like half way through the book that the real numbers are even introduced. This gives the student some time to get used to writing proofs in a discrete domain.<br />
<br />
==Linear algebra==<br />
<br />
Pros:<br />
<br />
* objects of study (linear maps) are simple<br />
* there aren't a lot of paradoxical/unintuitive things<br />
* knowledge of linear algebra helps in many places<br />
<br />
Cons:<br />
<br />
* boring? i think SVD might be the only interesting theorem.<br />
* everything is isomorphic to R^n so vector spaces are not a good example of abstraction (Tim Gowers makes this point somewhere)<br />
* almost every book sucks?<br />
<br />
==Abstract algebra==<br />
<br />
Pros:<br />
<br />
* unsolvability of the quintic might be a good target to work toward<br />
* although algebraic structures are "abstract", it's possible to give many concrete finite examples that build intuition. when examples are finite, you can "see everything"/specify things completely in a way you can't e.g. specify a continuous function completely (by giving a list of where inputs map).<br />
<br />
Cons:<br />
<br />
* boring? i think many of the introductory stuff feels like "what's the point of this?" why care about groups, subgroups, normal subgroups?<br />
* something that's kinda funky about intro group theory: many of the non-trivial proofs are actually about number theory. like, because things like lagrange's theorem are phrased in terms of multiples of numbers, the practice problems are also phrased that way. same thing with stuff like x^n=e implies n is a multiple of ord(x), and talking about groups of prime order. also of course, cyclic groups <=> modular arithmetic connection. so yeah. it's like, you thought you were in here to learn algebra, but in fact, you're just being forced to work through a bunch of basic number theory. and since the algebra book isn't a number theory book, it's not exactly gentle/pedantic/rigorous/self-contained about the number theory it teaches. So you get an "intro to number theory" aspect but it's actually sort of a mickey mouse/dumbed down/not-the-real-thing experience. My current guess in light of this is that it's best to cover basic number theory in very rigorous detail ''before'' you start doing abstract algebra. That way, you can focus on the algebra parts without getting sidetracked into number theory.<br />
<br />
==Computability and logic==<br />
<br />
Pros:<br />
<br />
* proofs in analysis and linear algebra (and probably other places too) often make use of "algorithmic" ideas, e.g. the bisection proof of bolzano-weierstrass theorem. there is a sense in which we like our proofs to be computable, but without learning computability it's hard to express what we even mean by this. I think Stillwell's ''Reverse Mathematics'' talks about this issue?<br />
* several interesting theorems, including equivalence of semi-decidable and recursively enumerable sets, the existence of non-r.e. sets, various examples of diagonalization.<br />
<br />
Cons:<br />
<br />
* this isn't usually taken to be an intro-to-proofs subject, so the textbooks might not assume a level of innocence. In other words, the teaching material might be better for other subfields.<br />
* some of the material seems like unhelpful pedantry, like how interpretations are defined, and the proof of the soundness theorem. it takes a rare kind of mind (or substantial experience) to even realize that the soundness theorem is important.<br />
* might be too meta as an introduction to proofs, e.g. always distinguishing between object level and meta level<br />
<br />
==Discrete math==<br />
<br />
Pros:<br />
<br />
* many topics to pick and choose from<br />
* many interesting topics that can be covered that have wide applicability<br />
* topics tend to be concrete, so you can easily play with them (e.g. automata, finite graphs)<br />
* no philosophical issues (?)<br />
<br />
Cons:<br />
<br />
* there's something about proofs in discrete math/abstract algebra where when you can see the whole thing (because it's finite), it becomes really tempting to say that it's "obvious" and handwave through a proof.<br />
<br />
==Number theory==<br />
<br />
I wish there was a book called something like "extremely basic number theory in extremely rigorous detail" that covered the very most basic results in number theory that appear over and over again in other areas of math. so things like gcd/lcm properties, prime number infinity theorem, uniqueness of prime factorization, and stuff like that (anything else?). The reason for this is that a few other fields (like abstract algebra) make use of a lot of these results, but the books on algebra are unwilling to treat number theory in the level of rigorous detail that i think is good for someone starting out with proofs.<br />
<br />
Pros:<br />
<br />
* students are already intimately familiar with the integers.<br />
* the problems in number theory are easy to state.<br />
* good concepts like gcd that "feel obvious" but require careful treatment. (e.g. what is gcd(0,0)?)<br />
* non-trivial results like bezout's lemma, euclid's algorithm for finding gcd.<br />
<br />
Cons:<br />
<br />
* i'm not sure what a good book for this is.<br />
<br />
==Topology==<br />
<br />
i think i should split this up into metric spaces vs general topology, because i have different opinions about them.<br />
<br />
Pros:<br />
<br />
* i find metric spaces fun. it's a good example of abstraction. you can draw stuff a lot of times, to help you think.<br />
* lots of counterintuitive results, which makes it fun and also teaches you to rewire your intuitions.<br />
<br />
Cons:<br />
<br />
* i think general topology only makes much sense after going through metric spaces/point-set topology<br />
* even metric spaces might be too hard unless you've done some real analysis in R or R^n first. i'm not sure about this. (i first went through parts of spivak and tao which do things in R, so i have no idea how i would have done if i had started with metric spaces first. I didn't like how folland's advanced calculus does things in R^n though, or maybe i just don't like folland's style. my thinking was something like "if you're gonna do the R->R^n abstraction, why not go the full way to metric spaces?")<br />
* i think one reason topology is harder to understand than metric spaces is that the usual examples of topological spaces (except the indiscrete/trivial topology) are all metrizable, so the theory of topological spaces just reduces to the theory of metric spaces. (?)</div>IssaRice