User:IssaRice/Computability and logic/Diagonalization lemma (Yanofsky method): Difference between revisions

From Machinelearning
(Created page with ".")
 
 
(46 intermediate revisions by the same user not shown)
Line 1: Line 1:
.
This page is just a rewritten version of the [https://arxiv.org/pdf/math/0305282.pdf#section.5 proof of the diagonalization lemma from Yanofsky's paper]. There are some parts I found difficult to understand/gaps that I thought should be filled, so this version will hopefully be much more complete.
 
==Theorem statement and proof==
 
'''Theorem (Diagonalization lemma).''' Let <math>T</math> be some nice theory (I don't remember the exact conditions, but Peano Arithmetic or Robinson Arithmetic would work). For any single-place formula <math>\mathcal E(x)</math> with <math>x</math> as its only free variable, there exists a sentence (closed formula) <math>\mathcal C</math> such that <math>T \vdash \mathcal C \leftrightarrow \mathcal E(\ulcorner \mathcal C \urcorner)</math>.
 
''Proof.'' We will use what's called the [[wikipedia:Lindenbaum–Tarski algebra]] in our proof, which the Yanofsky paper just calls Lindenbaum classes. The idea is to define an equivalence an equivalence relation on formulas, so that <math>A \sim B</math> iff our theory <math>T</math> can prove that <math>A</math> and <math>B</math> are logically equivalent. In other words, we define <math>A \sim B</math> iff <math>T \vdash A \leftrightarrow B</math>. Thus below if we say <math>[A]_\sim = [B]_\sim</math> this means <math>T \vdash A \leftrightarrow B</math>, and similarly if we say <math>A \in [B]_\sim</math> this also means <math>T \vdash A \leftrightarrow B</math>. This is easily verified to be an equivalence relation. One twist is that it doesn't make sense to compare two formulas with differing arity. For instance, what does it even mean to say "2+2=5" and "x+5=7" are or are not provably equivalent? For this reason, we make a separate equivalence relation for each arity, and so we end up with separate quotients for each arity. For <math>i=0,1,2,\ldots</math>, we let <math>\mathit{Lind}^i</math> be the set of Lindenbaum classes of formulas with <math>i</math> free variables.
 
That means that e.g. each element of <math>\mathit{Lind}^0</math> is not a sentence, but rather a set of sentences that are all provably equivalent. But every sentence is either true or false, so doesn't this mean <math>\mathit{Lind}^0</math> has just two element, <math>[\top]_{\sim}</math> and <math>[\bot]_{\sim}</math> (i.e. the class of true sentences and the class of false sentences)? No! That's because even if two sentences are both true, our theory may be unable to prove that it is so! That's the whole point of Gödel's first incompleteness theorem, which shows us that there can be true sentences (such as the famous "I am not provable" sentence <math>G</math>) which are not provable (so that even though <math>G</math> is true, <math>G \notin [2+2=5]_\sim</math>).
 
In the paper, Yanofsky toggles fluidly between thinking of a formula like <math>\mathcal B(x)</math> as a formula vs a set of formulas whose representative is <math>\mathcal B(x)</math> (i.e. the class <math>[\mathcal B(x)]_\sim</math>). To avoid confusion, we will take care to always denote sets of formulas using the equivalence class notation <math>[-]_\sim</math>.
 
Now following the paper, we define a function <math>f : \mathit{Lind}^1 \times \mathit{Lind}^1 \to \mathit{Lind}^0</math> as follows:
 
:<math>f([\mathcal B(x)]_\sim, [\mathcal H(y)]_\sim) := [\mathcal H(\ulcorner \mathcal B(x) \urcorner)]_\sim</math>
 
This is just the act of substituting the Gödel number of first formula into the second formula. Notice a problem here! Since we have defined the function on the Lindenbaum classes rather than the bare set of formulas themselves, we must check that the result is not dependent on the choice of representatives. For <math>\mathcal H(y)</math> we are good, because if we picked some provably equivalent <math>\mathcal H'(y)</math> instead, the results <math>\mathcal H(\ulcorner \mathcal B(x) \urcorner)</math> and <math>\mathcal H'(\ulcorner \mathcal B(x) \urcorner)</math> are provably equivalent, so the two classes will be equal. But what about the choice of <math>\mathcal B(x)</math>? Since <math>\mathcal H(y)</math> could be literally any one-place formula at all, and different choices of representatives <math>\mathcal B(x)</math> will produce different Gödel numbers, substituting in these different Gödel numbers could produce totally different sentences. Imagine <math>\mathcal B(x)</math> has Gödel number 2 and <math>\mathcal B'(x)</math> is provably equivalent to <math>\mathcal B(x)</math> and has Gödel number 11. If <math>\mathcal H(y)</math> is a sentence like "x+5=7", then this will give two sentences "2+5=7" and "11+5=7", which are not provably equivalent!
 
So the function <math>f</math> given above is not well-defined. To get around this problem, we can say that whenever we need to make a choice of formula from an equivalence class, we always pick the one with the smallest Gödel number. Now no matter which representative <math>\mathcal B(x)</math> we choose, there will be a further step of standardizing on which Gödel number to use, so the function ends up well-defined. Picking the smallest Gödel number is well-defined by the [[wikipedia:well-ordering principle]]. So our new definition of <math>f</math> looks like this:
 
:<math>f([\mathcal B(x)]_\sim, [\mathcal H(y)]_\sim) := [\mathcal H(\min \{\ulcorner \varphi \urcorner : \varphi \in [\mathcal B(x)]_{\sim} \})]_\sim</math>
 
But actually, this notation is quite unwieldy especially as we start to nest more and more expressions, so let's define a min operator for sets of sentences. Formally, we define <math>\min [\mathcal B(x)]_\sim := \min \{\ulcorner \varphi \urcorner : \varphi \in [\mathcal B(x)]_{\sim} \}</math>. A little confusing perhaps, but we are not finding the formula with the least godel number, but instead the godel number of the formula with the least godel number.
 
Now each single-place formula <math>\mathcal E(x)</math> induces a mapping <math>\Phi_{\mathcal E} : \mathit{Lind}^0 \to \mathit{Lind}^0</math> by <math>\mathcal [\mathcal P]_\sim \mapsto \mathcal [\mathcal E(\min \{\ulcorner \varphi \urcorner : \varphi \in [\mathcal P]_\sim\})]_\sim</math>. We again have to standardize by taking the smallest Gödel number in the class in order to ensure the mapping is well-defined.
 
Drawing on Theorem 3 (diagonal theorem) earlier in the paper, we want to perform the familiar construction to get the function <math>g</math>.
 
We define <math>g([\mathcal B(x)]_\sim) := \Phi_{\mathcal E}(f([\mathcal B(x)]_\sim, [\mathcal B(x)]_\sim))</math> as in the paper. Since <math>f([\mathcal B(x)]_\sim, [\mathcal B(x)]_\sim) = [\mathcal B(\min \{\ulcorner \varphi \urcorner : \varphi \in [\mathcal B(x)]_{\sim} \})]_\sim</math> this means we have <math>\Phi_{\mathcal E}([\mathcal B(\min \{\ulcorner \varphi \urcorner : \varphi \in [\mathcal B(x)]_{\sim} \})]_\sim)</math> by replacing inside <math>\Phi_{\mathcal E}(-)</math>. And now expanding the definition of <math>\Phi_{\mathcal E}</math> we get <math>\mathcal [\mathcal E(\min \{\ulcorner \varphi \urcorner : \varphi \in [\mathcal B(\min \{\ulcorner \varphi \urcorner : \varphi \in [\mathcal B(x)]_{\sim} \})]_\sim\})]_\sim</math>. That is definitely way more complicated-looking than the <math>\mathcal E(\ulcorner\mathcal B(\ulcorner \mathcal B(x)\urcorner) \urcorner)</math> that Yanofsky gets at this point in the paper! But it also ensures everything is well-defined.
 
To get the proof to work, we must show <math>g</math> is representable. Recall that this means we must find some <math>G(x) \in \mathit{Lind}^1</math> such that <math>g(-) = f(-, G(x))</math>. How do we figure out which G(x) works? It's a lot easier to work informally by eliding the distinction between formulas and equivalence classes of formulas, so let's be a bit loose and be in "exploration mode" here, then we will come back later and phrase things formally.
 
well, we want to choose G(x) such that g(B(x)) = f(B(x), G(x)) for every B(x). Let's try to expand both LHS and RHS separately, and see where we get.
 
g(B(x)) = \Phi_E(f(B(x), B(x))) = E(B(B(x)))
 
and f(B(x), G(x)) = G(B(x))
 
so this means we need E(B(B(x))) = G(B(x)). So G just needs to diagonalize B(x) and then wrap it around E: G(x) = E(D(x)).
 
Formally, we define <math>G(x)</math> to be <math>E(D(x))</math> as well.
 
Now to really show that g is representable, we must show that <math>g([B(x)]_\sim) = f([B(x)]_\sim, [G(x)]_\sim)</math>.
 
We have <math>g([B(x)]_\sim) = \mathcal [\mathcal E(\min \{\ulcorner \varphi \urcorner : \varphi \in [\mathcal B(\min \{\ulcorner \varphi \urcorner : \varphi \in [\mathcal B(x)]_{\sim} \})]_\sim\})]_\sim</math>
 
and <math>f([B(x)]_\sim, [G(x)]_\sim) = [\mathcal G(\min \{\ulcorner \varphi \urcorner : \varphi \in [\mathcal B(x)]_{\sim} \})]_\sim</math>
 
expanding the definition of G, we get <math>[\mathcal E(D(\min \{\ulcorner \varphi \urcorner : \varphi \in [\mathcal B(x)]_{\sim} \}))]_\sim</math>.
 
To do the next step, it is convenient to put <math>n := \min \{\ulcorner \varphi \urcorner : \varphi \in [\mathcal B(x)]_{\sim} \}</math>, and let <math>\varphi_n</math> be the formula such that <math>\ulcorner \varphi_n \urcorner = n</math>. Note that <math>T \vdash \varphi_n \leftrightarrow \mathcal B(x)</math>.
 
So we want to compute D(n). Using our new notation, this is <math>D(n) = \ulcorner \varphi_n(\ulcorner \varphi_n \urcorner) \urcorner</math>.
 
So in the end we get <math>[\mathcal E(\ulcorner \varphi_n(\ulcorner \varphi_n \urcorner) \urcorner)]_\sim</math>.
 
 
 
Now by definition of D, we get <math>[\mathcal E(\mathcal B( \min \{\ulcorner \varphi \urcorner : \varphi \in [\mathcal B(x)]_{\sim} \} ))]_\sim</math>.
 
[TODO] so actually to get the proof to work, we need <math>D(\ulcorner \varphi \urcorner) = \min [\varphi(\ulcorner \varphi \urcorner)]_\sim</math>. But doesn't this make D uncomputable?
 
[TODO] Here we need to explain what D(x) means and why it's valid.
 
[TODO] another thing we ought to discuss in the proof: since we are looking for a fixed point for any Lind^0 -> Lind^0, that fixes what the set Y could be. But it doesn't decide what the set T could be. How do we know that T=Lind^1?
 
==Intuitions==
 
==Acknowledgments==
 
Thanks to Rupert McCallum for helping me work through the proof.
 
==See also==
 
* [[User:IssaRice/Computability_and_logic/Diagonalization_lemma]]

Latest revision as of 03:32, 29 July 2021

This page is just a rewritten version of the proof of the diagonalization lemma from Yanofsky's paper. There are some parts I found difficult to understand/gaps that I thought should be filled, so this version will hopefully be much more complete.

Theorem statement and proof

Theorem (Diagonalization lemma). Let be some nice theory (I don't remember the exact conditions, but Peano Arithmetic or Robinson Arithmetic would work). For any single-place formula with as its only free variable, there exists a sentence (closed formula) such that .

Proof. We will use what's called the wikipedia:Lindenbaum–Tarski algebra in our proof, which the Yanofsky paper just calls Lindenbaum classes. The idea is to define an equivalence an equivalence relation on formulas, so that iff our theory can prove that and are logically equivalent. In other words, we define iff . Thus below if we say this means , and similarly if we say this also means . This is easily verified to be an equivalence relation. One twist is that it doesn't make sense to compare two formulas with differing arity. For instance, what does it even mean to say "2+2=5" and "x+5=7" are or are not provably equivalent? For this reason, we make a separate equivalence relation for each arity, and so we end up with separate quotients for each arity. For , we let be the set of Lindenbaum classes of formulas with free variables.

That means that e.g. each element of is not a sentence, but rather a set of sentences that are all provably equivalent. But every sentence is either true or false, so doesn't this mean has just two element, and (i.e. the class of true sentences and the class of false sentences)? No! That's because even if two sentences are both true, our theory may be unable to prove that it is so! That's the whole point of Gödel's first incompleteness theorem, which shows us that there can be true sentences (such as the famous "I am not provable" sentence ) which are not provable (so that even though is true, ).

In the paper, Yanofsky toggles fluidly between thinking of a formula like as a formula vs a set of formulas whose representative is (i.e. the class ). To avoid confusion, we will take care to always denote sets of formulas using the equivalence class notation .

Now following the paper, we define a function as follows:

This is just the act of substituting the Gödel number of first formula into the second formula. Notice a problem here! Since we have defined the function on the Lindenbaum classes rather than the bare set of formulas themselves, we must check that the result is not dependent on the choice of representatives. For we are good, because if we picked some provably equivalent instead, the results and are provably equivalent, so the two classes will be equal. But what about the choice of ? Since could be literally any one-place formula at all, and different choices of representatives will produce different Gödel numbers, substituting in these different Gödel numbers could produce totally different sentences. Imagine has Gödel number 2 and is provably equivalent to and has Gödel number 11. If is a sentence like "x+5=7", then this will give two sentences "2+5=7" and "11+5=7", which are not provably equivalent!

So the function given above is not well-defined. To get around this problem, we can say that whenever we need to make a choice of formula from an equivalence class, we always pick the one with the smallest Gödel number. Now no matter which representative we choose, there will be a further step of standardizing on which Gödel number to use, so the function ends up well-defined. Picking the smallest Gödel number is well-defined by the wikipedia:well-ordering principle. So our new definition of looks like this:

But actually, this notation is quite unwieldy especially as we start to nest more and more expressions, so let's define a min operator for sets of sentences. Formally, we define . A little confusing perhaps, but we are not finding the formula with the least godel number, but instead the godel number of the formula with the least godel number.

Now each single-place formula induces a mapping by . We again have to standardize by taking the smallest Gödel number in the class in order to ensure the mapping is well-defined.

Drawing on Theorem 3 (diagonal theorem) earlier in the paper, we want to perform the familiar construction to get the function .

We define as in the paper. Since this means we have by replacing inside . And now expanding the definition of we get . That is definitely way more complicated-looking than the that Yanofsky gets at this point in the paper! But it also ensures everything is well-defined.

To get the proof to work, we must show is representable. Recall that this means we must find some such that . How do we figure out which G(x) works? It's a lot easier to work informally by eliding the distinction between formulas and equivalence classes of formulas, so let's be a bit loose and be in "exploration mode" here, then we will come back later and phrase things formally.

well, we want to choose G(x) such that g(B(x)) = f(B(x), G(x)) for every B(x). Let's try to expand both LHS and RHS separately, and see where we get.

g(B(x)) = \Phi_E(f(B(x), B(x))) = E(B(B(x)))

and f(B(x), G(x)) = G(B(x))

so this means we need E(B(B(x))) = G(B(x)). So G just needs to diagonalize B(x) and then wrap it around E: G(x) = E(D(x)).

Formally, we define to be as well.

Now to really show that g is representable, we must show that .

We have

and

expanding the definition of G, we get .

To do the next step, it is convenient to put , and let be the formula such that . Note that .

So we want to compute D(n). Using our new notation, this is .

So in the end we get .


Now by definition of D, we get .

[TODO] so actually to get the proof to work, we need . But doesn't this make D uncomputable?

[TODO] Here we need to explain what D(x) means and why it's valid.

[TODO] another thing we ought to discuss in the proof: since we are looking for a fixed point for any Lind^0 -> Lind^0, that fixes what the set Y could be. But it doesn't decide what the set T could be. How do we know that T=Lind^1?

Intuitions

Acknowledgments

Thanks to Rupert McCallum for helping me work through the proof.

See also