User:IssaRice/Computability and logic/Diagonalization lemma: Difference between revisions

From Machinelearning
No edit summary
No edit summary
Line 3: Line 3:
A verbal version of this result, modified from GEB (p. 449), runs as follows: Take the predicate cannot-be-proved-when-diagonalized(x). This takes a predicate in the x input, and says whether the sentence can be proved when diagonalized (i.e. inserted into itself). For instance, cannot-be-proved-when-diagonalized("has-length-less-than-one-thousand(x)") claims that has-length-less-than-one-thousand("has-length-less-than-one-thousand(x)") cannot be proved. In this case, let's say it's false, since we can see that the string "has-length-less-than-one-thousand(x)" has length less than 1000, and let's assume our proof system is strong enough to prove this. Now, to diagonalize cannot-be-proved-when-diagonalized(x) is to form the sentence cannot-be-proved-when-diagonalized("cannot-be-proved-when-diagonalized(x)"). So can this sentence be proved or not? If it can be proved, then the sentence itself claims that it cannot be proved, a contradiction. So it must not be provable.
A verbal version of this result, modified from GEB (p. 449), runs as follows: Take the predicate cannot-be-proved-when-diagonalized(x). This takes a predicate in the x input, and says whether the sentence can be proved when diagonalized (i.e. inserted into itself). For instance, cannot-be-proved-when-diagonalized("has-length-less-than-one-thousand(x)") claims that has-length-less-than-one-thousand("has-length-less-than-one-thousand(x)") cannot be proved. In this case, let's say it's false, since we can see that the string "has-length-less-than-one-thousand(x)" has length less than 1000, and let's assume our proof system is strong enough to prove this. Now, to diagonalize cannot-be-proved-when-diagonalized(x) is to form the sentence cannot-be-proved-when-diagonalized("cannot-be-proved-when-diagonalized(x)"). So can this sentence be proved or not? If it can be proved, then the sentence itself claims that it cannot be proved, a contradiction. So it must not be provable.


Basically, unlike the English language, a sentence can't refer to itself using phrases like "this sentence itself", so there is no straightforward way to make claims like "This sentence cannot be proved". To get around this restriction, we must use diagonalization -- substituting a sentence's own encoding (i.e. string representation, aka "Gödel number") into itself. This allows a predicate to talk of its own string representation. So now if that predicate happens to claim unprovability -- then we get Gödel's first incompleteness theorem.
Basically, unlike the English language, a sentence can't refer to itself using phrases like "this sentence itself", so there is no straightforward way to make claims like "This sentence cannot be proved". To get around this restriction, we must use diagonalization substituting a sentence's own encoding (i.e. string representation, aka "Gödel number") into itself. This allows a predicate to talk of its own string representation. So now if that predicate happens to claim unprovability then we get Gödel's first incompleteness theorem.


The diagonalization lemma generalizes to talk about any predicate P(x), not just not-provable(x). We want to find a sentence G such that G is true if and only if P("G") is (this is a little sloppy -- it's not actually the string "G", but rather if we made whatever G happens to be into a string...). Let G be has-property-P-when-diagonalized("has-property-P-when-diagonalized(x)"). If G is true, then has-property-P-when-diagonalized(x) must have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") which means P("G") is true. If G is false, then has-property-P-when-diagonalized(x) must not have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") is false.
The diagonalization lemma generalizes to talk about any predicate P(x), not just not-provable(x). We want to find a sentence G such that G is true if and only if P("G") is (this is a little sloppy -- it's not actually the string "G", but rather if we made whatever G happens to be into a string...). Let G be has-property-P-when-diagonalized("has-property-P-when-diagonalized(x)"). If G is true, then has-property-P-when-diagonalized(x) must have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") which means P("G") is true. If G is false, then has-property-P-when-diagonalized(x) must not have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") is false.

Revision as of 19:25, 7 June 2021

The diagonalization lemma, also called the Gödel–Carnap fixed point theorem, is a fixed point theorem in logic.

A verbal version of this result, modified from GEB (p. 449), runs as follows: Take the predicate cannot-be-proved-when-diagonalized(x). This takes a predicate in the x input, and says whether the sentence can be proved when diagonalized (i.e. inserted into itself). For instance, cannot-be-proved-when-diagonalized("has-length-less-than-one-thousand(x)") claims that has-length-less-than-one-thousand("has-length-less-than-one-thousand(x)") cannot be proved. In this case, let's say it's false, since we can see that the string "has-length-less-than-one-thousand(x)" has length less than 1000, and let's assume our proof system is strong enough to prove this. Now, to diagonalize cannot-be-proved-when-diagonalized(x) is to form the sentence cannot-be-proved-when-diagonalized("cannot-be-proved-when-diagonalized(x)"). So can this sentence be proved or not? If it can be proved, then the sentence itself claims that it cannot be proved, a contradiction. So it must not be provable.

Basically, unlike the English language, a sentence can't refer to itself using phrases like "this sentence itself", so there is no straightforward way to make claims like "This sentence cannot be proved". To get around this restriction, we must use diagonalization – substituting a sentence's own encoding (i.e. string representation, aka "Gödel number") into itself. This allows a predicate to talk of its own string representation. So now if that predicate happens to claim unprovability – then we get Gödel's first incompleteness theorem.

The diagonalization lemma generalizes to talk about any predicate P(x), not just not-provable(x). We want to find a sentence G such that G is true if and only if P("G") is (this is a little sloppy -- it's not actually the string "G", but rather if we made whatever G happens to be into a string...). Let G be has-property-P-when-diagonalized("has-property-P-when-diagonalized(x)"). If G is true, then has-property-P-when-diagonalized(x) must have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") which means P("G") is true. If G is false, then has-property-P-when-diagonalized(x) must not have property P when diagonalized, i.e. P("has-property-P-when-diagonalized('has-property-P-when-diagonalized(x)')") is false.

Rogers's fixed point theorem

Let f be a total computable function. Then there exists an index e such that φeφf(e).

(simplified)

Define d(x)=φx(x) (this is actually slightly wrong, but it brings out the analogy better).

Consider the function fd. This is partial recursive, so fdφi for some index i.

Now φf(d(i))φφi(i) since fdφi. This is equivalent to φd(i) by definition of d. Thus, we may take e=d(i) to complete the proof.

It looks like we have f(d(i))=φi(i)=d(i), i.e. f(e)=e. Is this right?


Repeatedly using the facts that (1) i is an index for fd, and (2) d(i)=φi(i), allows us to create an iteration effect:

φi(i)f(d(i))f(φi(i))f(f(d(i)))f(f(φi(i)))ffd(i)

(I'm wondering if there's some deeper meaning to this. So far it's just an interesting connection between diagonalization-based fixed points and iteration-based fixed points. I think there might be a connection between this and the fix function in Haskell.)


In the more rigorous/careful version of the proof, we use the s-m-n theorem to get an index of a function, s, which is basically like d. The difference is that φx(x) might not be defined for all x (actually it isn't, since some partial functions are always undefined) so d is not total. On the other hand, s is obtained via the s-m-n theorem so is total. When φx(x) is undefined, s(x) gives an index of the always-undefined partial function. So s says "this is undefined" in a defined way. Thanks to this property, the expression φs(x) always makes sense, whereas φφx(x) sometimes doesn't make sense.


See also https://machinelearning.subwiki.org/wiki/User:IssaRice/Computability_and_logic/Rogers_fixed_point_theorem_using_Sipser%27s_notation

Diagonalization lemma

(semantic version)

Let A be a formula with one free variable. Then there exists a sentence G such that G iff A(G).

Define diag(x) to be C(C) where x=C. In other words, given a number x, the function diag finds the formula with that Godel number, then diagonalizes it (i.e. substitutes the Godel number of the formula into the formula itself), then returns the Godel number of the resulting sentence.

Let B be A(diag(x)), and let G be B(B).

Then G is A(diag(B)), by substituting x=B in the definition of B.

We also have diag(B)=B(B) by definition of diag. By definition of G, this is G, so we have diag(B)=G.

To complete the proof, apply A to both sides of the final equality to obtain A(diag(B)) iff A(G); this simplifies to G iff A(G).

[1]

[2]

use of extra quantified variable to make a substitution

(see p. 448 of GEB)

outside the formal system, if we have some function f, a constant a, and some one-place relation R, we can substitute f(a) into R like: R(f(a)). but many systems of formal logic don't have a way to directly talk about outputs of functions like f(a). instead, they have a relation like F(a,y) to mean f(a)=y. [why on earth would they do this? i think the basic reason is that if functions are just relations, we have fewer cases to have to prove in those annoying structural induction proofs.] so how do we express an idea like R(f(a))? we can make use of an extra variable to hold the output, e.g. F(a,y)R(y). but this leaves y free, so actually we want y(F(a,y)R(y)). alternatively, we can also say F(a,y)R(y). in this case we want y(F(a,y)R(y)). it's easy to prove that both of these clumsy ways of writing are logically equivalent to R(f(a)).

something similar happens when we want to diagonalize formulas. given a formula A(x) that has just x free, it's easy enough to diagonalize it: A(A(x)). but what about a sentence like B? how do we "substitute" in something into something that has no free variable? [why on earth would you want to try that? i think it simplifies the proof a little if we assume diagonalization is defined for any sentence. but i forgot where exactly the simplification occurs.] the idea is again to make use of a separate variable: x(x=A(x)A(x)). again, i think we could also do x(x=A(x)A(x)). we're basically considering a function that finds the godel number of a sentence. except unlike a relation, a single-free-variable formula fixes some specific variable that it leaves free (a relation doesn't know whether it's x or y that's free -- it just expresses some idea), so we need to fix some single variable to use throughout.

Trying to discover the lemma

approach 1

https://mathoverflow.net/a/31374

approach 2

see Owings paper.

In the framework of this paper, we have a matrix where each entry is of a certain type. Then we apply the function α to the diagonal. If the diagonal turns into one of the rows, α has a fixed point.

So now the trick is to figure out what our α should be, and also what our matrix should look like.

Picking the α doesn't seem hard: we want a fixed point for the operation φf(), so we can pick α(φe)=φf(e). One problem is that this might not be well-defined, but we can just go with this for now (it ends up not mattering, for reasons I don't really understand, but the Owings paper has another workaround, which is to use relations; I find that more confusing).

The matrix that works turns out to have entries φφj(k). I'm not sure how one would have figured this out. One might also think φj(k) would work, but notice that then we fail the type checking with α (which takes a function, not a natural number).

So now we take the diagonal, which has entries φφk(k), for k=0,1,2,, and apply α. We get φf(φk(k)). But d defined by d(x)=φx(x) is a recursive function, so the diagonal has turned into φf(d(k))=φfd(k). Since a composition of recursive functions is itself recursive, fd is recursive. So we have some index e for it, i.e. fdφe. So α applied to the diagonal results in φφe(k), which is one of the rows (the eth row). This means α has a fixed point, in the eth entry, i.e. at φφe(e). So we expect α(φφe(e))=φφe(e). Since α(φφe(e))=φf(φe(e)), the "real" fixed point for the operator will be at φe(e). Indeed, φf(φe(e))φfd(e)φφe(e).

Now we have to verify that α doesn't need to be well-defined.

approach 3

Take Cantor's theorem, generalize it to mention fixed points, then take the contrapositive. See the Yanofsky paper for details.

This version still has some mystery for me, e.g. replacing "the set has at least two elements" with "there is a function from the set to itself without a fixed point". The logical equivalence is easy to see, but getting the idea for rephrasing this condition to mention fixed points is not obvious at all.

The use of the s-m-n theorem also isn't obvious to me. Why use it at all? Why use it on g? Why do we care about the index of s?

It's also not clear to me why we use T=N and Y=F. In some sense it does make sense, like the natural numbers are all the algorithms, and the set of computable functions are the "properties" (a.k.a. "the objects being named").

approach 4

http://www.andrew.cmu.edu/user/kk3n/complearn/chapter8.pdf -- see section 8.1

also see Moore and Mertens's section on lambda calculus

in the untyped lambda calculus, there is only one type of entity, namely functions, which can operate on other functions. This makes it easy for functions to operate on themselves, which creates self-reference.

but when working with partial recursive functions, we don't have this. instead, we have numbers and then partial functions that operate on numbers. to get self-reference, we need some kind of encoding. this is why we numbered the partial recursive functions.

but now, one of the familiar facts about the lambda calculus is the existence of the fixed point combinator (aka y combinator). (note: this passes the buck to wondering how one would have come up with the lambda calculus, or how one would come up with the fixed point combinator in that setting; but this seems easier to answer.) since this theorem works in one setting in which we have self-reference, we might wonder if we can "port over" the theorem to the case where we have self-reference in a different setting.

Comparison table

Some things to notice:

  • The two theorems are essentially identical, with identical proofs, as seen by the matching rows. The analogy breaks down slightly at the very end, where we apply φf() vs A() (the latter corresponds to f until the very end).
  • In the partial recursive functions world, it's easy to go from the index (e.g. e) to the partial function (φe). In the formulas world it's the reverse, where it's easy to go from a formula (e.g. A) to its Godel number A). I wonder if there is something essential here, or if it is simply some sort of historical accident in notation.
  • For the diagonalization lemma, here we have done the semantic version (? I think...), but usually the manipulations are done inside a formal system with reference to some theory T to derive a syntactic result (i.e. we have some theory that is strong enough to do all these manipulations within the object-level language). For partial recursive functions, as far as I know, there is no analogous distinction between semantics vs syntax.
  • The diagonalization part is not completely correct/as strong as possible for both proofs. For the partial recursive functions side, we want to make sure that φφx(x) is actually defined in each case. For the logic side, I think often the diagonalization is defined as x(x=AA) so that it is defined for all formulas, not just ones with one free variable. But the essential ideas are all present below, and since this makes the comparison easier, the presentation is simplified.
Step Rogers's fixed point theorem Diagonalization lemma
Theorem statement (note: quantifiers are part of the metalanguage) (fe)φeφf(e) (AG)GA(G)
Given mapping f A
Definition of diagonal function d(x)=φx(x) diag(C)=C(C)
Composition of given mapping with diagonal function (givendiagonal) f(d(x)) A(diag(x))
Naming the givendiagonal composition fd (name not given because compositions are easy to express outside a formal language) B
Index of givendiagonal composition i B
Expanding using definition of diagonal d(i)=φi(i) diag(B)=B(B)
The givendiagonal composition applied to own index (i.e. diagonalization of the composition) fd(i) B(B)
G defined φi(i) (no equivalent definition) G is B(B)
f(d(i))=φi(i) A(diag(B))B(B)
Renaming index e=d(i) G=diag(B)
Leibniz law to previous row Apply φf() to obtain φf(e)=φf(d(i)) Apply A() to obtain A(G)A(diag(B))
Use definition of G φf(e)=φφi(i)=φe A(G)B(B)G
(Definition of G)? φi(i) is f(d(i)) G is A(diag(B))

Quotes

"All of these theorems tend to strain one's intuition; in fact, many people find them almost paradoxical. The most popular proofs of these theorems only serve to aggravate the situation because they are completely unmotivated, seem to depend upon a low combinatorial trick, and are so barbarically short as to be nearly incapable of rational analysis."[3]

"This is just a lovely result, insightful in its concept and far reaching in its consequences. We’d love to say that the proof was also lovely and enlightening, but to be honest, we don’t have an enlightening sort of proof to show you. Sometimes the best way to describe a proof is that the argument sort of picks you up and shakes you until you agree that it does, in fact, establish what it is supposed to establish. That’s what you get here."[4]

"The brevity of the proof does not make for transparency; it has the aura of a magician’s trick. How did Gödel ever come up with the idea? As a matter of fact, Gödel did not come up with that idea."[1]

Questions/things to explain

  • In Peter Smith's book, he defines Gdl(m,n) as Prf(m, diag(n)). What is the analogue of Gld for the Rogers fixed point theorem?
  • I like the D(φ)φ(φ) that begins this answer, but what is the analogue for partial functions? It seems like it is d(x)=φx(x), which does exist (because we are allowed to have undefined values). So the motivation that works for the logic version doesn't work for the partial functions version, which bugs me.

References

  1. 1.0 1.1 Haim Gaifman. "Naming and Diagonalization, from Cantor to Gödel to Kleene".
  2. https://mathoverflow.net/questions/30874/arithmetic-fixed-point-theorem
  3. James C. Owings, Jr. "Diagonalization and the Recursion Theorem". 1973.
  4. Christopher C. Leary; Lars Kristiansen. A Friendly Introduction to Mathematical Logic (2nd ed). p. 172.