Solomonoff induction: Difference between revisions

Revision as of 06:16, 26 March 2019

Solomonoff induction is an idealized method of sequence prediction.

Model selection vs sequence prediction

Variants

There are many variants of Solomonoff induction that seem to all turn out essentially the same, but most resources on Solomonoff induction only talk about one or two of the variants, which makes it pretty confusing for someone trying to learn about it. It's also not obvious that all of these variants are essentially the same, so showing that would be nice.

Comparison dimensions:

Determinism: are your programs probabilistic or deterministic? If deterministic, I think the programs require random coinflips as input.
Kind of Turing machine used: ordinary, prefix, monotone
Prediction length: finite vs infinite sequences
Input (to UTM) length: finite vs infinite
Solomonoff prior vs universal mixture; for universal mixture, you need to also specify the class over which you're mixing, and what you are summing over (programs, as in Eliezer's Arbital dialogue, or semimeasures, as in Legg's paper)
Discrete vs continuous? (is this the same as finite vs infinite?)

List of variants considered in various resources:

Solomonoff's original paper:
- (section 3.1)
- (section 3.2)
- (section 3.3)
- (section 3.4) this uses a universal mixture, but i think solomonoff is using all computable measures (rather than enumerable semimeasures)
Eliezer's Solomonoff induction dialogue:^[1]
Li and Vitanyi:^[2]
- $m$ (defined in theorem 4.3.1 and example 4.3.3) seems like a discrete universal mixture. Input is finite. The "programs" (lower semicomputable discrete semimeasures $P_{j}$ ) are probabilistic.
- universal a priori probability $Q_{U}$ (definition 4.3.5): deterministic prefix machine, random coinflips as input, finite input (program), finite output
- $M$
- universal a priori probability $λ_{ψ} (x)$ (definition 4.5.6, p. 302)
- $M_{n o r m}$ (a version of $M$ that is a measure)
LessWrong Wiki:
Scholarpedia article:^[3]
- $m$ : discrete, deterministic UTM (with random input), finite output, finite input (however, the "arises in three different ways" part gives other ways to view $m$ )
- $M$ : continuous, infinite output, finite input (but maybe only because we consider minimal programs), deterministic UTM (with random coinflips)
Shane Legg's introduction:^[4]
- $P^{U} (μ)$ and $P (μ)$ -- these aren't really the same as all the others. I think Legg introduces these because he's summing over all the $μ$ s rather than over all programs (if he was summing over all the programs, he could have just used $2^{- | p |}$ instead)
- $M (x)$
Sterkenburg's "The Foundations of Solomonoff Prediction":
- $P_{I}$ -- this seems similar to solomonoff's section 3.3
- $P'_{I I}$
- $P^{'}'_{I I}$
- $P_{I I}$
- algorithmic probability $Q_{U} (σ)$ (p. 29): monotone machine
- $P_{I V}$ -- i think this is the same as solomonoff's section 3.4, eq. 13
- $ξ_{w} (σ)$ (p. 33): universal mixture of semicomputable semimeasures
Marcus Hutter's papers...
Herrmann's "An Introduction to Solomonoff Induction With an Application to Scientific Confirmation Theory":
Logical Induction Arbital page:^[5]
- deterministic, regular (rather than prefix-free) Turing machines

Type of hypothesis

What is the type of a hypothesis? The type varies by formalization. For example:

In classical statistical inference, a hypothesis is a constant that happens to be unknown, so e.g. it is a real number. This hypothesis then induces a probability distribution (called a "model").
In Bayesian statistical inference, a hypothesis is an event (i.e. a subset of the sample space). There is a random variable representing the hypothesis (e.g. $Θ$ ), and you can get each individual hypothesis (e.g. $H_{1} \subseteq Ω$ ) by setting it to different values (e.g. $H_{1} = (Θ = θ_{1})$ ).
In Solomonoff induction a hypothesis is a program (?) whose type depends on the variant of Solomonoff induction being used.

Significance of random coin flips

^[6]

^[7] (p. 4)

"""The prior AIXI is based on is the Solomonoff prior. It is defined as the output of a universal Turing machine (UTM) whose inputs are coin-flips.

In other words, feed a UTM a random program. Normally, you’d think of a UTM as only being able to simulate deterministic machines. Here, however, the initial inputs can instruct the UTM to use the rest of the infinite input tape as a source of randomness to simulate a stochastic Turing machine.

Combining this with the previous idea about viewing Bayesian learning as a way of allocating “trust” to “experts” which meets a bounded loss condition, we can see the Solomonoff prior as a kind of ideal machine learning algorithm which can learn to act like any algorithm you might come up with, no matter how clever."""^[8] -- one thing that's not clear to me with this interpretation is what if we want something to have probability 1/3? we can approximate it arbitrarily well using coinflips, but we can never get it exactly. contrast this to probabilistic programs, which can output 1/3 exactly.

Measures vs semimeasures

Multiple ways to think about semimeasures:

"defective" measures that can be patched up by adding an extra element
probability distribution over finite and infinite strings^[5] -- the "semi" part comes because of the inclusion of the finite strings. If all the finite strings had probability 0, then we would have a measure.

Computable, computably enumerable, lower semicomputable, upper semicomputable

there is "computable" in the recursion theory sense, and then there is "computable" in the lower and upper approximations sense. I think these are equivalent; the latter is just trying to define the former in the case where the domain/range are uncountable.

Turing machines, prefix machines, monotone machines

Kraft's inequality

prefix machine = prefix-free machine?

prefix machines are used for discrete version, and monotone machines are used for continuous version?

What functions can prefix/monotone machines compute? Can they compute everything a regular Turing machine can, but just add an "interactive" component? Or is there some Turing-computable function that prefix/monotone machines can't compute? -- since the work tape is bidirectional and we can move freely on it, we can start by copying the program to it, then going back to the leftmost nonblank square, then executing the program like a normal Turing machine, and finally copying the resulting tape onto the write-only output tape. This only works if the program is finite though.

Suppose the set of input programs is prefix-free. Then if $p ⪯ q$ , we must have $p = q$ , so we automatically get $M (p) ⪯ M (q)$ . So prefix-free implies monotonicity.

Hutter says (p. 11) that "For given $x$ , ${p : T (p) = x *}$ forms a prefix code". At first I was confused about this. What if a program prints "+" each time it reads a 0 and does nothing when it reads a 1? Then on input "000" it prints "+++", but if the input was "0001", then it seems like this would also print "+++", so the set of programs that print "+++" don't form a prefix code (because "000" is a prefix of "0001")! But looking more closely, Hutter says that $T (p) = x *$ means " $p$ is to the left of the input head when the last bit of $x$ is output". When the last bit of "+++" is output, the thing to the left of the input head is "000". So by this definition, it seems like "0001" will not be in the set of programs, i.e. we cannot say $T (0001) = + + + *$ . This is somewhat unintuitive to me (I wouldn't have expected this from the notation), but at least now it makes sense to say that the set of programs for a given output $x$ forms a prefix code.

So why do we sometimes use a prefix machine and sometimes use a monotone machine? If we're dealing with finite output strings, then we can require that the machine halts. So then program itself contains the halting instruction. But when we're dealing with infinite output strings, we also want to allow for the fact that maybe the machine won't halt, and that that's okay too. But if we allow non-halting programs, then we need to be more careful about which programs to allow.

I have an intuition that says if we take monotone machines but restrict to finite output, then we get exactly prefix machines.

consider program p:

print("+")
halt()

and program p':

print("+")
print("")
halt()

because of the halting instruction, p is not a prefix of p'. But if we're working with infinite programs q:

print("+")

and q':

print("+")
print("")

then q is a prefix of q'. Monotone machines prevent this kind of thing by saying that you're only allowed to call yourself is a program if you've just finished printing a non-empty string.

What if we take the set of all monotone programs, and insert halt() at every possible location?

Applications

there are some standard applications you can find standard references
AIXI
carl's analogy to psychophysical laws -- rob bensinger's post about AIXI also talks about "qualia factory"
malign prior stuff
comparison to logical induction

References

↑ "Solomonoff induction: Intro Dialogue (Math 2)". Arbital.
↑ An Introduction to Kolmogorov Complexity and Its Applications (3rd ed).
↑ Marcus Hutter; Shane Legg; Paul M.B. Vitanyi. "Algorithmic probability". 2007.
↑ Shane Legg. "Solomonoff Induction". 2004.
↑ ^5.0 ^5.1 Alex Appel. "Logical Induction (incomplete)". 2017.
↑ Marcus Hutter. "On Universal Prediction and Bayesian Confirmation".
↑ Ian Wood; Peter Sunehag; Marcus Hutter. "(Non-)Equivalence of Universal Priors".
↑ https://intelligence.org/2018/11/02/embedded-models/

[eliezer-dialogue-1] "Solomonoff induction: Intro Dialogue (Math 2)". Arbital.

[li-vitanyi-2] An Introduction to Kolmogorov Complexity and Its Applications (3rd ed).

[scholarpedia-3] Marcus Hutter; Shane Legg; Paul M.B. Vitanyi. "Algorithmic probability". 2007.

[shane-legg-4] Shane Legg. "Solomonoff Induction". 2004.

[appel-logical-induction-5] 5.0 ^5.1 Alex Appel. "Logical Induction (incomplete)". 2017.

[hutter-upbc-6] Marcus Hutter. "On Universal Prediction and Bayesian Confirmation".

[wood-sunehag-hutter-equivalence-7] Ian Wood; Peter Sunehag; Marcus Hutter. "(Non-)Equivalence of Universal Priors".

[8] ttps://intelligence.org/2018/11/02/embedded-models/

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

@@ Line 68: / Line 68: @@
 In other words, feed a UTM a random program. Normally, you’d think of a UTM as only being able to simulate deterministic machines. Here, however, the initial inputs can instruct the UTM to use the rest of the infinite input tape as a source of randomness to simulate a ''stochastic'' Turing machine.
-Combining this with the previous idea about viewing Bayesian learning as a way of allocating “trust” to “experts” which meets a bounded loss condition, we can see the Solomonoff prior as a kind of ideal machine learning algorithm which can learn to act like any algorithm you might come up with, no matter how clever."""<ref>https://intelligence.org/2018/11/02/embedded-models/</ref>
+Combining this with the previous idea about viewing Bayesian learning as a way of allocating “trust” to “experts” which meets a bounded loss condition, we can see the Solomonoff prior as a kind of ideal machine learning algorithm which can learn to act like any algorithm you might come up with, no matter how clever."""<ref>https://intelligence.org/2018/11/02/embedded-models/</ref> -- one thing that's not clear to me with this interpretation is what if we want something to have probability 1/3? we can approximate it arbitrarily well using coinflips, but we can never get it exactly. contrast this to probabilistic programs, which can output 1/3 exactly.
 ==Measures vs semimeasures==