Variants of Solomonoff induction: Difference between revisions

From Machinelearning
No edit summary
No edit summary
 
(41 intermediate revisions by the same user not shown)
Line 1: Line 1:
This page lists some '''variants of Solomonoff induction'''.
This page lists some '''variants of Solomonoff induction'''. [[Solomonoff induction]] is an idealized method of sequence prediction


For determinism, I think "deterministic" is the same as "Solomonoff prior" and "stochastic" is the same as "universal mixture". Sterkenburg calls deterministic versions a "bottom-up approach" whereas the universal mixture is a "top-down approach" (p. 30).<ref name="sterkenburg"/> For deterministic variants, the type of universal machine must be specified. With universal mixtures, one must specify two things: the weighting to use, and the class of distributions to consider.
For determinism, I think "deterministic" is the same as "Solomonoff prior" and "stochastic" is the same as "universal mixture". Sterkenburg calls deterministic versions a "bottom-up approach" whereas the universal mixture is a "top-down approach" (p. 30).<ref name="sterkenburg"/> For deterministic variants, the type of universal machine must be specified. With universal mixtures, one must specify two things: the weighting to use, and the class of distributions to consider.


For discrete vs continuous, I think this just means whether the prior we define is over finite strings or over infinite sequences (where we want to know the probability of an infinite sequence starting with a given finite string).
For discrete vs continuous, I think this just means whether the prior we define is over finite strings or over infinite sequences (where we want to know the probability of an infinite sequence starting with a given finite string). I'm not sure how to tell whether a given formula is discrete or continuous. One difference seems to be that with discrete semimeasures, we only require that the sum is at most 1, whereas with continuous semimeasures we also require that <math>\mu(x) \geq \mu(x0) + \mu(x1)</math>? (see e.g. p. 5 of [https://arxiv.org/pdf/1508.05733.pdf] and p. 294 of Li and Vitanyi) Apparently one way to think of the discrete version is to think of the sample space as the natural numbers, i.e. one-letter strings from a countably infinite alphabet (see p. 265 of Li and Vitanyi).
 
NOTE: as of july 2019 i'm no longer sure that discrete vs continuous is actually a distinction. i think all(?) versions of solomonoff induction use continuous semimeasures. i'm not sure why the discrete one can't be used though. e.g. "For applications such as the prediction of growing sequences it is necessary to define a similar distribution on infinite binary sequences" [http://www.scholarpedia.org/article/Algorithmic_probability] Solomonoff also says similar things in at least one of his papers.
February 2020 update: there is a discrete vs continuous distinction after all. With a discrete semimeasure, <math>\mu(xy\mid x) = 0</math> for all non-empty strings y, since we are considering ordinary Turing machines that halt. if a TM halts with output x, then it can't also halt with output xy. So the universal discrete semimeasure can't be used for sequence prediction, since all conditional probabilities are zero, hence useless. To do sequence prediction, we need to switch to monotone TMs and allow partial outputs.


{| class="sortable wikitable"
{| class="sortable wikitable"
|-
|-
! Source !! Formula !! Determinism !! Type of machine used !! Discrete vs continuous !! Notes
! Source !! Formula !! Determinism !! Discrete vs continuous !! Notes
|-
| LessWrong Wiki<ref>https://wiki.lesswrong.com/wiki/Solomonoff_induction</ref> || <math>m(y_0) = \sum_{p \in \mathcal P : U(p) = y_0} 2^{-\ell(p)}</math> where <math>\mathcal P</math> is the set of self-delimiting programs || Deterministic; prefix Turing machine || Discrete ||
|-
| Scholarpedia discrete universal a priori probability<ref name="scholarpedia">Marcus Hutter; Shane Legg; Paul M.B. Vitanyi. [http://www.scholarpedia.org/article/Algorithmic_probability "Algorithmic probability"]. ''Scholarpedia''. 2007.</ref> || <math>m(x) = \sum_{p:U(p)=x} 2^{-\ell(p)}</math> where the sum is over halting (i.e. self-delimiting) programs || Deterministic; prefix Turing machine || Discrete || In this notation, <math>U(p) = x</math> means that <math>U</math>, when run with program <math>p</math>, halts having just read all of <math>p</math> (and not beyond <math>p</math>). This means that "halting program" is the same thing as "self-delimiting program".
|-
| Scholarpedia continuous universal a priori probability<ref name="scholarpedia"/> || <math>M(x) = \sum_{p : U(p) = x*} 2^{-\ell(p)}</math> where the sum is over minimal programs || Deterministic; monotone Turing machine || Continuous ||
|-
| Sterkenburg (p. 22)<ref name="sterkenburg">Tom Florian Sterkenburg. "The Foundations of Solomonoff Prediction". February 2013.</ref> || <math>P_{\mathrm{I}}(\sigma) = \lim_{n\to\infty} \frac{|T_{\sigma,n}|}{|T_n|}</math> where <math>\sigma</math> is a finite string, <math>T_n</math> is the set of all halting (valid) inputs of length <math>n</math> to the reference machine <math>U</math>, <math>T_{\sigma,n}</math> is the set of all halting (valid) inputs of length <math>n</math> that output something starting with <math>\sigma</math> || Deterministic; ordinary Turing machine || Discrete (I think because we are only considering halting inputs, the sample space is the set of all finite strings, which is a countable/discrete set) || this seems similar to solomonoff's section 3.3. also I think this is different from the scholarpedia <math>m(x)</math> because here we allow anything that outputs starting with x to count toward the probability, rather than exactly x.
|-
| Sterkenburg (p. 24)<ref name="sterkenburg"/> || <math>P'_{\mathrm{II}}(\sigma) = 2^{-|\tau_\mathrm{min}|}</math> where <math>\tau_\mathrm{min}</math> is the shortest program <math>\tau</math> such that <math>U(\tau) = \sigma</math> (i.e. the shortest program that causes the reference machine to output <math>\sigma</math> and halt) || deterministic; universal Turing machine, universal prefix machine (to get a probability distribution; see remark on p. 27) || discrete? || this formula does not define a probability distribution over strings <math>\sigma</math> because the sum of probabilities does not converge
|-
| Sterkenburg (p. 25)<ref name="sterkenburg"/> || <math>P''_{\mathrm{II}}(\sigma) = \lim_{n\to\infty} \sum_{\tau \in T_{\sigma,n}} 2^{-|\tau|}</math> where <math>T_{\sigma,n}</math> is the set of all programs <math>\tau</math> of length <math>n</math> such that <math>U(\tau)</math> begins with <math>\sigma</math> || deterministic; universal Turing machine || || <math>P''_{\mathrm{II}}(\sigma)</math> is divergent even for a single <math>\sigma</math>, so this is not actually a workable version, but is intended as a stepping stone
|-
| Sterkenburg (p. 26)<ref name="sterkenburg"/> || <math>P_{\mathrm{II}}(\sigma) = \lim_{\epsilon\to 0} \lim_{n\to \infty} \sum_{\tau \in T_{\sigma,n}} \left(\frac{1-\epsilon}{2}\right)^{|\tau|}</math> || deterministic; universal Turing machine, universal prefix machine (to get a probability distribution; see remark on p. 27) || || The use of the <math>\epsilon</math> is a hack to get the sum to converge
|-
|-
| LessWrong Wiki<ref>https://wiki.lesswrong.com/wiki/Solomonoff_induction</ref> || <math>m(y_0) = \sum_{p \in \mathcal P : U(p) = y_0} 2^{-\ell(p)}</math> where <math>\mathcal P</math> is the set of self-delimiting programs || Deterministic || Page doesn't say, but uses self-delimiting programs and it's discrete, so prefix Turing machine? || Discrete because of the <math>U(p) = y_0</math> rather than <math>U(p) = y_0*</math>
| Sterkenburg (p. 29)<ref name="sterkenburg"/> || <math>Q_U(\sigma) = \sum_{\tau \in T_\sigma} 2^{-|\tau|}</math> where <math>T_\sigma</math> is the set of minimal descriptions of <math>\sigma</math> (i.e. set of programs that output something starting with <math>\sigma</math> such that if one removes one bit from the end of the program, it no longer outputs something starting with <math>\sigma</math>) || deterministic; universal monotone machine || continuous? ||
|-
|-
| Scholarpedia discrete universal a priori probability<ref name="scholarpedia">Marcus Hutter; Shane Legg; Paul M.B. Vitanyi. [http://www.scholarpedia.org/article/Algorithmic_probability "Algorithmic probability"]. ''Scholarpedia''. 2007.</ref> || <math>m(x) = \sum_{p:U(p)=x} 2^{-\ell(p)}</math> where the sum is over halting programs || deterministic? || prefix Turing machine || discrete
| Sterkenburg (p. 31)<ref name="sterkenburg"/> || <math>P_{\mathrm{IV}}(\sigma) = \lim_{n\to\infty} \sum_i f_{i,n}P_i(\sigma)</math> || Stochastic;  || || I think this is the same as Solomonoff's section 3.4, eq. 13, except that Solomonoff takes the limit of the <math>f_{i,n}</math> separately and then plugs it into the formula, rather than taking the limit of the sum.
|-
|-
| Scholarpedia continuous universal a priori probability<ref name="scholarpedia"/> || <math>M(x) = \sum_{p : U(p) = x*} 2^{-\ell(p)}</math> where the sum is over minimal programs || deterministic? || Monotone Turing machine || Continuous
| Sterkenburg (p. 33)<ref name="sterkenburg"/> || <math>\xi_w(\sigma) = \sum_i w(\mu_i)\mu_i(\sigma)</math> || Stochastic; weighting unspecified, except for the requirement that <math>w(\mu_i)>0</math> for all <math>i</math> and that <math>\sum_i w(\mu_i) \leq 1</math>; the model class is all semicomputable semimeasures ||
|-
|-
| Sterkenburg (p. 22)<ref name="sterkenburg">Tom Florian Sterkenburg. "The Foundations of Solomonoff Prediction". February 2013.</ref> || <math>P_{\mathrm{I}}(\sigma) = \lim_{n\to\infty} \frac{|T_{\sigma,n}|}{|T_n|}</math> where <math>\sigma</math> is a finite string, <math>T_n</math> is the set of all halting (valid) inputs of length <math>n</math> to the reference machine <math>U</math>, <math>T_{\sigma,n}</math> is the set of all halting (valid) inputs of length <math>n</math> that output something starting with <math>\sigma</math> || deterministic || universal Turing machine (no restrictions on prefix-free-ness) || discrete?
| Arbital<ref name="eliezer-dialogue">[https://arbital.com/p/solomonoff_induction/?l=1hh "Solomonoff induction: Intro Dialogue (Math 2)"]. Arbital.</ref> || <math>\mathbb{S}\mathrm{ol}(s_{\preceq n}) = \sum_{\mathrm{prog} \in \mathcal{P}} 2^{-\mathrm{length}(\mathrm{prog})} \cdot \mathbb P_{\mathrm{prog}}(s_{\preceq n})</math> || Stochastic; the weighting is 1/2 to the power of the length of the program; the model class is all programs that define a probability measure (?), where the programs are given using a prefix-free code ||
|-
|-
| Sterkenburg (p. 24)<ref name="sterkenburg"/> || <math>P'_{\mathrm{II}}(\sigma) = 2^{-|\tau_\mathrm{min}|}</math> where <math>\tau_\mathrm{min}</math> is the shortest program <math>\tau</math> such that <math>U(\tau) = \sigma</math> (i.e. the shortest program that causes the reference machine to output <math>\sigma</math> and halt) || deterministic || universal Turing machine, universal prefix machine (to get a probability distribution; see remark on p. 27) || discrete? || this formula does not define a probability distribution over strings <math>\sigma</math> because the sum of probabilities does not converge
| Solomonoff section 3.1, eq. 1<ref name="solomonoff">R. J. Solomonoff. [https://www.sciencedirect.com/science/article/pii/S0019995864902232 "A formal theory of inductive inference. Part I"]. 1964.</ref> || <math>\lim_{\epsilon\to0} \lim_{n\to\infty} \sum_{k=1}^{r^n} \sum_{i=1}^\infty ((1-\epsilon)/2)^{N_{(S_{TC_{n,k}})_i}}</math> (this might be incorrect). Solomonoff is using the notation <math>C_{n,k}</math> to mean that he doesn't care about what happens as long as the desired string <math>T</math> starts the output. In more modern notation we would say <math>M(p)=T*</math> or <math>T \preceq M(p)</math> rather than <math>M(p) = TC_{n,k}</math> (although note that with Solomonoff's notation, we are also restricting the length of the output, so in the modern notation we would have to also require that <math>|M(p)| = |T| + n</math>). || Deterministic; universal Turing machine || || Solomonoff originally gave the conditional probability of seeing <math>a</math> given we've already seen <math>T</math>, which he writes <math>P(a,T,M_1)</math> but which in more common notation would be something like <math>P_{M_1}(Ta\mid T)</math>.
|-
|-
| Sterkenburg (p. 25)<ref name="sterkenburg"/> || <math>P''_{\mathrm{II}}(\sigma) = \lim_{n\to\infty} \sum_{\tau \in T_{\sigma,n}} 2^{-|\tau|}</math> where <math>T_{\sigma,n}</math> is the set of all programs <math>\tau</math> of length <math>n</math> such that <math>U(\tau)</math> begins with <math>\sigma</math> || deterministic || universal Turing machine || || <math>P''_{\mathrm{II}}(\sigma)</math> is divergent even for a single <math>\sigma</math>, so this is not actually a workable version, but is intended as a stepping stone
| Solomonoff section 3.2, eq. 7<ref name="solomonoff"/> || <math>\sum_{i=1}^\infty 2^{-N(T,i)}</math> where <math>N(T,i)</math> is the number of bits in the <math>i</math>th minimal program for <math>T</math> || Deterministic; universal monotone machine || ||
|-
|-
| Sterkenburg (p. 26)<ref name="sterkenburg"/> || <math>P_{\mathrm{II}}(\sigma) = \lim_{\epsilon\to 0} \lim_{n\to \infty} \sum_{\tau \in T_{\sigma,n}} \left(\frac{1-\epsilon}{2}\right)^{|\tau|}</math> || deterministic || universal Turing machine, universal prefix machine (to get a probability distribution; see remark on p. 27) || || The use of the <math>\epsilon</math> is a hack to get the sum to converge
| Solomonoff section 3.3, eq. 9 and 11<ref name="solomonoff"/> || <math>\lim_{R\to\infty} N_T/N_R</math>, where <math>N_R</math> is the number of programs of length <math>R</math> that cause the machine to halt eventually, and <math>N_T</math> are the subset of those programs that cause the machine to output something starting with <math>T</math> || Deterministic; either a universal Turing machine or a universal monotone machine || ||
|-
|-
| Sterkenburg (p. 29)<ref name="sterkenburg"/> || <math>Q_U(\sigma) = \sum_{\tau \in T_\sigma} 2^{-|\tau|}</math> where <math>T_\sigma</math> is the set of minimal descriptions of <math>\sigma</math> (i.e. set of programs that output something starting with <math>\sigma</math> such that if one removes one bit from the end of the program, it no longer outputs something starting with <math>\sigma</math>) || deterministic || universal monotone machine || continuous? ||
| Solomonoff section 3.3, eq. 10<ref name="solomonoff"/> || <math>\sum_{n=1}^\infty \sum_{k=1}^{r^n} \sum_{i=1}^\infty ((1-\epsilon)/2)^{N_{(S_{TC_{n,k}})_i}}</math> || Deterministic; || || Solomonoff says this follows from equation 10, but I'm not sure how. In more modern notation, I think this can be written as <math>\sum_{p:U(p)=T*} \left(\frac{1-\epsilon}{2}\right)^{|p|}</math>
|-
|-
| Sterkenburg (p. 31)<ref name="sterkenburg"/> || <math>P_{\mathrm{IV}}(\sigma) = \lim_{n\to\infty} \sum_i f_{i,n}P_i(\sigma)</math> || Stochastic ||
| Solomonoff section 3.4, eq. 13<ref name="solomonoff"/> || <math>\sum_{i=1}^\infty f_i P_i(T)</math> || Stochastic || || i think solomonoff is using all computable measures (rather than enumerable semimeasures)
|-
|-
| Sterkenburg (p. 33)<ref name="sterkenburg"/> || <math>\xi_w(\sigma) = \sum_i w(\mu_i)\mu_i(\sigma)</math> || Stochastic ||
| Legg<ref name="shane-legg">Shane Legg. [http://www.vetta.org/documents/legg-1996-solomonoff-induction.pdf "Solomonoff Induction"]. 2004.</ref> || <math>\mathbf M(x) = \sum_{\mu \in \mathcal M^R} Q^{-H(\mu)} \mu(x)</math> where <math>\mathcal M^R</math> is the set of all computable semimeasures, <math>Q</math> is the number of symbols in the alphabet, and <math>Q^{-H(\mu)}</math> is an approximation of <math>\sum_{\mathcal U(p,x)=\mu(x)} Q^{-|p|}</math> || Stochastic || || Legg is summing over all the semimeasures rather than programs that compute semimeasures, which is why he needs to define the weights <math>Q^{-H(\mu)}</math> rather than just taking <math>2^{-|p|}</math>.
|}
|}



Latest revision as of 04:48, 3 February 2020

This page lists some variants of Solomonoff induction. Solomonoff induction is an idealized method of sequence prediction

For determinism, I think "deterministic" is the same as "Solomonoff prior" and "stochastic" is the same as "universal mixture". Sterkenburg calls deterministic versions a "bottom-up approach" whereas the universal mixture is a "top-down approach" (p. 30).[1] For deterministic variants, the type of universal machine must be specified. With universal mixtures, one must specify two things: the weighting to use, and the class of distributions to consider.

For discrete vs continuous, I think this just means whether the prior we define is over finite strings or over infinite sequences (where we want to know the probability of an infinite sequence starting with a given finite string). I'm not sure how to tell whether a given formula is discrete or continuous. One difference seems to be that with discrete semimeasures, we only require that the sum is at most 1, whereas with continuous semimeasures we also require that ? (see e.g. p. 5 of [1] and p. 294 of Li and Vitanyi) Apparently one way to think of the discrete version is to think of the sample space as the natural numbers, i.e. one-letter strings from a countably infinite alphabet (see p. 265 of Li and Vitanyi).

NOTE: as of july 2019 i'm no longer sure that discrete vs continuous is actually a distinction. i think all(?) versions of solomonoff induction use continuous semimeasures. i'm not sure why the discrete one can't be used though. e.g. "For applications such as the prediction of growing sequences it is necessary to define a similar distribution on infinite binary sequences" [2] Solomonoff also says similar things in at least one of his papers. February 2020 update: there is a discrete vs continuous distinction after all. With a discrete semimeasure, for all non-empty strings y, since we are considering ordinary Turing machines that halt. if a TM halts with output x, then it can't also halt with output xy. So the universal discrete semimeasure can't be used for sequence prediction, since all conditional probabilities are zero, hence useless. To do sequence prediction, we need to switch to monotone TMs and allow partial outputs.

Source Formula Determinism Discrete vs continuous Notes
LessWrong Wiki[2] where is the set of self-delimiting programs Deterministic; prefix Turing machine Discrete
Scholarpedia discrete universal a priori probability[3] where the sum is over halting (i.e. self-delimiting) programs Deterministic; prefix Turing machine Discrete In this notation, means that , when run with program , halts having just read all of (and not beyond ). This means that "halting program" is the same thing as "self-delimiting program".
Scholarpedia continuous universal a priori probability[3] where the sum is over minimal programs Deterministic; monotone Turing machine Continuous
Sterkenburg (p. 22)[1] where is a finite string, is the set of all halting (valid) inputs of length to the reference machine , is the set of all halting (valid) inputs of length that output something starting with Deterministic; ordinary Turing machine Discrete (I think because we are only considering halting inputs, the sample space is the set of all finite strings, which is a countable/discrete set) this seems similar to solomonoff's section 3.3. also I think this is different from the scholarpedia because here we allow anything that outputs starting with x to count toward the probability, rather than exactly x.
Sterkenburg (p. 24)[1] where is the shortest program such that (i.e. the shortest program that causes the reference machine to output and halt) deterministic; universal Turing machine, universal prefix machine (to get a probability distribution; see remark on p. 27) discrete? this formula does not define a probability distribution over strings because the sum of probabilities does not converge
Sterkenburg (p. 25)[1] where is the set of all programs of length such that begins with deterministic; universal Turing machine is divergent even for a single , so this is not actually a workable version, but is intended as a stepping stone
Sterkenburg (p. 26)[1] deterministic; universal Turing machine, universal prefix machine (to get a probability distribution; see remark on p. 27) The use of the is a hack to get the sum to converge
Sterkenburg (p. 29)[1] where is the set of minimal descriptions of (i.e. set of programs that output something starting with such that if one removes one bit from the end of the program, it no longer outputs something starting with ) deterministic; universal monotone machine continuous?
Sterkenburg (p. 31)[1] Stochastic; I think this is the same as Solomonoff's section 3.4, eq. 13, except that Solomonoff takes the limit of the separately and then plugs it into the formula, rather than taking the limit of the sum.
Sterkenburg (p. 33)[1] Stochastic; weighting unspecified, except for the requirement that for all and that ; the model class is all semicomputable semimeasures
Arbital[4] Stochastic; the weighting is 1/2 to the power of the length of the program; the model class is all programs that define a probability measure (?), where the programs are given using a prefix-free code
Solomonoff section 3.1, eq. 1[5] (this might be incorrect). Solomonoff is using the notation to mean that he doesn't care about what happens as long as the desired string starts the output. In more modern notation we would say or rather than (although note that with Solomonoff's notation, we are also restricting the length of the output, so in the modern notation we would have to also require that ). Deterministic; universal Turing machine Solomonoff originally gave the conditional probability of seeing given we've already seen , which he writes but which in more common notation would be something like .
Solomonoff section 3.2, eq. 7[5] where is the number of bits in the th minimal program for Deterministic; universal monotone machine
Solomonoff section 3.3, eq. 9 and 11[5] , where is the number of programs of length that cause the machine to halt eventually, and are the subset of those programs that cause the machine to output something starting with Deterministic; either a universal Turing machine or a universal monotone machine
Solomonoff section 3.3, eq. 10[5] Deterministic; Solomonoff says this follows from equation 10, but I'm not sure how. In more modern notation, I think this can be written as
Solomonoff section 3.4, eq. 13[5] Stochastic i think solomonoff is using all computable measures (rather than enumerable semimeasures)
Legg[6] where is the set of all computable semimeasures, is the number of symbols in the alphabet, and is an approximation of Stochastic Legg is summing over all the semimeasures rather than programs that compute semimeasures, which is why he needs to define the weights rather than just taking .

References

  1. 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Tom Florian Sterkenburg. "The Foundations of Solomonoff Prediction". February 2013.
  2. https://wiki.lesswrong.com/wiki/Solomonoff_induction
  3. 3.0 3.1 Marcus Hutter; Shane Legg; Paul M.B. Vitanyi. "Algorithmic probability". Scholarpedia. 2007.
  4. "Solomonoff induction: Intro Dialogue (Math 2)". Arbital.
  5. 5.0 5.1 5.2 5.3 5.4 R. J. Solomonoff. "A formal theory of inductive inference. Part I". 1964.
  6. Shane Legg. "Solomonoff Induction". 2004.