<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://machinelearning.subwiki.org/w/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=IssaRice</id>
	<title>Machinelearning - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://machinelearning.subwiki.org/w/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=IssaRice"/>
	<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/wiki/Special:Contributions/IssaRice"/>
	<updated>2026-05-22T08:59:50Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.2</generator>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Taking_inf_and_sup_separately&amp;diff=3597</id>
		<title>User:IssaRice/Taking inf and sup separately</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Taking_inf_and_sup_separately&amp;diff=3597"/>
		<updated>2023-10-20T22:24:59Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes a trick that is sometimes helpful in analysis.&lt;br /&gt;
&lt;br /&gt;
==Satement==&lt;br /&gt;
&lt;br /&gt;
Let &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; be bounded subsets of the real line. Suppose that for every &amp;lt;math&amp;gt;a\in A&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;b\in B&amp;lt;/math&amp;gt; we have &amp;lt;math&amp;gt;a\geq b&amp;lt;/math&amp;gt;. Then &amp;lt;math&amp;gt;\inf(A)\geq \sup(B)&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Actually, do &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; have to be bounded? I think they can even be empty!&lt;br /&gt;
&lt;br /&gt;
==Proof==&lt;br /&gt;
&lt;br /&gt;
Let &amp;lt;math&amp;gt;a\in A&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;b\in B&amp;lt;/math&amp;gt; be arbitrary. We have by hypothesis &amp;lt;math&amp;gt;a\geq b&amp;lt;/math&amp;gt;. Since &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt; is arbitrary, we have that &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt; is an upper bound of the set &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt;, so taking the superemum over &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt; we have &amp;lt;math&amp;gt;a \geq \sup(B)&amp;lt;/math&amp;gt; (remember, &amp;lt;math&amp;gt;\sup(B)&amp;lt;/math&amp;gt; is the &#039;&#039;least&#039;&#039; upper bound, whereas &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt; is just another upper bound). Since &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt; was arbitrary, we see that &amp;lt;math&amp;gt;\sup(B)&amp;lt;/math&amp;gt; is a lower bound of the set &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt;. Taking the infimum over &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, we have &amp;lt;math&amp;gt;\inf(A) \geq \sup(B)&amp;lt;/math&amp;gt;, as required.&lt;br /&gt;
&lt;br /&gt;
==Applications==&lt;br /&gt;
&lt;br /&gt;
===liminf vs limsup===&lt;br /&gt;
&lt;br /&gt;
(Notation from Tao&#039;s &#039;&#039;Analysis I&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Let &amp;lt;math&amp;gt;(a_n)_{n=m}^\infty&amp;lt;/math&amp;gt; be a sequence of real numbers. Let &amp;lt;math&amp;gt;L^- := \liminf_{n\to\infty} a_n&amp;lt;/math&amp;gt; and let &amp;lt;math&amp;gt;L^+ := \limsup_{n\to\infty} a_n&amp;lt;/math&amp;gt;. Then we have &amp;lt;math&amp;gt;L^- \leq L^+&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Consider the sequences &amp;lt;math&amp;gt;(a^-_N)_{N=m}^\infty&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;(a^+_N)_{N=m}^\infty&amp;lt;/math&amp;gt; defined by &amp;lt;math&amp;gt;a^-_N := \inf(a_n)_{n=N}^\infty&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;a^+_N := \sup(a_n)_{n=N}^\infty&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Now consider the sets &amp;lt;math&amp;gt;A := \{a^+_N : N \geq m\}&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;B := \{a^-_N : N \geq m\}&amp;lt;/math&amp;gt;. If we can show that &amp;lt;math&amp;gt;a^+_j \geq a^-_k&amp;lt;/math&amp;gt; for arbitrary &amp;lt;math&amp;gt;j,k\geq m&amp;lt;/math&amp;gt;, then we can apply the trick to these sets to conclude that &amp;lt;math&amp;gt;L^+ = \inf(a^+_N)_{N=m} = \inf(A) \geq \sup(B) = \sup(a^-_N)_{N=m} = L^-&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
===Comparison principle===&lt;br /&gt;
&lt;br /&gt;
This technique, in modified form where we take two sups separately or two infs separately, can also be used to show that if &amp;lt;math&amp;gt;a_n \leq b_n&amp;lt;/math&amp;gt; for all &amp;lt;math&amp;gt;n&amp;lt;/math&amp;gt;, then &amp;lt;math&amp;gt;\sup(a_n)_{n=0}^\infty \leq \sup(b_n)_{n=0}^\infty&amp;lt;/math&amp;gt;, &amp;lt;math&amp;gt;\inf(a_n)_{n=0}^\infty \leq \inf(b_n)_{n=0}^\infty&amp;lt;/math&amp;gt;, &amp;lt;math&amp;gt;\limsup_{n\to\infty}(a_n)_{n=0}^\infty \leq \limsup_{n\to\infty}(b_n)_{n=0}^\infty&amp;lt;/math&amp;gt;, and &amp;lt;math&amp;gt;\liminf_{n\to\infty}(a_n)_{n=0}^\infty \leq \liminf_{n\to\infty}(b_n)_{n=0}^\infty&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
===Lower and upper Riemann integral===&lt;br /&gt;
&lt;br /&gt;
(Notation from Tao&#039;s &#039;&#039;Analysis I&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Let &amp;lt;math&amp;gt;I&amp;lt;/math&amp;gt; be a bounded interval on the real line, and let &amp;lt;math&amp;gt;f : I \to \mathbf R&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
We have&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\overline{\int}_I f := \inf\left\{p.c.\int_I g : g\text{ is a p.c. function on }I\text{ that majorizes }f\right\}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\underline{\int}_I f := \sup\left\{p.c.\int_I g : g\text{ is a p.c. function on }I\text{ that minorizes }f\right\}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We want to show &amp;lt;math&amp;gt;\underline{\int}_I f \leq \overline{\int}_I f&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Define&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;A := \left\{p.c.\int_I g : g\text{ is a p.c. function on }I\text{ that majorizes }f\right\}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;B := \left\{p.c.\int_I g : g\text{ is a p.c. function on }I\text{ that minorizes }f\right\}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then we have &amp;lt;math&amp;gt;\overline{\int}_I f = \inf(A)&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\underline{\int}_I f = \sup(B)&amp;lt;/math&amp;gt;. To apply the trick all we need to do is to let &amp;lt;math&amp;gt;g&amp;lt;/math&amp;gt; be a p.c. function on &amp;lt;math&amp;gt;I&amp;lt;/math&amp;gt; that majorizes &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt;, and let &amp;lt;math&amp;gt;h&amp;lt;/math&amp;gt; be a p.c. function on &amp;lt;math&amp;gt;I&amp;lt;/math&amp;gt; that minorizes &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt;, and show that &amp;lt;math&amp;gt;p.c.\int_I g\geq p.c.\int_I h&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
==alternating series test==&lt;br /&gt;
&lt;br /&gt;
(this one is more of a failed application)&lt;br /&gt;
&lt;br /&gt;
each even partial sum is at least as large as each odd partial sum, so the inf over the even partial sums is at least as large as the sup over the odd partial sums. this actually isn&#039;t strong enough to prove what we want. we actually need the stronger condition that the even partial sums are a decreasing sequence, and that the odd partial sums are an increasing sequence, and that eventually their difference becomes arbitrarily small.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
After I wrote this page, I found the same theorem in Apostol&#039;s &#039;&#039;Calculus&#039;&#039; (volume 1, 2nd edition, p. 28) in the section &amp;quot;Fundamental properties of the supremum and infimum&amp;quot;.&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Taking_inf_and_sup_separately&amp;diff=3596</id>
		<title>User:IssaRice/Taking inf and sup separately</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Taking_inf_and_sup_separately&amp;diff=3596"/>
		<updated>2023-10-20T22:23:45Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page describes a trick that is sometimes helpful in analysis.&lt;br /&gt;
&lt;br /&gt;
==Satement==&lt;br /&gt;
&lt;br /&gt;
Let &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; be bounded subsets of the real line. Suppose that for every &amp;lt;math&amp;gt;a\in A&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;b\in B&amp;lt;/math&amp;gt; we have &amp;lt;math&amp;gt;a\geq b&amp;lt;/math&amp;gt;. Then &amp;lt;math&amp;gt;\inf(A)\geq \sup(B)&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Actually, do &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; have to be bounded? I think they can even be empty!&lt;br /&gt;
&lt;br /&gt;
==Proof==&lt;br /&gt;
&lt;br /&gt;
Let &amp;lt;math&amp;gt;a\in A&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;b\in B&amp;lt;/math&amp;gt; be arbitrary. We have by hypothesis &amp;lt;math&amp;gt;a\geq b&amp;lt;/math&amp;gt;. Since &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt; is arbitrary, we have that &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt; is an upper bound of the set &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt;, so taking the superemum over &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt; we have &amp;lt;math&amp;gt;a \geq \sup(B)&amp;lt;/math&amp;gt; (remember, &amp;lt;math&amp;gt;\sup(B)&amp;lt;/math&amp;gt; is the &#039;&#039;least&#039;&#039; upper bound, whereas &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt; is just another upper bound). Since &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt; was arbitrary, we see that &amp;lt;math&amp;gt;\sup(B)&amp;lt;/math&amp;gt; is a lower bound of the set &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt;. Taking the infimum over &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, we have &amp;lt;math&amp;gt;\inf(A) \geq \sup(B)&amp;lt;/math&amp;gt;, as required.&lt;br /&gt;
&lt;br /&gt;
==Applications==&lt;br /&gt;
&lt;br /&gt;
===liminf vs limsup===&lt;br /&gt;
&lt;br /&gt;
(Notation from Tao&#039;s &#039;&#039;Analysis I&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Let &amp;lt;math&amp;gt;(a_n)_{n=m}^\infty&amp;lt;/math&amp;gt; be a sequence of real numbers. Let &amp;lt;math&amp;gt;L^- := \liminf_{n\to\infty} a_n&amp;lt;/math&amp;gt; and let &amp;lt;math&amp;gt;L^+ := \limsup_{n\to\infty} a_n&amp;lt;/math&amp;gt;. Then we have &amp;lt;math&amp;gt;L^- \leq L^+&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Consider the sequences &amp;lt;math&amp;gt;(a^-_N)_{N=m}^\infty&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;(a^+_N)_{N=m}^\infty&amp;lt;/math&amp;gt; defined by &amp;lt;math&amp;gt;a^-_N := \inf(a_n)_{n=N}^\infty&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;a^+_N := \sup(a_n)_{n=N}^\infty&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Now consider the sets &amp;lt;math&amp;gt;A := \{a^+_N : N \geq m\}&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;B := \{a^-_N : N \geq m\}&amp;lt;/math&amp;gt;. If we can show that &amp;lt;math&amp;gt;a^+_j \geq a^-_k&amp;lt;/math&amp;gt; for arbitrary &amp;lt;math&amp;gt;j,k\geq m&amp;lt;/math&amp;gt;, then we can apply the trick to these sets to conclude that &amp;lt;math&amp;gt;L^+ = \inf(a^+_N)_{N=m} = \inf(A) \geq \sup(B) = \sup(a^-_N)_{N=m} = L^-&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
==Comparison principle==&lt;br /&gt;
&lt;br /&gt;
This technique can also be used to show that if &amp;lt;math&amp;gt;a_n \leq b_n&amp;lt;/math&amp;gt; for all &amp;lt;math&amp;gt;n&amp;lt;/math&amp;gt;, then &amp;lt;math&amp;gt;\sup(a_n)_{n=0}^\infty \leq \sup(b_n)_{n=0}^\infty&amp;lt;/math&amp;gt;, &amp;lt;math&amp;gt;\inf(a_n)_{n=0}^\infty \leq \inf(b_n)_{n=0}^\infty&amp;lt;/math&amp;gt;, &amp;lt;math&amp;gt;\limsup_{n\to\infty}(a_n)_{n=0}^\infty \leq \limsup_{n\to\infty}(b_n)_{n=0}^\infty&amp;lt;/math&amp;gt;, and &amp;lt;math&amp;gt;\liminf_{n\to\infty}(a_n)_{n=0}^\infty \leq \liminf_{n\to\infty}(b_n)_{n=0}^\infty&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
===Lower and upper Riemann integral===&lt;br /&gt;
&lt;br /&gt;
(Notation from Tao&#039;s &#039;&#039;Analysis I&#039;&#039;.)&lt;br /&gt;
&lt;br /&gt;
Let &amp;lt;math&amp;gt;I&amp;lt;/math&amp;gt; be a bounded interval on the real line, and let &amp;lt;math&amp;gt;f : I \to \mathbf R&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
We have&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\overline{\int}_I f := \inf\left\{p.c.\int_I g : g\text{ is a p.c. function on }I\text{ that majorizes }f\right\}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\underline{\int}_I f := \sup\left\{p.c.\int_I g : g\text{ is a p.c. function on }I\text{ that minorizes }f\right\}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We want to show &amp;lt;math&amp;gt;\underline{\int}_I f \leq \overline{\int}_I f&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Define&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;A := \left\{p.c.\int_I g : g\text{ is a p.c. function on }I\text{ that majorizes }f\right\}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;B := \left\{p.c.\int_I g : g\text{ is a p.c. function on }I\text{ that minorizes }f\right\}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then we have &amp;lt;math&amp;gt;\overline{\int}_I f = \inf(A)&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\underline{\int}_I f = \sup(B)&amp;lt;/math&amp;gt;. To apply the trick all we need to do is to let &amp;lt;math&amp;gt;g&amp;lt;/math&amp;gt; be a p.c. function on &amp;lt;math&amp;gt;I&amp;lt;/math&amp;gt; that majorizes &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt;, and let &amp;lt;math&amp;gt;h&amp;lt;/math&amp;gt; be a p.c. function on &amp;lt;math&amp;gt;I&amp;lt;/math&amp;gt; that minorizes &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt;, and show that &amp;lt;math&amp;gt;p.c.\int_I g\geq p.c.\int_I h&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
==alternating series test==&lt;br /&gt;
&lt;br /&gt;
(this one is more of a failed application)&lt;br /&gt;
&lt;br /&gt;
each even partial sum is at least as large as each odd partial sum, so the inf over the even partial sums is at least as large as the sup over the odd partial sums. this actually isn&#039;t strong enough to prove what we want. we actually need the stronger condition that the even partial sums are a decreasing sequence, and that the odd partial sums are an increasing sequence, and that eventually their difference becomes arbitrarily small.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
After I wrote this page, I found the same theorem in Apostol&#039;s &#039;&#039;Calculus&#039;&#039; (volume 1, 2nd edition, p. 28) in the section &amp;quot;Fundamental properties of the supremum and infimum&amp;quot;.&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Property_table_of_theories_of_arithmetic&amp;diff=3595</id>
		<title>User:IssaRice/Computability and logic/Property table of theories of arithmetic</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Property_table_of_theories_of_arithmetic&amp;diff=3595"/>
		<updated>2023-09-13T20:26:41Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{| class=&amp;quot;sortable wikitable&amp;quot;&lt;br /&gt;
! Theory !! Finitely axiomatized? !! Recursively axiomatized? !! Negation-complete? !! Sound? !! Consistent?&lt;br /&gt;
|-&lt;br /&gt;
| Baby Arithmetic || No || Yes || Yes ||&lt;br /&gt;
|-&lt;br /&gt;
| True Arithmetic || No || No || Yes ||&lt;br /&gt;
|-&lt;br /&gt;
| Peano Arithmetic || No || Yes || No ||&lt;br /&gt;
|-&lt;br /&gt;
| Robinson Arithmetic || Yes || Yes || No ||&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
&lt;br /&gt;
* [[User:IssaRice/Computability and logic/Summary table of sets in computability]]&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Property_table_of_theories_of_arithmetic&amp;diff=3594</id>
		<title>User:IssaRice/Computability and logic/Property table of theories of arithmetic</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Property_table_of_theories_of_arithmetic&amp;diff=3594"/>
		<updated>2023-09-13T20:25:26Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{| class=&amp;quot;sortable wikitable&amp;quot;&lt;br /&gt;
! Theory !! Finitely axiomatized? !! Recursively axiomatized? !! Negation-complete? !! Sound? !! Consistent?&lt;br /&gt;
|-&lt;br /&gt;
| Baby Arithmetic&lt;br /&gt;
|-&lt;br /&gt;
| True Arithmetic&lt;br /&gt;
|-&lt;br /&gt;
| Peano Arithmetic&lt;br /&gt;
|-&lt;br /&gt;
| Robinson Arithmetic&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
&lt;br /&gt;
* [[User:IssaRice/Computability and logic/Summary table of sets in computability]]&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Summary_table_of_sets_in_computability&amp;diff=3593</id>
		<title>User:IssaRice/Computability and logic/Summary table of sets in computability</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Summary_table_of_sets_in_computability&amp;diff=3593"/>
		<updated>2023-09-13T20:25:23Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{| class=&amp;quot;sortable wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Set !! Enumerable (countable)? !! Recursive? !! Primitive recursive? !! Recursively enumerable/semirecursive? !! Co-recursively enumerable?&lt;br /&gt;
|-&lt;br /&gt;
| Set of natural numbers || Yes || Yes || Yes || Yes || Yes&lt;br /&gt;
|-&lt;br /&gt;
| Set of even positive integers || Yes || Yes || Yes || Yes || Yes&lt;br /&gt;
|-&lt;br /&gt;
| Set of rational numbers || Yes || Yes || Yes || Yes || Yes&lt;br /&gt;
|-&lt;br /&gt;
| Set of real numbers || No || No || No || No ||&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;\mathsf{HALT} = \{\langle m, n\rangle : \text{Machine } m \text{ halts on input }n\}&amp;lt;/math&amp;gt; || Yes || No || No || Yes || No&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;K = \{x : x \in W_x\}&amp;lt;/math&amp;gt; (where &amp;lt;math&amp;gt;W_x&amp;lt;/math&amp;gt; is the domain of &amp;lt;math&amp;gt;\varphi_x&amp;lt;/math&amp;gt;) || Yes || No || No || Yes || No&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;\overline{K} = \overline{\{x : x \in W_x\}} = \{x : x \notin W_x\}&amp;lt;/math&amp;gt; || Yes || No || No || No || Yes&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;\{x : \varphi_x(x) = 0\}&amp;lt;/math&amp;gt;&amp;lt;ref group=notes&amp;gt;See Stillwell&#039;s Reverse Mathematics, p. 76 for discussion.&amp;lt;/ref&amp;gt; || Yes || No || No || Yes || No&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;\{x : \varphi_x(x) = 1\}&amp;lt;/math&amp;gt; || Yes || No || No || Yes || No&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
| Set of theorems of Peano arithmetic || Yes || No || No || Yes || No&lt;br /&gt;
|-&lt;br /&gt;
| Set of truths of arithmetic ([[wikipedia:True arithmetic|true arithmetic]]) || Yes || No || No || No&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
add expressible and capturable as columns to the table?&lt;br /&gt;
&lt;br /&gt;
I thing one thing to stress is that some sets are bad on account of their size (i.e. they are &amp;quot;too big&amp;quot;) while other sets are bad on account of their &#039;&#039;shape&#039;&#039;. For instance, subsets of the natural numbers that are not recursive are fully contained in sets that &#039;&#039;are&#039;&#039; recursive; so it is not that they are &amp;quot;too big&amp;quot; (because even some proper supersets are recursive), but that their shape is too intricate. On the other hand, a set like &amp;lt;math&amp;gt;\mathbf R&amp;lt;/math&amp;gt; just has too many elements, so it is not the shape that matters.&lt;br /&gt;
&lt;br /&gt;
==Notes==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references group=&amp;quot;notes&amp;quot;/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
&lt;br /&gt;
* [[User:IssaRice/Computability and logic/Property table of theories of arithmetic]]&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Property_table_of_theories_of_arithmetic&amp;diff=3592</id>
		<title>User:IssaRice/Computability and logic/Property table of theories of arithmetic</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Property_table_of_theories_of_arithmetic&amp;diff=3592"/>
		<updated>2023-09-13T20:24:24Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{| style=&amp;quot;sortable wikitable&amp;quot;&lt;br /&gt;
! Theory !! Finitely axiomatized? !! Recursively axiomatized? !! Negation-complete? !! Sound? !! Consistent?&lt;br /&gt;
|-&lt;br /&gt;
| Baby Arithmetic&lt;br /&gt;
|-&lt;br /&gt;
| True Arithmetic&lt;br /&gt;
|-&lt;br /&gt;
| Peano Arithmetic&lt;br /&gt;
|-&lt;br /&gt;
| Robinson Arithmetic&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Property_table_of_theories_of_arithmetic&amp;diff=3591</id>
		<title>User:IssaRice/Computability and logic/Property table of theories of arithmetic</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Property_table_of_theories_of_arithmetic&amp;diff=3591"/>
		<updated>2023-09-13T20:24:11Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: Created page with &amp;quot;{| ! Theory !! Finitely axiomatized? !! Recursively axiomatized? !! Negation-complete? !! Sound? !! Consistent? |- | Baby Arithmetic |- | True Arithmetic |- | Peano Arithmetic...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{|&lt;br /&gt;
! Theory !! Finitely axiomatized? !! Recursively axiomatized? !! Negation-complete? !! Sound? !! Consistent?&lt;br /&gt;
|-&lt;br /&gt;
| Baby Arithmetic&lt;br /&gt;
|-&lt;br /&gt;
| True Arithmetic&lt;br /&gt;
|-&lt;br /&gt;
| Peano Arithmetic&lt;br /&gt;
|-&lt;br /&gt;
| Robinson Arithmetic&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/Equivalent_statements_for_injectivity_and_surjectivity&amp;diff=3590</id>
		<title>User:IssaRice/Linear algebra/Equivalent statements for injectivity and surjectivity</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/Equivalent_statements_for_injectivity_and_surjectivity&amp;diff=3590"/>
		<updated>2023-08-30T19:42:11Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Let &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; be an &amp;lt;math&amp;gt;m \times n&amp;lt;/math&amp;gt; matrix. That&#039;s a matrix with &amp;lt;math&amp;gt;m&amp;lt;/math&amp;gt; rows and &amp;lt;math&amp;gt;n&amp;lt;/math&amp;gt; columns, which you can also think of as a map &amp;lt;math&amp;gt;\mathbf R^n \to \mathbf R^m&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! Injective !! Surjective !! Bijective&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; is injective || &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; is surjective || &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; is bijective&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; has a left inverse || &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; has a right inverse || &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; has both a left and right inverse (which turn out to be the same)&lt;br /&gt;
|-&lt;br /&gt;
| for each &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt;, the equation &amp;lt;math&amp;gt;Ax = b&amp;lt;/math&amp;gt; has at most one solution (in other words, a solution may not exist, but if it does, it is unique) || for each &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt;, the equation &amp;lt;math&amp;gt;Ax = b&amp;lt;/math&amp;gt; has at least one solution (in other words, a solution always exists, but it may not be unique) || for each &amp;lt;math&amp;gt;b&amp;lt;/math&amp;gt;, the equation &amp;lt;math&amp;gt;Ax = b&amp;lt;/math&amp;gt; has exactly one (in other words, a solution always exists, and it is unique)&lt;br /&gt;
|-&lt;br /&gt;
| the columns of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; are linearly independent || the columns of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; span &amp;lt;math&amp;gt;\mathbf R^m&amp;lt;/math&amp;gt; || the columns of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; are a basis of &amp;lt;math&amp;gt;\mathbf R^m&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| the rows of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; span &amp;lt;math&amp;gt;\mathbf R^n&amp;lt;/math&amp;gt; || the rows of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; are linearly independent || the rows of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; are a basis of &amp;lt;math&amp;gt;\mathbf R^n&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| || || &amp;lt;math&amp;gt;\det(A) \ne 0&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; has rank &amp;lt;math&amp;gt;n&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; has rank &amp;lt;math&amp;gt;m&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; has rank &amp;lt;math&amp;gt;n=m&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| in the row echelon form of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt;, there is a pivot in every column || in the row echelon form of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt;, there is a pivot in every row || in the row echelon form of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt;, there is a pivot in every column and every row&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;\operatorname{null} A = \{0\}&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\operatorname{range} A = \mathbf R^m&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;\dim \operatorname{null} A = 0&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\dim \operatorname{null} A = n-m&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;math&amp;gt;\dim \operatorname{range} A = n&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\dim \operatorname{range} A = m&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==Characterizations of injectivity==&lt;br /&gt;
&lt;br /&gt;
===left inverse===&lt;br /&gt;
&lt;br /&gt;
===Ax=b has at most one solution===&lt;br /&gt;
&lt;br /&gt;
===linearly independent columns===&lt;br /&gt;
&lt;br /&gt;
===spanning rows===&lt;br /&gt;
&lt;br /&gt;
===rank n===&lt;br /&gt;
&lt;br /&gt;
===pivot in every column===&lt;br /&gt;
&lt;br /&gt;
===null space = {0}===&lt;br /&gt;
&lt;br /&gt;
===zero-dimensional null space===&lt;br /&gt;
&lt;br /&gt;
===dimension of range = n===&lt;br /&gt;
&lt;br /&gt;
==External links==&lt;br /&gt;
&lt;br /&gt;
* http://davidjekel.com/wp-content/uploads/2019/07/Linear_Algebra_Equivalences.pdf&lt;br /&gt;
* Hubbard and Hubbard&#039;s &#039;&#039;Vector Calculus, Linear Algebra, and Differential Forms: A Unified Approach&#039;&#039; also has a similar table in section 2.5 (kernels, images, and the dimension formula).&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Strength_of_a_mathematical_statement&amp;diff=3589</id>
		<title>User:IssaRice/Strength of a mathematical statement</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Strength_of_a_mathematical_statement&amp;diff=3589"/>
		<updated>2023-07-02T20:42:21Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: /* External links */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;In mathematics, one talks about statements being &amp;quot;stronger&amp;quot; than others, &amp;quot;more general&amp;quot; than others, a method being &amp;quot;more powerful&amp;quot; than others, etc. This page tries to point out some of the subtleties of this way of speaking.&lt;br /&gt;
&lt;br /&gt;
==Interaction of negation and strength==&lt;br /&gt;
&lt;br /&gt;
Negating a strong statement produces a weak statement, and negating a weak statement produces a strong statement. If a statement has strong and weak components, then the flip occurs at each stage. For example, in &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;\forall x W(x)&amp;lt;/math&amp;gt; with &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;W(x)&amp;lt;/math&amp;gt; a weak statement, negating it produces &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;\exists x \neg W(x)&amp;lt;/math&amp;gt;, where the strong &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;\forall x&amp;lt;/math&amp;gt; has become the weak &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;\exists x&amp;lt;/math&amp;gt;, and the weak &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;W(x)&amp;lt;/math&amp;gt; has become a strong &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;\neg W(x)&amp;lt;/math&amp;gt;. See Gowers&#039;s posts for more discussion on this.&lt;br /&gt;
&lt;br /&gt;
==Strong vs subset==&lt;br /&gt;
&lt;br /&gt;
A puzzle: why do we say P is stronger than Q if P is a subset of Q, but we also say that a theorem is stronger if it is more general (so bigger)?&lt;br /&gt;
&lt;br /&gt;
* One reply/intuition uses something like possible world semantics, e.g. see Wei Dai&#039;s post on Aumann&#039;s agreement theorem.&amp;lt;ref&amp;gt;Wei Dai. [https://www.lesswrong.com/posts/JdK3kr4ug9kJvKzGy/probability-space-and-aumann-agreement &amp;quot;Probability Space &amp;amp; Aumann Agreement&amp;quot;]. December 10, 2009.&amp;lt;/ref&amp;gt; There is just one possible world (a single &amp;lt;math&amp;gt;\omega \in \Omega&amp;lt;/math&amp;gt;), but our information state is the set of all possible worlds that we cannot distinguish, so the less we know, the more possible worlds we think we could be in.&lt;br /&gt;
* One visualization is to use a Venn diagram. The stronger the statement, the more our movement is restricted, as we are forced to be in more and more sets.&lt;br /&gt;
* When we say a strong statement like &amp;lt;math&amp;gt;\forall x P(x)&amp;lt;/math&amp;gt;, we are saying &amp;lt;math&amp;gt;P(x_1) \wedge P(x_2) \wedge \cdots \wedge P(x_n)&amp;lt;/math&amp;gt;. When we say a weak statement like &amp;lt;math&amp;gt;\exists x P(x)&amp;lt;/math&amp;gt;, we are saying &amp;lt;math&amp;gt;P(x_1) \vee P(x_2) \vee \cdots \vee P(x_n)&amp;lt;/math&amp;gt;. It seems like in both cases we are accumulating more and more things.&lt;br /&gt;
* But if we&#039;re working in a proof system, &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;\forall x P(x)&amp;lt;/math&amp;gt; means we have all of &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;P(x_1), \ldots, P(x_n)&amp;lt;/math&amp;gt; separately, whereas with &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;\exists x P(x)&amp;lt;/math&amp;gt; we only have one long statement &amp;lt;math&amp;gt;P(x_1) \vee P(x_2) \vee \cdots \vee P(x_n)&amp;lt;/math&amp;gt;.&lt;br /&gt;
* In causal inference, I think &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;X \perp\!\!\!\perp Y\cup W&amp;lt;/math&amp;gt; is stronger than &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;(X \perp\!\!\!\perp Y) \vee (X \perp\!\!\!\perp W)&amp;lt;/math&amp;gt;, even though both seem to use a single &amp;quot;or&amp;quot;-type operation. But if &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;Y&amp;lt;/math&amp;gt; and &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;W&amp;lt;/math&amp;gt; are disjoint, then I think the former is true while the latter may be false. I think this is similar to how &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;\forall x\in X(P(x))&amp;lt;/math&amp;gt; is usually stronger than &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;\exists x\in X(P(x))&amp;lt;/math&amp;gt;, unless &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;X = \emptyset&amp;lt;/math&amp;gt;.&lt;br /&gt;
* Maybe another way to state the puzzle is this: &amp;quot;P is stronger than Q&amp;quot; ↔ &amp;quot;P implies Q&amp;quot; ↔ &amp;quot;Q is at least as true as P&amp;quot; ↔ &amp;quot;Q ≥ P&amp;quot; (as truth values T=1 and F=0) ↔ &amp;quot;Q is &#039;at least as powerful as&#039; P&amp;quot;! Obviously, the last link is the problem.&lt;br /&gt;
* Let&#039;s say we have &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;P(x) \wedge P(y) \wedge P(z)&amp;lt;/math&amp;gt;. Then we can deduce &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;P(y)&amp;lt;/math&amp;gt;. So we can say &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;(P(x) \wedge P(y) \wedge P(z)) \implies P(y)&amp;lt;/math&amp;gt;. Let&#039;s visualize this by drawing each of &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;P(x), P(y), P(z)&amp;lt;/math&amp;gt; as points. Then if we know &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;P(x) \wedge P(y) \wedge P(z)&amp;lt;/math&amp;gt;, the set of statements we know is &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;A := \{P(x), P(y), P(z)\}&amp;lt;/math&amp;gt;. The set of statements we are trying to prove is &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;B := \{P(y)\}&amp;lt;/math&amp;gt;. But now notice something strange: &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;A&amp;lt;/math&amp;gt; is stronger than &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;B&amp;lt;/math&amp;gt;, but we have &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;B \subsetneq A&amp;lt;/math&amp;gt;. A question might be: how do we visualize &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;P(x) \vee P(y) \vee P(z)&amp;lt;/math&amp;gt; in this scheme? My first thought was &amp;quot;Maybe we need three copies of the diagram, so that we have &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;(P(x) \wedge P(y) \wedge P(z)) \implies P(x)&amp;lt;/math&amp;gt;, &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;(P(x) \wedge P(y) \wedge P(z)) \implies P(y)&amp;lt;/math&amp;gt;, and &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;(P(x) \wedge P(y) \wedge P(z)) \implies P(z)&amp;lt;/math&amp;gt;&amp;quot;. But maybe a better way to think of this is that each set such as &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;B&amp;lt;/math&amp;gt; above is a &#039;&#039;microcosm&#039;&#039;. Once you&#039;re in &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;B&amp;lt;/math&amp;gt;, it&#039;s not as small as you thought! You&#039;re actually in the set &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;\{P(y), P(y) \vee P(x), P(y) \vee P(z), P(y) \vee P(x) \vee P(z)\}&amp;lt;/math&amp;gt;. And once you&#039;re in this microcosm/&amp;quot;kingdom&amp;quot;, you can navigate to wherever you please. This explains why a strong statement is bigger (in this visualization): when we start out with more statements, our &amp;quot;kingdom&amp;quot; is bigger (we get more combinations of OR statements for free). So we can navigate to many other smaller sets of sentences (villages, towns, whatever).&lt;br /&gt;
* The above vs joint distribution. Symbolically, the contrast between &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;\Pr(x,y,z)&amp;lt;/math&amp;gt; (the joint distribution specifies an elementary event, which is small, whereas a marginal distribution specifies a &amp;quot;lumped together&amp;quot; event, which is large) and &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;P(x),P(y),P(z)&amp;lt;/math&amp;gt; (the more statements we know, the larger the set of statements we know).&lt;br /&gt;
* What do I mean by &amp;quot;navigate to&amp;quot;? Basically I mean &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;\vdash&amp;lt;/math&amp;gt; from mathematical logic. If &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;A&amp;lt;/math&amp;gt; and &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;B&amp;lt;/math&amp;gt; are sets of sentences, then &amp;quot;if we have &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;A&amp;lt;/math&amp;gt;, we can navigate to &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;B&amp;lt;/math&amp;gt;&amp;quot; means &amp;lt;math display=&amp;quot;inline&amp;quot;&amp;gt;A \vdash B&amp;lt;/math&amp;gt;.&lt;br /&gt;
* Maybe a better notation would be &amp;lt;math&amp;gt;P\subseteq Q \iff \forall x(P(x)\implies Q(x)) \iff \forall x \{\varphi : Q(x) \vdash \varphi\} \subseteq \{\varphi : P(x) \vdash \varphi\} \implies \{\varphi : Q \vdash \varphi\} \subseteq \{\varphi : P \vdash \varphi\}&amp;lt;/math&amp;gt;&lt;br /&gt;
* The identity &amp;lt;math&amp;gt;\left(\bigcap_{\alpha \in I} A_\alpha\right) \cap \left(\bigcap_{\alpha\in J} A_\alpha\right) = \bigcap_{\alpha\in I\cup J} A_\alpha&amp;lt;/math&amp;gt; (for nonempty &amp;lt;math&amp;gt;I,J&amp;lt;/math&amp;gt;) also seems like part of this, where the appearance of a &amp;quot;union&amp;quot; actually makes the statement stronger.&lt;br /&gt;
* Nate&#039;s comment: &amp;quot;the way I think of it is that there are fewer ways to write a fn with a more general type (for the same reasons there’s less you can say about that is true about all apples than about any one particular apple), so if your function is in the general type and you write it in the general type then the typechecker verifies that you didn’t accidentally depend on any specifics.&amp;quot; [https://www.facebook.com/satvik.beri/posts/641556838954?comment_id=641557093444&amp;amp;reply_comment_id=641575107344&amp;amp;comment_tracking=%7B%22tn%22%3A%22R%22%7D]&lt;br /&gt;
* &amp;quot;What is true of one apple may not be true of another apple; thus more can be said about a single apple than about all the apples in the world.&amp;quot; [https://www.readthesequences.com/The-Twelve-Virtues-Of-Rationality] See also [https://www.readthesequences.com/The-Virtue-Of-Narrowness]&lt;br /&gt;
* There is an analogy with markup languages that I&#039;m not sure is completely correct but I thought interesting: there are two approaches to &amp;quot;unifying&amp;quot; or &amp;quot;single-source publishing&amp;quot; all your work. You could either work in a &amp;quot;super language&amp;quot; that uses the union of features of all markup languages you want to export to, or you could work in a &amp;quot;crippled language&amp;quot; that uses the intersection of features of all markup languages you want to export to.&lt;br /&gt;
* This is essentially [[wikipedia:Covariance and contravariance (computer_science)|contravariance]]:&lt;br /&gt;
** I think &amp;lt;math&amp;gt;P \subseteq Q&amp;lt;/math&amp;gt; iff for all sets &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt;, we have &amp;lt;math&amp;gt;Q \subseteq A \implies P \subseteq A&amp;lt;/math&amp;gt;. Also, &amp;quot;the worlds where &amp;lt;math&amp;gt;Q\subseteq A&amp;lt;/math&amp;gt;&amp;quot; is a subset of &amp;quot;the worlds where &amp;lt;math&amp;gt;P\subseteq A&amp;lt;/math&amp;gt;&amp;quot; iff &amp;lt;math&amp;gt;P\subseteq Q&amp;lt;/math&amp;gt;. Similarly, in subtyping, &amp;lt;math&amp;gt;Q\to A&amp;lt;/math&amp;gt; is a subtype of &amp;lt;math&amp;gt;P\to A&amp;lt;/math&amp;gt; iff &amp;lt;math&amp;gt;P&amp;lt;/math&amp;gt; is a subtype of &amp;lt;math&amp;gt;Q&amp;lt;/math&amp;gt;. Notice that all of these have the same &amp;quot;form&amp;quot; as the &amp;lt;math&amp;gt;Q \vdash \varphi&amp;lt;/math&amp;gt; stuff from above.&lt;br /&gt;
&lt;br /&gt;
==Proving a stronger statement==&lt;br /&gt;
&lt;br /&gt;
* Charles Chapman Pugh: &amp;quot;It may seem paradoxical at first, but a specific math problem can be harder to solve than some abstract generalization of it.&amp;quot; (&#039;&#039;Real Mathematical Analysis&#039;&#039;, p. 51.)&lt;br /&gt;
&lt;br /&gt;
==Strength in a more informal sense==&lt;br /&gt;
&lt;br /&gt;
* https://gowers.wordpress.com/2008/12/28/how-can-one-equivalent-statement-be-stronger-than-another/&lt;br /&gt;
&lt;br /&gt;
In this sense, two logically equivalent statements can have a different strength, i.e. strength is not measured in the logical power of the statement.&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==External links==&lt;br /&gt;
&lt;br /&gt;
* https://gowers.wordpress.com/2011/09/26/basic-logic-connectives-not/ (search &amp;quot;strong&amp;quot;)&lt;br /&gt;
* https://gowers.wordpress.com/2011/10/02/basic-logic-relationships-between-statements-negation/ (search &amp;quot;strong&amp;quot;)&lt;br /&gt;
* https://math.stackexchange.com/a/3316825/35525&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Expresses_versus_captures&amp;diff=3588</id>
		<title>User:IssaRice/Computability and logic/Expresses versus captures</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Expresses_versus_captures&amp;diff=3588"/>
		<updated>2023-04-23T07:52:02Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: /* Capturing functions */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The &#039;&#039;&#039;expresses versus captures&#039;&#039;&#039; distinction is an important one in mathematical logic, but unfortunately the terminology differs wildly between different texts. The following table gives a comparison.&lt;br /&gt;
&lt;br /&gt;
* Expressing is done by a language. There is only one form of expressing; I think this follows from the [[wikipedia:Law of excluded middle]].&lt;br /&gt;
* Capturing is done by a theory or by axioms. There are two forms of capturing: strong capture (corresponding to deciding), and weak capture (corresponding to recognizing, or semi-deciding).&lt;br /&gt;
&lt;br /&gt;
==Comparing strengths==&lt;br /&gt;
&lt;br /&gt;
For the predicate version of expresses/captures, does one imply the other?&lt;br /&gt;
&lt;br /&gt;
It turns out that given a sound theory, &amp;quot;captures&amp;quot; implies &amp;quot;expresses&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
However, even for a &amp;quot;nice&amp;quot; theory, the implication in the other direction does not hold. A good example is the provability property for the theory, which takes a Goedel number of a sentence and is true iff that sentence is provable. This property turns out to be expressible but not capturable.&lt;br /&gt;
&lt;br /&gt;
==Capturing functions==&lt;br /&gt;
&lt;br /&gt;
For functions, it seems like there are at least four different strengths.&lt;br /&gt;
&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline{m}, \overline{n})&amp;lt;/math&amp;gt; and (ii) &amp;lt;math&amp;gt;T \vdash \exists y (\phi(\overline{m}, y) \wedge \forall v(\phi(\overline{m}, v) \to v=y))&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;&amp;gt;Peter Smith. Godel book, p. 119, 120, 122.&amp;lt;/ref&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt;, if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt;, then &amp;lt;math&amp;gt;T \vdash \forall y (\phi(\overline m,y) \leftrightarrow y = \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m)=n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) if &amp;lt;math&amp;gt;f(m)\ne n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \neg \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff (i) for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt;, if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) we have &amp;lt;math&amp;gt;T \vdash \forall x \exists y (\phi(x,y) \wedge \forall v (\phi(x,v) \to v=y))&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m)=n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) if &amp;lt;math&amp;gt;f(m)\ne n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \nvdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref&amp;gt;Leary and Kristiansen. A Friendly Introduction to Mathematical Logic (2nd ed). p. 121&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
I tried reading Peter Smith&#039;s [https://www.logicmatters.net/resources/pdfs/godelbook/GodelBookLM.pdf#chapter.16 explanation in IGT2] for why one version of functional capture is preferred (in particular why it&#039;s better for the theory like Q or PA to &amp;quot;know&amp;quot; that the two-place formula is functional), but I am still confused. I can&#039;t tell where, downstream in some proof that leads up to the first incompleteness theorem, this definition gets used in a way that the naive &amp;quot;just make sure the theory can case-by-case prove the graph relation of the function&amp;quot; approach &#039;&#039;doesn&#039;t&#039;&#039; work. Like he says that it basically doesn&#039;t matter which definition you pick, but then still prefers one definition that&#039;s &#039;&#039;not&#039;&#039; the naive one, without elaborating on where exactly his preferred definition is helpful compared to the naive version. Ok actually i think he does say at the beginning of that section that the crucial thing is in section 17.3, something to do with the godel beta trick.&lt;br /&gt;
&lt;br /&gt;
i think i&#039;m still confused. smith writes, &amp;quot;But our preferred somewhat stronger notion is the one that we will need to work with in smoothly proving the key Theorem 17.1. So that&#039;s why we concentrate on it.&amp;quot; Theorem 17.1 is the statement that Q is p.r. adequate. But the definition of p.r. adequacy itself uses the definition of capturing functions! So maybe the trouble is that even though it doesn&#039;t matter which version of p.r. adequacy we use (possibly even what we can call &amp;quot;weak p.r. adequacy&amp;quot;, the idea that the theory can weakly capture any p.r. function, is sufficient to carry on with godel&#039;s proof), what goes wrong with weak capturing is that it&#039;s very difficult to prove weak p.r. adequacy directly (i.e. without first proving regular p.r. adequacy).&lt;br /&gt;
&lt;br /&gt;
==Comparison of usage patterns==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;sortable wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Text !! &amp;quot;Expresses&amp;quot; !! &amp;quot;Captures&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
| Peter Smith. Godel book (see especially footnote 9 on p. 45) || expresses || captures&lt;br /&gt;
|-&lt;br /&gt;
| Leary &amp;amp; Kristiansen || defines || represents&lt;br /&gt;
|-&lt;br /&gt;
| Goldrei || defines (but the book also uses &amp;quot;represents&amp;quot;)&amp;lt;ref&amp;gt;Goldrei. &#039;&#039;Propositional and Predicate Calculus&#039;&#039;. p. 137.&amp;lt;/ref&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Boolos, Burgess, Jeffrey (5th ed) || arithmetically defines&amp;lt;ref name=&amp;quot;boolos&amp;quot;&amp;gt;George S. Boolos; John P. Burgess; Richard C. Jeffrey. &#039;&#039;Computability and Logic&#039;&#039; (5th ed). p. 199 for &amp;quot;arithmetically defines&amp;quot;. p. 207 for &amp;quot;defines&amp;quot;.&amp;lt;/ref&amp;gt; || defines (for sets), represents (for functions)&amp;lt;ref name=&amp;quot;boolos&amp;quot;/&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Wikipedia || [[wikipedia:Arithmetical set|arithmetically defines]], i think [https://en.wikipedia.org/wiki/Tarski&#039;s_undefinability_theorem#Statement_of_the_theorem this page] uses &amp;quot;defines&amp;quot; in the expresses sense (? actually i&#039;m not sure; this sense of &amp;quot;defines&amp;quot; seems different) || [https://en.wikipedia.org/wiki/Diagonal_lemma#Background this page] uses &amp;quot;represents&amp;quot;, but I don&#039;t think there&#039;s a standalone article for the concept&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Expresses_versus_captures&amp;diff=3587</id>
		<title>User:IssaRice/Computability and logic/Expresses versus captures</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Expresses_versus_captures&amp;diff=3587"/>
		<updated>2023-04-23T07:36:38Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: /* Capturing functions */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The &#039;&#039;&#039;expresses versus captures&#039;&#039;&#039; distinction is an important one in mathematical logic, but unfortunately the terminology differs wildly between different texts. The following table gives a comparison.&lt;br /&gt;
&lt;br /&gt;
* Expressing is done by a language. There is only one form of expressing; I think this follows from the [[wikipedia:Law of excluded middle]].&lt;br /&gt;
* Capturing is done by a theory or by axioms. There are two forms of capturing: strong capture (corresponding to deciding), and weak capture (corresponding to recognizing, or semi-deciding).&lt;br /&gt;
&lt;br /&gt;
==Comparing strengths==&lt;br /&gt;
&lt;br /&gt;
For the predicate version of expresses/captures, does one imply the other?&lt;br /&gt;
&lt;br /&gt;
It turns out that given a sound theory, &amp;quot;captures&amp;quot; implies &amp;quot;expresses&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
However, even for a &amp;quot;nice&amp;quot; theory, the implication in the other direction does not hold. A good example is the provability property for the theory, which takes a Goedel number of a sentence and is true iff that sentence is provable. This property turns out to be expressible but not capturable.&lt;br /&gt;
&lt;br /&gt;
==Capturing functions==&lt;br /&gt;
&lt;br /&gt;
For functions, it seems like there are at least four different strengths.&lt;br /&gt;
&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline{m}, \overline{n})&amp;lt;/math&amp;gt; and (ii) &amp;lt;math&amp;gt;T \vdash \exists y (\phi(\overline{m}, y) \wedge \forall v(\phi(\overline{m}, v) \to v=y))&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;&amp;gt;Peter Smith. Godel book, p. 119, 120, 122.&amp;lt;/ref&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt;, if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt;, then &amp;lt;math&amp;gt;T \vdash \forall y (\phi(\overline m,y) \leftrightarrow y = \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m)=n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) if &amp;lt;math&amp;gt;f(m)\ne n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \neg \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff (i) for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt;, if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) we have &amp;lt;math&amp;gt;T \vdash \forall x \exists y (\phi(x,y) \wedge \forall v (\phi(x,v) \to v=y))&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m)=n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) if &amp;lt;math&amp;gt;f(m)\ne n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \nvdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref&amp;gt;Leary and Kristiansen. A Friendly Introduction to Mathematical Logic (2nd ed). p. 121&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
I tried reading Peter Smith&#039;s [https://www.logicmatters.net/resources/pdfs/godelbook/GodelBookLM.pdf#chapter.16 explanation in IGT2] for why one version of functional capture is preferred (in particular why it&#039;s better for the theory like Q or PA to &amp;quot;know&amp;quot; that the two-place formula is functional), but I am still confused. I can&#039;t tell where, downstream in some proof that leads up to the first incompleteness theorem, this definition gets used in a way that the naive &amp;quot;just make sure the theory can case-by-case prove the graph relation of the function&amp;quot; approach &#039;&#039;doesn&#039;t&#039;&#039; work. Like he says that it basically doesn&#039;t matter which definition you pick, but then still prefers one definition that&#039;s &#039;&#039;not&#039;&#039; the naive one, without elaborating on where exactly his preferred definition is helpful compared to the naive version. Ok actually i think he does say at the beginning of that section that the crucial thing is in section 17.3, something to do with the godel beta trick.&lt;br /&gt;
&lt;br /&gt;
==Comparison of usage patterns==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;sortable wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Text !! &amp;quot;Expresses&amp;quot; !! &amp;quot;Captures&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
| Peter Smith. Godel book (see especially footnote 9 on p. 45) || expresses || captures&lt;br /&gt;
|-&lt;br /&gt;
| Leary &amp;amp; Kristiansen || defines || represents&lt;br /&gt;
|-&lt;br /&gt;
| Goldrei || defines (but the book also uses &amp;quot;represents&amp;quot;)&amp;lt;ref&amp;gt;Goldrei. &#039;&#039;Propositional and Predicate Calculus&#039;&#039;. p. 137.&amp;lt;/ref&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Boolos, Burgess, Jeffrey (5th ed) || arithmetically defines&amp;lt;ref name=&amp;quot;boolos&amp;quot;&amp;gt;George S. Boolos; John P. Burgess; Richard C. Jeffrey. &#039;&#039;Computability and Logic&#039;&#039; (5th ed). p. 199 for &amp;quot;arithmetically defines&amp;quot;. p. 207 for &amp;quot;defines&amp;quot;.&amp;lt;/ref&amp;gt; || defines (for sets), represents (for functions)&amp;lt;ref name=&amp;quot;boolos&amp;quot;/&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Wikipedia || [[wikipedia:Arithmetical set|arithmetically defines]], i think [https://en.wikipedia.org/wiki/Tarski&#039;s_undefinability_theorem#Statement_of_the_theorem this page] uses &amp;quot;defines&amp;quot; in the expresses sense (? actually i&#039;m not sure; this sense of &amp;quot;defines&amp;quot; seems different) || [https://en.wikipedia.org/wiki/Diagonal_lemma#Background this page] uses &amp;quot;represents&amp;quot;, but I don&#039;t think there&#039;s a standalone article for the concept&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Expresses_versus_captures&amp;diff=3586</id>
		<title>User:IssaRice/Computability and logic/Expresses versus captures</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Expresses_versus_captures&amp;diff=3586"/>
		<updated>2023-04-23T07:32:46Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: /* Capturing functions */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The &#039;&#039;&#039;expresses versus captures&#039;&#039;&#039; distinction is an important one in mathematical logic, but unfortunately the terminology differs wildly between different texts. The following table gives a comparison.&lt;br /&gt;
&lt;br /&gt;
* Expressing is done by a language. There is only one form of expressing; I think this follows from the [[wikipedia:Law of excluded middle]].&lt;br /&gt;
* Capturing is done by a theory or by axioms. There are two forms of capturing: strong capture (corresponding to deciding), and weak capture (corresponding to recognizing, or semi-deciding).&lt;br /&gt;
&lt;br /&gt;
==Comparing strengths==&lt;br /&gt;
&lt;br /&gt;
For the predicate version of expresses/captures, does one imply the other?&lt;br /&gt;
&lt;br /&gt;
It turns out that given a sound theory, &amp;quot;captures&amp;quot; implies &amp;quot;expresses&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
However, even for a &amp;quot;nice&amp;quot; theory, the implication in the other direction does not hold. A good example is the provability property for the theory, which takes a Goedel number of a sentence and is true iff that sentence is provable. This property turns out to be expressible but not capturable.&lt;br /&gt;
&lt;br /&gt;
==Capturing functions==&lt;br /&gt;
&lt;br /&gt;
For functions, it seems like there are at least four different strengths.&lt;br /&gt;
&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline{m}, \overline{n})&amp;lt;/math&amp;gt; and (ii) &amp;lt;math&amp;gt;T \vdash \exists y (\phi(\overline{m}, y) \wedge \forall v(\phi(\overline{m}, v) \to v=y))&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;&amp;gt;Peter Smith. Godel book, p. 119, 120, 122.&amp;lt;/ref&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt;, if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt;, then &amp;lt;math&amp;gt;T \vdash \forall y (\phi(\overline m,y) \leftrightarrow y = \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m)=n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) if &amp;lt;math&amp;gt;f(m)\ne n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \neg \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff (i) for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt;, if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) we have &amp;lt;math&amp;gt;T \vdash \forall x \exists y (\phi(x,y) \wedge \forall v (\phi(x,v) \to v=y))&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m)=n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) if &amp;lt;math&amp;gt;f(m)\ne n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \nvdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref&amp;gt;Leary and Kristiansen. A Friendly Introduction to Mathematical Logic (2nd ed). p. 121&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
I tried reading Peter Smith&#039;s [https://www.logicmatters.net/resources/pdfs/godelbook/GodelBookLM.pdf#chapter.16 explanation in IGT2] for why one version of functional capture is preferred (in particular why it&#039;s better for the theory like Q or PA to &amp;quot;know&amp;quot; that the two-place formula is functional), but I am still confused. I can&#039;t tell where, downstream in some proof that leads up to the first incompleteness theorem, this definition gets used in a way that the naive &amp;quot;just make sure the theory can case-by-case prove the graph relation of the function&amp;quot; approach &#039;&#039;doesn&#039;t&#039;&#039; work. Like he says that it basically doesn&#039;t matter which definition you pick, but then still prefers one definition that&#039;s &#039;&#039;not&#039;&#039; the naive one, without elaborating on where exactly his preferred definition is helpful compared to the naive version.&lt;br /&gt;
&lt;br /&gt;
==Comparison of usage patterns==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;sortable wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Text !! &amp;quot;Expresses&amp;quot; !! &amp;quot;Captures&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
| Peter Smith. Godel book (see especially footnote 9 on p. 45) || expresses || captures&lt;br /&gt;
|-&lt;br /&gt;
| Leary &amp;amp; Kristiansen || defines || represents&lt;br /&gt;
|-&lt;br /&gt;
| Goldrei || defines (but the book also uses &amp;quot;represents&amp;quot;)&amp;lt;ref&amp;gt;Goldrei. &#039;&#039;Propositional and Predicate Calculus&#039;&#039;. p. 137.&amp;lt;/ref&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Boolos, Burgess, Jeffrey (5th ed) || arithmetically defines&amp;lt;ref name=&amp;quot;boolos&amp;quot;&amp;gt;George S. Boolos; John P. Burgess; Richard C. Jeffrey. &#039;&#039;Computability and Logic&#039;&#039; (5th ed). p. 199 for &amp;quot;arithmetically defines&amp;quot;. p. 207 for &amp;quot;defines&amp;quot;.&amp;lt;/ref&amp;gt; || defines (for sets), represents (for functions)&amp;lt;ref name=&amp;quot;boolos&amp;quot;/&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Wikipedia || [[wikipedia:Arithmetical set|arithmetically defines]], i think [https://en.wikipedia.org/wiki/Tarski&#039;s_undefinability_theorem#Statement_of_the_theorem this page] uses &amp;quot;defines&amp;quot; in the expresses sense (? actually i&#039;m not sure; this sense of &amp;quot;defines&amp;quot; seems different) || [https://en.wikipedia.org/wiki/Diagonal_lemma#Background this page] uses &amp;quot;represents&amp;quot;, but I don&#039;t think there&#039;s a standalone article for the concept&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Expresses_versus_captures&amp;diff=3585</id>
		<title>User:IssaRice/Computability and logic/Expresses versus captures</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Expresses_versus_captures&amp;diff=3585"/>
		<updated>2023-04-23T07:31:48Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: /* Capturing functions */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The &#039;&#039;&#039;expresses versus captures&#039;&#039;&#039; distinction is an important one in mathematical logic, but unfortunately the terminology differs wildly between different texts. The following table gives a comparison.&lt;br /&gt;
&lt;br /&gt;
* Expressing is done by a language. There is only one form of expressing; I think this follows from the [[wikipedia:Law of excluded middle]].&lt;br /&gt;
* Capturing is done by a theory or by axioms. There are two forms of capturing: strong capture (corresponding to deciding), and weak capture (corresponding to recognizing, or semi-deciding).&lt;br /&gt;
&lt;br /&gt;
==Comparing strengths==&lt;br /&gt;
&lt;br /&gt;
For the predicate version of expresses/captures, does one imply the other?&lt;br /&gt;
&lt;br /&gt;
It turns out that given a sound theory, &amp;quot;captures&amp;quot; implies &amp;quot;expresses&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
However, even for a &amp;quot;nice&amp;quot; theory, the implication in the other direction does not hold. A good example is the provability property for the theory, which takes a Goedel number of a sentence and is true iff that sentence is provable. This property turns out to be expressible but not capturable.&lt;br /&gt;
&lt;br /&gt;
==Capturing functions==&lt;br /&gt;
&lt;br /&gt;
For functions, it seems like there are at least four different strengths.&lt;br /&gt;
&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline{m}, \overline{n})&amp;lt;/math&amp;gt; and (ii) &amp;lt;math&amp;gt;T \vdash \exists y (\phi(\overline{m}, y) \wedge \forall v(\phi(\overline{m}, v) \to v=y))&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;&amp;gt;Peter Smith. Godel book, p. 119, 120, 122.&amp;lt;/ref&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt;, if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt;, then &amp;lt;math&amp;gt;T \vdash \forall y (\phi(\overline m,y) \leftrightarrow y = \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m)=n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) if &amp;lt;math&amp;gt;f(m)\ne n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \neg \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff (i) for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt;, if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) we have &amp;lt;math&amp;gt;T \vdash \forall x \exists y (\phi(x,y) \wedge \forall v (\phi(x,v) \to v=y))&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m)=n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) if &amp;lt;math&amp;gt;f(m)\ne n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \nvdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref&amp;gt;Leary and Kristiansen. A Friendly Introduction to Mathematical Logic (2nd ed). p. 121&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
I tried reading Peter Smith&#039;s explanation in IGT2 for why one version of functional capture is preferred (in particular why it&#039;s better for the theory like Q or PA to &amp;quot;know&amp;quot; that the two-place formula is functional), but I am still confused. I can&#039;t tell where, downstream in some proof that leads up to the first incompleteness theorem, this definition gets used in a way that the naive &amp;quot;just make sure the theory can case-by-case prove the graph relation of the function&amp;quot; approach &#039;&#039;doesn&#039;t&#039;&#039; work. Like he says that it basically doesn&#039;t matter which definition you pick, but then still prefers one definition that&#039;s &#039;&#039;not&#039;&#039; the naive one, without elaborating on where exactly his preferred definition is helpful compared to the naive version.&lt;br /&gt;
&lt;br /&gt;
==Comparison of usage patterns==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;sortable wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Text !! &amp;quot;Expresses&amp;quot; !! &amp;quot;Captures&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
| Peter Smith. Godel book (see especially footnote 9 on p. 45) || expresses || captures&lt;br /&gt;
|-&lt;br /&gt;
| Leary &amp;amp; Kristiansen || defines || represents&lt;br /&gt;
|-&lt;br /&gt;
| Goldrei || defines (but the book also uses &amp;quot;represents&amp;quot;)&amp;lt;ref&amp;gt;Goldrei. &#039;&#039;Propositional and Predicate Calculus&#039;&#039;. p. 137.&amp;lt;/ref&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Boolos, Burgess, Jeffrey (5th ed) || arithmetically defines&amp;lt;ref name=&amp;quot;boolos&amp;quot;&amp;gt;George S. Boolos; John P. Burgess; Richard C. Jeffrey. &#039;&#039;Computability and Logic&#039;&#039; (5th ed). p. 199 for &amp;quot;arithmetically defines&amp;quot;. p. 207 for &amp;quot;defines&amp;quot;.&amp;lt;/ref&amp;gt; || defines (for sets), represents (for functions)&amp;lt;ref name=&amp;quot;boolos&amp;quot;/&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Wikipedia || [[wikipedia:Arithmetical set|arithmetically defines]], i think [https://en.wikipedia.org/wiki/Tarski&#039;s_undefinability_theorem#Statement_of_the_theorem this page] uses &amp;quot;defines&amp;quot; in the expresses sense (? actually i&#039;m not sure; this sense of &amp;quot;defines&amp;quot; seems different) || [https://en.wikipedia.org/wiki/Diagonal_lemma#Background this page] uses &amp;quot;represents&amp;quot;, but I don&#039;t think there&#039;s a standalone article for the concept&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Expresses_versus_captures&amp;diff=3584</id>
		<title>User:IssaRice/Computability and logic/Expresses versus captures</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Expresses_versus_captures&amp;diff=3584"/>
		<updated>2023-04-23T07:31:22Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: /* Capturing functions */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The &#039;&#039;&#039;expresses versus captures&#039;&#039;&#039; distinction is an important one in mathematical logic, but unfortunately the terminology differs wildly between different texts. The following table gives a comparison.&lt;br /&gt;
&lt;br /&gt;
* Expressing is done by a language. There is only one form of expressing; I think this follows from the [[wikipedia:Law of excluded middle]].&lt;br /&gt;
* Capturing is done by a theory or by axioms. There are two forms of capturing: strong capture (corresponding to deciding), and weak capture (corresponding to recognizing, or semi-deciding).&lt;br /&gt;
&lt;br /&gt;
==Comparing strengths==&lt;br /&gt;
&lt;br /&gt;
For the predicate version of expresses/captures, does one imply the other?&lt;br /&gt;
&lt;br /&gt;
It turns out that given a sound theory, &amp;quot;captures&amp;quot; implies &amp;quot;expresses&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
However, even for a &amp;quot;nice&amp;quot; theory, the implication in the other direction does not hold. A good example is the provability property for the theory, which takes a Goedel number of a sentence and is true iff that sentence is provable. This property turns out to be expressible but not capturable.&lt;br /&gt;
&lt;br /&gt;
==Capturing functions==&lt;br /&gt;
&lt;br /&gt;
For functions, it seems like there are at least four different strengths.&lt;br /&gt;
&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline{m}, \overline{n})&amp;lt;/math&amp;gt; and (ii) &amp;lt;math&amp;gt;T \vdash \exists y (\phi(\overline{m}, y) \wedge \forall v(\phi(\overline{m}, v) \to v=y))&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;&amp;gt;Peter Smith. Godel book, p. 119, 120, 122.&amp;lt;/ref&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt;, if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt;, then &amp;lt;math&amp;gt;T \vdash \forall y (\phi(\overline m,y) \leftrightarrow y = \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m)=n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) if &amp;lt;math&amp;gt;f(m)\ne n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \neg \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff (i) for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt;, if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) we have &amp;lt;math&amp;gt;T \vdash \forall x \exists y (\phi(x,y) \wedge \forall v (\phi(x,v) \to v=y))&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m)=n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) if &amp;lt;math&amp;gt;f(m)\ne n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \nvdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref&amp;gt;Leary and Kristiansen. A Friendly Introduction to Mathematical Logic (2nd ed). p. 121&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
I tried reading Peter Smith&#039;s explanation in GWT2 for why one version of functional capture is preferred (in particular why it&#039;s better for the theory like Q or PA to &amp;quot;know&amp;quot; that the two-place formula is functional), but I am still confused. I can&#039;t tell where, downstream in some proof that leads up to the first incompleteness theorem, this definition gets used in a way that the naive &amp;quot;just make sure the theory can case-by-case prove the graph relation of the function&amp;quot; approach &#039;&#039;doesn&#039;t&#039;&#039; work. Like he says that it basically doesn&#039;t matter which definition you pick, but then still prefers one definition that&#039;s &#039;&#039;not&#039;&#039; the naive one, without elaborating on where exactly his preferred definition is helpful compared to the naive version.&lt;br /&gt;
&lt;br /&gt;
==Comparison of usage patterns==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;sortable wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Text !! &amp;quot;Expresses&amp;quot; !! &amp;quot;Captures&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
| Peter Smith. Godel book (see especially footnote 9 on p. 45) || expresses || captures&lt;br /&gt;
|-&lt;br /&gt;
| Leary &amp;amp; Kristiansen || defines || represents&lt;br /&gt;
|-&lt;br /&gt;
| Goldrei || defines (but the book also uses &amp;quot;represents&amp;quot;)&amp;lt;ref&amp;gt;Goldrei. &#039;&#039;Propositional and Predicate Calculus&#039;&#039;. p. 137.&amp;lt;/ref&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Boolos, Burgess, Jeffrey (5th ed) || arithmetically defines&amp;lt;ref name=&amp;quot;boolos&amp;quot;&amp;gt;George S. Boolos; John P. Burgess; Richard C. Jeffrey. &#039;&#039;Computability and Logic&#039;&#039; (5th ed). p. 199 for &amp;quot;arithmetically defines&amp;quot;. p. 207 for &amp;quot;defines&amp;quot;.&amp;lt;/ref&amp;gt; || defines (for sets), represents (for functions)&amp;lt;ref name=&amp;quot;boolos&amp;quot;/&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Wikipedia || [[wikipedia:Arithmetical set|arithmetically defines]], i think [https://en.wikipedia.org/wiki/Tarski&#039;s_undefinability_theorem#Statement_of_the_theorem this page] uses &amp;quot;defines&amp;quot; in the expresses sense (? actually i&#039;m not sure; this sense of &amp;quot;defines&amp;quot; seems different) || [https://en.wikipedia.org/wiki/Diagonal_lemma#Background this page] uses &amp;quot;represents&amp;quot;, but I don&#039;t think there&#039;s a standalone article for the concept&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Expresses_versus_captures&amp;diff=3583</id>
		<title>User:IssaRice/Computability and logic/Expresses versus captures</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Expresses_versus_captures&amp;diff=3583"/>
		<updated>2023-04-23T07:25:21Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: /* Capturing functions */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The &#039;&#039;&#039;expresses versus captures&#039;&#039;&#039; distinction is an important one in mathematical logic, but unfortunately the terminology differs wildly between different texts. The following table gives a comparison.&lt;br /&gt;
&lt;br /&gt;
* Expressing is done by a language. There is only one form of expressing; I think this follows from the [[wikipedia:Law of excluded middle]].&lt;br /&gt;
* Capturing is done by a theory or by axioms. There are two forms of capturing: strong capture (corresponding to deciding), and weak capture (corresponding to recognizing, or semi-deciding).&lt;br /&gt;
&lt;br /&gt;
==Comparing strengths==&lt;br /&gt;
&lt;br /&gt;
For the predicate version of expresses/captures, does one imply the other?&lt;br /&gt;
&lt;br /&gt;
It turns out that given a sound theory, &amp;quot;captures&amp;quot; implies &amp;quot;expresses&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
However, even for a &amp;quot;nice&amp;quot; theory, the implication in the other direction does not hold. A good example is the provability property for the theory, which takes a Goedel number of a sentence and is true iff that sentence is provable. This property turns out to be expressible but not capturable.&lt;br /&gt;
&lt;br /&gt;
==Capturing functions==&lt;br /&gt;
&lt;br /&gt;
For functions, it seems like there are at least four different strengths.&lt;br /&gt;
&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline{m}, \overline{n})&amp;lt;/math&amp;gt; and (ii) &amp;lt;math&amp;gt;T \vdash \exists y (\phi(\overline{m}, y) \wedge \forall v(\phi(\overline{m}, v) \to v=y))&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;&amp;gt;Peter Smith. Godel book, p. 119, 120, 122.&amp;lt;/ref&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt;, if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt;, then &amp;lt;math&amp;gt;T \vdash \forall y (\phi(\overline m,y) \leftrightarrow y = \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m)=n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) if &amp;lt;math&amp;gt;f(m)\ne n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \neg \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff (i) for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt;, if &amp;lt;math&amp;gt;f(m) = n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) we have &amp;lt;math&amp;gt;T \vdash \forall x \exists y (\phi(x,y) \wedge \forall v (\phi(x,v) \to v=y))&amp;lt;/math&amp;gt;.&amp;lt;ref name=&amp;quot;smith&amp;quot;/&amp;gt;&lt;br /&gt;
# &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is captured by &amp;lt;math&amp;gt;\phi(x,y)&amp;lt;/math&amp;gt; iff for all &amp;lt;math&amp;gt;m,n&amp;lt;/math&amp;gt; (i) if &amp;lt;math&amp;gt;f(m)=n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \vdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;, and (ii) if &amp;lt;math&amp;gt;f(m)\ne n&amp;lt;/math&amp;gt; then &amp;lt;math&amp;gt;T \nvdash \phi(\overline m, \overline n)&amp;lt;/math&amp;gt;.&amp;lt;ref&amp;gt;Leary and Kristiansen. A Friendly Introduction to Mathematical Logic (2nd ed). p. 121&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
I tried reading Peter Smith&#039;s explanation in GWT2 for why one version of functional capture is preferred (in particular why it&#039;s better for the theory like Q or PA to &amp;quot;know&amp;quot; that the two-place formula is functional), but I am still confused. I can&#039;t tell where, downstream in some proof that leads up to the first incompleteness theorem, this definition gets used in a way that the naive &amp;quot;just make sure the theory can case-by-case prove the graph relation of the function&amp;quot; approach &#039;&#039;doesn&#039;t&#039;&#039; work.&lt;br /&gt;
&lt;br /&gt;
==Comparison of usage patterns==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;sortable wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Text !! &amp;quot;Expresses&amp;quot; !! &amp;quot;Captures&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
| Peter Smith. Godel book (see especially footnote 9 on p. 45) || expresses || captures&lt;br /&gt;
|-&lt;br /&gt;
| Leary &amp;amp; Kristiansen || defines || represents&lt;br /&gt;
|-&lt;br /&gt;
| Goldrei || defines (but the book also uses &amp;quot;represents&amp;quot;)&amp;lt;ref&amp;gt;Goldrei. &#039;&#039;Propositional and Predicate Calculus&#039;&#039;. p. 137.&amp;lt;/ref&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Boolos, Burgess, Jeffrey (5th ed) || arithmetically defines&amp;lt;ref name=&amp;quot;boolos&amp;quot;&amp;gt;George S. Boolos; John P. Burgess; Richard C. Jeffrey. &#039;&#039;Computability and Logic&#039;&#039; (5th ed). p. 199 for &amp;quot;arithmetically defines&amp;quot;. p. 207 for &amp;quot;defines&amp;quot;.&amp;lt;/ref&amp;gt; || defines (for sets), represents (for functions)&amp;lt;ref name=&amp;quot;boolos&amp;quot;/&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Wikipedia || [[wikipedia:Arithmetical set|arithmetically defines]], i think [https://en.wikipedia.org/wiki/Tarski&#039;s_undefinability_theorem#Statement_of_the_theorem this page] uses &amp;quot;defines&amp;quot; in the expresses sense (? actually i&#039;m not sure; this sense of &amp;quot;defines&amp;quot; seems different) || [https://en.wikipedia.org/wiki/Diagonal_lemma#Background this page] uses &amp;quot;represents&amp;quot;, but I don&#039;t think there&#039;s a standalone article for the concept&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Some_important_distinctions_and_equivalences_in_introductory_mathematical_logic&amp;diff=3582</id>
		<title>User:IssaRice/Computability and logic/Some important distinctions and equivalences in introductory mathematical logic</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Computability_and_logic/Some_important_distinctions_and_equivalences_in_introductory_mathematical_logic&amp;diff=3582"/>
		<updated>2023-04-21T18:40:34Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page lists some important distinctions and equivalences in introductory mathematical logic and computability theory. Bizarrely, most books won&#039;t even mention these distinctions, so you will probably be &#039;&#039;very&#039;&#039; confused at the start as you inevitably conflate the distinct ideas and can&#039;t confidently make the connection between equivalent ideas.&lt;br /&gt;
&lt;br /&gt;
This page is kind of a &amp;quot;list of things I wish someone told me&amp;quot; as I was learning basic logic. It&#039;s not intended as a standalone introduction to mathematical logic, but rather a supplement to more standard resources. I imagine this page being too easy or completely obvious to people who have studied mathematical logic, and incomprehensible to people who have not studied it. There should be a &amp;quot;sweet spot&amp;quot; somewhere in between, of someone who is struggling with making sense of mathematical logic. In particular, if you are working through Boolos/Burgess/Jeffrey, this might be the post for you!&lt;br /&gt;
&lt;br /&gt;
For the LessWrong audience: (1) there have been a number of posts on mathematical logic before; (2) some folks are studying this material in preparation for doing AI alignment research; (3) making distinctions between concepts that seem similar at first seems like a useful rationality skill [https://www.lesswrong.com/posts/ALCnqX6Xx8bpFMZq3/the-cartoon-guide-to-loeb-s-theorem#8YBFpWjAbosrHzCvW] [https://www.greaterwrong.com/posts/3FoMuCLqZggTxoC3S/logical-pinpointing/comment/vPaWniDjMQxoyLN7a]. Other applications? (metaethics, buckets errors, &amp;quot;your strength as a rationalist&amp;quot; vs &amp;quot;your consistency as a theory&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
==Completeness==&lt;br /&gt;
&lt;br /&gt;
The term &amp;quot;complete&amp;quot; can apply to a &#039;&#039;logic&#039;&#039;, in which case it is also called &amp;quot;semantically complete&amp;quot;. If a logic is semantically complete, it means that if a set of sentences &amp;lt;math&amp;gt;\Delta&amp;lt;/math&amp;gt; semantically implies a sentence &amp;lt;math&amp;gt;\phi&amp;lt;/math&amp;gt; (i.e. every interpretation that makes every sentence in &amp;lt;math&amp;gt;\Delta&amp;lt;/math&amp;gt; true also makes &amp;lt;math&amp;gt;\phi&amp;lt;/math&amp;gt; true), then it is possible to prove &amp;lt;math&amp;gt;\phi&amp;lt;/math&amp;gt; using &amp;lt;math&amp;gt;\Delta&amp;lt;/math&amp;gt; as the set of assumptions. This meaning of completeness is the topic of Gödel&#039;s completeness theorem. In addition, this form of completeness can be stated in another way by saying that any consistent set of sentences is satisfiable (has a model); some proofs of Gödel&#039;s completeness theorem use this formulation (for a proof of the equivalence, see [https://machinelearning.subwiki.org/wiki/User:IssaRice/Computability_and_logic/Semantic_completeness#Alternative_formulation here]).&lt;br /&gt;
&lt;br /&gt;
The term &amp;quot;complete&amp;quot; can also apply to a &#039;&#039;theory&#039;&#039;, in which case it is also called &amp;quot;negation-complete&amp;quot;. If a theory is negation-complete, it means that for every sentence &amp;lt;math&amp;gt;\phi&amp;lt;/math&amp;gt;, the theory proves either &amp;lt;math&amp;gt;\phi&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\neg \phi&amp;lt;/math&amp;gt; (if it proves both, it is still negation-complete, but it is also inconsistent, so is not an interesting theory). This meaning of completeness is the topic of Gödel&#039;s first incompleteness theorem (which states that certain theories of interest are &#039;&#039;not&#039;&#039; negation-complete). Completeness for theories can also be stated in the following way: a theory &amp;lt;math&amp;gt;T&amp;lt;/math&amp;gt; is incomplete iff it does not decide every sentence (i.e. there exists an undecidable sentence).&lt;br /&gt;
&lt;br /&gt;
These two ideas are actually related. See Peter Smith&#039;s footnote on this.&lt;br /&gt;
&lt;br /&gt;
The lesson with completeness is to always be sure what kind of object it is being applied to.&lt;br /&gt;
&lt;br /&gt;
outside of logic, in mathematics there are [https://en.wikipedia.org/wiki/Completeness#Mathematics many different meanings of &amp;quot;completeness&amp;quot;]. However, these will be pretty clearly different, since the context is so different.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Exercise&#039;&#039;&#039;. (Give the one about how goedel&#039;s completeness and incompleteness theorems seem contradictory)&lt;br /&gt;
&lt;br /&gt;
==Soundness==&lt;br /&gt;
&lt;br /&gt;
soundness: sound logic (soundness theorem) vs sound theory. soundness of logic is about truth in all interpretations, while soundness of theory is about truth in a specific interpretation. unless the axioms of a theory are just valid sentences (in which case, the axioms are not really adding anything beyond what the logic already has, assuming the logic is complete), the axioms of a theory will in general be false in some interpretations -- this is what makes theories interesting, because they have non-logical content.&lt;br /&gt;
&lt;br /&gt;
==Truth in intended interpretation vs truth in all interpretations==&lt;br /&gt;
&lt;br /&gt;
Let &amp;lt;math&amp;gt;P&amp;lt;/math&amp;gt; be a predicate, and let &amp;lt;math&amp;gt;c&amp;lt;/math&amp;gt; be a constant. Is the sentence &amp;lt;math&amp;gt;P(c)&amp;lt;/math&amp;gt; true? You would rightly say that the truth depends on what meanings we gave to the predicate and the constant. What about the sentence &amp;lt;math&amp;gt;P(c)\vee \neg P(c)&amp;lt;/math&amp;gt;? Now we might be tempted to say that this sentence is true. Someone might say &amp;quot;But you don&#039;t even know what &amp;lt;math&amp;gt;P&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;c&amp;lt;/math&amp;gt; &#039;&#039;mean&#039;&#039;! How can you know if it&#039;s true or not?&amp;quot; One might respond that no matter what meanings we gave to &amp;lt;math&amp;gt;P&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;c&amp;lt;/math&amp;gt;, the sentence would come out true. In other words, there is no way to assign meanings to &amp;lt;math&amp;gt;P&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;c&amp;lt;/math&amp;gt; that would make the sentence false.&lt;br /&gt;
&lt;br /&gt;
truth in all interpretations (validity) vs truth in the intended interpretation (natural reading, standard interpretation): see [[../Intended interpretation versus all interpretations]]&lt;br /&gt;
&lt;br /&gt;
==All interpretations vs all interpretations of the non-logical symbols of the language==&lt;br /&gt;
&lt;br /&gt;
talk about how when we use &amp;lt;math&amp;gt;\models&amp;lt;/math&amp;gt;, we still keep the interpretation of the &#039;&#039;logical&#039;&#039; symbols in the usual/intended way. see also footnote 5 [https://www.logicmatters.net/resources/pdfs/GWT2edn.pdf#page=14 here].&lt;br /&gt;
&lt;br /&gt;
==The models symbol==&lt;br /&gt;
&lt;br /&gt;
The models symbol, &amp;lt;math&amp;gt;\models&amp;lt;/math&amp;gt;, comes up a lot in logic. But its meaning can be quite different depending on how it is used.&lt;br /&gt;
&lt;br /&gt;
The two basic variations are:&lt;br /&gt;
&lt;br /&gt;
* When a set of sentences comes before the symbol, like in &amp;lt;math&amp;gt;\Gamma \models \phi&amp;lt;/math&amp;gt;. In this case, we are talking about the truth in all interpretations. When &amp;lt;math&amp;gt;\Gamma = \emptyset&amp;lt;/math&amp;gt;, it is sometimes omitted, so that we just write &amp;lt;math&amp;gt;\models \phi&amp;lt;/math&amp;gt;.&lt;br /&gt;
* When a structure comes before the symbol, like in &amp;lt;math&amp;gt;\mathfrak A \models \phi&amp;lt;/math&amp;gt;. In this case, we are talking about the truth in just that one structure.&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;\models&amp;lt;/math&amp;gt;: when a set of sentences comes before the symbol vs when a structure comes before the symbol: see [[../Models symbol]]&lt;br /&gt;
&lt;br /&gt;
==Syntax vs semantics==&lt;br /&gt;
&lt;br /&gt;
I have a rough draft quiz about this that I should post. Also see &amp;lt;ref&amp;gt;https://www.hedonisticlearning.com/posts/the-pedagogy-of-logic-a-rant.html#syntax-versus-semantics&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;proves&amp;quot; vs &amp;quot;semantically implies&amp;quot; (absence of counterexample)&lt;br /&gt;
* deduction vs truth tables&lt;br /&gt;
&lt;br /&gt;
In computability theory, &amp;quot;syntax&amp;quot; corresponds to algorithms and &amp;quot;semantics&amp;quot; corresponds to functions computable via algorithms. See Rice&#039;s theorem, where this distinction becomes especially important.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;T \vdash \neg \phi&amp;lt;/math&amp;gt; is very different from &amp;lt;math&amp;gt;T \not\vdash \phi&amp;lt;/math&amp;gt;, but &amp;lt;math&amp;gt;\mathfrak A \models \neg \phi&amp;lt;/math&amp;gt; is equivalent to &amp;lt;math&amp;gt;\mathfrak A \not\models \phi&amp;lt;/math&amp;gt; (in fact, the latter is the definition of the former!).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Exercise.&#039;&#039;&#039; Is &amp;lt;math&amp;gt;T \models \neg \phi&amp;lt;/math&amp;gt; is equivalent to &amp;lt;math&amp;gt;T \not\models \phi&amp;lt;/math&amp;gt;?&lt;br /&gt;
&lt;br /&gt;
==Sentence vs wff vs formula vs closed formula==&lt;br /&gt;
&lt;br /&gt;
* formula = wff (you can read some of the history [https://en.wikipedia.org/wiki/Well-formed_formula#Usage_of_the_terminology here])&lt;br /&gt;
* sentence = closed formula&lt;br /&gt;
* the exception is that in propositional logic, since there are not quantifiers, it doesn&#039;t makes sense to distinguish between formulas and sentences, so I think everyone just calls these things formulas.&lt;br /&gt;
&lt;br /&gt;
==Enumerating vs computably enumerating vs primitive-recursively enumerating==&lt;br /&gt;
&lt;br /&gt;
First of all, there&#039;s a lot of terms that mean the same thing, so to clarify: &amp;quot;computably enumerable&amp;quot; means &amp;quot;recursively enumerable&amp;quot; means &amp;quot;effectively enumerable&amp;quot; (although the last one might be used as the informal counterpart, so would be equivalent to the formal counterparts only given the Church-Turing thesis).&lt;br /&gt;
&lt;br /&gt;
Let &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; be a set of natural numbers. We might say that &amp;lt;math&amp;gt;f : \mathbf N \to A&amp;lt;/math&amp;gt; &#039;&#039;enumerates&#039;&#039; &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; iff &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is surjective. Why? because we can list out the elements of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; as follows: &amp;lt;math&amp;gt;f(0), f(1), f(2), \ldots&amp;lt;/math&amp;gt;. Since &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is surjective, every element of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; will show up in this list at some point, i.e. we have &amp;quot;enumerated&amp;quot; the elements of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt;. Call a set enumerable if there is a function that enumerates it. For those who have studied countability/uncountability, the definition of &amp;quot;enumerable&amp;quot; given here is the same as &amp;quot;at most countable&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
The idea in computability theory is that we don&#039;t want just any enumeration. Since we are studying computable functions, we want to restrict our attention to &#039;&#039;computable&#039;&#039; enumerations. This leads to the idea of computably enumerable sets.&lt;br /&gt;
&lt;br /&gt;
Given that primitive recursive functions are a strictly smaller class compared to partial recursive functions, one might suspect that sets which can be enumerated by primitive recursive functions are also a smaller class compared to sets which can be enumerated by partial recursive functions. To put it another way, suppose that a set is primitive-recursively enumerable. Then we have some primitive recursive function &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; that enumerates it. But primitive recursive functions are recursive, so &amp;lt;math&amp;gt;f&amp;lt;/math&amp;gt; is recursive. Thus the set is recursively enumerable. That&#039;s one way, but it&#039;s not obvious whether the converse holds. It turns out that the converse does hold.&lt;br /&gt;
&lt;br /&gt;
NOTE: in [[Solomonoff induction]] and related areas there is a concept of &amp;quot;enumerable&amp;quot; which is different from the enumeration used here.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Exercise&#039;&#039;&#039; (from boolos). Show that enumerable = &amp;quot;at most countable&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
==algorithm vs program vs source code vs index vs Godel number==&lt;br /&gt;
&lt;br /&gt;
I think all of these are essentially the same in that they are all ways of encoding some objects one cares about, but I notice that my &amp;quot;mental imagery&amp;quot; is different for them.&lt;br /&gt;
&lt;br /&gt;
Formulas are identified by their syntax (how they are written) rather than by their meaning, whereas computable partial functions are identified by their meaning (behavior) rather than their syntax. This means that if you are told a partial function is computable you can find &#039;&#039;some&#039;&#039; algorithm that computes it, but it won&#039;t be &amp;quot;the&amp;quot; algorithm, since there are many algorithms that compute it. In the case of formulas, you can find &#039;&#039;the&#039;&#039; formula given a godel number.&lt;br /&gt;
&lt;br /&gt;
* Godel number I think is used most in logic when numbering formulas. It&#039;s also the case that when numbering formulas, no two godel numbers have the same formula, i.e. the mapping from godel numbers to formulas is injective. (This is not the case when numbering computable functions.)&lt;br /&gt;
* When I think of &amp;quot;algorithm&amp;quot; or &amp;quot;program&amp;quot;, I imagine it being a string rather than a natural number (it seems most concrete to me to imagine a Python program). This distinction turns out to not be important, since we can just think of strings as integers. The term &amp;quot;algorithm&amp;quot; especially makes me think about the actual behavior, whereas with &amp;quot;index&amp;quot; I don&#039;t think about it.&lt;br /&gt;
* When I think of &amp;quot;index&amp;quot; I think of natural numbers. I also imagine there being some sequence of indices, whereas with &amp;quot;algorithm&amp;quot; or &amp;quot;program&amp;quot; I can think of them in isolation.&lt;br /&gt;
&lt;br /&gt;
TODO: mention lambda calculus, where there is no distinction between the different &amp;quot;levels&amp;quot;, so e.g. we get a fixed point result without doing any encoding.&lt;br /&gt;
&lt;br /&gt;
see also: https://machinelearning.subwiki.org/wiki/User:IssaRice/Computability_and_logic/Index_and_program&lt;br /&gt;
&lt;br /&gt;
==theory vs axioms==&lt;br /&gt;
&lt;br /&gt;
I think &amp;quot;theory&amp;quot; and &amp;quot;axioms&amp;quot; can often be used interchangeably in logic. For instance, to say that a theory proves a sentence is the same as saying the axioms prove the sentence, since everything in the theory comes from the axioms.&lt;br /&gt;
&lt;br /&gt;
Synecdoche problem: in mathematics more generally, when referring to a structured space (such as a group) it is conventional to use the same name for both the space itself as well as the set of objects living in that space.See [https://terrytao.wordpress.com/2015/09/29/275a-notes-0-foundations-of-probability-theory/#comment-460004 these] [https://terrytao.wordpress.com/2009/01/12/245b-notes-1-the-stone-and-loomis-sikorski-representation-theorems-optional/#comment-48246 two] comments by Terence Tao. I think a similar thing happens with the word &amp;quot;theory&amp;quot; in logic. Some books say a theory is just a set of sentences -- any set of sentences. Other books require that this set be closed under deductions (it would be pretty strange to have a theory where &amp;lt;math&amp;gt;p&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;q&amp;lt;/math&amp;gt; are theorems but something deducible from them, such as &amp;lt;math&amp;gt;p\wedge q&amp;lt;/math&amp;gt;, is not a theorem). Yet other books single out the starting set of sentences (as non-logical axioms) and then call the set of sentences that can be proved from it the theorems; the &amp;quot;theory&amp;quot; is then the overall name applied to the whole structure.&lt;br /&gt;
&lt;br /&gt;
In practice this distinction doesn&#039;t seem to matter much.&lt;br /&gt;
&lt;br /&gt;
==Deciding==&lt;br /&gt;
&lt;br /&gt;
The terms &amp;quot;decides&amp;quot;, &amp;quot;decidable&amp;quot;, &amp;quot;deciding&amp;quot;, etc. show up a lot in computability and logic. There are several meanings here, which makes things pretty confusing.&lt;br /&gt;
&lt;br /&gt;
* Decidable set or relation: given an input in some class (e.g. a natural number) there must be an algorithm that tells us whether or not the input is in the set or has the property. The thing to contrast &amp;quot;decidable&amp;quot; is with recognizing or semi-deciding, which says that if the input is in the set, we must say it is in a finite amount of time, but if it isn&#039;t, then we don&#039;t need to produce any answer.&lt;br /&gt;
* Decidable theory: given a sentence, we must say whether or not the sentence is provable in the theory. This is an instance of the previous point, if we think of a theory as a set of sentences.&lt;br /&gt;
* A theory deciding a sentence: a theory &amp;lt;math&amp;gt;T&amp;lt;/math&amp;gt; decides a sentence &amp;lt;math&amp;gt;\phi&amp;lt;/math&amp;gt; if &amp;lt;math&amp;gt;T \vdash \phi&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;T \vdash \neg\phi&amp;lt;/math&amp;gt;. In other words, the theory must prove or refute the sentence. Note that saying &amp;lt;math&amp;gt;T \vdash \neg\phi&amp;lt;/math&amp;gt;  is very different from saying &amp;lt;math&amp;gt;T \not\vdash \phi&amp;lt;/math&amp;gt;. This means that a theory is negation-complete iff it decides every sentence.&lt;br /&gt;
* Decidable logic: to say that a logic (such as propositional logic or first-order logic) is decidable means that there is an algorithm that will tell us whether or not any given sentence is valid (i.e. true under every interpretation). In the case of first-order logic, the question of whether it is decidable is called the &#039;&#039;Entscheidungsproblem&#039;&#039;. Note also that the decidability of first-order logic can be stated in several equivalent ways.&lt;br /&gt;
&lt;br /&gt;
==algorithm vs function computed by algorithm==&lt;br /&gt;
&lt;br /&gt;
see [[../Function versus algorithm]]&lt;br /&gt;
&lt;br /&gt;
Things that algorithms have that functions computable by algorithms don&#039;t have:&lt;br /&gt;
&lt;br /&gt;
* The type &amp;lt;math&amp;gt;\mathbf N&amp;lt;/math&amp;gt;&lt;br /&gt;
* A runtime&lt;br /&gt;
* A specific memory usage (note: you can prove things about functions like &amp;quot;it can&#039;t be computed using less than such-and-such memory&amp;quot;, but you can always use &#039;&#039;more&#039;&#039; memory to compute a function, so there isn&#039;t a specific number)&lt;br /&gt;
&lt;br /&gt;
Things that functions computable by algorithms have that algorithms don&#039;t have:&lt;br /&gt;
&lt;br /&gt;
* The type &amp;lt;math&amp;gt;\mathbf N \to \mathbf N&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Further reading==&lt;br /&gt;
&lt;br /&gt;
===Diagonalization redpill===&lt;br /&gt;
&lt;br /&gt;
As you&#039;ve gone along studying computability and logic, I&#039;m sure you&#039;ve had thoughts like &amp;quot;hmm, Cantor&#039;s theorem, Russell&#039;s paradox, and the halting problem all seem pretty much the same&amp;quot; or &amp;quot;the recursion theorem and the diagonalization lemma seem suspiciously similar&amp;quot;. You may even have heard of Lawvere&#039;s theorem, which is supposed to tie all these things together, but you may have said &amp;quot;eh, I don&#039;t know any category theory and I&#039;m not sure it&#039;s worth learning it just to understand diagonalization better&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
My suggestion is to eventually read Yanofsky&#039;s paper, [https://arxiv.org/abs/math/0305282 &amp;quot;A Universal Approach to Self-Referential Paradoxes, Incompleteness and Fixed Points&amp;quot;]. The paper covers Lawvere&#039;s theorem and ties all the diagonalization results together &#039;&#039;without using any category theory&#039;&#039;. The paper is highly readable (assuming you&#039;ve already studied computability and logic). I suggested reading the paper &amp;quot;eventually&amp;quot; because I&#039;m not sure reading it immediately before/after studying logic produces the best &amp;quot;emotional impact&amp;quot;. Personally, I found the paper after struggling a lot with trying to make sense of diagonalization (specifically, I had been spending a few days trying to see how the recursion theorem and diagonalization lemma were the same).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Exercise&#039;&#039;&#039;. Find the flaw in yanofsky&#039;s recursion theorem proof.&lt;br /&gt;
&lt;br /&gt;
===Reverse mathematics===&lt;br /&gt;
&lt;br /&gt;
Stillwell&#039;s book.&lt;br /&gt;
&lt;br /&gt;
==Language, logic, formal system, system, theory, structure==&lt;br /&gt;
&lt;br /&gt;
Nate Soares already covered some of this in his post.&lt;br /&gt;
&lt;br /&gt;
Here&#039;s a table that might help to keep things straight.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Name !! Examples&lt;br /&gt;
|-&lt;br /&gt;
| Language ||&lt;br /&gt;
|-&lt;br /&gt;
| Signature ||&lt;br /&gt;
|-&lt;br /&gt;
| Logic || Propositional logic, first-order logic&lt;br /&gt;
|-&lt;br /&gt;
| Theory || Robinson arithmetic, Peano arithmetic&lt;br /&gt;
|-&lt;br /&gt;
| Structure || The natural numbers&lt;br /&gt;
|-&lt;br /&gt;
| Proof system || Hilbert system, natural deduction&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
language/signature manage the symbols&lt;br /&gt;
&lt;br /&gt;
logic (and arity function of signature) manage the syntax&lt;br /&gt;
&lt;br /&gt;
theory/axioms manage the content-specific rules of how to push symbols around&lt;br /&gt;
&lt;br /&gt;
proof system manages the content-agnostic rules of how to push symbols around&lt;br /&gt;
&lt;br /&gt;
structure manages how to interpret the symbol pushing&lt;br /&gt;
&lt;br /&gt;
==notes==&lt;br /&gt;
&lt;br /&gt;
* structure vs interpretation vs model: synecdoche problem&lt;br /&gt;
* interpretation: there&#039;s interpretation in the &amp;quot;structure&amp;quot; sense and also interpretation in the sense of embedding one theory inside another theory (e.g. doing arithmetic with sets)&lt;br /&gt;
* proves logically vs proves in a particular theory: each theory adds some non-logical axioms that allows it to prove things that aren&#039;t logically valid&lt;br /&gt;
* deciding (corresponding to computable &#039;&#039;total&#039;&#039; functions) vs recognizing (semi-deciding, computably enumerating; corresponding to computable &#039;&#039;partial&#039;&#039; functions)&lt;br /&gt;
* expresses vs captures (strong capture, weak capture): see [[../Expresses versus captures]]&lt;br /&gt;
* primitive recursive vs (general, &amp;amp;mu;) recursive&lt;br /&gt;
* &amp;lt;math&amp;gt;\Delta_0 = \Sigma_0 = \Pi_0&amp;lt;/math&amp;gt; vs &amp;lt;math&amp;gt;\Delta_1 = \Sigma_1 \cap \Pi_1&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;\Sigma_n&amp;lt;/math&amp;gt; formulas vs relations vs sets&lt;br /&gt;
* enumerating vs computably enumerating vs primitively recursively enumerating&lt;br /&gt;
* different kinds of deduction systems: axiomatic, natural deduction, sequents, trees&lt;br /&gt;
* model as representation vs that which is represented (in mathematical logic, it is that which is represented, but in other disciplines a different meaning of &amp;quot;model&amp;quot; is used)&lt;br /&gt;
* two different usages of satisfiable for formulas with a free variable? (i.e. whether the formula must be true for all variable assignment functions or just true for some)&lt;br /&gt;
* after covering completeness and decidability, cover https://machinelearning.subwiki.org/wiki/User:IssaRice/Computability_and_logic/List_of_possibilities_for_completeness_and_decidability&lt;br /&gt;
&lt;br /&gt;
There are also some important &#039;&#039;equivalences&#039;&#039; to keep in mind:&lt;br /&gt;
&lt;br /&gt;
* Turing computable = computable = recursive = unbounded loops&lt;br /&gt;
* Primitive recursive = bounded loops&lt;br /&gt;
* computably enumerating = semi-deciding&lt;br /&gt;
* computably enumerating in order = deciding&lt;br /&gt;
* Kleene&#039;s T predicate, godel beta function, Prf(m,n) https://machinelearning.subwiki.org/wiki/User:IssaRice/Computability_and_logic/Bounded_computation_trick&lt;br /&gt;
* &amp;lt;math&amp;gt;\exists&amp;lt;/math&amp;gt; vs mu-search&lt;br /&gt;
&lt;br /&gt;
==External links==&lt;br /&gt;
&lt;br /&gt;
* https://www.greaterwrong.com/posts/MG8Yhsxqu9JY4xRPr/mental-context-for-model-theory&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3581</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3581"/>
		<updated>2023-04-08T21:15:42Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables &#039;&#039;symmetrize&#039;&#039;] a function by adding up all the permutations of the variables. a symmetric function is one in which you can interchange any of the variables and the function will stay the same. not in the trivial sense that &amp;lt;math&amp;gt;f(x,y)&amp;lt;/math&amp;gt; is &amp;quot;the same&amp;quot; as &amp;lt;math&amp;gt;f(y,x)&amp;lt;/math&amp;gt; since x and y are &amp;quot;just variables&amp;quot;. but rather, in the sense that &amp;lt;math&amp;gt;f(x,y)=f(y,x)&amp;lt;/math&amp;gt;. in symbols:&lt;br /&gt;
&lt;br /&gt;
* we are NOT saying &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (y,x) \mapsto f(y,x)&amp;lt;/math&amp;gt;. this is trivially true for all functions!&lt;br /&gt;
* but rather: &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (x,y) \mapsto f(y,x)&amp;lt;/math&amp;gt;. or in other words: &amp;lt;math&amp;gt;\forall x \forall y [f(x,y) = f(y,x)]&amp;lt;/math&amp;gt; (this is not trivially true! it&#039;s false for many functions including &amp;lt;math&amp;gt;f(x,y) := x-y&amp;lt;/math&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
in the case of the shapley value, the &amp;quot;marginal contribution&amp;quot; function is NOT symmetric. so the naive fix that we would hope would work is to symmetrize it by adding all the possible permutations of the variables, forming a new function.&lt;br /&gt;
&lt;br /&gt;
wait, what? what even &#039;&#039;is&#039;&#039; the &amp;quot;marginal contribution function&amp;quot;?? for a player i of interest, it&#039;s the function that gives player i&#039;s marginal contribution, given an arbitrary sequence of players as input. let&#039;s say we are given a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt;. what&#039;s player i&#039;s marginal contribution in this sequence? well, if x1 = i, then player i is the first player to join, so the marginal contribution is &amp;lt;math&amp;gt;v(\{i\}) - v(\emptyset) = v(\{i\})&amp;lt;/math&amp;gt;. if x2=i, then the marginal contribution of player i is &amp;lt;math&amp;gt;v(\{x_1, i\}) - v(\{x_1\})&amp;lt;/math&amp;gt;. and so on. in general, if &amp;lt;math&amp;gt;x_j = i&amp;lt;/math&amp;gt; then player i&#039;s marginal contribution is &amp;lt;math&amp;gt;v(\{x_1, \ldots, x_{j-1}, i\}) - v(\{x_1, \ldots, x_{j-1}\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
as i said, this function, which we can call &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;, is not symmetric. but we can symmetrize &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt; by adding up all the possible orderings of the input variables:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} f_i(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
given a permutation &amp;lt;math&amp;gt;\sigma \in \mathrm{Sym}(n)&amp;lt;/math&amp;gt; and a function &amp;lt;math&amp;gt;f : X^n \to \mathbf R&amp;lt;/math&amp;gt;, we can define the permutation of the function &amp;lt;math&amp;gt;\sigma^* : (X^n \to \mathbf R) \to X^n \to \mathbf R&amp;lt;/math&amp;gt; by:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sigma^*(f) := (x_1, \ldots, x_n) \mapsto f(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
By an abuse of notation, we can drop the star in &amp;lt;math&amp;gt;\sigma^*&amp;lt;/math&amp;gt; and just call the resulting extension &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt;. Using this new notation, we can rewrite the symmetrization as&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are now two problems:&lt;br /&gt;
&lt;br /&gt;
# the expression above is in terms of variables &amp;lt;math&amp;gt;x_1, \ldots, x_n&amp;lt;/math&amp;gt; or in other words it&#039;s a &#039;&#039;function&#039;&#039;, not a number; we need to evaluate it somehow&lt;br /&gt;
# the expression is not &#039;&#039;normalized&#039;&#039;. the total value of the grand coalition is &amp;lt;math&amp;gt;v(\{1, \ldots, n\})&amp;lt;/math&amp;gt;. we need to make sure that when we add up all the shapley values we give to each player, that that sum is equal to &amp;lt;math&amp;gt;v(\{1, \ldots, n\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
problem 1 is easy to solve. because we symmetrized the function, we can evaluate the function on &#039;&#039;any&#039;&#039; ordering of the players!&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\left(\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)\right) (1, \ldots, n)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
is one such way to get a number. but really, the ordering of (1, ..., n) does not matter.&lt;br /&gt;
&lt;br /&gt;
as for problem 2, we want to find a value of &amp;lt;math&amp;gt;\alpha&amp;lt;/math&amp;gt; such that the following is satisfied:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;v(\{1,\ldots,n\}) = \alpha \sum_{i=1}^n \left(\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)\right) (1, \ldots, n)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
so just rearrange to solve for &amp;lt;math&amp;gt;\alpha&amp;lt;/math&amp;gt;, and we are done. at least, conceptually. it would be nice if we could prove that &amp;lt;math&amp;gt;\alpha = \frac{1}{|\mathrm{Sym}(n)|}&amp;lt;/math&amp;gt;. this is not hard to do (just interchange the two sums. for fixed sigma, the form of f_i means you can telescope-sum it to get &amp;lt;math&amp;gt;v(N)-v(\emptyset) = v(N)&amp;lt;/math&amp;gt; in each inner sum. see [https://en.wikipedia.org/wiki/Shapley_value#Efficiency wikipedia] for details.)&lt;br /&gt;
&lt;br /&gt;
So the fully honest shapley value expression is the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{v(\{1,\ldots,n\})}{\sum_{i=1}^n \left(\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)\right) (1, \ldots, n)} \left(\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)\right) (1, \ldots, n)&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3580</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3580"/>
		<updated>2023-04-08T21:14:50Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables &#039;&#039;symmetrize&#039;&#039;] a function by adding up all the permutations of the variables. a symmetric function is one in which you can interchange any of the variables and the function will stay the same. not in the trivial sense that &amp;lt;math&amp;gt;f(x,y)&amp;lt;/math&amp;gt; is &amp;quot;the same&amp;quot; as &amp;lt;math&amp;gt;f(y,x)&amp;lt;/math&amp;gt; since x and y are &amp;quot;just variables&amp;quot;. but rather, in the sense that &amp;lt;math&amp;gt;f(x,y)=f(y,x)&amp;lt;/math&amp;gt;. in symbols:&lt;br /&gt;
&lt;br /&gt;
* we are NOT saying &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (y,x) \mapsto f(y,x)&amp;lt;/math&amp;gt;. this is trivially true for all functions!&lt;br /&gt;
* but rather: &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (x,y) \mapsto f(y,x)&amp;lt;/math&amp;gt;. or in other words: &amp;lt;math&amp;gt;\forall x \forall y [f(x,y) = f(y,x)]&amp;lt;/math&amp;gt; (this is not trivially true! it&#039;s false for many functions including &amp;lt;math&amp;gt;f(x,y) := x-y&amp;lt;/math&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
in the case of the shapley value, the &amp;quot;marginal contribution&amp;quot; function is NOT symmetric. so the naive fix that we would hope would work is to symmetrize it by adding all the possible permutations of the variables, forming a new function.&lt;br /&gt;
&lt;br /&gt;
wait, what? what even &#039;&#039;is&#039;&#039; the &amp;quot;marginal contribution function&amp;quot;?? for a player i of interest, it&#039;s the function that gives player i&#039;s marginal contribution, given an arbitrary sequence of players as input. let&#039;s say we are given a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt;. what&#039;s player i&#039;s marginal contribution in this sequence? well, if x1 = i, then player i is the first player to join, so the marginal contribution is &amp;lt;math&amp;gt;v(\{i\}) - v(\emptyset) = v(\{i\})&amp;lt;/math&amp;gt;. if x2=i, then the marginal contribution of player i is &amp;lt;math&amp;gt;v(\{x_1, i\}) - v(\{x_1\})&amp;lt;/math&amp;gt;. and so on. in general, if &amp;lt;math&amp;gt;x_j = i&amp;lt;/math&amp;gt; then player i&#039;s marginal contribution is &amp;lt;math&amp;gt;v(\{x_1, \ldots, x_{j-1}, i\}) - v(\{x_1, \ldots, x_{j-1}\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
as i said, this function, which we can call &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;, is not symmetric. but we can symmetrize &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt; by adding up all the possible orderings of the input variables:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} f_i(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
given a permutation &amp;lt;math&amp;gt;\sigma \in \mathrm{Sym}(n)&amp;lt;/math&amp;gt; and a function &amp;lt;math&amp;gt;f : X^n \to \mathbf R&amp;lt;/math&amp;gt;, we can define the permutation of the function &amp;lt;math&amp;gt;\sigma^* : (X^n \to \mathbf R) \to X^n \to \mathbf R&amp;lt;/math&amp;gt; by:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sigma^*(f) := (x_1, \ldots, x_n) \mapsto f(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
By an abuse of notation, we can drop the star in &amp;lt;math&amp;gt;\sigma^*&amp;lt;/math&amp;gt; and just call the resulting extension &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt;. Using this new notation, we can rewrite the symmetrization as&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are now two problems:&lt;br /&gt;
&lt;br /&gt;
# the expression above is in terms of variables &amp;lt;math&amp;gt;x_1, \ldots, x_n&amp;lt;/math&amp;gt; or in other words it&#039;s a &#039;&#039;function&#039;&#039;, not a number; we need to evaluate it somehow&lt;br /&gt;
# the expression is not &#039;&#039;normalized&#039;&#039;. the total value of the grand coalition is &amp;lt;math&amp;gt;v(\{1, \ldots, n\})&amp;lt;/math&amp;gt;. we need to make sure that when we add up all the shapley values we give to each player, that that sum is equal to &amp;lt;math&amp;gt;v(\{1, \ldots, n\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
problem 1 is easy to solve. because we symmetrized the function, we can evaluate the function on &#039;&#039;any&#039;&#039; ordering of the players!&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\left(\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)\right) (1, \ldots, n)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
is one such way to get a number. but really, the ordering of (1, ..., n) does not matter.&lt;br /&gt;
&lt;br /&gt;
as for problem 2, we want to find a value of &amp;lt;math&amp;gt;\alpha&amp;lt;/math&amp;gt; such that the following is satisfied:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;v(\{1,\ldots,n\}) = \alpha \sum_{i=1}^n \left(\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)\right) (1, \ldots, n)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
so just rearrange to solve for &amp;lt;math&amp;gt;\alpha&amp;lt;/math&amp;gt;, and we are done. at least, conceptually. it would be nice if we could prove that &amp;lt;math&amp;gt;\alpha = \frac{1}{|\mathrm{Sym}(n)|}&amp;lt;/math&amp;gt;. this is not hard to do (just interchange the two sums. for fixed sigma, the form of f_i means you can telescope-sum it to get &amp;lt;math&amp;gt;v(N)-v(\emptyset) = v(N)&amp;lt;/math&amp;gt; in each inner sum. see [https://en.wikipedia.org/wiki/Shapley_value#Efficiency wikipedia] for details.)&lt;br /&gt;
&lt;br /&gt;
So the fully honest shapley value expression is the following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{v(\{1,\ldots,n\})}{\sum_{i=1}^n \left(\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)\right) (1, \ldots, n)} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3579</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3579"/>
		<updated>2023-04-08T21:03:17Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables &#039;&#039;symmetrize&#039;&#039;] a function by adding up all the permutations of the variables. a symmetric function is one in which you can interchange any of the variables and the function will stay the same. not in the trivial sense that &amp;lt;math&amp;gt;f(x,y)&amp;lt;/math&amp;gt; is &amp;quot;the same&amp;quot; as &amp;lt;math&amp;gt;f(y,x)&amp;lt;/math&amp;gt; since x and y are &amp;quot;just variables&amp;quot;. but rather, in the sense that &amp;lt;math&amp;gt;f(x,y)=f(y,x)&amp;lt;/math&amp;gt;. in symbols:&lt;br /&gt;
&lt;br /&gt;
* we are NOT saying &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (y,x) \mapsto f(y,x)&amp;lt;/math&amp;gt;. this is trivially true for all functions!&lt;br /&gt;
* but rather: &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (x,y) \mapsto f(y,x)&amp;lt;/math&amp;gt;. or in other words: &amp;lt;math&amp;gt;\forall x \forall y [f(x,y) = f(y,x)]&amp;lt;/math&amp;gt; (this is not trivially true! it&#039;s false for many functions including &amp;lt;math&amp;gt;f(x,y) := x-y&amp;lt;/math&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
in the case of the shapley value, the &amp;quot;marginal contribution&amp;quot; function is NOT symmetric. so the naive fix that we would hope would work is to symmetrize it by adding all the possible permutations of the variables, forming a new function.&lt;br /&gt;
&lt;br /&gt;
wait, what? what even &#039;&#039;is&#039;&#039; the &amp;quot;marginal contribution function&amp;quot;?? for a player i of interest, it&#039;s the function that gives player i&#039;s marginal contribution, given an arbitrary sequence of players as input. let&#039;s say we are given a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt;. what&#039;s player i&#039;s marginal contribution in this sequence? well, if x1 = i, then player i is the first player to join, so the marginal contribution is &amp;lt;math&amp;gt;v(\{i\}) - v(\emptyset) = v(\{i\})&amp;lt;/math&amp;gt;. if x2=i, then the marginal contribution of player i is &amp;lt;math&amp;gt;v(\{x_1, i\}) - v(\{x_1\})&amp;lt;/math&amp;gt;. and so on. in general, if &amp;lt;math&amp;gt;x_j = i&amp;lt;/math&amp;gt; then player i&#039;s marginal contribution is &amp;lt;math&amp;gt;v(\{x_1, \ldots, x_{j-1}, i\}) - v(\{x_1, \ldots, x_{j-1}\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
as i said, this function, which we can call &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;, is not symmetric. but we can symmetrize &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt; by adding up all the possible orderings of the input variables:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} f_i(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
given a permutation &amp;lt;math&amp;gt;\sigma \in \mathrm{Sym}(n)&amp;lt;/math&amp;gt; and a function &amp;lt;math&amp;gt;f : X^n \to \mathbf R&amp;lt;/math&amp;gt;, we can define the permutation of the function &amp;lt;math&amp;gt;\sigma^* : (X^n \to \mathbf R) \to X^n \to \mathbf R&amp;lt;/math&amp;gt; by:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sigma^*(f) := (x_1, \ldots, x_n) \mapsto f(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
By an abuse of notation, we can drop the star in &amp;lt;math&amp;gt;\sigma^*&amp;lt;/math&amp;gt; and just call the resulting extension &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt;. Using this new notation, we can rewrite the symmetrization as&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are now two problems:&lt;br /&gt;
&lt;br /&gt;
# the expression above is in terms of variables &amp;lt;math&amp;gt;x_1, \ldots, x_n&amp;lt;/math&amp;gt; or in other words it&#039;s a &#039;&#039;function&#039;&#039;, not a number; we need to evaluate it somehow&lt;br /&gt;
# the expression is not &#039;&#039;normalized&#039;&#039;. the total value of the grand coalition is &amp;lt;math&amp;gt;v(\{1, \ldots, n\})&amp;lt;/math&amp;gt;. we need to make sure that when we add up all the shapley values we give to each player, that that sum is equal to &amp;lt;math&amp;gt;v(\{1, \ldots, n\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
problem 1 is easy to solve. because we symmetrized the function, we can evaluate the function on &#039;&#039;any&#039;&#039; ordering of the players!&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\left(\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)\right) (1, \ldots, n)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
is one such way to get a number. but really, the ordering of (1, ..., n) does not matter.&lt;br /&gt;
&lt;br /&gt;
as for problem 2, we want to find a value of &amp;lt;math&amp;gt;\alpha&amp;lt;/math&amp;gt; such that the following is satisfied:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;v(\{1,\ldots,n\}) = \alpha \sum_{i=1}^n \left(\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)\right) (1, \ldots, n)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
so just rearrange to solve for &amp;lt;math&amp;gt;\alpha&amp;lt;/math&amp;gt;, and we are done. at least, conceptually. it would be nice if we could prove that &amp;lt;math&amp;gt;\alpha = \frac{1}{|\mathrm{Sym}(n)|}&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3578</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3578"/>
		<updated>2023-04-08T21:00:54Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables &#039;&#039;symmetrize&#039;&#039;] a function by adding up all the permutations of the variables. a symmetric function is one in which you can interchange any of the variables and the function will stay the same. not in the trivial sense that &amp;lt;math&amp;gt;f(x,y)&amp;lt;/math&amp;gt; is &amp;quot;the same&amp;quot; as &amp;lt;math&amp;gt;f(y,x)&amp;lt;/math&amp;gt; since x and y are &amp;quot;just variables&amp;quot;. but rather, in the sense that &amp;lt;math&amp;gt;f(x,y)=f(y,x)&amp;lt;/math&amp;gt;. in symbols:&lt;br /&gt;
&lt;br /&gt;
* we are NOT saying &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (y,x) \mapsto f(y,x)&amp;lt;/math&amp;gt;. this is trivially true for all functions!&lt;br /&gt;
* but rather: &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (x,y) \mapsto f(y,x)&amp;lt;/math&amp;gt;. or in other words: &amp;lt;math&amp;gt;\forall x \forall y [f(x,y) = f(y,x)]&amp;lt;/math&amp;gt; (this is not trivially true! it&#039;s false for many functions including &amp;lt;math&amp;gt;f(x,y) := x-y&amp;lt;/math&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
in the case of the shapley value, the &amp;quot;marginal contribution&amp;quot; function is NOT symmetric. so the naive fix that we would hope would work is to symmetrize it by adding all the possible permutations of the variables, forming a new function.&lt;br /&gt;
&lt;br /&gt;
wait, what? what even &#039;&#039;is&#039;&#039; the &amp;quot;marginal contribution function&amp;quot;?? for a player i of interest, it&#039;s the function that gives player i&#039;s marginal contribution, given an arbitrary sequence of players as input. let&#039;s say we are given a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt;. what&#039;s player i&#039;s marginal contribution in this sequence? well, if x1 = i, then player i is the first player to join, so the marginal contribution is &amp;lt;math&amp;gt;v(\{i\}) - v(\emptyset) = v(\{i\})&amp;lt;/math&amp;gt;. if x2=i, then the marginal contribution of player i is &amp;lt;math&amp;gt;v(\{x_1, i\}) - v(\{x_1\})&amp;lt;/math&amp;gt;. and so on. in general, if &amp;lt;math&amp;gt;x_j = i&amp;lt;/math&amp;gt; then player i&#039;s marginal contribution is &amp;lt;math&amp;gt;v(\{x_1, \ldots, x_{j-1}, i\}) - v(\{x_1, \ldots, x_{j-1}\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
as i said, this function, which we can call &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;, is not symmetric. but we can symmetrize &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt; by adding up all the possible orderings of the input variables:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} f_i(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
given a permutation &amp;lt;math&amp;gt;\sigma \in \mathrm{Sym}(n)&amp;lt;/math&amp;gt; and a function &amp;lt;math&amp;gt;f : X^n \to \mathbf R&amp;lt;/math&amp;gt;, we can define the permutation of the function &amp;lt;math&amp;gt;\sigma^* : (X^n \to \mathbf R) \to X^n \to \mathbf R&amp;lt;/math&amp;gt; by:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sigma^*(f) := (x_1, \ldots, x_n) \mapsto f(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
By an abuse of notation, we can drop the star in &amp;lt;math&amp;gt;\sigma^*&amp;lt;/math&amp;gt; and just call the resulting extension &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt;. Using this new notation, we can rewrite the symmetrization as&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are now two problems:&lt;br /&gt;
&lt;br /&gt;
# the expression above is in terms of variables &amp;lt;math&amp;gt;x_1, \ldots, x_n&amp;lt;/math&amp;gt; or in other words it&#039;s a &#039;&#039;function&#039;&#039;, not a number; we need to evaluate it somehow&lt;br /&gt;
# the expression is not &#039;&#039;normalized&#039;&#039;. the total value of the grand coalition is &amp;lt;math&amp;gt;v(\{1, \ldots, n\})&amp;lt;/math&amp;gt;. we need to make sure that when we add up all the shapley values we give to each player, that that sum is equal to &amp;lt;math&amp;gt;v(\{1, \ldots, n\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
problem 1 is easy to solve. because we symmetrized the function, we can evaluate the function on &#039;&#039;any&#039;&#039; ordering of the players!&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\left(\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)\right) (1, \ldots, n)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
is one such way to get a number. but really, the ordering of (1, ..., n) does not matter.&lt;br /&gt;
&lt;br /&gt;
as for problem 2, we want to find a value of &amp;lt;math&amp;gt;\alpha&amp;lt;/math&amp;gt; such that the following is satisfied:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;v(\{1,\ldots,n\}) = \alpha \sum_{i=1}^n \left(\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)\right) (1, \ldots, n)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3577</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3577"/>
		<updated>2023-04-08T20:58:39Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables &#039;&#039;symmetrize&#039;&#039;] a function by adding up all the permutations of the variables. a symmetric function is one in which you can interchange any of the variables and the function will stay the same. not in the trivial sense that &amp;lt;math&amp;gt;f(x,y)&amp;lt;/math&amp;gt; is &amp;quot;the same&amp;quot; as &amp;lt;math&amp;gt;f(y,x)&amp;lt;/math&amp;gt; since x and y are &amp;quot;just variables&amp;quot;. but rather, in the sense that &amp;lt;math&amp;gt;f(x,y)=f(y,x)&amp;lt;/math&amp;gt;. in symbols:&lt;br /&gt;
&lt;br /&gt;
* we are NOT saying &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (y,x) \mapsto f(y,x)&amp;lt;/math&amp;gt;. this is trivially true for all functions!&lt;br /&gt;
* but rather: &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (x,y) \mapsto f(y,x)&amp;lt;/math&amp;gt;. or in other words: &amp;lt;math&amp;gt;\forall x \forall y [f(x,y) = f(y,x)]&amp;lt;/math&amp;gt; (this is not trivially true! it&#039;s false for many functions including &amp;lt;math&amp;gt;f(x,y) := x-y&amp;lt;/math&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
in the case of the shapley value, the &amp;quot;marginal contribution&amp;quot; function is NOT symmetric. so the naive fix that we would hope would work is to symmetrize it by adding all the possible permutations of the variables, forming a new function.&lt;br /&gt;
&lt;br /&gt;
wait, what? what even &#039;&#039;is&#039;&#039; the &amp;quot;marginal contribution function&amp;quot;?? for a player i of interest, it&#039;s the function that gives player i&#039;s marginal contribution, given an arbitrary sequence of players as input. let&#039;s say we are given a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt;. what&#039;s player i&#039;s marginal contribution in this sequence? well, if x1 = i, then player i is the first player to join, so the marginal contribution is &amp;lt;math&amp;gt;v(\{i\}) - v(\emptyset) = v(\{i\})&amp;lt;/math&amp;gt;. if x2=i, then the marginal contribution of player i is &amp;lt;math&amp;gt;v(\{x_1, i\}) - v(\{x_1\})&amp;lt;/math&amp;gt;. and so on. in general, if &amp;lt;math&amp;gt;x_j = i&amp;lt;/math&amp;gt; then player i&#039;s marginal contribution is &amp;lt;math&amp;gt;v(\{x_1, \ldots, x_{j-1}, i\}) - v(\{x_1, \ldots, x_{j-1}\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
as i said, this function, which we can call &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;, is not symmetric. but we can symmetrize &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt; by adding up all the possible orderings of the input variables:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} f_i(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
given a permutation &amp;lt;math&amp;gt;\sigma \in \mathrm{Sym}(n)&amp;lt;/math&amp;gt; and a function &amp;lt;math&amp;gt;f : X^n \to \mathbf R&amp;lt;/math&amp;gt;, we can define the permutation of the function &amp;lt;math&amp;gt;\sigma^* : (X^n \to \mathbf R) \to X^n \to \mathbf R&amp;lt;/math&amp;gt; by:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sigma^*(f) := (x_1, \ldots, x_n) \mapsto f(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
By an abuse of notation, we can drop the star in &amp;lt;math&amp;gt;\sigma^*&amp;lt;/math&amp;gt; and just call the resulting extension &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt;. Using this new notation, we can rewrite the symmetrization as&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are now two problems:&lt;br /&gt;
&lt;br /&gt;
# the expression above is in terms of variables &amp;lt;math&amp;gt;x_1, \ldots, x_n&amp;lt;/math&amp;gt; or in other words it&#039;s a &#039;&#039;function&#039;&#039;, not a number; we need to evaluate it somehow&lt;br /&gt;
# the expression is not &#039;&#039;normalized&#039;&#039;. the total value of the grand coalition is &amp;lt;math&amp;gt;v(\{1, \ldots, n\})&amp;lt;/math&amp;gt;. we need to make sure that when we add up all the shapley values we give to each player, that that sum is equal to &amp;lt;math&amp;gt;v(\{1, \ldots, n\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
problem 1 is easy to solve. because we symmetrized the function, we can evaluate the function on &#039;&#039;any&#039;&#039; ordering of the players!&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\left(\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)\right) (1, \ldots, n)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3576</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3576"/>
		<updated>2023-04-08T20:57:27Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables &#039;&#039;symmetrize&#039;&#039;] a function by adding up all the permutations of the variables. a symmetric function is one in which you can interchange any of the variables and the function will stay the same. not in the trivial sense that &amp;lt;math&amp;gt;f(x,y)&amp;lt;/math&amp;gt; is &amp;quot;the same&amp;quot; as &amp;lt;math&amp;gt;f(y,x)&amp;lt;/math&amp;gt; since x and y are &amp;quot;just variables&amp;quot;. but rather, in the sense that &amp;lt;math&amp;gt;f(x,y)=f(y,x)&amp;lt;/math&amp;gt;. in symbols:&lt;br /&gt;
&lt;br /&gt;
* we are NOT saying &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (y,x) \mapsto f(y,x)&amp;lt;/math&amp;gt;. this is trivially true for all functions!&lt;br /&gt;
* but rather: &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (x,y) \mapsto f(y,x)&amp;lt;/math&amp;gt;. or in other words: &amp;lt;math&amp;gt;\forall x \forall y [f(x,y) = f(y,x)]&amp;lt;/math&amp;gt; (this is not trivially true! it&#039;s false for many functions including &amp;lt;math&amp;gt;f(x,y) := x-y&amp;lt;/math&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
in the case of the shapley value, the &amp;quot;marginal contribution&amp;quot; function is NOT symmetric. so the naive fix that we would hope would work is to symmetrize it by adding all the possible permutations of the variables, forming a new function.&lt;br /&gt;
&lt;br /&gt;
wait, what? what even &#039;&#039;is&#039;&#039; the &amp;quot;marginal contribution function&amp;quot;?? for a player i of interest, it&#039;s the function that gives player i&#039;s marginal contribution, given an arbitrary sequence of players as input. let&#039;s say we are given a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt;. what&#039;s player i&#039;s marginal contribution in this sequence? well, if x1 = i, then player i is the first player to join, so the marginal contribution is &amp;lt;math&amp;gt;v(\{i\}) - v(\emptyset) = v(\{i\})&amp;lt;/math&amp;gt;. if x2=i, then the marginal contribution of player i is &amp;lt;math&amp;gt;v(\{x_1, i\}) - v(\{x_1\})&amp;lt;/math&amp;gt;. and so on. in general, if &amp;lt;math&amp;gt;x_j = i&amp;lt;/math&amp;gt; then player i&#039;s marginal contribution is &amp;lt;math&amp;gt;v(\{x_1, \ldots, x_{j-1}, i\}) - v(\{x_1, \ldots, x_{j-1}\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
as i said, this function, which we can call &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;, is not symmetric. but we can symmetrize &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt; by adding up all the possible orderings of the input variables:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} f_i(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
given a permutation &amp;lt;math&amp;gt;\sigma \in \mathrm{Sym}(n)&amp;lt;/math&amp;gt; and a function &amp;lt;math&amp;gt;f : X^n \to \mathbf R&amp;lt;/math&amp;gt;, we can define the permutation of the function &amp;lt;math&amp;gt;\sigma^* : (X^n \to \mathbf R) \to X^n \to \mathbf R&amp;lt;/math&amp;gt; by:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sigma^*(f) := (x_1, \ldots, x_n) \mapsto f(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
By an abuse of notation, we can drop the star in &amp;lt;math&amp;gt;\sigma^*&amp;lt;/math&amp;gt; and just call the resulting extension &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt;. Using this new notation, we can rewrite the symmetrization as&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There are now two problems:&lt;br /&gt;
&lt;br /&gt;
# the expression above is in terms of variables &amp;lt;math&amp;gt;x_1, \ldots, x_n&amp;lt;/math&amp;gt; or in other words it&#039;s a &#039;&#039;function&#039;&#039;, not a number; we need to evaluate it somehow&lt;br /&gt;
# the expression is not &#039;&#039;normalized&#039;&#039;. the total value of the grand coalition is &amp;lt;math&amp;gt;v(\{1, \ldots, n\})&amp;lt;/math&amp;gt;. we need to make sure that when we add up all the shapley values we give to each player, that that sum is equal to &amp;lt;math&amp;gt;v(\{1, \ldots, n\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3575</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3575"/>
		<updated>2023-04-08T20:52:51Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables &#039;&#039;symmetrize&#039;&#039;] a function by adding up all the permutations of the variables. a symmetric function is one in which you can interchange any of the variables and the function will stay the same. not in the trivial sense that &amp;lt;math&amp;gt;f(x,y)&amp;lt;/math&amp;gt; is &amp;quot;the same&amp;quot; as &amp;lt;math&amp;gt;f(y,x)&amp;lt;/math&amp;gt; since x and y are &amp;quot;just variables&amp;quot;. but rather, in the sense that &amp;lt;math&amp;gt;f(x,y)=f(y,x)&amp;lt;/math&amp;gt;. in symbols:&lt;br /&gt;
&lt;br /&gt;
* we are NOT saying &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (y,x) \mapsto f(y,x)&amp;lt;/math&amp;gt;. this is trivially true for all functions!&lt;br /&gt;
* but rather: &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (x,y) \mapsto f(y,x)&amp;lt;/math&amp;gt;. or in other words: &amp;lt;math&amp;gt;\forall x \forall y [f(x,y) = f(y,x)]&amp;lt;/math&amp;gt; (this is not trivially true! it&#039;s false for many functions including &amp;lt;math&amp;gt;f(x,y) := x-y&amp;lt;/math&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
in the case of the shapley value, the &amp;quot;marginal contribution&amp;quot; function is NOT symmetric. so the naive fix that we would hope would work is to symmetrize it by adding all the possible permutations of the variables, forming a new function.&lt;br /&gt;
&lt;br /&gt;
wait, what? what even &#039;&#039;is&#039;&#039; the &amp;quot;marginal contribution function&amp;quot;?? for a player i of interest, it&#039;s the function that gives player i&#039;s marginal contribution, given an arbitrary sequence of players as input. let&#039;s say we are given a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt;. what&#039;s player i&#039;s marginal contribution in this sequence? well, if x1 = i, then player i is the first player to join, so the marginal contribution is &amp;lt;math&amp;gt;v(\{i\}) - v(\emptyset) = v(\{i\})&amp;lt;/math&amp;gt;. if x2=i, then the marginal contribution of player i is &amp;lt;math&amp;gt;v(\{x_1, i\}) - v(\{x_1\})&amp;lt;/math&amp;gt;. and so on. in general, if &amp;lt;math&amp;gt;x_j = i&amp;lt;/math&amp;gt; then player i&#039;s marginal contribution is &amp;lt;math&amp;gt;v(\{x_1, \ldots, x_{j-1}, i\}) - v(\{x_1, \ldots, x_{j-1}\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
as i said, this function, which we can call &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;, is not symmetric. but we can symmetrize &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt; by adding up all the possible orderings of the input variables:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} f_i(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
given a permutation &amp;lt;math&amp;gt;\sigma \in \mathrm{Sym}(n)&amp;lt;/math&amp;gt; and a function &amp;lt;math&amp;gt;f : X^n \to \mathbf R&amp;lt;/math&amp;gt;, we can define the permutation of the function &amp;lt;math&amp;gt;\sigma^* : (X^n \to \mathbf R) \to X^n \to \mathbf R&amp;lt;/math&amp;gt; by:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sigma^*(f) := (x_1, \ldots, x_n) \mapsto f(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
By an abuse of notation, we can drop the star in &amp;lt;math&amp;gt;\sigma^*&amp;lt;/math&amp;gt; and just call the resulting extension &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt;. Using this new notation, we can rewrite the symmetrization as&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} \sigma(f_i)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3574</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3574"/>
		<updated>2023-04-08T20:51:51Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables &#039;&#039;symmetrize&#039;&#039;] a function by adding up all the permutations of the variables. a symmetric function is one in which you can interchange any of the variables and the function will stay the same. not in the trivial sense that &amp;lt;math&amp;gt;f(x,y)&amp;lt;/math&amp;gt; is &amp;quot;the same&amp;quot; as &amp;lt;math&amp;gt;f(y,x)&amp;lt;/math&amp;gt; since x and y are &amp;quot;just variables&amp;quot;. but rather, in the sense that &amp;lt;math&amp;gt;f(x,y)=f(y,x)&amp;lt;/math&amp;gt;. in symbols:&lt;br /&gt;
&lt;br /&gt;
* we are NOT saying &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (y,x) \mapsto f(y,x)&amp;lt;/math&amp;gt;. this is trivially true for all functions!&lt;br /&gt;
* but rather: &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (x,y) \mapsto f(y,x)&amp;lt;/math&amp;gt;. or in other words: &amp;lt;math&amp;gt;\forall x \forall y [f(x,y) = f(y,x)]&amp;lt;/math&amp;gt; (this is not trivially true! it&#039;s false for many functions including &amp;lt;math&amp;gt;f(x,y) := x-y&amp;lt;/math&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
in the case of the shapley value, the &amp;quot;marginal contribution&amp;quot; function is NOT symmetric. so the naive fix that we would hope would work is to symmetrize it by adding all the possible permutations of the variables, forming a new function.&lt;br /&gt;
&lt;br /&gt;
wait, what? what even &#039;&#039;is&#039;&#039; the &amp;quot;marginal contribution function&amp;quot;?? for a player i of interest, it&#039;s the function that gives player i&#039;s marginal contribution, given an arbitrary sequence of players as input. let&#039;s say we are given a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt;. what&#039;s player i&#039;s marginal contribution in this sequence? well, if x1 = i, then player i is the first player to join, so the marginal contribution is &amp;lt;math&amp;gt;v(\{i\}) - v(\emptyset) = v(\{i\})&amp;lt;/math&amp;gt;. if x2=i, then the marginal contribution of player i is &amp;lt;math&amp;gt;v(\{x_1, i\}) - v(\{x_1\})&amp;lt;/math&amp;gt;. and so on. in general, if &amp;lt;math&amp;gt;x_j = i&amp;lt;/math&amp;gt; then player i&#039;s marginal contribution is &amp;lt;math&amp;gt;v(\{x_1, \ldots, x_{j-1}, i\}) - v(\{x_1, \ldots, x_{j-1}\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
as i said, this function, which we can call &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;, is not symmetric. but we can symmetrize &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt; by adding up all the possible orderings of the input variables:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} f_i(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
given a permutation &amp;lt;math&amp;gt;\sigma \in \mathrm{Sym}(n)&amp;lt;/math&amp;gt; and a function &amp;lt;math&amp;gt;f : X^n \to \mathbf R&amp;lt;/math&amp;gt;, we can define the permutation of the function &amp;lt;math&amp;gt;\sigma^* : (X^n \to \mathbf R) \to X^n \to \mathbf R&amp;lt;/math&amp;gt; by:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sigma^*(f) := (x_1, \ldots, x_n) \mapsto f(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
By an abuse of notation, we can drop the star in &amp;lt;math&amp;gt;\sigma^*&amp;lt;/math&amp;gt; and just call the resulting extension &amp;lt;math&amp;gt;\sigma&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3573</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3573"/>
		<updated>2023-04-08T20:48:22Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables &#039;&#039;symmetrize&#039;&#039;] a function by adding up all the permutations of the variables. a symmetric function is one in which you can interchange any of the variables and the function will stay the same. not in the trivial sense that &amp;lt;math&amp;gt;f(x,y)&amp;lt;/math&amp;gt; is &amp;quot;the same&amp;quot; as &amp;lt;math&amp;gt;f(y,x)&amp;lt;/math&amp;gt; since x and y are &amp;quot;just variables&amp;quot;. but rather, in the sense that &amp;lt;math&amp;gt;f(x,y)=f(y,x)&amp;lt;/math&amp;gt;. in symbols:&lt;br /&gt;
&lt;br /&gt;
* we are NOT saying &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (y,x) \mapsto f(y,x)&amp;lt;/math&amp;gt;. this is trivially true for all functions!&lt;br /&gt;
* but rather: &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (x,y) \mapsto f(y,x)&amp;lt;/math&amp;gt;. or in other words: &amp;lt;math&amp;gt;\forall x \forall y [f(x,y) = f(y,x)]&amp;lt;/math&amp;gt; (this is not trivially true! it&#039;s false for many functions including &amp;lt;math&amp;gt;f(x,y) := x-y&amp;lt;/math&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
in the case of the shapley value, the &amp;quot;marginal contribution&amp;quot; function is NOT symmetric. so the naive fix that we would hope would work is to symmetrize it by adding all the possible permutations of the variables, forming a new function.&lt;br /&gt;
&lt;br /&gt;
wait, what? what even &#039;&#039;is&#039;&#039; the &amp;quot;marginal contribution function&amp;quot;?? for a player i of interest, it&#039;s the function that gives player i&#039;s marginal contribution, given an arbitrary sequence of players as input. let&#039;s say we are given a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt;. what&#039;s player i&#039;s marginal contribution in this sequence? well, if x1 = i, then player i is the first player to join, so the marginal contribution is &amp;lt;math&amp;gt;v(\{i\}) - v(\emptyset) = v(\{i\})&amp;lt;/math&amp;gt;. if x2=i, then the marginal contribution of player i is &amp;lt;math&amp;gt;v(\{x_1, i\}) - v(\{x_1\})&amp;lt;/math&amp;gt;. and so on. in general, if &amp;lt;math&amp;gt;x_j = i&amp;lt;/math&amp;gt; then player i&#039;s marginal contribution is &amp;lt;math&amp;gt;v(\{x_1, \ldots, x_{j-1}, i\}) - v(\{x_1, \ldots, x_{j-1}\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
as i said, this function, which we can call &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;, is not symmetric. but we can symmetrize &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt; by adding up all the possible orderings of the input variables:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{\sigma \in \mathrm{Sym}(n)} f_i(x_{\sigma(1)}, \ldots, x_{\sigma(n)})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Lebesgue_theory&amp;diff=3572</id>
		<title>User:IssaRice/Lebesgue theory</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Lebesgue_theory&amp;diff=3572"/>
		<updated>2023-04-08T20:37:57Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;some questions for now:&lt;br /&gt;
&lt;br /&gt;
* why all the asymmetry in the usual definitions? [https://mathoverflow.net/questions/308856/why-is-lebesgue-measure-theory-asymmetric/308888]&lt;br /&gt;
* why isn&#039;t the lebesgue integral defined as the area under the graph? pugh&#039;s book does it this way. why is the definition in terms of simple function or the inf thing that axler does in MIRA preferred by textbooks?&lt;br /&gt;
* what would a corresponding &amp;quot;riemann measure&amp;quot; look like for subsets of R^n? is that just the jordan measure?&lt;br /&gt;
* why is caratheodory&#039;s criterion for measurability defined the way it is? there was a good blog post i saw once that gave a picture but i don&#039;t remember if i was fully convinced.&lt;br /&gt;
* is the only difference between jordan and lebesgue measure that one has a finite number of boxes and the other has countably many boxes? seems like it [https://terrytao.files.wordpress.com/2012/12/gsm-126-tao5-measure-book.pdf#page=35]&lt;br /&gt;
** in that case, one question i have is, why can&#039;t we reach the lebesgue integral simply by taking partitions along the x-axis with countably many points, instead of finitely many points (as in the riemann integral)?&lt;br /&gt;
* related to the asymmetry question: why can&#039;t we define a set to be lebesgue measureable iff its outer and inner lebesgue measures coincide, just like with jordan measurability? it must be that caratheodory&#039;s definition generalizes better. so there must be a theorem like &amp;quot;if a set is bounded, then the outer and inner lebesgue measure coincide if and only if caratheodory&#039;s criterion is satisfied&amp;quot;. then [https://lw2.issarice.com/posts/rs2focaRymwvkW2jS/inversion-of-theorems-into-definitions-when-generalizing &#039;&#039;this theorem&#039;&#039; justifies] using the caratheodory criterion to try to measure unbounded sets.&lt;br /&gt;
** is there an analogue of the caratheodory theorem for jordan measure, to allow us to extend jordan measurability to unbounded sets? or is this not an interesting question to ask since if we try to measure such sets, the answer will always be infinity even for &amp;quot;thin&amp;quot; sets like Q?&lt;br /&gt;
* i&#039;ve always found the coin-counting analogy between riemann and lebesgue measure confusing. why should one method be better than the other, when they both produce the same answer? if i was actually trying to quickly estimate how much i had in coins, i would just gather up all the highest denominations and count those, then add a &amp;quot;fuzz factor&amp;quot; to account for some error. i wouldn&#039;t even bother counting the pennies unless it seemed like there was a huge number of them.&lt;br /&gt;
* the riemann/jordan vs lebesgue difference has been described as finite vs countable, x-axis partitioning vs y-axis partitioning, topological boundary measure = 0 vs measure-theoretic boundary measure = 0, piecewise constant function vs simple function and maybe one other thing i am forgetting. but what is the essence of the difference here? why do all these separate distinctions turn out to be &amp;quot;the same thing&amp;quot;?&lt;br /&gt;
** why should partitioning the y-axis lead to being able to integrate more functions compared to partitioning the x-axis? the finite vs countable distinction makes sense, but the y vs x thing makes no sense to me.&lt;br /&gt;
* Apostol&#039;s analysis lists two non-equivalent definitions of riemann integrability i think. which one does the jordan-undergraph riemann integral pick out and why?&lt;br /&gt;
* since there is a notion of riemann sums, is there also an analogous notion of &amp;quot;lebesgue sums&amp;quot;? Yes; see pugh&#039;s book.&lt;br /&gt;
* pugh expresses riemann integrability in terms of the boundary of the function having zero &#039;&#039;lebesgue&#039;&#039; measure. why do we have to bring in lebesgue measure here? can&#039;t it be jordan measure?&lt;br /&gt;
* why can&#039;t we extend jordan measurability to unbounded sets by doing something analogous to improper riemann integration? like, we define a &amp;quot;measure&amp;quot; for some finite portion of the set parametrized by some bound, then take the limit as the bound goes to infinity.&lt;br /&gt;
* more general way of generating questions: there&#039;s some stuff we talk about a lot in &amp;quot;riemann/jordan land&amp;quot; (e.g. upper and lower sums and defining integrability when the two are equal), and some stuff we talk about a lot in &amp;quot;lebesgue land&amp;quot; (e.g. caratheodory criterion). for each thing we talk about in one of the lands, what is the analogue of it in the other land?&lt;br /&gt;
* if the lebesgue integral was just the riemann integral but dividing along the y-axis instead of the x-axis, then we should just be able to get the lebesgue integral by integrating f^-1 or something?&lt;br /&gt;
* the horizontal stacking picture people like to draw for the lebesgue integral seems deceptive? like, if you actually look at how a simple function is integrated, you kind of draw the horizontal &amp;quot;window&amp;quot; but only to find where the corresponding points are on the x-axis? and then you multiply the y value with the width along the x-axis, so the actual area for that bit is still a vertical rectangle. thank god, someone else [https://www.youtube.com/watch?v=LDNDTOVnKJk&amp;amp;lc=UgzgAWpd40oYJpsTfbx4AaABAg noticed this]: &amp;quot;I think you haven&#039;t understood the Lebesgue integral at all. The animation with horizontal rectangles is quite flawed. That&#039;s not how it works!! Rectangles are vertical, what happens is that we split the range of the image into different sections. In each of them, we take an arbitrary point, and multiply it by the inverse function of the section. That may be one or more VERTICAL rectangles.&amp;quot;&lt;br /&gt;
* my [https://www.youtube.com/watch?v=LDNDTOVnKJk&amp;amp;lc=UgyBsn4N7-nI9GjKbU54AaABAg comment]: &amp;quot;If the only difference between the Riemann and Lebesgue integral was dividing up along the x-axis vs y-axis, then the Jordan measure (which uses boxes and doesn&#039;t care about the x or y axis) should be able to find the area under the graph of any Lebesgue-integrable function, right? The fact that this is not possible I think means there is some other deeper difference between the two integrals.&amp;quot;&lt;br /&gt;
* trying to measure the set of irrational numbers in [0,1]: i think this should have length 1, but even though it&#039;s a bounded set, the sup of the inner measure seems to be 0? actually it does seem to be possible to construct a positive measure subset that only contains irrationals [https://math.stackexchange.com/a/3932309/35525], which is pretty unintuitive. hmm, but this is still different from inner measure i think, because we aren&#039;t using a countable number of boxes inside the irrationals.. instead we&#039;re starting by surrounding the rationals and then substracting out those sets.&lt;br /&gt;
* ok here&#039;s another reason i think the y-axis thing is bullshit: we could define a &amp;quot;jordan integral&amp;quot; by taking the definition of the lebesgue integral via simple functions but replacing the measure with the jordan measure. i think the resulting integral would be equivalent to the riemann integral.&lt;br /&gt;
* if all lebesgue did was to replace &amp;quot;finite&amp;quot; with &amp;quot;countable&amp;quot; in the definition of jordan measure, why was lebesgue&#039;s theory considered so important/revolutionary? was it the fact that lebesgue also did all the legwork to prove that his measure/integral had all these nice properties?&lt;br /&gt;
* what if we used simple functions to define the riemann integral, or piecewise constant functions to define the lebesgue integral?&lt;br /&gt;
* i think a lot of definitions of the lebesgue integral &amp;quot;cheat&amp;quot; by using the lebesgue measure (it puts all the difficulty of measurement into the lebesgue measure part). the riemann integral doesn&#039;t use the jordan measure in most definitions; you just add up the rectangles yourself. can we just get the lebesgue integral by using a countable number of rectangles? or is there more to lebesgue than that? can we make use of measurability without using the measure itself, to define the lebesgue integral?&lt;br /&gt;
* the horizontal slab idea is coming from [https://en.wikipedia.org/wiki/Lebesgue_integration#Via_improper_Riemann_integral this definition]. the weird thing though is that almost no one defines the lebesgue integral that way??? so like, the image you present has nothing to do with the technical definition you give. also, the horizontal slab thing is just using the lebesgue measure! all the work is being done by the lebesgue measure! if you substituted the jordan measure instead then i think you would just get back the riemann integral again.&lt;br /&gt;
&lt;br /&gt;
pugh&#039;s book has more connections between riemann and lebesgue&lt;br /&gt;
&lt;br /&gt;
also, i am scared to ask, but how does all of this apply to the [https://en.wikipedia.org/wiki/Henstock%E2%80%93Kurzweil_integral gauge integral]?&lt;br /&gt;
&lt;br /&gt;
i think it&#039;s pretty bad that there seems to be no book that answers all of these questions, period, let alone in an easily understandable manner. you can tell these questions are not even asked in the textbooks because professional mathematicians are asking them on mathoverflow... e.g. [https://mathoverflow.net/questions/321916/why-isnt-integral-defined-as-the-area-under-the-graph-of-function] [https://mathoverflow.net/questions/308856/why-is-lebesgue-measure-theory-asymmetric/308888]&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3571</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3571"/>
		<updated>2023-04-08T04:56:44Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables &#039;&#039;symmetrize&#039;&#039;] a function by adding up all the permutations of the variables. a symmetric function is one in which you can interchange any of the variables and the function will stay the same. not in the trivial sense that &amp;lt;math&amp;gt;f(x,y)&amp;lt;/math&amp;gt; is &amp;quot;the same&amp;quot; as &amp;lt;math&amp;gt;f(y,x)&amp;lt;/math&amp;gt; since x and y are &amp;quot;just variables&amp;quot;. but rather, in the sense that &amp;lt;math&amp;gt;f(x,y)=f(y,x)&amp;lt;/math&amp;gt;. in symbols:&lt;br /&gt;
&lt;br /&gt;
* we are NOT saying &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (y,x) \mapsto f(y,x)&amp;lt;/math&amp;gt;. this is trivially true for all functions!&lt;br /&gt;
* but rather: &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (x,y) \mapsto f(y,x)&amp;lt;/math&amp;gt;. or in other words: &amp;lt;math&amp;gt;\forall x \forall y [f(x,y) = f(y,x)]&amp;lt;/math&amp;gt; (this is not trivially true! it&#039;s false for many functions including &amp;lt;math&amp;gt;f(x,y) := x-y&amp;lt;/math&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
in the case of the shapley value, the &amp;quot;marginal contribution&amp;quot; function is NOT symmetric. so the naive fix that we would hope would work is to symmetrize it by adding all the possible permutations of the variables, forming a new function.&lt;br /&gt;
&lt;br /&gt;
wait, what? what even &#039;&#039;is&#039;&#039; the &amp;quot;marginal contribution function&amp;quot;?? for a player i of interest, it&#039;s the function that gives player i&#039;s marginal contribution, given an arbitrary sequence of players as input. let&#039;s say we are given a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt;. what&#039;s player i&#039;s marginal contribution in this sequence? well, if x1 = i, then player i is the first player to join, so the marginal contribution is &amp;lt;math&amp;gt;v(\{i\}) - v(\emptyset) = v(\{i\})&amp;lt;/math&amp;gt;. if x2=i, then the marginal contribution of player i is &amp;lt;math&amp;gt;v(\{x_1, i\}) - v(\{x_1\})&amp;lt;/math&amp;gt;. and so on. in general, if &amp;lt;math&amp;gt;x_j = i&amp;lt;/math&amp;gt; then player i&#039;s marginal contribution is &amp;lt;math&amp;gt;v(\{x_1, \ldots, x_{j-1}, i\}) - v(\{x_1, \ldots, x_{j-1}\})&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3570</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3570"/>
		<updated>2023-04-08T04:51:08Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables &#039;&#039;symmetrize&#039;&#039;] a function by adding up all the permutations of the variables. a symmetric function is one in which you can interchange any of the variables and the function will stay the same. not in the trivial sense that &amp;lt;math&amp;gt;f(x,y)&amp;lt;/math&amp;gt; is &amp;quot;the same&amp;quot; as &amp;lt;math&amp;gt;f(y,x)&amp;lt;/math&amp;gt; since x and y are &amp;quot;just variables&amp;quot;. but rather, in the sense that &amp;lt;math&amp;gt;f(x,y)=f(y,x)&amp;lt;/math&amp;gt;. in symbols:&lt;br /&gt;
&lt;br /&gt;
* we are NOT saying &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (y,x) \mapsto f(y,x)&amp;lt;/math&amp;gt;. this is trivially true for all functions!&lt;br /&gt;
* but rather: &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (x,y) \mapsto f(y,x)&amp;lt;/math&amp;gt;. or in other words: &amp;lt;math&amp;gt;\forall x \forall y [f(x,y) = f(y,x)]&amp;lt;/math&amp;gt; (this is not trivially true! it&#039;s false for many functions including &amp;lt;math&amp;gt;f(x,y) := x-y&amp;lt;/math&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
in the case of the shapley value, the &amp;quot;marginal contribution&amp;quot; function is NOT symmetric. so the naive fix that we would hope would work is to symmetrize it by adding all the possible permutations of the variables, forming a new function.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3569</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3569"/>
		<updated>2023-04-08T04:48:55Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables &#039;&#039;symmetrize&#039;&#039;] a function by adding up all the permutations of the variables. a symmetric function is one in which you can interchange any of the variables and the function will stay the same. not in the trivial sense that &amp;lt;math&amp;gt;f(x,y)&amp;lt;/math&amp;gt; is &amp;quot;the same&amp;quot; as &amp;lt;math&amp;gt;f(y,x)&amp;lt;/math&amp;gt; since x and y are &amp;quot;just variables&amp;quot;. but rather, in the sense that &amp;lt;math&amp;gt;f(x,y)=f(y,x)&amp;lt;/math&amp;gt;. in symbols:&lt;br /&gt;
&lt;br /&gt;
* we are NOT saying &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (y,x) \mapsto f(y,x)&amp;lt;/math&amp;gt;. this is trivially true for all functions!&lt;br /&gt;
* but rather: &amp;lt;math&amp;gt;\forall x \forall y [f(x,y) = f(y,x)]&amp;lt;/math&amp;gt; (this is not trivially true! it&#039;s false for many functions including &amp;lt;math&amp;gt;f(x,y) := x-y&amp;lt;/math&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
in the case of the shapley value, the &amp;quot;marginal contribution&amp;quot; function is NOT symmetric. so the naive fix that we would hope would work is to symmetrize it by adding all the possible permutations of the variables, forming a new function.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3568</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3568"/>
		<updated>2023-04-08T04:46:57Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables &#039;&#039;symmetrize&#039;&#039;] a function by adding up all the permutations of the variables. a symmetric function is one in which you can interchange any of the variables and the function will stay the same. not in the trivial sense that &amp;lt;math&amp;gt;f(x,y)&amp;lt;/math&amp;gt; is &amp;quot;the same&amp;quot; as &amp;lt;math&amp;gt;f(y,x)&amp;lt;/math&amp;gt; since x and y are &amp;quot;just variables&amp;quot;. but rather, in the sense that &amp;lt;math&amp;gt;f(x,y)=f(y,x)&amp;lt;/math&amp;gt;. in symbols:&lt;br /&gt;
&lt;br /&gt;
* we are NOT saying &amp;lt;math&amp;gt;(x,y) \mapsto f(x,y) = (y,x) \mapsto f(y,x)&amp;lt;/math&amp;gt;. this is trivially true for all functions!&lt;br /&gt;
* but rather: &amp;lt;math&amp;gt;\forall x \forall y [f(x,y) = f(y,x)]&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3567</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3567"/>
		<updated>2023-04-08T04:42:30Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables &#039;&#039;symmetrize&#039;&#039;] a function by adding up all the permutations of the variables.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3566</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3566"/>
		<updated>2023-04-08T04:40:28Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3565</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3565"/>
		<updated>2023-04-08T04:30:45Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\phi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3564</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3564"/>
		<updated>2023-04-08T04:30:20Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\phi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{\sigma(k) &amp;lt; \sigma(i)\} \cup \{i\}) - v(\{\sigma(k) &amp;lt; \sigma(i)\}))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3563</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3563"/>
		<updated>2023-04-08T04:27:26Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that the size of &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; is &amp;lt;math&amp;gt;n!&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3562</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3562"/>
		<updated>2023-04-08T04:26:41Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average? well first of all, we said above that the shapley value is averaging over all &#039;&#039;sequences&#039;&#039; of ways to add the n players. one way to formalize the concept of a &amp;quot;sequence&amp;quot; or &amp;quot;ordering&amp;quot; is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence &amp;lt;math&amp;gt;(x_1, x_2, \ldots, x_n)&amp;lt;/math&amp;gt; by defining the permutation &amp;lt;math&amp;gt;\sigma(k) := x_k&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
since the shapley value is an &#039;&#039;average&#039;&#039; and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
a relevant fact is that &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3561</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3561"/>
		<updated>2023-04-08T04:21:44Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
how is &#039;&#039;that&#039;&#039; supposed to be an average?&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3560</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3560"/>
		<updated>2023-04-08T04:21:11Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3559</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3559"/>
		<updated>2023-04-08T04:20:00Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
Now, let&#039;s take the ugly-ass formula for the shapley value that you always see:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{n!} \sum_{S \subseteq N \setminus \{i\}} |S|! (n - |S| - 1)! (v(S \cup \{i\}) - v(S))&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3558</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3558"/>
		<updated>2023-04-08T04:18:09Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
So in what sense is the shapley value an average? if &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the set of players, then we can define the set of all permutations &amp;lt;math&amp;gt;\mathrm{Sym}(N)&amp;lt;/math&amp;gt; on &amp;lt;math&amp;gt;N&amp;lt;/math&amp;gt;. (This is also denoted as &amp;lt;math&amp;gt;\mathrm{Sym}(n)&amp;lt;/math&amp;gt; and called the &amp;quot;symmetric group of degree n&amp;quot; since &amp;lt;math&amp;gt;N = \{1, \ldots, n\}&amp;lt;/math&amp;gt; is the &amp;quot;default&amp;quot; set of size n.)&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3557</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3557"/>
		<updated>2023-04-08T04:15:42Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
So, the shapley value is an &#039;&#039;average&#039;&#039;. but what kind of average? an &#039;&#039;arithmetic average&#039;&#039;. well, an arithmetic average takes a specific form. it looks like this. if you&#039;re averaging the elements of some set &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, then the arithmetic average &amp;lt;math&amp;gt;\bar{X}&amp;lt;/math&amp;gt; is&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\frac{1}{|X|} \sum_{x\in X} f(x)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We throw in the function f because the elements of X might not be numbers. or even if they &#039;&#039;are&#039;&#039; numbers, you might want to apply some weighting other than the default one (the identity function).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3556</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3556"/>
		<updated>2023-04-08T04:12:43Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
first of all, what&#039;s the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3555</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3555"/>
		<updated>2023-04-08T04:11:47Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called &amp;quot;player i&amp;quot;). so here we go, here&#039;s a TRUE REDPILLED exposition of the shapley value!&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3554</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3554"/>
		<updated>2023-04-08T04:10:07Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;.&lt;br /&gt;
&lt;br /&gt;
the Shapley value is &amp;lt;math&amp;gt;\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))&amp;lt;/math&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3553</id>
		<title>User:IssaRice/Shapley value</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Shapley_value&amp;diff=3553"/>
		<updated>2023-04-08T04:04:56Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: Created page with &amp;quot;.&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;.&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/Singular_value_decomposition&amp;diff=3552</id>
		<title>User:IssaRice/Linear algebra/Singular value decomposition</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/Singular_value_decomposition&amp;diff=3552"/>
		<updated>2022-10-01T21:10:52Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;the stupid textbooks don&#039;t tell you anything about SVD!!!! i think it&#039;s super helpful to look at all the &#039;&#039;wrong&#039;&#039; things one might say about SVD... we need to un-knot all those wrong intuitions. i&#039;ll list some knots that i have had.&lt;br /&gt;
&lt;br /&gt;
starting at this image: https://en.wikipedia.org/wiki/File:Singular-Value-Decomposition.svg&lt;br /&gt;
&lt;br /&gt;
* if A is an invertible matrix, then &amp;lt;math&amp;gt;A = E_1 \cdots E_m&amp;lt;/math&amp;gt; for some elementary matrices &amp;lt;math&amp;gt;E_1,\ldots,E_m&amp;lt;/math&amp;gt;. Dilations and swapping elementary matrices obviously involve only orthogonal operations. So we can write A as an alternating product of orthogonal and shear matrices (the product of two orthogonal matrices is again orthogonal. right???). If we can prove SVD for shears, we can convert this to an alternating product of orthogonal and &#039;&#039;diagonal&#039;&#039; matrices. unfortunately, this doesn&#039;t seem to lead to a full proof of SVD (unless orthogonal and diagonal matrices somehow commute).&lt;br /&gt;
* one question one might have is, to get the behavior of M in the linked image, can&#039;t we just squish along the standard basis directions, then rotate? surely this would produce the same ellipse. And it would seem that we&#039;ve only required one rotation, instead of the two in SVD. That&#039;s true, but pay attention to where the basis vectors went. A squish followed by a rotation... would preserve orthogonality. But in M it is clear that these basis vectors are no longer orthogonal. So even though we have faithfully preserved the ellipse, we don&#039;t have the same transformation. i.e. &amp;lt;math&amp;gt;M(\{v : \|v\| = 1\}) = M&#039;(\{v : \|v\|=1\})&amp;lt;/math&amp;gt; need not imply &amp;lt;math&amp;gt;M=M&#039;&amp;lt;/math&amp;gt;, apparently.  This must be an artifact of the fact that a circle is an extremely symmetric shape, so lots of non-identical transformations can still produce the same image of a circle. I think if we started out with a square, we would not have the same image if we just instead stretched and then did a rotation (actually, maybe a square too is still too symmetric; see example [https://youtu.be/vSczTbgc8Rc?list=PLnQX-jgAF5pTZXPiD8ciEARRylD9brJXU&amp;amp;t=679 here]).&lt;br /&gt;
* (polar decomposition.) In the linked image, look at the axes of the final ellipse, labeled &amp;lt;math&amp;gt;\sigma_1&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\sigma_2&amp;lt;/math&amp;gt;. Call those vectors &amp;lt;math&amp;gt;u_1&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;u_2&amp;lt;/math&amp;gt;. So &amp;lt;math&amp;gt;u_1 = Mv_1&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;u_2 = Mv_2&amp;lt;/math&amp;gt; for some vectors v1 and v2. Now, backtrack along the arrows, starting from the final image, going through U, then Sigma, then V*. Pay attention to what it does to u1 and u2. In each step, the vectors remain orthogonal. So not only are u1 and u2 orthogonal, we must have that v1 and v2 are orthogonal. So now, couldn&#039;t we say, &amp;quot;take v1 and v2, squish along those axes. then rotate.&amp;quot; That seems to have required only one rotation. What&#039;s going on? The problem is that a diagonal matrix can only stretch along the standard basis. So &amp;quot;stretch along v1 and v2&amp;quot; can&#039;t be done via a diagonal matrix (unless v1 and v2 are the standard basis, of course). Let&#039;s say &amp;lt;math&amp;gt;M = RD&amp;lt;/math&amp;gt; where R is a rotation, and D is &amp;quot;stretch along v1 and v2&amp;quot;. So &amp;lt;math&amp;gt;Dv_1 = \lambda_1 v_1&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;Dv_2 = \lambda_2 v_2&amp;lt;/math&amp;gt;. Now, D is not a diagonal matrix, when viewed in the standard basis, but it &#039;&#039;is&#039;&#039; a diagonal matrix when viewed under the basis v1,v2. To get to the standard basis, we need to convert the incoming vectors like v1 into e1, then apply the stretching, then reconvert back to v1. In other words, we want to show &amp;lt;math&amp;gt;D = U\Sigma U^*&amp;lt;/math&amp;gt; for some diagonal matrix Sigma and orthogonal U. Just take &amp;lt;math&amp;gt;Ue_j=v_j&amp;lt;/math&amp;gt;, i.e. the matrix U is the matrix with columns [v1 v2]. Now, &amp;lt;math&amp;gt;Dv_j = U\Sigma U^* v_j = U\Sigma e_j = U\lambda_j e_j = \lambda_j v_j&amp;lt;/math&amp;gt; just like we wanted. So &amp;lt;math&amp;gt;M = RD = RU\Sigma U^*&amp;lt;/math&amp;gt;. Of course, RU is another orthogonal matrix, so we recover SVD.&lt;br /&gt;
** also, if any of the entries of &amp;lt;math&amp;gt;\Sigma&amp;lt;/math&amp;gt; are negative, we could have chosen a different v vector that would have made it positive, so we can assume that D is a positive operator (positive semi-definite).&lt;br /&gt;
* another question is: if we squish along the right orthogonal directions, can&#039;t we get away with not needing the extra rotation? after all, an ellipse can be squished into a circle without any rotations. what must be the case, although i can&#039;t explain it visually yet, is that if we do just that, then the standard basis vectors (yellow and pink) get mapped to the wrong spots. this &#039;&#039;might&#039;&#039; be an artifact of shears (the wikipedia SVD image is a shear). clearer to look at michael nielsen&#039;s [http://cognitivemedium.com/emm/images/tangent_definition.png image]. here, if we start with the ellipse and shrink along Ms and stretch along Mt, we do get a circle. but Ms doesn&#039;t go back to s, and Mt doesn&#039;t go back to t; for that we&#039;ll need the extra rotation.&lt;br /&gt;
* [https://math.stackexchange.com/questions/2899052/singular-value-decomposition-reconciling-the-maximal-stretching-and-spectral my old question]:&lt;br /&gt;
** what is this &amp;lt;math&amp;gt;\sqrt{T^*T}&amp;lt;/math&amp;gt; that axler keeps talking about? that&#039;s the &amp;quot;stretch along well-chosen orthogonal directions&amp;quot; operation that we start out with in polar decomposition.&lt;br /&gt;
** for proof (1), see [http://cognitivemedium.com/emm/emm.html michael nielsen]. basically, the maximal stretching direction has a tangent vector (on the ellipse) that is orthogonal to it, because if it &#039;&#039;wasn&#039;t&#039;&#039; orthogonal, then we could get an even more stretched out vector. the other piece that&#039;s required is that linear maps preserve tangency. i.e. if v(t) is a parametrization of a circle, and M is a matrix, then M(v(t)) traces out an ellipse as t varies. (i&#039;m using t as a parameter even though nielsen uses it as a vector. seriously, who the heck uses t for a vector??) the tangent vector on the circle at v(t) is v&#039;(t). this tangent vector gets mapped to M(v&#039;(t)). the tangent vector at M(v(t)) on the ellipse is &amp;lt;math display=inline&amp;gt;\frac{d}{dt} M(v(t))&amp;lt;/math&amp;gt;. now, by linearity of M and the definition of the derivative, we can basically &amp;quot;pull out&amp;quot; the M and see that &amp;lt;math display=inline&amp;gt;\frac{d}{dt} M(v(t)) = M(\frac{d}{dt} v(t))&amp;lt;/math&amp;gt;.&amp;lt;ref group=note&amp;gt;&amp;lt;math&amp;gt;\frac{d}{dt} M(v(t)) = \lim_{h\to0} \frac{Mv(t+h) - Mv(t)}{h} = \lim_{h\to0} \frac1h M(v(t+h)-v(t)) = \lim_{h\to0}M(\frac1h (v(t+h)-v(t))) = M(\frac{d}{dt} v(t))&amp;lt;/math&amp;gt;. You might also want to play around with an example like &amp;lt;math&amp;gt;\begin{pmatrix}3 &amp;amp; 0\\ 0 &amp;amp; 4\end{pmatrix}&amp;lt;/math&amp;gt;, which takes (cos t, sin t) to (3cos t, 4sin t). The tangent at the original point is (-sin t, cos t). The tangent at the image is (-3sin t, 4cos t), which is equal to the image of the tangent.&amp;lt;/ref&amp;gt; what this means is that if you have a point on the circle and its tangent, then you map both of them under M, then the tangent of the image of the point is the image of the tangent at the point.&amp;lt;ref group=note&amp;gt; I think another way to see this is is via uniqueness of taylor approximations? like if v is a point on the circle, and u is the tangent vector at v, then points near v can be written as &amp;lt;math&amp;gt;v + \Delta u + O(\Delta^2)&amp;lt;/math&amp;gt;, and if we apply M to those points, we get &amp;lt;math&amp;gt;Mv + \Delta Mu + O(\Delta^2)&amp;lt;/math&amp;gt;. if taylor approximations are unique, then the fact that the term linear in &amp;lt;math&amp;gt;\Delta&amp;lt;/math&amp;gt; has Mu means that Mu must be tangent at Mv.&amp;lt;/ref&amp;gt; what this implies is that for our maximal stretch vector, since the tangent on the circle is orthogonal, the image of that tangent is also a tangent at the new place on the ellipse, and we already know that the tangent is orthogonal for the maximal stretch vector.&lt;br /&gt;
** so how does (2) find the same basis without talking about &amp;quot;maximal stretching&amp;quot;? well, in (2), &amp;lt;math&amp;gt;\sqrt{T^*T}&amp;lt;/math&amp;gt; &#039;&#039;means&#039;&#039; &amp;quot;stretch along well-chosen orthogonal directions&amp;quot; -- it&#039;s the positive operator that appears in polar decomposition. and if we stretch along orthogonal directions, then surely one of them has to be the maximal stretching direction (rather than, say, some direction intermediate between two of the axes).&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
&lt;br /&gt;
* https://machinelearning.subwiki.org/wiki/User:IssaRice/Linear_algebra/Classification_of_operators -- performing SVD on some nicer operators allows you to skip some of the steps, resulting in a simpler decomposition.&lt;br /&gt;
&lt;br /&gt;
==Footnotes==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references group=&amp;quot;note&amp;quot;/&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/Singular_value_decomposition&amp;diff=3551</id>
		<title>User:IssaRice/Linear algebra/Singular value decomposition</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=User:IssaRice/Linear_algebra/Singular_value_decomposition&amp;diff=3551"/>
		<updated>2022-10-01T21:08:40Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;the stupid textbooks don&#039;t tell you anything about SVD!!!! i think it&#039;s super helpful to look at all the &#039;&#039;wrong&#039;&#039; things one might say about SVD... we need to un-knot all those wrong intuitions. i&#039;ll list some knots that i have had.&lt;br /&gt;
&lt;br /&gt;
starting at this image: https://en.wikipedia.org/wiki/File:Singular-Value-Decomposition.svg&lt;br /&gt;
&lt;br /&gt;
* if A is an invertible matrix, then &amp;lt;math&amp;gt;A = E_1 \cdots E_m&amp;lt;/math&amp;gt; for some elementary matrices &amp;lt;math&amp;gt;E_1,\ldots,E_m&amp;lt;/math&amp;gt;. Dilations and swapping elementary matrices obviously involve only orthogonal operations. So we can write A as an alternating product of orthogonal and shear matrices (the product of two orthogonal matrices is again orthogonal. right???). If we can prove SVD for shears, we can convert this to an alternating product of orthogonal and &#039;&#039;diagonal&#039;&#039; matrices. unfortunately, this doesn&#039;t seem to lead to a full proof of SVD (unless orthogonal and diagonal matrices somehow commute).&lt;br /&gt;
* one question one might have is, to get the behavior of M in the linked image, can&#039;t we just squish along the standard basis directions, then rotate? surely this would produce the same ellipse. And it would seem that we&#039;ve only required one rotation, instead of the two in SVD. That&#039;s true, but pay attention to where the basis vectors went. A squish followed by a rotation... would preserve orthogonality. But in M it is clear that these basis vectors are no longer orthogonal. So even though we have faithfully preserved the ellipse, we don&#039;t have the same transformation. i.e. &amp;lt;math&amp;gt;M(\{v : \|v\| = 1\}) = M&#039;(\{v : \|v\|=1\})&amp;lt;/math&amp;gt; need not imply &amp;lt;math&amp;gt;M=M&#039;&amp;lt;/math&amp;gt;, apparently.  This must be an artifact of the fact that a circle is an extremely symmetric shape, so lots of non-identical transformations can still produce the same image of a circle. I think if we started out with a square, we would not have the same image if we just instead stretched and then did a rotation.&lt;br /&gt;
* (polar decomposition.) In the linked image, look at the axes of the final ellipse, labeled &amp;lt;math&amp;gt;\sigma_1&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\sigma_2&amp;lt;/math&amp;gt;. Call those vectors &amp;lt;math&amp;gt;u_1&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;u_2&amp;lt;/math&amp;gt;. So &amp;lt;math&amp;gt;u_1 = Mv_1&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;u_2 = Mv_2&amp;lt;/math&amp;gt; for some vectors v1 and v2. Now, backtrack along the arrows, starting from the final image, going through U, then Sigma, then V*. Pay attention to what it does to u1 and u2. In each step, the vectors remain orthogonal. So not only are u1 and u2 orthogonal, we must have that v1 and v2 are orthogonal. So now, couldn&#039;t we say, &amp;quot;take v1 and v2, squish along those axes. then rotate.&amp;quot; That seems to have required only one rotation. What&#039;s going on? The problem is that a diagonal matrix can only stretch along the standard basis. So &amp;quot;stretch along v1 and v2&amp;quot; can&#039;t be done via a diagonal matrix (unless v1 and v2 are the standard basis, of course). Let&#039;s say &amp;lt;math&amp;gt;M = RD&amp;lt;/math&amp;gt; where R is a rotation, and D is &amp;quot;stretch along v1 and v2&amp;quot;. So &amp;lt;math&amp;gt;Dv_1 = \lambda_1 v_1&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;Dv_2 = \lambda_2 v_2&amp;lt;/math&amp;gt;. Now, D is not a diagonal matrix, when viewed in the standard basis, but it &#039;&#039;is&#039;&#039; a diagonal matrix when viewed under the basis v1,v2. To get to the standard basis, we need to convert the incoming vectors like v1 into e1, then apply the stretching, then reconvert back to v1. In other words, we want to show &amp;lt;math&amp;gt;D = U\Sigma U^*&amp;lt;/math&amp;gt; for some diagonal matrix Sigma and orthogonal U. Just take &amp;lt;math&amp;gt;Ue_j=v_j&amp;lt;/math&amp;gt;, i.e. the matrix U is the matrix with columns [v1 v2]. Now, &amp;lt;math&amp;gt;Dv_j = U\Sigma U^* v_j = U\Sigma e_j = U\lambda_j e_j = \lambda_j v_j&amp;lt;/math&amp;gt; just like we wanted. So &amp;lt;math&amp;gt;M = RD = RU\Sigma U^*&amp;lt;/math&amp;gt;. Of course, RU is another orthogonal matrix, so we recover SVD.&lt;br /&gt;
** also, if any of the entries of &amp;lt;math&amp;gt;\Sigma&amp;lt;/math&amp;gt; are negative, we could have chosen a different v vector that would have made it positive, so we can assume that D is a positive operator (positive semi-definite).&lt;br /&gt;
* another question is: if we squish along the right orthogonal directions, can&#039;t we get away with not needing the extra rotation? after all, an ellipse can be squished into a circle without any rotations. what must be the case, although i can&#039;t explain it visually yet, is that if we do just that, then the standard basis vectors (yellow and pink) get mapped to the wrong spots. this &#039;&#039;might&#039;&#039; be an artifact of shears (the wikipedia SVD image is a shear). clearer to look at michael nielsen&#039;s [http://cognitivemedium.com/emm/images/tangent_definition.png image]. here, if we start with the ellipse and shrink along Ms and stretch along Mt, we do get a circle. but Ms doesn&#039;t go back to s, and Mt doesn&#039;t go back to t; for that we&#039;ll need the extra rotation.&lt;br /&gt;
* [https://math.stackexchange.com/questions/2899052/singular-value-decomposition-reconciling-the-maximal-stretching-and-spectral my old question]:&lt;br /&gt;
** what is this &amp;lt;math&amp;gt;\sqrt{T^*T}&amp;lt;/math&amp;gt; that axler keeps talking about? that&#039;s the &amp;quot;stretch along well-chosen orthogonal directions&amp;quot; operation that we start out with in polar decomposition.&lt;br /&gt;
** for proof (1), see [http://cognitivemedium.com/emm/emm.html michael nielsen]. basically, the maximal stretching direction has a tangent vector (on the ellipse) that is orthogonal to it, because if it &#039;&#039;wasn&#039;t&#039;&#039; orthogonal, then we could get an even more stretched out vector. the other piece that&#039;s required is that linear maps preserve tangency. i.e. if v(t) is a parametrization of a circle, and M is a matrix, then M(v(t)) traces out an ellipse as t varies. (i&#039;m using t as a parameter even though nielsen uses it as a vector. seriously, who the heck uses t for a vector??) the tangent vector on the circle at v(t) is v&#039;(t). this tangent vector gets mapped to M(v&#039;(t)). the tangent vector at M(v(t)) on the ellipse is &amp;lt;math display=inline&amp;gt;\frac{d}{dt} M(v(t))&amp;lt;/math&amp;gt;. now, by linearity of M and the definition of the derivative, we can basically &amp;quot;pull out&amp;quot; the M and see that &amp;lt;math display=inline&amp;gt;\frac{d}{dt} M(v(t)) = M(\frac{d}{dt} v(t))&amp;lt;/math&amp;gt;.&amp;lt;ref group=note&amp;gt;&amp;lt;math&amp;gt;\frac{d}{dt} M(v(t)) = \lim_{h\to0} \frac{Mv(t+h) - Mv(t)}{h} = \lim_{h\to0} \frac1h M(v(t+h)-v(t)) = \lim_{h\to0}M(\frac1h (v(t+h)-v(t))) = M(\frac{d}{dt} v(t))&amp;lt;/math&amp;gt;. You might also want to play around with an example like &amp;lt;math&amp;gt;\begin{pmatrix}3 &amp;amp; 0\\ 0 &amp;amp; 4\end{pmatrix}&amp;lt;/math&amp;gt;, which takes (cos t, sin t) to (3cos t, 4sin t). The tangent at the original point is (-sin t, cos t). The tangent at the image is (-3sin t, 4cos t), which is equal to the image of the tangent.&amp;lt;/ref&amp;gt; what this means is that if you have a point on the circle and its tangent, then you map both of them under M, then the tangent of the image of the point is the image of the tangent at the point.&amp;lt;ref group=note&amp;gt; I think another way to see this is is via uniqueness of taylor approximations? like if v is a point on the circle, and u is the tangent vector at v, then points near v can be written as &amp;lt;math&amp;gt;v + \Delta u + O(\Delta^2)&amp;lt;/math&amp;gt;, and if we apply M to those points, we get &amp;lt;math&amp;gt;Mv + \Delta Mu + O(\Delta^2)&amp;lt;/math&amp;gt;. if taylor approximations are unique, then the fact that the term linear in &amp;lt;math&amp;gt;\Delta&amp;lt;/math&amp;gt; has Mu means that Mu must be tangent at Mv.&amp;lt;/ref&amp;gt; what this implies is that for our maximal stretch vector, since the tangent on the circle is orthogonal, the image of that tangent is also a tangent at the new place on the ellipse, and we already know that the tangent is orthogonal for the maximal stretch vector.&lt;br /&gt;
** so how does (2) find the same basis without talking about &amp;quot;maximal stretching&amp;quot;? well, in (2), &amp;lt;math&amp;gt;\sqrt{T^*T}&amp;lt;/math&amp;gt; &#039;&#039;means&#039;&#039; &amp;quot;stretch along well-chosen orthogonal directions&amp;quot; -- it&#039;s the positive operator that appears in polar decomposition. and if we stretch along orthogonal directions, then surely one of them has to be the maximal stretching direction (rather than, say, some direction intermediate between two of the axes).&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
&lt;br /&gt;
* https://machinelearning.subwiki.org/wiki/User:IssaRice/Linear_algebra/Classification_of_operators -- performing SVD on some nicer operators allows you to skip some of the steps, resulting in a simpler decomposition.&lt;br /&gt;
&lt;br /&gt;
==Footnotes==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references group=&amp;quot;note&amp;quot;/&amp;gt;&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=Summary_table_of_probability_terms&amp;diff=3550</id>
		<title>Summary table of probability terms</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=Summary_table_of_probability_terms&amp;diff=3550"/>
		<updated>2022-07-14T18:16:48Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: /* Dependencies */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page is a &#039;&#039;&#039;summary table of probability terms&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
==Table==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;sortable wikitable&amp;quot;&lt;br /&gt;
! Term !! Notation !! Type !! Definition !! Notes&lt;br /&gt;
|-&lt;br /&gt;
| Reals || &amp;lt;math&amp;gt;\mathbf R&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Borel subsets of the reals || &amp;lt;math&amp;gt;\mathcal B&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| A Borel set || &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal B&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| [[Sample space]] || &amp;lt;math&amp;gt;\Omega&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Outcome || &amp;lt;math&amp;gt;\omega&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Events or measurable sets || &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Probability measure || &amp;lt;math&amp;gt;\mathbf P&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\Pr&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathbf P_{\mathcal F}&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal F \to [0,1]&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Probability triple or probability space || &amp;lt;math&amp;gt;(\Omega, \mathcal F, \mathbf P)&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Distribution || &amp;lt;math&amp;gt;\mu&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathcal D&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;D&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathbf P_{\mathcal B}&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathcal L(X)&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathbf P X^{-1}&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal B \to \mathbf [0,1]&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;B \mapsto \mathbf P(X \in B)&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Induced probability space || &amp;lt;math&amp;gt;(\mathbf R, \mathcal B, \mu)&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Cumulative distribution function or CDF || &amp;lt;math&amp;gt;F_X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R \to [0,1]&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Probability density function or PDF || &amp;lt;math&amp;gt;f_X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R \to [0,\infty)&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| [[Random variable]] || &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega \to \mathbf R&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Preimage of random variable || &amp;lt;math&amp;gt;X^{-1}&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;2^{\mathbf R} \to 2^{\Omega}&amp;lt;/math&amp;gt; but all we need is &amp;lt;math&amp;gt;\mathcal B \to \mathcal F&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Indicator of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;1_A&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega \to \{0,1\}&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;1_A(\omega) = \begin{cases}1 &amp;amp; \omega\in A \\ 0 &amp;amp; \omega \not\in A\end{cases}&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| [[Expectation]] || &amp;lt;math&amp;gt;\mathbf E&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathrm E&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;(\Omega \to \mathbf R) \to \mathbf R&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;X \in B&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\{\omega \in \Omega : X(\omega) \in B\}&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;X=x&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\{\omega \in \Omega : X(\omega) = x\}&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;X\leq x&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\{\omega \in \Omega : X(\omega) \leq x\}&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Function of a random variable, where &amp;lt;math&amp;gt;f\colon \mathbf R \to \mathbf R&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;f(X)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega \to \mathbf R&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;f\circ X&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| [[Expected value]] of &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf E(X)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;\mathbf E(X\mid Y=y)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;\mathbf E(X\mid Y)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega \to \mathbf R&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\omega \mapsto \mathbf E(X\mid Y=Y(\omega))&amp;lt;/math&amp;gt;?&lt;br /&gt;
|-&lt;br /&gt;
| Utility function || &amp;lt;math&amp;gt;u&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R \to \mathbf R&amp;lt;/math&amp;gt; || || I &#039;&#039;think&#039;&#039; this is what the type must be, based on how it&#039;s used. But we usually think of the utility function as assigning numbers to outcomes; but if that is so, it must be a random variable! What&#039;s up with that? (2022-07-14: I think in probability theory, we usually discuss only real random variables, since that allows us to do a lot more with them like take expected value. But in fields like AI, we consider more general random variables &amp;lt;math&amp;gt;\Omega \to \mathcal O&amp;lt;/math&amp;gt; that take values in some space of outcomes &amp;lt;math&amp;gt;\mathcal O&amp;lt;/math&amp;gt;. We can&#039;t &amp;quot;average over&amp;quot; outcomes so we can&#039;t really take expected values anymore, but this allows us to make the utility function more general so we get &amp;lt;math&amp;gt;u : \mathcal O \to \mathbf R&amp;lt;/math&amp;gt;.)&lt;br /&gt;
|-&lt;br /&gt;
| Expected utility of &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf{EU}(X)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf E(u(X))&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;u\circ X&amp;lt;/math&amp;gt; is indeed a random variable, so the type check passes.&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
All the utility stuff isn&#039;t really related to machine learning. It&#039;s more related to the decision theory stuff I&#039;m learning. I&#039;m putting it here for now for convenience but might move it later.&lt;br /&gt;
&lt;br /&gt;
TODO add &amp;quot;probability distribution over S&amp;quot; and &amp;quot;probability distribution on A&amp;quot; [https://arxiv.org/pdf/1711.00363.pdf]&lt;br /&gt;
&lt;br /&gt;
Li and Vitanyi (&#039;&#039;An Introduction to Kolmogorov Complexity and Its Applications&#039;&#039;, p. 19) calls the probability measure on &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; a probability distribution over S (the sample space).&lt;br /&gt;
&lt;br /&gt;
TODO: add probability mass function (defined only for discrete random variables)&lt;br /&gt;
&lt;br /&gt;
==Dependencies==&lt;br /&gt;
&lt;br /&gt;
Let &amp;lt;math&amp;gt;(\Omega, \mathcal F, \mathbf P)&amp;lt;/math&amp;gt; be a probability space.&lt;br /&gt;
&lt;br /&gt;
* Given a random variable &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, we can compute its distribution &amp;lt;math&amp;gt;\mu&amp;lt;/math&amp;gt;. How? Just let &amp;lt;math&amp;gt;\mu(B) = \mathbf P_{\mathcal F}(X \in B)&amp;lt;/math&amp;gt;&lt;br /&gt;
* Given a random variable, we can compute the probability density function. How?&lt;br /&gt;
* Given a random variable, we can compute the cumulative distribution function. How?&lt;br /&gt;
* Given a distribution, we can retrieve a random variable. But this random variable is not unique? This is why we can say stuff like &amp;quot;let &amp;lt;math&amp;gt;X\sim \mathcal D&amp;lt;/math&amp;gt;&amp;quot;.&lt;br /&gt;
* Given a distribution &amp;lt;math&amp;gt;\mu&amp;lt;/math&amp;gt;, we can compute its density function. How? Just find the derivative of &amp;lt;math&amp;gt;\mu((-\infty,x])&amp;lt;/math&amp;gt;. (?) (2022-07-14: something something Radon–Nikodym theorem...)&lt;br /&gt;
* Given a cumulative distribution function, we can compute the random variable. (Right?) (2022-07-14: but a CDF is like a distribution, so the random variable won&#039;t be unique.)&lt;br /&gt;
* Given a probability density function, can we get everything else? Don&#039;t we just have to integrate to get the cdf, which gets us the random variable and the distribution?&lt;br /&gt;
* Given a cumulative distribution function, how do we get the distribution? We have &amp;lt;math&amp;gt;F_X(x) = \mathbf P_{\mathcal F}(X\leq x) = \mathbf P_{\mathcal B}((-\infty,x])&amp;lt;/math&amp;gt;, which gets us some of what the distribution &amp;lt;math&amp;gt;\mathbf P_{\mathcal B}&amp;lt;/math&amp;gt; maps to, but &amp;lt;math&amp;gt;\mathcal B&amp;lt;/math&amp;gt; is bigger than this. What do we do about the other values we need to map? We can compute intervals like &amp;lt;math&amp;gt;F_X(b) - F_X(a) = \mathbf P_{\mathcal F}(a \leq X\leq b) = \mathbf P_{\mathcal B}([a,b])&amp;lt;/math&amp;gt;. And we can apparently do the same for unions and limiting operations.&lt;br /&gt;
&lt;br /&gt;
==Philosophical details about the sample space==&lt;br /&gt;
&lt;br /&gt;
Given a random variable &amp;lt;math&amp;gt;X : \Omega \to \mathbf R&amp;lt;/math&amp;gt; and any reasonable predicate &amp;lt;math&amp;gt;P&amp;lt;/math&amp;gt; about &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, we can replace &amp;lt;math&amp;gt;P(X)&amp;lt;/math&amp;gt; with its extension &amp;lt;math&amp;gt;\{\omega \in \Omega : P(X(\omega))\} = \{\omega \in \Omega : X(\omega) \in B\}&amp;lt;/math&amp;gt; for some &amp;lt;math&amp;gt;B \in \mathcal B&amp;lt;/math&amp;gt;. And from then on, we can write &amp;lt;math&amp;gt;\mathbf P_{\mathcal F}(X\in B)&amp;lt;/math&amp;gt; as &amp;lt;math&amp;gt;\mathbf P_{\mathcal F}(X^{-1}(B)) = \mathbf P_{\mathcal B}(B) = \mu(B)&amp;lt;/math&amp;gt;. In other words, we can just work with Borel sets of the reals (measuring them with the distribution) rather than the original events (measuring them with the original probability measure). Where did &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; go? &amp;lt;math&amp;gt;\mathbf P_{\mathcal F} \circ X^{-1} = \mathbf P_{\mathcal B}&amp;lt;/math&amp;gt;, so you can write &amp;lt;math&amp;gt;\mathbf P_{\mathcal B}&amp;lt;/math&amp;gt; using &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;. But once you already have &amp;lt;math&amp;gt;\mathbf P_{\mathcal B}&amp;lt;/math&amp;gt;, you don&#039;t need to know what &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; is.&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
&lt;br /&gt;
* [[Summary table of multivariable derivatives]]&lt;br /&gt;
* [[Comparison of machine learning textbooks]]&lt;br /&gt;
&lt;br /&gt;
==External links==&lt;br /&gt;
&lt;br /&gt;
* [https://terrytao.wordpress.com/2010/01/01/254a-notes-0-a-review-of-probability-theory/ 254A, Notes 0: A review of probability theory] and [https://terrytao.wordpress.com/2015/09/29/275a-notes-0-foundations-of-probability-theory/ 275A, Notes 0: Foundations of probability theory] by [[wikipedia:Terence Tao|Terence Tao]]&lt;br /&gt;
* [http://dsp.ucsd.edu/~kreutz/PEI-05%20Support%20Files/Basic%20Random%20Variables%20Concepts.pdf Basic Random Variable Concepts] by Kenneth Kreutz-Delgado&lt;br /&gt;
* Various questions on Mathematics Stack Exchange:&lt;br /&gt;
** https://math.stackexchange.com/questions/2233731/discarding-random-variables-in-favor-of-a-domain-less-definition&lt;br /&gt;
** https://math.stackexchange.com/questions/18198/what-are-the-sample-spaces-when-talking-about-continuous-random-variables&lt;br /&gt;
** https://math.stackexchange.com/questions/2233721/the-true-domain-of-random-variables&lt;br /&gt;
** https://math.stackexchange.com/questions/712734/domain-of-a-random-variable-sample-space-or-probability-space&lt;br /&gt;
** https://math.stackexchange.com/questions/23006/the-role-of-the-hidden-probability-space-on-which-random-variables-are-defined&lt;br /&gt;
** https://math.stackexchange.com/questions/1612012/how-should-i-understand-the-probability-space-omega-mathcalf-p-what-d&lt;br /&gt;
** https://math.stackexchange.com/questions/2531810/why-does-probability-theory-insist-on-sample-spaces&lt;br /&gt;
** https://math.stackexchange.com/questions/1690289/what-is-a-probability-distribution&lt;br /&gt;
** https://math.stackexchange.com/questions/1073744/distinguishing-probability-measure-function-and-distribution&lt;br /&gt;
** https://math.stackexchange.com/questions/57027/concept-of-probability-distribution&lt;br /&gt;
* Tim Gowers:&lt;br /&gt;
** https://gowers.wordpress.com/2010/09/01/icm2010-fourth-day/ (search for &amp;quot;random variable&amp;quot;)&lt;br /&gt;
** https://mathoverflow.net/questions/12516/a-random-variable-is-it-a-function-or-an-equivalence-class-of-functions&lt;br /&gt;
&lt;br /&gt;
[[Category:Probability]]&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=Summary_table_of_probability_terms&amp;diff=3549</id>
		<title>Summary table of probability terms</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=Summary_table_of_probability_terms&amp;diff=3549"/>
		<updated>2022-07-14T18:15:36Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: /* Dependencies */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page is a &#039;&#039;&#039;summary table of probability terms&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
==Table==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;sortable wikitable&amp;quot;&lt;br /&gt;
! Term !! Notation !! Type !! Definition !! Notes&lt;br /&gt;
|-&lt;br /&gt;
| Reals || &amp;lt;math&amp;gt;\mathbf R&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Borel subsets of the reals || &amp;lt;math&amp;gt;\mathcal B&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| A Borel set || &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal B&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| [[Sample space]] || &amp;lt;math&amp;gt;\Omega&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Outcome || &amp;lt;math&amp;gt;\omega&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Events or measurable sets || &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Probability measure || &amp;lt;math&amp;gt;\mathbf P&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\Pr&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathbf P_{\mathcal F}&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal F \to [0,1]&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Probability triple or probability space || &amp;lt;math&amp;gt;(\Omega, \mathcal F, \mathbf P)&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Distribution || &amp;lt;math&amp;gt;\mu&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathcal D&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;D&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathbf P_{\mathcal B}&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathcal L(X)&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathbf P X^{-1}&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal B \to \mathbf [0,1]&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;B \mapsto \mathbf P(X \in B)&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Induced probability space || &amp;lt;math&amp;gt;(\mathbf R, \mathcal B, \mu)&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Cumulative distribution function or CDF || &amp;lt;math&amp;gt;F_X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R \to [0,1]&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Probability density function or PDF || &amp;lt;math&amp;gt;f_X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R \to [0,\infty)&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| [[Random variable]] || &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega \to \mathbf R&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Preimage of random variable || &amp;lt;math&amp;gt;X^{-1}&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;2^{\mathbf R} \to 2^{\Omega}&amp;lt;/math&amp;gt; but all we need is &amp;lt;math&amp;gt;\mathcal B \to \mathcal F&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Indicator of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;1_A&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega \to \{0,1\}&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;1_A(\omega) = \begin{cases}1 &amp;amp; \omega\in A \\ 0 &amp;amp; \omega \not\in A\end{cases}&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| [[Expectation]] || &amp;lt;math&amp;gt;\mathbf E&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathrm E&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;(\Omega \to \mathbf R) \to \mathbf R&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;X \in B&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\{\omega \in \Omega : X(\omega) \in B\}&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;X=x&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\{\omega \in \Omega : X(\omega) = x\}&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;X\leq x&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\{\omega \in \Omega : X(\omega) \leq x\}&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Function of a random variable, where &amp;lt;math&amp;gt;f\colon \mathbf R \to \mathbf R&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;f(X)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega \to \mathbf R&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;f\circ X&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| [[Expected value]] of &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf E(X)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;\mathbf E(X\mid Y=y)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;\mathbf E(X\mid Y)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega \to \mathbf R&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\omega \mapsto \mathbf E(X\mid Y=Y(\omega))&amp;lt;/math&amp;gt;?&lt;br /&gt;
|-&lt;br /&gt;
| Utility function || &amp;lt;math&amp;gt;u&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R \to \mathbf R&amp;lt;/math&amp;gt; || || I &#039;&#039;think&#039;&#039; this is what the type must be, based on how it&#039;s used. But we usually think of the utility function as assigning numbers to outcomes; but if that is so, it must be a random variable! What&#039;s up with that? (2022-07-14: I think in probability theory, we usually discuss only real random variables, since that allows us to do a lot more with them like take expected value. But in fields like AI, we consider more general random variables &amp;lt;math&amp;gt;\Omega \to \mathcal O&amp;lt;/math&amp;gt; that take values in some space of outcomes &amp;lt;math&amp;gt;\mathcal O&amp;lt;/math&amp;gt;. We can&#039;t &amp;quot;average over&amp;quot; outcomes so we can&#039;t really take expected values anymore, but this allows us to make the utility function more general so we get &amp;lt;math&amp;gt;u : \mathcal O \to \mathbf R&amp;lt;/math&amp;gt;.)&lt;br /&gt;
|-&lt;br /&gt;
| Expected utility of &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf{EU}(X)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf E(u(X))&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;u\circ X&amp;lt;/math&amp;gt; is indeed a random variable, so the type check passes.&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
All the utility stuff isn&#039;t really related to machine learning. It&#039;s more related to the decision theory stuff I&#039;m learning. I&#039;m putting it here for now for convenience but might move it later.&lt;br /&gt;
&lt;br /&gt;
TODO add &amp;quot;probability distribution over S&amp;quot; and &amp;quot;probability distribution on A&amp;quot; [https://arxiv.org/pdf/1711.00363.pdf]&lt;br /&gt;
&lt;br /&gt;
Li and Vitanyi (&#039;&#039;An Introduction to Kolmogorov Complexity and Its Applications&#039;&#039;, p. 19) calls the probability measure on &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; a probability distribution over S (the sample space).&lt;br /&gt;
&lt;br /&gt;
TODO: add probability mass function (defined only for discrete random variables)&lt;br /&gt;
&lt;br /&gt;
==Dependencies==&lt;br /&gt;
&lt;br /&gt;
Let &amp;lt;math&amp;gt;(\Omega, \mathcal F, \mathbf P)&amp;lt;/math&amp;gt; be a probability space.&lt;br /&gt;
&lt;br /&gt;
* Given a random variable &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, we can compute its distribution &amp;lt;math&amp;gt;\mu&amp;lt;/math&amp;gt;. How? Just let &amp;lt;math&amp;gt;\mu(B) = \mathbf P_{\mathcal F}(X \in B)&amp;lt;/math&amp;gt;&lt;br /&gt;
* Given a random variable, we can compute the probability density function. How?&lt;br /&gt;
* Given a random variable, we can compute the cumulative distribution function. How?&lt;br /&gt;
* Given a distribution, we can retrieve a random variable. But this random variable is not unique? This is why we can say stuff like &amp;quot;let &amp;lt;math&amp;gt;X\sim \mathcal D&amp;lt;/math&amp;gt;&amp;quot;.&lt;br /&gt;
* Given a distribution &amp;lt;math&amp;gt;\mu&amp;lt;/math&amp;gt;, we can compute its density function. How? Just find the derivative of &amp;lt;math&amp;gt;\mu((-\infty,x])&amp;lt;/math&amp;gt;. (?) (2022-07-14: something something Radon–Nikodym theorem...)&lt;br /&gt;
* Given a cumulative distribution function, we can compute the random variable. (Right?)&lt;br /&gt;
* Given a probability density function, can we get everything else? Don&#039;t we just have to integrate to get the cdf, which gets us the random variable and the distribution?&lt;br /&gt;
* Given a cumulative distribution function, how do we get the distribution? We have &amp;lt;math&amp;gt;F_X(x) = \mathbf P_{\mathcal F}(X\leq x) = \mathbf P_{\mathcal B}((-\infty,x])&amp;lt;/math&amp;gt;, which gets us some of what the distribution &amp;lt;math&amp;gt;\mathbf P_{\mathcal B}&amp;lt;/math&amp;gt; maps to, but &amp;lt;math&amp;gt;\mathcal B&amp;lt;/math&amp;gt; is bigger than this. What do we do about the other values we need to map? We can compute intervals like &amp;lt;math&amp;gt;F_X(b) - F_X(a) = \mathbf P_{\mathcal F}(a \leq X\leq b) = \mathbf P_{\mathcal B}([a,b])&amp;lt;/math&amp;gt;. And we can apparently do the same for unions and limiting operations.&lt;br /&gt;
&lt;br /&gt;
==Philosophical details about the sample space==&lt;br /&gt;
&lt;br /&gt;
Given a random variable &amp;lt;math&amp;gt;X : \Omega \to \mathbf R&amp;lt;/math&amp;gt; and any reasonable predicate &amp;lt;math&amp;gt;P&amp;lt;/math&amp;gt; about &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, we can replace &amp;lt;math&amp;gt;P(X)&amp;lt;/math&amp;gt; with its extension &amp;lt;math&amp;gt;\{\omega \in \Omega : P(X(\omega))\} = \{\omega \in \Omega : X(\omega) \in B\}&amp;lt;/math&amp;gt; for some &amp;lt;math&amp;gt;B \in \mathcal B&amp;lt;/math&amp;gt;. And from then on, we can write &amp;lt;math&amp;gt;\mathbf P_{\mathcal F}(X\in B)&amp;lt;/math&amp;gt; as &amp;lt;math&amp;gt;\mathbf P_{\mathcal F}(X^{-1}(B)) = \mathbf P_{\mathcal B}(B) = \mu(B)&amp;lt;/math&amp;gt;. In other words, we can just work with Borel sets of the reals (measuring them with the distribution) rather than the original events (measuring them with the original probability measure). Where did &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; go? &amp;lt;math&amp;gt;\mathbf P_{\mathcal F} \circ X^{-1} = \mathbf P_{\mathcal B}&amp;lt;/math&amp;gt;, so you can write &amp;lt;math&amp;gt;\mathbf P_{\mathcal B}&amp;lt;/math&amp;gt; using &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;. But once you already have &amp;lt;math&amp;gt;\mathbf P_{\mathcal B}&amp;lt;/math&amp;gt;, you don&#039;t need to know what &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; is.&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
&lt;br /&gt;
* [[Summary table of multivariable derivatives]]&lt;br /&gt;
* [[Comparison of machine learning textbooks]]&lt;br /&gt;
&lt;br /&gt;
==External links==&lt;br /&gt;
&lt;br /&gt;
* [https://terrytao.wordpress.com/2010/01/01/254a-notes-0-a-review-of-probability-theory/ 254A, Notes 0: A review of probability theory] and [https://terrytao.wordpress.com/2015/09/29/275a-notes-0-foundations-of-probability-theory/ 275A, Notes 0: Foundations of probability theory] by [[wikipedia:Terence Tao|Terence Tao]]&lt;br /&gt;
* [http://dsp.ucsd.edu/~kreutz/PEI-05%20Support%20Files/Basic%20Random%20Variables%20Concepts.pdf Basic Random Variable Concepts] by Kenneth Kreutz-Delgado&lt;br /&gt;
* Various questions on Mathematics Stack Exchange:&lt;br /&gt;
** https://math.stackexchange.com/questions/2233731/discarding-random-variables-in-favor-of-a-domain-less-definition&lt;br /&gt;
** https://math.stackexchange.com/questions/18198/what-are-the-sample-spaces-when-talking-about-continuous-random-variables&lt;br /&gt;
** https://math.stackexchange.com/questions/2233721/the-true-domain-of-random-variables&lt;br /&gt;
** https://math.stackexchange.com/questions/712734/domain-of-a-random-variable-sample-space-or-probability-space&lt;br /&gt;
** https://math.stackexchange.com/questions/23006/the-role-of-the-hidden-probability-space-on-which-random-variables-are-defined&lt;br /&gt;
** https://math.stackexchange.com/questions/1612012/how-should-i-understand-the-probability-space-omega-mathcalf-p-what-d&lt;br /&gt;
** https://math.stackexchange.com/questions/2531810/why-does-probability-theory-insist-on-sample-spaces&lt;br /&gt;
** https://math.stackexchange.com/questions/1690289/what-is-a-probability-distribution&lt;br /&gt;
** https://math.stackexchange.com/questions/1073744/distinguishing-probability-measure-function-and-distribution&lt;br /&gt;
** https://math.stackexchange.com/questions/57027/concept-of-probability-distribution&lt;br /&gt;
* Tim Gowers:&lt;br /&gt;
** https://gowers.wordpress.com/2010/09/01/icm2010-fourth-day/ (search for &amp;quot;random variable&amp;quot;)&lt;br /&gt;
** https://mathoverflow.net/questions/12516/a-random-variable-is-it-a-function-or-an-equivalence-class-of-functions&lt;br /&gt;
&lt;br /&gt;
[[Category:Probability]]&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
	<entry>
		<id>https://machinelearning.subwiki.org/w/index.php?title=Summary_table_of_probability_terms&amp;diff=3548</id>
		<title>Summary table of probability terms</title>
		<link rel="alternate" type="text/html" href="https://machinelearning.subwiki.org/w/index.php?title=Summary_table_of_probability_terms&amp;diff=3548"/>
		<updated>2022-07-14T18:08:21Z</updated>

		<summary type="html">&lt;p&gt;IssaRice: /* Table */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page is a &#039;&#039;&#039;summary table of probability terms&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
==Table==&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;sortable wikitable&amp;quot;&lt;br /&gt;
! Term !! Notation !! Type !! Definition !! Notes&lt;br /&gt;
|-&lt;br /&gt;
| Reals || &amp;lt;math&amp;gt;\mathbf R&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Borel subsets of the reals || &amp;lt;math&amp;gt;\mathcal B&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| A Borel set || &amp;lt;math&amp;gt;B&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal B&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| [[Sample space]] || &amp;lt;math&amp;gt;\Omega&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Outcome || &amp;lt;math&amp;gt;\omega&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Events or measurable sets || &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Probability measure || &amp;lt;math&amp;gt;\mathbf P&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\Pr&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathbf P_{\mathcal F}&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal F \to [0,1]&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Probability triple or probability space || &amp;lt;math&amp;gt;(\Omega, \mathcal F, \mathbf P)&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Distribution || &amp;lt;math&amp;gt;\mu&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathcal D&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;D&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathbf P_{\mathcal B}&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathcal L(X)&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathbf P X^{-1}&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal B \to \mathbf [0,1]&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;B \mapsto \mathbf P(X \in B)&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Induced probability space || &amp;lt;math&amp;gt;(\mathbf R, \mathcal B, \mu)&amp;lt;/math&amp;gt; || ||&lt;br /&gt;
|-&lt;br /&gt;
| Cumulative distribution function or CDF || &amp;lt;math&amp;gt;F_X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R \to [0,1]&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Probability density function or PDF || &amp;lt;math&amp;gt;f_X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R \to [0,\infty)&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| [[Random variable]] || &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega \to \mathbf R&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Preimage of random variable || &amp;lt;math&amp;gt;X^{-1}&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;2^{\mathbf R} \to 2^{\Omega}&amp;lt;/math&amp;gt; but all we need is &amp;lt;math&amp;gt;\mathcal B \to \mathcal F&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| Indicator of &amp;lt;math&amp;gt;A&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;1_A&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega \to \{0,1\}&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;1_A(\omega) = \begin{cases}1 &amp;amp; \omega\in A \\ 0 &amp;amp; \omega \not\in A\end{cases}&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| [[Expectation]] || &amp;lt;math&amp;gt;\mathbf E&amp;lt;/math&amp;gt; or &amp;lt;math&amp;gt;\mathrm E&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;(\Omega \to \mathbf R) \to \mathbf R&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;X \in B&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\{\omega \in \Omega : X(\omega) \in B\}&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;X=x&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\{\omega \in \Omega : X(\omega) = x\}&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;X\leq x&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\{\omega \in \Omega : X(\omega) \leq x\}&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| Function of a random variable, where &amp;lt;math&amp;gt;f\colon \mathbf R \to \mathbf R&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;f(X)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega \to \mathbf R&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;f\circ X&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| [[Expected value]] of &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf E(X)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R&amp;lt;/math&amp;gt;&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;\mathbf E(X\mid Y=y)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R&amp;lt;/math&amp;gt; ||&lt;br /&gt;
|-&lt;br /&gt;
| || &amp;lt;math&amp;gt;\mathbf E(X\mid Y)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\Omega \to \mathbf R&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\omega \mapsto \mathbf E(X\mid Y=Y(\omega))&amp;lt;/math&amp;gt;?&lt;br /&gt;
|-&lt;br /&gt;
| Utility function || &amp;lt;math&amp;gt;u&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R \to \mathbf R&amp;lt;/math&amp;gt; || || I &#039;&#039;think&#039;&#039; this is what the type must be, based on how it&#039;s used. But we usually think of the utility function as assigning numbers to outcomes; but if that is so, it must be a random variable! What&#039;s up with that? (2022-07-14: I think in probability theory, we usually discuss only real random variables, since that allows us to do a lot more with them like take expected value. But in fields like AI, we consider more general random variables &amp;lt;math&amp;gt;\Omega \to \mathcal O&amp;lt;/math&amp;gt; that take values in some space of outcomes &amp;lt;math&amp;gt;\mathcal O&amp;lt;/math&amp;gt;. We can&#039;t &amp;quot;average over&amp;quot; outcomes so we can&#039;t really take expected values anymore, but this allows us to make the utility function more general so we get &amp;lt;math&amp;gt;u : \mathcal O \to \mathbf R&amp;lt;/math&amp;gt;.)&lt;br /&gt;
|-&lt;br /&gt;
| Expected utility of &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf{EU}(X)&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf R&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;\mathbf E(u(X))&amp;lt;/math&amp;gt; || &amp;lt;math&amp;gt;u\circ X&amp;lt;/math&amp;gt; is indeed a random variable, so the type check passes.&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
All the utility stuff isn&#039;t really related to machine learning. It&#039;s more related to the decision theory stuff I&#039;m learning. I&#039;m putting it here for now for convenience but might move it later.&lt;br /&gt;
&lt;br /&gt;
TODO add &amp;quot;probability distribution over S&amp;quot; and &amp;quot;probability distribution on A&amp;quot; [https://arxiv.org/pdf/1711.00363.pdf]&lt;br /&gt;
&lt;br /&gt;
Li and Vitanyi (&#039;&#039;An Introduction to Kolmogorov Complexity and Its Applications&#039;&#039;, p. 19) calls the probability measure on &amp;lt;math&amp;gt;\mathcal F&amp;lt;/math&amp;gt; a probability distribution over S (the sample space).&lt;br /&gt;
&lt;br /&gt;
TODO: add probability mass function (defined only for discrete random variables)&lt;br /&gt;
&lt;br /&gt;
==Dependencies==&lt;br /&gt;
&lt;br /&gt;
Let &amp;lt;math&amp;gt;(\Omega, \mathcal F, \mathbf P)&amp;lt;/math&amp;gt; be a probability space.&lt;br /&gt;
&lt;br /&gt;
* Given a random variable &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, we can compute its distribution &amp;lt;math&amp;gt;\mu&amp;lt;/math&amp;gt;. How? Just let &amp;lt;math&amp;gt;\mu(B) = \mathbf P_{\mathcal F}(X \in B)&amp;lt;/math&amp;gt;&lt;br /&gt;
* Given a random variable, we can compute the probability density function. How?&lt;br /&gt;
* Given a random variable, we can compute the cumulative distribution function. How?&lt;br /&gt;
* Given a distribution, we can retrieve a random variable. But this random variable is not unique? This is why we can say stuff like &amp;quot;let &amp;lt;math&amp;gt;X\sim \mathcal D&amp;lt;/math&amp;gt;&amp;quot;.&lt;br /&gt;
* Given a distribution &amp;lt;math&amp;gt;\mu&amp;lt;/math&amp;gt;, we can compute its density function. How? Just find the derivative of &amp;lt;math&amp;gt;\mu((-\infty,x])&amp;lt;/math&amp;gt;. (?)&lt;br /&gt;
* Given a cumulative distribution function, we can compute the random variable. (Right?)&lt;br /&gt;
* Given a probability density function, can we get everything else? Don&#039;t we just have to integrate to get the cdf, which gets us the random variable and the distribution?&lt;br /&gt;
* Given a cumulative distribution function, how do we get the distribution? We have &amp;lt;math&amp;gt;F_X(x) = \mathbf P_{\mathcal F}(X\leq x) = \mathbf P_{\mathcal B}((-\infty,x])&amp;lt;/math&amp;gt;, which gets us some of what the distribution &amp;lt;math&amp;gt;\mathbf P_{\mathcal B}&amp;lt;/math&amp;gt; maps to, but &amp;lt;math&amp;gt;\mathcal B&amp;lt;/math&amp;gt; is bigger than this. What do we do about the other values we need to map? We can compute intervals like &amp;lt;math&amp;gt;F_X(b) - F_X(a) = \mathbf P_{\mathcal F}(a \leq X\leq b) = \mathbf P_{\mathcal B}([a,b])&amp;lt;/math&amp;gt;. And we can apparently do the same for unions and limiting operations.&lt;br /&gt;
&lt;br /&gt;
==Philosophical details about the sample space==&lt;br /&gt;
&lt;br /&gt;
Given a random variable &amp;lt;math&amp;gt;X : \Omega \to \mathbf R&amp;lt;/math&amp;gt; and any reasonable predicate &amp;lt;math&amp;gt;P&amp;lt;/math&amp;gt; about &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;, we can replace &amp;lt;math&amp;gt;P(X)&amp;lt;/math&amp;gt; with its extension &amp;lt;math&amp;gt;\{\omega \in \Omega : P(X(\omega))\} = \{\omega \in \Omega : X(\omega) \in B\}&amp;lt;/math&amp;gt; for some &amp;lt;math&amp;gt;B \in \mathcal B&amp;lt;/math&amp;gt;. And from then on, we can write &amp;lt;math&amp;gt;\mathbf P_{\mathcal F}(X\in B)&amp;lt;/math&amp;gt; as &amp;lt;math&amp;gt;\mathbf P_{\mathcal F}(X^{-1}(B)) = \mathbf P_{\mathcal B}(B) = \mu(B)&amp;lt;/math&amp;gt;. In other words, we can just work with Borel sets of the reals (measuring them with the distribution) rather than the original events (measuring them with the original probability measure). Where did &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; go? &amp;lt;math&amp;gt;\mathbf P_{\mathcal F} \circ X^{-1} = \mathbf P_{\mathcal B}&amp;lt;/math&amp;gt;, so you can write &amp;lt;math&amp;gt;\mathbf P_{\mathcal B}&amp;lt;/math&amp;gt; using &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt;. But once you already have &amp;lt;math&amp;gt;\mathbf P_{\mathcal B}&amp;lt;/math&amp;gt;, you don&#039;t need to know what &amp;lt;math&amp;gt;X&amp;lt;/math&amp;gt; is.&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
&lt;br /&gt;
* [[Summary table of multivariable derivatives]]&lt;br /&gt;
* [[Comparison of machine learning textbooks]]&lt;br /&gt;
&lt;br /&gt;
==External links==&lt;br /&gt;
&lt;br /&gt;
* [https://terrytao.wordpress.com/2010/01/01/254a-notes-0-a-review-of-probability-theory/ 254A, Notes 0: A review of probability theory] and [https://terrytao.wordpress.com/2015/09/29/275a-notes-0-foundations-of-probability-theory/ 275A, Notes 0: Foundations of probability theory] by [[wikipedia:Terence Tao|Terence Tao]]&lt;br /&gt;
* [http://dsp.ucsd.edu/~kreutz/PEI-05%20Support%20Files/Basic%20Random%20Variables%20Concepts.pdf Basic Random Variable Concepts] by Kenneth Kreutz-Delgado&lt;br /&gt;
* Various questions on Mathematics Stack Exchange:&lt;br /&gt;
** https://math.stackexchange.com/questions/2233731/discarding-random-variables-in-favor-of-a-domain-less-definition&lt;br /&gt;
** https://math.stackexchange.com/questions/18198/what-are-the-sample-spaces-when-talking-about-continuous-random-variables&lt;br /&gt;
** https://math.stackexchange.com/questions/2233721/the-true-domain-of-random-variables&lt;br /&gt;
** https://math.stackexchange.com/questions/712734/domain-of-a-random-variable-sample-space-or-probability-space&lt;br /&gt;
** https://math.stackexchange.com/questions/23006/the-role-of-the-hidden-probability-space-on-which-random-variables-are-defined&lt;br /&gt;
** https://math.stackexchange.com/questions/1612012/how-should-i-understand-the-probability-space-omega-mathcalf-p-what-d&lt;br /&gt;
** https://math.stackexchange.com/questions/2531810/why-does-probability-theory-insist-on-sample-spaces&lt;br /&gt;
** https://math.stackexchange.com/questions/1690289/what-is-a-probability-distribution&lt;br /&gt;
** https://math.stackexchange.com/questions/1073744/distinguishing-probability-measure-function-and-distribution&lt;br /&gt;
** https://math.stackexchange.com/questions/57027/concept-of-probability-distribution&lt;br /&gt;
* Tim Gowers:&lt;br /&gt;
** https://gowers.wordpress.com/2010/09/01/icm2010-fourth-day/ (search for &amp;quot;random variable&amp;quot;)&lt;br /&gt;
** https://mathoverflow.net/questions/12516/a-random-variable-is-it-a-function-or-an-equivalence-class-of-functions&lt;br /&gt;
&lt;br /&gt;
[[Category:Probability]]&lt;/div&gt;</summary>
		<author><name>IssaRice</name></author>
	</entry>
</feed>