most expositions of the Shapley value SUCK BALLS because they try to sum over the subsets excluding the playing in question (usually called "player i"). so here we go, here's a TRUE REDPILLED exposition of the shapley value!

first of all, what's the shapley value even trying to do? once we understand it in words, we can just convert our verbal understanding into symbols. and then we will be done.

...

So, the shapley value is an ''average''. but what kind of average? an ''arithmetic average''. well, an arithmetic average takes a specific form. it looks like this. if you're averaging the elements of some set <math>X</math>, then the arithmetic average <math>\bar{X}</math> is

<math>\bar X = \frac{1}{|X|} \sum_{x\in X} f(x)</math>

We throw in the function f because the elements of X might not be numbers. or even if they ''are'' numbers, you might want to apply some weighting other than the default one (the identity function).

Now, let's take the ugly-ass formula for the shapley value that you always see:

<math>\sum_{S \subseteq N \setminus \{i\}} \frac{|S|!\ (n - |S| - 1)!}{n!} (v(S \cup \{i\}) - v(S))</math>

how is ''that'' supposed to be an average? well first of all, we said above that the shapley value is averaging over all ''sequences'' of ways to add the n players. one way to formalize the concept of a "sequence" or "ordering" is to use permutations. a permutation is just a function that reorders the elements of of a set. so each sequence corresponds to a permutation. we can recover a sequence <math>(x_1, x_2, \ldots, x_n)</math> by defining the permutation <math>\sigma(k) := x_k</math>.

So in what sense is the shapley value an average? if <math>N = \{1, \ldots, n\}</math> is the set of players, then we can define the set of all permutations <math>\mathrm{Sym}(N)</math> on <math>N</math>. (This is also denoted as <math>\mathrm{Sym}(n)</math> and called the "symmetric group of degree n" since <math>N = \{1, \ldots, n\}</math> is the "default" set of size n.)

since the shapley value is an ''average'' and we are in particular averaging over all sequences, we want to rewrite the formula as something that looks like:

<math>\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f(\sigma)</math>

And in fact, at this point we know enough to convert our verbal understanding into a formula like the one above.

<math>\varphi_i(v) = \frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} (v(\{k : \sigma(k) < \sigma(i)\} \cup \{i\}) - v(\{k : \sigma(k) < \sigma(i)\}))</math>

a relevant fact is that the size of <math>\mathrm{Sym}(n)</math> is <math>n!</math>.

There is another way to look at this. or rather, a way to extend our understanding. a common thing done in algebra is to [https://en.wikipedia.org/wiki/Symmetrization#n_variables ''symmetrize''] a function by adding up all the permutations of the variables. a symmetric function is one in which you can interchange any of the variables and the function will stay the same. not in the trivial sense that <math>f(x,y)</math> is "the same" as <math>f(y,x)</math> since x and y are "just variables". but rather, in the sense that <math>f(x,y)=f(y,x)</math>. in symbols:

* we are NOT saying <math>(x,y) \mapsto f(x,y) = (y,x) \mapsto f(y,x)</math>. this is trivially true for all functions!
* but rather: <math>(x,y) \mapsto f(x,y) = (x,y) \mapsto f(y,x)</math>. or in other words: <math>\forall x \forall y [f(x,y) = f(y,x)]</math> (this is not trivially true! it's false for many functions including <math>f(x,y) := x-y</math>)

in the case of the shapley value, the "marginal contribution" function is NOT symmetric. so the naive fix that we would hope would work is to symmetrize it by adding all the possible permutations of the variables, forming a new function.

wait, what? what even ''is'' the "marginal contribution function"?? for a player i of interest, it's the function that gives player i's marginal contribution, given an arbitrary sequence of players as input. let's say we are given a sequence <math>(x_1, x_2, \ldots, x_n)</math>. what's player i's marginal contribution in this sequence? well, if x1 = i, then player i is the first player to join, so the marginal contribution is <math>v(\{i\}) - v(\emptyset) = v(\{i\})</math>. if x2=i, then the marginal contribution of player i is <math>v(\{x_1, i\}) - v(\{x_1\})</math>. and so on. in general, if <math>x_j = i</math> then player i's marginal contribution is <math>v(\{x_1, \ldots, x_{j-1}, i\}) - v(\{x_1, \ldots, x_{j-1}\})</math>.

as i said, this function, which we can call <math>f_i</math>, is not symmetric. but we can symmetrize <math>f_i</math> by adding up all the possible orderings of the input variables:

<math>\sum_{\sigma \in \mathrm{Sym}(n)} f_i(x_{\sigma(1)}, \ldots, x_{\sigma(n)})</math>

given a permutation <math>\sigma \in \mathrm{Sym}(n)</math> and a function <math>f : X^n \to \mathbf R</math>, we can define the permutation of the function <math>\sigma^* : (X^n \to \mathbf R) \to X^n \to \mathbf R</math> by:

<math>\sigma^*(f) := (x_1, \ldots, x_n) \mapsto f(x_{\sigma(1)}, \ldots, x_{\sigma(n)})</math>

By an abuse of notation, we can drop the star in <math>\sigma^*</math> and just call the resulting extension <math>\sigma</math>.

the Shapley value is <math>\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))</math>

User:IssaRice/Shapley value

2023-04-08T20:48:22Z

IssaRice:

User:IssaRice/Lebesgue theory

User:IssaRice/Shapley value

2023-04-08T04:10:07Z

IssaRice:

.

the Shapley value is <math>\frac{1}{|\mathrm{Sym}(n)|} \sum_{\sigma \in \mathrm{Sym}(n)} f_i(\sigma(1), \ldots, \sigma(n))</math>

User:IssaRice/Shapley value

2023-04-08T04:04:56Z

IssaRice: Created page with "."

User:IssaRice/Linear algebra/Singular value decomposition

2022-10-01T21:10:52Z

IssaRice:

the stupid textbooks don't tell you anything about SVD!!!! i think it's super helpful to look at all the ''wrong'' things one might say about SVD... we need to un-knot all those wrong intuitions. i'll list some knots that i have had.

starting at this image: https://en.wikipedia.org/wiki/File:Singular-Value-Decomposition.svg

* if A is an invertible matrix, then <math>A = E_1 \cdots E_m</math> for some elementary matrices <math>E_1,\ldots,E_m</math>. Dilations and swapping elementary matrices obviously involve only orthogonal operations. So we can write A as an alternating product of orthogonal and shear matrices (the product of two orthogonal matrices is again orthogonal. right???). If we can prove SVD for shears, we can convert this to an alternating product of orthogonal and ''diagonal'' matrices. unfortunately, this doesn't seem to lead to a full proof of SVD (unless orthogonal and diagonal matrices somehow commute).
* one question one might have is, to get the behavior of M in the linked image, can't we just squish along the standard basis directions, then rotate? surely this would produce the same ellipse. And it would seem that we've only required one rotation, instead of the two in SVD. That's true, but pay attention to where the basis vectors went. A squish followed by a rotation... would preserve orthogonality. But in M it is clear that these basis vectors are no longer orthogonal. So even though we have faithfully preserved the ellipse, we don't have the same transformation. i.e. <math>M(\{v : \|v\| = 1\}) = M'(\{v : \|v\|=1\})</math> need not imply <math>M=M'</math>, apparently. This must be an artifact of the fact that a circle is an extremely symmetric shape, so lots of non-identical transformations can still produce the same image of a circle. I think if we started out with a square, we would not have the same image if we just instead stretched and then did a rotation (actually, maybe a square too is still too symmetric; see example [https://youtu.be/vSczTbgc8Rc?list=PLnQX-jgAF5pTZXPiD8ciEARRylD9brJXU&t=679 here]).
* (polar decomposition.) In the linked image, look at the axes of the final ellipse, labeled <math>\sigma_1</math> and <math>\sigma_2</math>. Call those vectors <math>u_1</math> and <math>u_2</math>. So <math>u_1 = Mv_1</math> and <math>u_2 = Mv_2</math> for some vectors v1 and v2. Now, backtrack along the arrows, starting from the final image, going through U, then Sigma, then V*. Pay attention to what it does to u1 and u2. In each step, the vectors remain orthogonal. So not only are u1 and u2 orthogonal, we must have that v1 and v2 are orthogonal. So now, couldn't we say, "take v1 and v2, squish along those axes. then rotate." That seems to have required only one rotation. What's going on? The problem is that a diagonal matrix can only stretch along the standard basis. So "stretch along v1 and v2" can't be done via a diagonal matrix (unless v1 and v2 are the standard basis, of course). Let's say <math>M = RD</math> where R is a rotation, and D is "stretch along v1 and v2". So <math>Dv_1 = \lambda_1 v_1</math> and <math>Dv_2 = \lambda_2 v_2</math>. Now, D is not a diagonal matrix, when viewed in the standard basis, but it ''is'' a diagonal matrix when viewed under the basis v1,v2. To get to the standard basis, we need to convert the incoming vectors like v1 into e1, then apply the stretching, then reconvert back to v1. In other words, we want to show <math>D = U\Sigma U^*</math> for some diagonal matrix Sigma and orthogonal U. Just take <math>Ue_j=v_j</math>, i.e. the matrix U is the matrix with columns [v1 v2]. Now, <math>Dv_j = U\Sigma U^* v_j = U\Sigma e_j = U\lambda_j e_j = \lambda_j v_j</math> just like we wanted. So <math>M = RD = RU\Sigma U^*</math>. Of course, RU is another orthogonal matrix, so we recover SVD.
** also, if any of the entries of <math>\Sigma</math> are negative, we could have chosen a different v vector that would have made it positive, so we can assume that D is a positive operator (positive semi-definite).
* another question is: if we squish along the right orthogonal directions, can't we get away with not needing the extra rotation? after all, an ellipse can be squished into a circle without any rotations. what must be the case, although i can't explain it visually yet, is that if we do just that, then the standard basis vectors (yellow and pink) get mapped to the wrong spots. this ''might'' be an artifact of shears (the wikipedia SVD image is a shear). clearer to look at michael nielsen's [http://cognitivemedium.com/emm/images/tangent_definition.png image]. here, if we start with the ellipse and shrink along Ms and stretch along Mt, we do get a circle. but Ms doesn't go back to s, and Mt doesn't go back to t; for that we'll need the extra rotation.
* [https://math.stackexchange.com/questions/2899052/singular-value-decomposition-reconciling-the-maximal-stretching-and-spectral my old question]:
** what is this <math>\sqrt{T^*T}</math> that axler keeps talking about? that's the "stretch along well-chosen orthogonal directions" operation that we start out with in polar decomposition.
** for proof (1), see [http://cognitivemedium.com/emm/emm.html michael nielsen]. basically, the maximal stretching direction has a tangent vector (on the ellipse) that is orthogonal to it, because if it ''wasn't'' orthogonal, then we could get an even more stretched out vector. the other piece that's required is that linear maps preserve tangency. i.e. if v(t) is a parametrization of a circle, and M is a matrix, then M(v(t)) traces out an ellipse as t varies. (i'm using t as a parameter even though nielsen uses it as a vector. seriously, who the heck uses t for a vector??) the tangent vector on the circle at v(t) is v'(t). this tangent vector gets mapped to M(v'(t)). the tangent vector at M(v(t)) on the ellipse is <math display=inline>\frac{d}{dt} M(v(t))</math>. now, by linearity of M and the definition of the derivative, we can basically "pull out" the M and see that <math display=inline>\frac{d}{dt} M(v(t)) = M(\frac{d}{dt} v(t))</math>.<ref group=note><math>\frac{d}{dt} M(v(t)) = \lim_{h\to0} \frac{Mv(t+h) - Mv(t)}{h} = \lim_{h\to0} \frac1h M(v(t+h)-v(t)) = \lim_{h\to0}M(\frac1h (v(t+h)-v(t))) = M(\frac{d}{dt} v(t))</math>. You might also want to play around with an example like <math>\begin{pmatrix}3 & 0\\ 0 & 4\end{pmatrix}</math>, which takes (cos t, sin t) to (3cos t, 4sin t). The tangent at the original point is (-sin t, cos t). The tangent at the image is (-3sin t, 4cos t), which is equal to the image of the tangent.</ref> what this means is that if you have a point on the circle and its tangent, then you map both of them under M, then the tangent of the image of the point is the image of the tangent at the point.<ref group=note> I think another way to see this is is via uniqueness of taylor approximations? like if v is a point on the circle, and u is the tangent vector at v, then points near v can be written as <math>v + \Delta u + O(\Delta^2)</math>, and if we apply M to those points, we get <math>Mv + \Delta Mu + O(\Delta^2)</math>. if taylor approximations are unique, then the fact that the term linear in <math>\Delta</math> has Mu means that Mu must be tangent at Mv.</ref> what this implies is that for our maximal stretch vector, since the tangent on the circle is orthogonal, the image of that tangent is also a tangent at the new place on the ellipse, and we already know that the tangent is orthogonal for the maximal stretch vector.
** so how does (2) find the same basis without talking about "maximal stretching"? well, in (2), <math>\sqrt{T^*T}</math> ''means'' "stretch along well-chosen orthogonal directions" -- it's the positive operator that appears in polar decomposition. and if we stretch along orthogonal directions, then surely one of them has to be the maximal stretching direction (rather than, say, some direction intermediate between two of the axes).

==See also==

* https://machinelearning.subwiki.org/wiki/User:IssaRice/Linear_algebra/Classification_of_operators -- performing SVD on some nicer operators allows you to skip some of the steps, resulting in a simpler decomposition.

==Footnotes==

<references group="note"/>

User:IssaRice/Linear algebra/Singular value decomposition

2022-10-01T21:08:40Z

IssaRice:

the stupid textbooks don't tell you anything about SVD!!!! i think it's super helpful to look at all the ''wrong'' things one might say about SVD... we need to un-knot all those wrong intuitions. i'll list some knots that i have had.

starting at this image: https://en.wikipedia.org/wiki/File:Singular-Value-Decomposition.svg

* if A is an invertible matrix, then <math>A = E_1 \cdots E_m</math> for some elementary matrices <math>E_1,\ldots,E_m</math>. Dilations and swapping elementary matrices obviously involve only orthogonal operations. So we can write A as an alternating product of orthogonal and shear matrices (the product of two orthogonal matrices is again orthogonal. right???). If we can prove SVD for shears, we can convert this to an alternating product of orthogonal and ''diagonal'' matrices. unfortunately, this doesn't seem to lead to a full proof of SVD (unless orthogonal and diagonal matrices somehow commute).
* one question one might have is, to get the behavior of M in the linked image, can't we just squish along the standard basis directions, then rotate? surely this would produce the same ellipse. And it would seem that we've only required one rotation, instead of the two in SVD. That's true, but pay attention to where the basis vectors went. A squish followed by a rotation... would preserve orthogonality. But in M it is clear that these basis vectors are no longer orthogonal. So even though we have faithfully preserved the ellipse, we don't have the same transformation. i.e. <math>M(\{v : \|v\| = 1\}) = M'(\{v : \|v\|=1\})</math> need not imply <math>M=M'</math>, apparently. This must be an artifact of the fact that a circle is an extremely symmetric shape, so lots of non-identical transformations can still produce the same image of a circle. I think if we started out with a square, we would not have the same image if we just instead stretched and then did a rotation.
* (polar decomposition.) In the linked image, look at the axes of the final ellipse, labeled <math>\sigma_1</math> and <math>\sigma_2</math>. Call those vectors <math>u_1</math> and <math>u_2</math>. So <math>u_1 = Mv_1</math> and <math>u_2 = Mv_2</math> for some vectors v1 and v2. Now, backtrack along the arrows, starting from the final image, going through U, then Sigma, then V*. Pay attention to what it does to u1 and u2. In each step, the vectors remain orthogonal. So not only are u1 and u2 orthogonal, we must have that v1 and v2 are orthogonal. So now, couldn't we say, "take v1 and v2, squish along those axes. then rotate." That seems to have required only one rotation. What's going on? The problem is that a diagonal matrix can only stretch along the standard basis. So "stretch along v1 and v2" can't be done via a diagonal matrix (unless v1 and v2 are the standard basis, of course). Let's say <math>M = RD</math> where R is a rotation, and D is "stretch along v1 and v2". So <math>Dv_1 = \lambda_1 v_1</math> and <math>Dv_2 = \lambda_2 v_2</math>. Now, D is not a diagonal matrix, when viewed in the standard basis, but it ''is'' a diagonal matrix when viewed under the basis v1,v2. To get to the standard basis, we need to convert the incoming vectors like v1 into e1, then apply the stretching, then reconvert back to v1. In other words, we want to show <math>D = U\Sigma U^*</math> for some diagonal matrix Sigma and orthogonal U. Just take <math>Ue_j=v_j</math>, i.e. the matrix U is the matrix with columns [v1 v2]. Now, <math>Dv_j = U\Sigma U^* v_j = U\Sigma e_j = U\lambda_j e_j = \lambda_j v_j</math> just like we wanted. So <math>M = RD = RU\Sigma U^*</math>. Of course, RU is another orthogonal matrix, so we recover SVD.
** also, if any of the entries of <math>\Sigma</math> are negative, we could have chosen a different v vector that would have made it positive, so we can assume that D is a positive operator (positive semi-definite).
* another question is: if we squish along the right orthogonal directions, can't we get away with not needing the extra rotation? after all, an ellipse can be squished into a circle without any rotations. what must be the case, although i can't explain it visually yet, is that if we do just that, then the standard basis vectors (yellow and pink) get mapped to the wrong spots. this ''might'' be an artifact of shears (the wikipedia SVD image is a shear). clearer to look at michael nielsen's [http://cognitivemedium.com/emm/images/tangent_definition.png image]. here, if we start with the ellipse and shrink along Ms and stretch along Mt, we do get a circle. but Ms doesn't go back to s, and Mt doesn't go back to t; for that we'll need the extra rotation.
* [https://math.stackexchange.com/questions/2899052/singular-value-decomposition-reconciling-the-maximal-stretching-and-spectral my old question]:
** what is this <math>\sqrt{T^*T}</math> that axler keeps talking about? that's the "stretch along well-chosen orthogonal directions" operation that we start out with in polar decomposition.
** for proof (1), see [http://cognitivemedium.com/emm/emm.html michael nielsen]. basically, the maximal stretching direction has a tangent vector (on the ellipse) that is orthogonal to it, because if it ''wasn't'' orthogonal, then we could get an even more stretched out vector. the other piece that's required is that linear maps preserve tangency. i.e. if v(t) is a parametrization of a circle, and M is a matrix, then M(v(t)) traces out an ellipse as t varies. (i'm using t as a parameter even though nielsen uses it as a vector. seriously, who the heck uses t for a vector??) the tangent vector on the circle at v(t) is v'(t). this tangent vector gets mapped to M(v'(t)). the tangent vector at M(v(t)) on the ellipse is <math display=inline>\frac{d}{dt} M(v(t))</math>. now, by linearity of M and the definition of the derivative, we can basically "pull out" the M and see that <math display=inline>\frac{d}{dt} M(v(t)) = M(\frac{d}{dt} v(t))</math>.<ref group=note><math>\frac{d}{dt} M(v(t)) = \lim_{h\to0} \frac{Mv(t+h) - Mv(t)}{h} = \lim_{h\to0} \frac1h M(v(t+h)-v(t)) = \lim_{h\to0}M(\frac1h (v(t+h)-v(t))) = M(\frac{d}{dt} v(t))</math>. You might also want to play around with an example like <math>\begin{pmatrix}3 & 0\\ 0 & 4\end{pmatrix}</math>, which takes (cos t, sin t) to (3cos t, 4sin t). The tangent at the original point is (-sin t, cos t). The tangent at the image is (-3sin t, 4cos t), which is equal to the image of the tangent.</ref> what this means is that if you have a point on the circle and its tangent, then you map both of them under M, then the tangent of the image of the point is the image of the tangent at the point.<ref group=note> I think another way to see this is is via uniqueness of taylor approximations? like if v is a point on the circle, and u is the tangent vector at v, then points near v can be written as <math>v + \Delta u + O(\Delta^2)</math>, and if we apply M to those points, we get <math>Mv + \Delta Mu + O(\Delta^2)</math>. if taylor approximations are unique, then the fact that the term linear in <math>\Delta</math> has Mu means that Mu must be tangent at Mv.</ref> what this implies is that for our maximal stretch vector, since the tangent on the circle is orthogonal, the image of that tangent is also a tangent at the new place on the ellipse, and we already know that the tangent is orthogonal for the maximal stretch vector.
** so how does (2) find the same basis without talking about "maximal stretching"? well, in (2), <math>\sqrt{T^*T}</math> ''means'' "stretch along well-chosen orthogonal directions" -- it's the positive operator that appears in polar decomposition. and if we stretch along orthogonal directions, then surely one of them has to be the maximal stretching direction (rather than, say, some direction intermediate between two of the axes).

==See also==

* https://machinelearning.subwiki.org/wiki/User:IssaRice/Linear_algebra/Classification_of_operators -- performing SVD on some nicer operators allows you to skip some of the steps, resulting in a simpler decomposition.

==Footnotes==

<references group="note"/>

Summary table of probability terms

2022-07-14T18:16:48Z

IssaRice: /* Dependencies */

This page is a '''summary table of probability terms'''.

==Table==

{| class="sortable wikitable"
! Term !! Notation !! Type !! Definition !! Notes
|-
| Reals || <math>\mathbf R</math> || ||
|-
| Borel subsets of the reals || <math>\mathcal B</math> || ||
|-
| A Borel set || <math>B</math> || <math>\mathcal B</math> ||
|-
| [[Sample space]] || <math>\Omega</math> || ||
|-
| Outcome || <math>\omega</math> || <math>\Omega</math> ||
|-
| Events or measurable sets || <math>\mathcal F</math> || ||
|-
| Probability measure || <math>\mathbf P</math> or <math>\Pr</math> or <math>\mathbf P_{\mathcal F}</math> || <math>\mathcal F \to [0,1]</math> ||
|-
| Probability triple or probability space || <math>(\Omega, \mathcal F, \mathbf P)</math> || ||
|-
| Distribution || <math>\mu</math> or <math>\mathcal D</math> or <math>D</math> or <math>\mathbf P_{\mathcal B}</math> or <math>\mathcal L(X)</math> or <math>\mathbf P X^{-1}</math> || <math>\mathcal B \to \mathbf [0,1]</math> || <math>B \mapsto \mathbf P(X \in B)</math>
|-
| Induced probability space || <math>(\mathbf R, \mathcal B, \mu)</math> || ||
|-
| Cumulative distribution function or CDF || <math>F_X</math> || <math>\mathbf R \to [0,1]</math> ||
|-
| Probability density function or PDF || <math>f_X</math> || <math>\mathbf R \to [0,\infty)</math> ||
|-
| [[Random variable]] || <math>X</math> || <math>\Omega \to \mathbf R</math> ||
|-
| Preimage of random variable || <math>X^{-1}</math> || <math>2^{\mathbf R} \to 2^{\Omega}</math> but all we need is <math>\mathcal B \to \mathcal F</math> ||
|-
| Indicator of <math>A</math> || <math>1_A</math> || <math>\Omega \to \{0,1\}</math> || <math>1_A(\omega) = \begin{cases}1 & \omega\in A \\ 0 & \omega \not\in A\end{cases}</math>
|-
| [[Expectation]] || <math>\mathbf E</math> or <math>\mathrm E</math> || <math>(\Omega \to \mathbf R) \to \mathbf R</math> ||
|-
| || <math>X \in B</math> || <math>\mathcal F</math> || <math>\{\omega \in \Omega : X(\omega) \in B\}</math>
|-
| || <math>X=x</math> || <math>\mathcal F</math> || <math>\{\omega \in \Omega : X(\omega) = x\}</math>
|-
| || <math>X\leq x</math> || <math>\mathcal F</math> || <math>\{\omega \in \Omega : X(\omega) \leq x\}</math>
|-
| Function of a random variable, where <math>f\colon \mathbf R \to \mathbf R</math> || <math>f(X)</math> || <math>\Omega \to \mathbf R</math> || <math>f\circ X</math> ||
|-
| [[Expected value]] of <math>X</math> || <math>\mathbf E(X)</math> || <math>\mathbf R</math>
|-
| || <math>\mathbf E(X\mid Y=y)</math> || <math>\mathbf R</math> ||
|-
| || <math>\mathbf E(X\mid Y)</math> || <math>\Omega \to \mathbf R</math> || <math>\omega \mapsto \mathbf E(X\mid Y=Y(\omega))</math>?
|-
| Utility function || <math>u</math> || <math>\mathbf R \to \mathbf R</math> || || I ''think'' this is what the type must be, based on how it's used. But we usually think of the utility function as assigning numbers to outcomes; but if that is so, it must be a random variable! What's up with that? (2022-07-14: I think in probability theory, we usually discuss only real random variables, since that allows us to do a lot more with them like take expected value. But in fields like AI, we consider more general random variables <math>\Omega \to \mathcal O</math> that take values in some space of outcomes <math>\mathcal O</math>. We can't "average over" outcomes so we can't really take expected values anymore, but this allows us to make the utility function more general so we get <math>u : \mathcal O \to \mathbf R</math>.)
|-
| Expected utility of <math>X</math> || <math>\mathbf{EU}(X)</math> || <math>\mathbf R</math> || <math>\mathbf E(u(X))</math> || <math>u\circ X</math> is indeed a random variable, so the type check passes.
|}

All the utility stuff isn't really related to machine learning. It's more related to the decision theory stuff I'm learning. I'm putting it here for now for convenience but might move it later.

TODO add "probability distribution over S" and "probability distribution on A" [https://arxiv.org/pdf/1711.00363.pdf]

Li and Vitanyi (''An Introduction to Kolmogorov Complexity and Its Applications'', p. 19) calls the probability measure on <math>\mathcal F</math> a probability distribution over S (the sample space).

TODO: add probability mass function (defined only for discrete random variables)

==Dependencies==

Let <math>(\Omega, \mathcal F, \mathbf P)</math> be a probability space.

* Given a random variable <math>X</math>, we can compute its distribution <math>\mu</math>. How? Just let <math>\mu(B) = \mathbf P_{\mathcal F}(X \in B)</math>
* Given a random variable, we can compute the probability density function. How?
* Given a random variable, we can compute the cumulative distribution function. How?
* Given a distribution, we can retrieve a random variable. But this random variable is not unique? This is why we can say stuff like "let <math>X\sim \mathcal D</math>".
* Given a distribution <math>\mu</math>, we can compute its density function. How? Just find the derivative of <math>\mu((-\infty,x])</math>. (?) (2022-07-14: something something Radon–Nikodym theorem...)
* Given a cumulative distribution function, we can compute the random variable. (Right?) (2022-07-14: but a CDF is like a distribution, so the random variable won't be unique.)
* Given a probability density function, can we get everything else? Don't we just have to integrate to get the cdf, which gets us the random variable and the distribution?
* Given a cumulative distribution function, how do we get the distribution? We have <math>F_X(x) = \mathbf P_{\mathcal F}(X\leq x) = \mathbf P_{\mathcal B}((-\infty,x])</math>, which gets us some of what the distribution <math>\mathbf P_{\mathcal B}</math> maps to, but <math>\mathcal B</math> is bigger than this. What do we do about the other values we need to map? We can compute intervals like <math>F_X(b) - F_X(a) = \mathbf P_{\mathcal F}(a \leq X\leq b) = \mathbf P_{\mathcal B}([a,b])</math>. And we can apparently do the same for unions and limiting operations.

==Philosophical details about the sample space==

Given a random variable <math>X : \Omega \to \mathbf R</math> and any reasonable predicate <math>P</math> about <math>X</math>, we can replace <math>P(X)</math> with its extension <math>\{\omega \in \Omega : P(X(\omega))\} = \{\omega \in \Omega : X(\omega) \in B\}</math> for some <math>B \in \mathcal B</math>. And from then on, we can write <math>\mathbf P_{\mathcal F}(X\in B)</math> as <math>\mathbf P_{\mathcal F}(X^{-1}(B)) = \mathbf P_{\mathcal B}(B) = \mu(B)</math>. In other words, we can just work with Borel sets of the reals (measuring them with the distribution) rather than the original events (measuring them with the original probability measure). Where did <math>X</math> go? <math>\mathbf P_{\mathcal F} \circ X^{-1} = \mathbf P_{\mathcal B}</math>, so you can write <math>\mathbf P_{\mathcal B}</math> using <math>X</math>. But once you already have <math>\mathbf P_{\mathcal B}</math>, you don't need to know what <math>X</math> is.

==See also==

* [[Summary table of multivariable derivatives]]
* [[Comparison of machine learning textbooks]]

==External links==

* [https://terrytao.wordpress.com/2010/01/01/254a-notes-0-a-review-of-probability-theory/ 254A, Notes 0: A review of probability theory] and [https://terrytao.wordpress.com/2015/09/29/275a-notes-0-foundations-of-probability-theory/ 275A, Notes 0: Foundations of probability theory] by [[wikipedia:Terence Tao|Terence Tao]]
* [http://dsp.ucsd.edu/~kreutz/PEI-05%20Support%20Files/Basic%20Random%20Variables%20Concepts.pdf Basic Random Variable Concepts] by Kenneth Kreutz-Delgado
* Various questions on Mathematics Stack Exchange:
** https://math.stackexchange.com/questions/2233731/discarding-random-variables-in-favor-of-a-domain-less-definition
** https://math.stackexchange.com/questions/18198/what-are-the-sample-spaces-when-talking-about-continuous-random-variables
** https://math.stackexchange.com/questions/2233721/the-true-domain-of-random-variables
** https://math.stackexchange.com/questions/712734/domain-of-a-random-variable-sample-space-or-probability-space
** https://math.stackexchange.com/questions/23006/the-role-of-the-hidden-probability-space-on-which-random-variables-are-defined
** https://math.stackexchange.com/questions/1612012/how-should-i-understand-the-probability-space-omega-mathcalf-p-what-d
** https://math.stackexchange.com/questions/2531810/why-does-probability-theory-insist-on-sample-spaces
** https://math.stackexchange.com/questions/1690289/what-is-a-probability-distribution
** https://math.stackexchange.com/questions/1073744/distinguishing-probability-measure-function-and-distribution
** https://math.stackexchange.com/questions/57027/concept-of-probability-distribution
* Tim Gowers:
** https://gowers.wordpress.com/2010/09/01/icm2010-fourth-day/ (search for "random variable")
** https://mathoverflow.net/questions/12516/a-random-variable-is-it-a-function-or-an-equivalence-class-of-functions

[[Category:Probability]]

Summary table of probability terms

2022-07-14T18:15:36Z

IssaRice: /* Dependencies */

This page is a '''summary table of probability terms'''.

==Table==

{| class="sortable wikitable"
! Term !! Notation !! Type !! Definition !! Notes
|-
| Reals || <math>\mathbf R</math> || ||
|-
| Borel subsets of the reals || <math>\mathcal B</math> || ||
|-
| A Borel set || <math>B</math> || <math>\mathcal B</math> ||
|-
| [[Sample space]] || <math>\Omega</math> || ||
|-
| Outcome || <math>\omega</math> || <math>\Omega</math> ||
|-
| Events or measurable sets || <math>\mathcal F</math> || ||
|-
| Probability measure || <math>\mathbf P</math> or <math>\Pr</math> or <math>\mathbf P_{\mathcal F}</math> || <math>\mathcal F \to [0,1]</math> ||
|-
| Probability triple or probability space || <math>(\Omega, \mathcal F, \mathbf P)</math> || ||
|-
| Distribution || <math>\mu</math> or <math>\mathcal D</math> or <math>D</math> or <math>\mathbf P_{\mathcal B}</math> or <math>\mathcal L(X)</math> or <math>\mathbf P X^{-1}</math> || <math>\mathcal B \to \mathbf [0,1]</math> || <math>B \mapsto \mathbf P(X \in B)</math>
|-
| Induced probability space || <math>(\mathbf R, \mathcal B, \mu)</math> || ||
|-
| Cumulative distribution function or CDF || <math>F_X</math> || <math>\mathbf R \to [0,1]</math> ||
|-
| Probability density function or PDF || <math>f_X</math> || <math>\mathbf R \to [0,\infty)</math> ||
|-
| [[Random variable]] || <math>X</math> || <math>\Omega \to \mathbf R</math> ||
|-
| Preimage of random variable || <math>X^{-1}</math> || <math>2^{\mathbf R} \to 2^{\Omega}</math> but all we need is <math>\mathcal B \to \mathcal F</math> ||
|-
| Indicator of <math>A</math> || <math>1_A</math> || <math>\Omega \to \{0,1\}</math> || <math>1_A(\omega) = \begin{cases}1 & \omega\in A \\ 0 & \omega \not\in A\end{cases}</math>
|-
| [[Expectation]] || <math>\mathbf E</math> or <math>\mathrm E</math> || <math>(\Omega \to \mathbf R) \to \mathbf R</math> ||
|-
| || <math>X \in B</math> || <math>\mathcal F</math> || <math>\{\omega \in \Omega : X(\omega) \in B\}</math>
|-
| || <math>X=x</math> || <math>\mathcal F</math> || <math>\{\omega \in \Omega : X(\omega) = x\}</math>
|-
| || <math>X\leq x</math> || <math>\mathcal F</math> || <math>\{\omega \in \Omega : X(\omega) \leq x\}</math>
|-
| Function of a random variable, where <math>f\colon \mathbf R \to \mathbf R</math> || <math>f(X)</math> || <math>\Omega \to \mathbf R</math> || <math>f\circ X</math> ||
|-
| [[Expected value]] of <math>X</math> || <math>\mathbf E(X)</math> || <math>\mathbf R</math>
|-
| || <math>\mathbf E(X\mid Y=y)</math> || <math>\mathbf R</math> ||
|-
| || <math>\mathbf E(X\mid Y)</math> || <math>\Omega \to \mathbf R</math> || <math>\omega \mapsto \mathbf E(X\mid Y=Y(\omega))</math>?
|-
| Utility function || <math>u</math> || <math>\mathbf R \to \mathbf R</math> || || I ''think'' this is what the type must be, based on how it's used. But we usually think of the utility function as assigning numbers to outcomes; but if that is so, it must be a random variable! What's up with that? (2022-07-14: I think in probability theory, we usually discuss only real random variables, since that allows us to do a lot more with them like take expected value. But in fields like AI, we consider more general random variables <math>\Omega \to \mathcal O</math> that take values in some space of outcomes <math>\mathcal O</math>. We can't "average over" outcomes so we can't really take expected values anymore, but this allows us to make the utility function more general so we get <math>u : \mathcal O \to \mathbf R</math>.)
|-
| Expected utility of <math>X</math> || <math>\mathbf{EU}(X)</math> || <math>\mathbf R</math> || <math>\mathbf E(u(X))</math> || <math>u\circ X</math> is indeed a random variable, so the type check passes.
|}

All the utility stuff isn't really related to machine learning. It's more related to the decision theory stuff I'm learning. I'm putting it here for now for convenience but might move it later.

TODO add "probability distribution over S" and "probability distribution on A" [https://arxiv.org/pdf/1711.00363.pdf]

Li and Vitanyi (''An Introduction to Kolmogorov Complexity and Its Applications'', p. 19) calls the probability measure on <math>\mathcal F</math> a probability distribution over S (the sample space).

TODO: add probability mass function (defined only for discrete random variables)

==Dependencies==

Let <math>(\Omega, \mathcal F, \mathbf P)</math> be a probability space.

* Given a random variable <math>X</math>, we can compute its distribution <math>\mu</math>. How? Just let <math>\mu(B) = \mathbf P_{\mathcal F}(X \in B)</math>
* Given a random variable, we can compute the probability density function. How?
* Given a random variable, we can compute the cumulative distribution function. How?
* Given a distribution, we can retrieve a random variable. But this random variable is not unique? This is why we can say stuff like "let <math>X\sim \mathcal D</math>".
* Given a distribution <math>\mu</math>, we can compute its density function. How? Just find the derivative of <math>\mu((-\infty,x])</math>. (?) (2022-07-14: something something Radon–Nikodym theorem...)
* Given a cumulative distribution function, we can compute the random variable. (Right?)
* Given a probability density function, can we get everything else? Don't we just have to integrate to get the cdf, which gets us the random variable and the distribution?
* Given a cumulative distribution function, how do we get the distribution? We have <math>F_X(x) = \mathbf P_{\mathcal F}(X\leq x) = \mathbf P_{\mathcal B}((-\infty,x])</math>, which gets us some of what the distribution <math>\mathbf P_{\mathcal B}</math> maps to, but <math>\mathcal B</math> is bigger than this. What do we do about the other values we need to map? We can compute intervals like <math>F_X(b) - F_X(a) = \mathbf P_{\mathcal F}(a \leq X\leq b) = \mathbf P_{\mathcal B}([a,b])</math>. And we can apparently do the same for unions and limiting operations.

==Philosophical details about the sample space==

Given a random variable <math>X : \Omega \to \mathbf R</math> and any reasonable predicate <math>P</math> about <math>X</math>, we can replace <math>P(X)</math> with its extension <math>\{\omega \in \Omega : P(X(\omega))\} = \{\omega \in \Omega : X(\omega) \in B\}</math> for some <math>B \in \mathcal B</math>. And from then on, we can write <math>\mathbf P_{\mathcal F}(X\in B)</math> as <math>\mathbf P_{\mathcal F}(X^{-1}(B)) = \mathbf P_{\mathcal B}(B) = \mu(B)</math>. In other words, we can just work with Borel sets of the reals (measuring them with the distribution) rather than the original events (measuring them with the original probability measure). Where did <math>X</math> go? <math>\mathbf P_{\mathcal F} \circ X^{-1} = \mathbf P_{\mathcal B}</math>, so you can write <math>\mathbf P_{\mathcal B}</math> using <math>X</math>. But once you already have <math>\mathbf P_{\mathcal B}</math>, you don't need to know what <math>X</math> is.

==See also==

* [[Summary table of multivariable derivatives]]
* [[Comparison of machine learning textbooks]]

==External links==

* [https://terrytao.wordpress.com/2010/01/01/254a-notes-0-a-review-of-probability-theory/ 254A, Notes 0: A review of probability theory] and [https://terrytao.wordpress.com/2015/09/29/275a-notes-0-foundations-of-probability-theory/ 275A, Notes 0: Foundations of probability theory] by [[wikipedia:Terence Tao|Terence Tao]]
* [http://dsp.ucsd.edu/~kreutz/PEI-05%20Support%20Files/Basic%20Random%20Variables%20Concepts.pdf Basic Random Variable Concepts] by Kenneth Kreutz-Delgado
* Various questions on Mathematics Stack Exchange:
** https://math.stackexchange.com/questions/2233731/discarding-random-variables-in-favor-of-a-domain-less-definition
** https://math.stackexchange.com/questions/18198/what-are-the-sample-spaces-when-talking-about-continuous-random-variables
** https://math.stackexchange.com/questions/2233721/the-true-domain-of-random-variables
** https://math.stackexchange.com/questions/712734/domain-of-a-random-variable-sample-space-or-probability-space
** https://math.stackexchange.com/questions/23006/the-role-of-the-hidden-probability-space-on-which-random-variables-are-defined
** https://math.stackexchange.com/questions/1612012/how-should-i-understand-the-probability-space-omega-mathcalf-p-what-d
** https://math.stackexchange.com/questions/2531810/why-does-probability-theory-insist-on-sample-spaces
** https://math.stackexchange.com/questions/1690289/what-is-a-probability-distribution
** https://math.stackexchange.com/questions/1073744/distinguishing-probability-measure-function-and-distribution
** https://math.stackexchange.com/questions/57027/concept-of-probability-distribution
* Tim Gowers:
** https://gowers.wordpress.com/2010/09/01/icm2010-fourth-day/ (search for "random variable")
** https://mathoverflow.net/questions/12516/a-random-variable-is-it-a-function-or-an-equivalence-class-of-functions

[[Category:Probability]]

Summary table of probability terms

2022-07-14T18:08:21Z

IssaRice: /* Table */

This page is a '''summary table of probability terms'''.

==Table==

{| class="sortable wikitable"
! Term !! Notation !! Type !! Definition !! Notes
|-
| Reals || <math>\mathbf R</math> || ||
|-
| Borel subsets of the reals || <math>\mathcal B</math> || ||
|-
| A Borel set || <math>B</math> || <math>\mathcal B</math> ||
|-
| [[Sample space]] || <math>\Omega</math> || ||
|-
| Outcome || <math>\omega</math> || <math>\Omega</math> ||
|-
| Events or measurable sets || <math>\mathcal F</math> || ||
|-
| Probability measure || <math>\mathbf P</math> or <math>\Pr</math> or <math>\mathbf P_{\mathcal F}</math> || <math>\mathcal F \to [0,1]</math> ||
|-
| Probability triple or probability space || <math>(\Omega, \mathcal F, \mathbf P)</math> || ||
|-
| Distribution || <math>\mu</math> or <math>\mathcal D</math> or <math>D</math> or <math>\mathbf P_{\mathcal B}</math> or <math>\mathcal L(X)</math> or <math>\mathbf P X^{-1}</math> || <math>\mathcal B \to \mathbf [0,1]</math> || <math>B \mapsto \mathbf P(X \in B)</math>
|-
| Induced probability space || <math>(\mathbf R, \mathcal B, \mu)</math> || ||
|-
| Cumulative distribution function or CDF || <math>F_X</math> || <math>\mathbf R \to [0,1]</math> ||
|-
| Probability density function or PDF || <math>f_X</math> || <math>\mathbf R \to [0,\infty)</math> ||
|-
| [[Random variable]] || <math>X</math> || <math>\Omega \to \mathbf R</math> ||
|-
| Preimage of random variable || <math>X^{-1}</math> || <math>2^{\mathbf R} \to 2^{\Omega}</math> but all we need is <math>\mathcal B \to \mathcal F</math> ||
|-
| Indicator of <math>A</math> || <math>1_A</math> || <math>\Omega \to \{0,1\}</math> || <math>1_A(\omega) = \begin{cases}1 & \omega\in A \\ 0 & \omega \not\in A\end{cases}</math>
|-
| [[Expectation]] || <math>\mathbf E</math> or <math>\mathrm E</math> || <math>(\Omega \to \mathbf R) \to \mathbf R</math> ||
|-
| || <math>X \in B</math> || <math>\mathcal F</math> || <math>\{\omega \in \Omega : X(\omega) \in B\}</math>
|-
| || <math>X=x</math> || <math>\mathcal F</math> || <math>\{\omega \in \Omega : X(\omega) = x\}</math>
|-
| || <math>X\leq x</math> || <math>\mathcal F</math> || <math>\{\omega \in \Omega : X(\omega) \leq x\}</math>
|-
| Function of a random variable, where <math>f\colon \mathbf R \to \mathbf R</math> || <math>f(X)</math> || <math>\Omega \to \mathbf R</math> || <math>f\circ X</math> ||
|-
| [[Expected value]] of <math>X</math> || <math>\mathbf E(X)</math> || <math>\mathbf R</math>
|-
| || <math>\mathbf E(X\mid Y=y)</math> || <math>\mathbf R</math> ||
|-
| || <math>\mathbf E(X\mid Y)</math> || <math>\Omega \to \mathbf R</math> || <math>\omega \mapsto \mathbf E(X\mid Y=Y(\omega))</math>?
|-
| Utility function || <math>u</math> || <math>\mathbf R \to \mathbf R</math> || || I ''think'' this is what the type must be, based on how it's used. But we usually think of the utility function as assigning numbers to outcomes; but if that is so, it must be a random variable! What's up with that? (2022-07-14: I think in probability theory, we usually discuss only real random variables, since that allows us to do a lot more with them like take expected value. But in fields like AI, we consider more general random variables <math>\Omega \to \mathcal O</math> that take values in some space of outcomes <math>\mathcal O</math>. We can't "average over" outcomes so we can't really take expected values anymore, but this allows us to make the utility function more general so we get <math>u : \mathcal O \to \mathbf R</math>.)
|-
| Expected utility of <math>X</math> || <math>\mathbf{EU}(X)</math> || <math>\mathbf R</math> || <math>\mathbf E(u(X))</math> || <math>u\circ X</math> is indeed a random variable, so the type check passes.
|}

All the utility stuff isn't really related to machine learning. It's more related to the decision theory stuff I'm learning. I'm putting it here for now for convenience but might move it later.

TODO add "probability distribution over S" and "probability distribution on A" [https://arxiv.org/pdf/1711.00363.pdf]

Li and Vitanyi (''An Introduction to Kolmogorov Complexity and Its Applications'', p. 19) calls the probability measure on <math>\mathcal F</math> a probability distribution over S (the sample space).

TODO: add probability mass function (defined only for discrete random variables)

==Dependencies==

Let <math>(\Omega, \mathcal F, \mathbf P)</math> be a probability space.

* Given a random variable <math>X</math>, we can compute its distribution <math>\mu</math>. How? Just let <math>\mu(B) = \mathbf P_{\mathcal F}(X \in B)</math>
* Given a random variable, we can compute the probability density function. How?
* Given a random variable, we can compute the cumulative distribution function. How?
* Given a distribution, we can retrieve a random variable. But this random variable is not unique? This is why we can say stuff like "let <math>X\sim \mathcal D</math>".
* Given a distribution <math>\mu</math>, we can compute its density function. How? Just find the derivative of <math>\mu((-\infty,x])</math>. (?)
* Given a cumulative distribution function, we can compute the random variable. (Right?)
* Given a probability density function, can we get everything else? Don't we just have to integrate to get the cdf, which gets us the random variable and the distribution?
* Given a cumulative distribution function, how do we get the distribution? We have <math>F_X(x) = \mathbf P_{\mathcal F}(X\leq x) = \mathbf P_{\mathcal B}((-\infty,x])</math>, which gets us some of what the distribution <math>\mathbf P_{\mathcal B}</math> maps to, but <math>\mathcal B</math> is bigger than this. What do we do about the other values we need to map? We can compute intervals like <math>F_X(b) - F_X(a) = \mathbf P_{\mathcal F}(a \leq X\leq b) = \mathbf P_{\mathcal B}([a,b])</math>. And we can apparently do the same for unions and limiting operations.

==Philosophical details about the sample space==

Given a random variable <math>X : \Omega \to \mathbf R</math> and any reasonable predicate <math>P</math> about <math>X</math>, we can replace <math>P(X)</math> with its extension <math>\{\omega \in \Omega : P(X(\omega))\} = \{\omega \in \Omega : X(\omega) \in B\}</math> for some <math>B \in \mathcal B</math>. And from then on, we can write <math>\mathbf P_{\mathcal F}(X\in B)</math> as <math>\mathbf P_{\mathcal F}(X^{-1}(B)) = \mathbf P_{\mathcal B}(B) = \mu(B)</math>. In other words, we can just work with Borel sets of the reals (measuring them with the distribution) rather than the original events (measuring them with the original probability measure). Where did <math>X</math> go? <math>\mathbf P_{\mathcal F} \circ X^{-1} = \mathbf P_{\mathcal B}</math>, so you can write <math>\mathbf P_{\mathcal B}</math> using <math>X</math>. But once you already have <math>\mathbf P_{\mathcal B}</math>, you don't need to know what <math>X</math> is.

==See also==

* [[Summary table of multivariable derivatives]]
* [[Comparison of machine learning textbooks]]

==External links==

* [https://terrytao.wordpress.com/2010/01/01/254a-notes-0-a-review-of-probability-theory/ 254A, Notes 0: A review of probability theory] and [https://terrytao.wordpress.com/2015/09/29/275a-notes-0-foundations-of-probability-theory/ 275A, Notes 0: Foundations of probability theory] by [[wikipedia:Terence Tao|Terence Tao]]
* [http://dsp.ucsd.edu/~kreutz/PEI-05%20Support%20Files/Basic%20Random%20Variables%20Concepts.pdf Basic Random Variable Concepts] by Kenneth Kreutz-Delgado
* Various questions on Mathematics Stack Exchange:
** https://math.stackexchange.com/questions/2233731/discarding-random-variables-in-favor-of-a-domain-less-definition
** https://math.stackexchange.com/questions/18198/what-are-the-sample-spaces-when-talking-about-continuous-random-variables
** https://math.stackexchange.com/questions/2233721/the-true-domain-of-random-variables
** https://math.stackexchange.com/questions/712734/domain-of-a-random-variable-sample-space-or-probability-space
** https://math.stackexchange.com/questions/23006/the-role-of-the-hidden-probability-space-on-which-random-variables-are-defined
** https://math.stackexchange.com/questions/1612012/how-should-i-understand-the-probability-space-omega-mathcalf-p-what-d
** https://math.stackexchange.com/questions/2531810/why-does-probability-theory-insist-on-sample-spaces
** https://math.stackexchange.com/questions/1690289/what-is-a-probability-distribution
** https://math.stackexchange.com/questions/1073744/distinguishing-probability-measure-function-and-distribution
** https://math.stackexchange.com/questions/57027/concept-of-probability-distribution
* Tim Gowers:
** https://gowers.wordpress.com/2010/09/01/icm2010-fourth-day/ (search for "random variable")
** https://mathoverflow.net/questions/12516/a-random-variable-is-it-a-function-or-an-equivalence-class-of-functions

[[Category:Probability]]