User:IssaRice/Linear algebra/Singular value decomposition: Difference between revisions

From Machinelearning
(Created page with "the stupid textbooks don't tell you anything about SVD!!!! i think it's super helpful to look at all the ''wrong'' things one might say about SVD... we need to un-knot all tho...")
 
No edit summary
 
(12 intermediate revisions by the same user not shown)
Line 3: Line 3:
starting at this image: https://en.wikipedia.org/wiki/File:Singular-Value-Decomposition.svg
starting at this image: https://en.wikipedia.org/wiki/File:Singular-Value-Decomposition.svg


* if A is an invertible matrix, then <math>A = E_1 \cdots E_m</math> for some elementary matrices <math>E_1,\ldots,E_m</math>. Dilations and swapping elementary matrices obviously involve only orthogonal operations. So we can write A as an alternating product of orthogonal and shear matrices. If we can prove SVD for shears, we can convert this to an alternating product of orthogonal and ''diagonal'' matrices. unfortunately, this doesn't seem to lead to a full proof of SVD (unless orthogonal and diagonal matrices somehow commute).
* if A is an invertible matrix, then <math>A = E_1 \cdots E_m</math> for some elementary matrices <math>E_1,\ldots,E_m</math>. Dilations and swapping elementary matrices obviously involve only orthogonal operations. So we can write A as an alternating product of orthogonal and shear matrices (the product of two orthogonal matrices is again orthogonal. right???). If we can prove SVD for shears, we can convert this to an alternating product of orthogonal and ''diagonal'' matrices. unfortunately, this doesn't seem to lead to a full proof of SVD (unless orthogonal and diagonal matrices somehow commute).
* one question one might have is, to get the behavior of M in the image, can't we just squish along the standard basis directions, then rotate? surely this would produce the same ellipse. And it would seem that we've only required one rotation, instead of the two in SVD. That's true, but pay attention to where the basis vectors went. A squish followed by a rotation... would preserve orthogonality. But in M it is clear that these basis vectors are no longer orthogonal. So even though we have faithfully preserved the ellipse, we don't have the same transformation. i.e. <math>M(\{v : \|v\| = 1\}) = M'(\{v : \|v\|=1\})</math> need not imply <math>M=M'</math>, apparently.
* one question one might have is, to get the behavior of M in the linked image, can't we just squish along the standard basis directions, then rotate? surely this would produce the same ellipse. And it would seem that we've only required one rotation, instead of the two in SVD. That's true, but pay attention to where the basis vectors went. A squish followed by a rotation... would preserve orthogonality. But in M it is clear that these basis vectors are no longer orthogonal. So even though we have faithfully preserved the ellipse, we don't have the same transformation. i.e. <math>M(\{v : \|v\| = 1\}) = M'(\{v : \|v\|=1\})</math> need not imply <math>M=M'</math>, apparently. This must be an artifact of the fact that a circle is an extremely symmetric shape, so lots of non-identical transformations can still produce the same image of a circle. I think if we started out with a square, we would not have the same image if we just instead stretched and then did a rotation (actually, maybe a square too is still too symmetric; see example [https://youtu.be/vSczTbgc8Rc?list=PLnQX-jgAF5pTZXPiD8ciEARRylD9brJXU&t=679 here]).
* (polar decomposition.) In the linked image, look at the axes of the final ellipse, labeled <math>\sigma_1</math> and <math>\sigma_2</math>. Call those vectors <math>u_1</math> and <math>u_2</math>. So <math>u_1 = Mv_1</math> and <math>u_2 = Mv_2</math> for some vectors v1 and v2. Now, backtrack along the arrows, starting from the final image, going through U, then Sigma, then V*. Pay attention to what it does to u1 and u2. In each step, the vectors remain orthogonal. So not only are u1 and u2 orthogonal, we must have that v1 and v2 are orthogonal. So now, couldn't we say, "take v1 and v2, squish along those axes. then rotate." That seems to have required only one rotation. What's going on? The problem is that a diagonal matrix can only stretch along the standard basis. So "stretch along v1 and v2" can't be done via a diagonal matrix (unless v1 and v2 are the standard basis, of course). Let's say <math>M = RD</math> where R is a rotation, and D is "stretch along v1 and v2". So <math>Dv_1 = \lambda_1 v_1</math> and <math>Dv_2 = \lambda_2 v_2</math>. Now, D is not a diagonal matrix, when viewed in the standard basis, but it ''is'' a diagonal matrix when viewed under the basis v1,v2. To get to the standard basis, we need to convert the incoming vectors like v1 into e1, then apply the stretching, then reconvert back to v1. In other words, we want to show <math>D = U\Sigma U^*</math> for some diagonal matrix Sigma and orthogonal U. Just take <math>Ue_j=v_j</math>, i.e. the matrix U is the matrix with columns [v1 v2]. Now, <math>Dv_j = U\Sigma U^* v_j = U\Sigma e_j = U\lambda_j e_j = \lambda_j v_j</math> just like we wanted. So <math>M = RD = RU\Sigma U^*</math>. Of course, RU is another orthogonal matrix, so we recover SVD.
** also, if any of the entries of <math>\Sigma</math> are negative, we could have chosen a different v vector that would have made it positive, so we can assume that D is a positive operator (positive semi-definite).
* another question is: if we squish along the right orthogonal directions, can't we get away with not needing the extra rotation? after all, an ellipse can be squished into a circle without any rotations. what must be the case, although i can't explain it visually yet, is that if we do just that, then the standard basis vectors (yellow and pink) get mapped to the wrong spots. this ''might'' be an artifact of shears (the wikipedia SVD image is a shear). clearer to look at michael nielsen's [http://cognitivemedium.com/emm/images/tangent_definition.png image]. here, if we start with the ellipse and shrink along Ms and stretch along Mt, we do get a circle. but Ms doesn't go back to s, and Mt doesn't go back to t; for that we'll need the extra rotation.
* [https://math.stackexchange.com/questions/2899052/singular-value-decomposition-reconciling-the-maximal-stretching-and-spectral my old question]:
** what is this <math>\sqrt{T^*T}</math> that axler keeps talking about? that's the "stretch along well-chosen orthogonal directions" operation that we start out with in polar decomposition.
** for proof (1), see [http://cognitivemedium.com/emm/emm.html michael nielsen]. basically, the maximal stretching direction has a tangent vector (on the ellipse) that is orthogonal to it, because if it ''wasn't'' orthogonal, then we could get an even more stretched out vector. the other piece that's required is that linear maps preserve tangency. i.e. if v(t) is a parametrization of a circle, and M is a matrix, then M(v(t)) traces out an ellipse as t varies. (i'm using t as a parameter even though nielsen uses it as a vector. seriously, who the heck uses t for a vector??) the tangent vector on the circle at v(t) is v'(t). this tangent vector gets mapped to M(v'(t)). the tangent vector at M(v(t)) on the ellipse is <math display=inline>\frac{d}{dt} M(v(t))</math>. now, by linearity of M and the definition of the derivative, we can basically "pull out" the M and see that <math display=inline>\frac{d}{dt} M(v(t)) = M(\frac{d}{dt} v(t))</math>.<ref group=note><math>\frac{d}{dt} M(v(t)) = \lim_{h\to0} \frac{Mv(t+h) - Mv(t)}{h} = \lim_{h\to0} \frac1h M(v(t+h)-v(t)) = \lim_{h\to0}M(\frac1h (v(t+h)-v(t))) = M(\frac{d}{dt} v(t))</math>. You might also want to play around with an example like <math>\begin{pmatrix}3 & 0\\ 0 & 4\end{pmatrix}</math>, which takes (cos t, sin t) to (3cos t, 4sin t). The tangent at the original point is (-sin t, cos t). The tangent at the image is (-3sin t, 4cos t), which is equal to the image of the tangent.</ref> what this means is that if you have a point on the circle and its tangent, then you map both of them under M, then the tangent of the image of the point is the image of the tangent at the point.<ref group=note> I think another way to see this is is via uniqueness of taylor approximations? like if v is a point on the circle, and u is the tangent vector at v, then points near v can be written as <math>v + \Delta u + O(\Delta^2)</math>, and if we apply M to those points, we get <math>Mv + \Delta Mu + O(\Delta^2)</math>. if taylor approximations are unique, then the fact that the term linear in <math>\Delta</math> has Mu means that Mu must be tangent at Mv.</ref> what this implies is that for our maximal stretch vector, since the tangent on the circle is orthogonal, the image of that tangent is also a tangent at the new place on the ellipse, and we already know that the tangent is orthogonal for the maximal stretch vector.
** so how does (2) find the same basis without talking about "maximal stretching"? well, in (2), <math>\sqrt{T^*T}</math> ''means'' "stretch along well-chosen orthogonal directions" -- it's the positive operator that appears in polar decomposition. and if we stretch along orthogonal directions, then surely one of them has to be the maximal stretching direction (rather than, say, some direction intermediate between two of the axes).
 
==See also==
 
* https://machinelearning.subwiki.org/wiki/User:IssaRice/Linear_algebra/Classification_of_operators -- performing SVD on some nicer operators allows you to skip some of the steps, resulting in a simpler decomposition.
 
==Footnotes==
 
<references group="note"/>

Latest revision as of 21:10, 1 October 2022

the stupid textbooks don't tell you anything about SVD!!!! i think it's super helpful to look at all the wrong things one might say about SVD... we need to un-knot all those wrong intuitions. i'll list some knots that i have had.

starting at this image: https://en.wikipedia.org/wiki/File:Singular-Value-Decomposition.svg

  • if A is an invertible matrix, then for some elementary matrices . Dilations and swapping elementary matrices obviously involve only orthogonal operations. So we can write A as an alternating product of orthogonal and shear matrices (the product of two orthogonal matrices is again orthogonal. right???). If we can prove SVD for shears, we can convert this to an alternating product of orthogonal and diagonal matrices. unfortunately, this doesn't seem to lead to a full proof of SVD (unless orthogonal and diagonal matrices somehow commute).
  • one question one might have is, to get the behavior of M in the linked image, can't we just squish along the standard basis directions, then rotate? surely this would produce the same ellipse. And it would seem that we've only required one rotation, instead of the two in SVD. That's true, but pay attention to where the basis vectors went. A squish followed by a rotation... would preserve orthogonality. But in M it is clear that these basis vectors are no longer orthogonal. So even though we have faithfully preserved the ellipse, we don't have the same transformation. i.e. need not imply , apparently. This must be an artifact of the fact that a circle is an extremely symmetric shape, so lots of non-identical transformations can still produce the same image of a circle. I think if we started out with a square, we would not have the same image if we just instead stretched and then did a rotation (actually, maybe a square too is still too symmetric; see example here).
  • (polar decomposition.) In the linked image, look at the axes of the final ellipse, labeled and . Call those vectors and . So and for some vectors v1 and v2. Now, backtrack along the arrows, starting from the final image, going through U, then Sigma, then V*. Pay attention to what it does to u1 and u2. In each step, the vectors remain orthogonal. So not only are u1 and u2 orthogonal, we must have that v1 and v2 are orthogonal. So now, couldn't we say, "take v1 and v2, squish along those axes. then rotate." That seems to have required only one rotation. What's going on? The problem is that a diagonal matrix can only stretch along the standard basis. So "stretch along v1 and v2" can't be done via a diagonal matrix (unless v1 and v2 are the standard basis, of course). Let's say where R is a rotation, and D is "stretch along v1 and v2". So and . Now, D is not a diagonal matrix, when viewed in the standard basis, but it is a diagonal matrix when viewed under the basis v1,v2. To get to the standard basis, we need to convert the incoming vectors like v1 into e1, then apply the stretching, then reconvert back to v1. In other words, we want to show for some diagonal matrix Sigma and orthogonal U. Just take , i.e. the matrix U is the matrix with columns [v1 v2]. Now, just like we wanted. So . Of course, RU is another orthogonal matrix, so we recover SVD.
    • also, if any of the entries of are negative, we could have chosen a different v vector that would have made it positive, so we can assume that D is a positive operator (positive semi-definite).
  • another question is: if we squish along the right orthogonal directions, can't we get away with not needing the extra rotation? after all, an ellipse can be squished into a circle without any rotations. what must be the case, although i can't explain it visually yet, is that if we do just that, then the standard basis vectors (yellow and pink) get mapped to the wrong spots. this might be an artifact of shears (the wikipedia SVD image is a shear). clearer to look at michael nielsen's image. here, if we start with the ellipse and shrink along Ms and stretch along Mt, we do get a circle. but Ms doesn't go back to s, and Mt doesn't go back to t; for that we'll need the extra rotation.
  • my old question:
    • what is this that axler keeps talking about? that's the "stretch along well-chosen orthogonal directions" operation that we start out with in polar decomposition.
    • for proof (1), see michael nielsen. basically, the maximal stretching direction has a tangent vector (on the ellipse) that is orthogonal to it, because if it wasn't orthogonal, then we could get an even more stretched out vector. the other piece that's required is that linear maps preserve tangency. i.e. if v(t) is a parametrization of a circle, and M is a matrix, then M(v(t)) traces out an ellipse as t varies. (i'm using t as a parameter even though nielsen uses it as a vector. seriously, who the heck uses t for a vector??) the tangent vector on the circle at v(t) is v'(t). this tangent vector gets mapped to M(v'(t)). the tangent vector at M(v(t)) on the ellipse is . now, by linearity of M and the definition of the derivative, we can basically "pull out" the M and see that .[note 1] what this means is that if you have a point on the circle and its tangent, then you map both of them under M, then the tangent of the image of the point is the image of the tangent at the point.[note 2] what this implies is that for our maximal stretch vector, since the tangent on the circle is orthogonal, the image of that tangent is also a tangent at the new place on the ellipse, and we already know that the tangent is orthogonal for the maximal stretch vector.
    • so how does (2) find the same basis without talking about "maximal stretching"? well, in (2), means "stretch along well-chosen orthogonal directions" -- it's the positive operator that appears in polar decomposition. and if we stretch along orthogonal directions, then surely one of them has to be the maximal stretching direction (rather than, say, some direction intermediate between two of the axes).

See also

Footnotes

  1. . You might also want to play around with an example like , which takes (cos t, sin t) to (3cos t, 4sin t). The tangent at the original point is (-sin t, cos t). The tangent at the image is (-3sin t, 4cos t), which is equal to the image of the tangent.
  2. I think another way to see this is is via uniqueness of taylor approximations? like if v is a point on the circle, and u is the tangent vector at v, then points near v can be written as , and if we apply M to those points, we get . if taylor approximations are unique, then the fact that the term linear in has Mu means that Mu must be tangent at Mv.