Some interesting problems in PCA

August 25, 2018August 26, 2018 ~ Dandan Peng ~ Leave a comment

1.For PCA, in order to find U and W we need to find a solution to the following objective function:

$argmin_ {U\in R^{d\times r}:U^TU=I}\sum^n_{i=1}\Vert x_i-UU^Tx_i\Vert_2^2$

How can we simplify the equation above such that we transform the problem into a trace maximisation problem? Show the intermediate steps and the new objective function.

Solution:

$\sum^n_{i=1}\Vert x_i-UU^Tx_i\Vert^2_2=\Vert X-UU^TX\Vert^2_2$

$=tr((X-UU^TX)^T(X-UU^TX))=tr(X^TX-2XTUU^TX+X^TUU^TUU^TX)$

$=tr(X^TX)-tr(X^TUU^TX)=\Vert X\Vert^2_2-tr(U^TXX^TU)=\Vert X\Vert^2_2-tr(U^TXX^TU)$

Now we turn the objective into a trace maximisation problem:
$argmax_{U\in R^{d\times r}:U^TU=I}trace(U^TXX^TU)$

$=argmax_{U\in R^{d\times r}:U^TU=I}trace(U^T\sum^n_{i=1}\pmb{x}_i\pmb{x}_i^TU)$

If we define $\Sigma=\frac{1}{n}\sum^n_{i=1}\pmb{x}_i\pmb{x}_i^T)$ , this is equivalent (up to a constant factor) to:
$argmax_{U\in R^{d\times r}:U^TU=I}race(U^T\Sigma U)$

2.With a mean-centered data matrix X, a key component of PCA is the spectral decomposition of a matrix M defined as M= VDV^Twhere:

M ∈R^d,d

V is a matrix whose columns $\pmb{v}_1, \pmb{v}_2, ..., \pmb{v}_d$ are the eigenvectors of M

D is the diagonal matrix with $D_{i,i}=\lambda_i$ and $\lambda_i$ is the eigenvalue of $\pmb{v}_i$

Assume that obtaining the square matrix M is computationally expensive. Can you

suggest an alternative method to obtain the eigenvectors of M?

Solution:

Consider $K=X^TX$ such that $K_{ij}=\langle x_i, x_j\rangle$

Suppose v is an eigenvector of K: $K\pmb{v}^*=\lambda\pmb{v}^*$ for some $\lambda\in \mathbb{R}$ .

Multiplying the equation by $\frac{1}{n}X$ and using the definition of K we obtain:

$\frac{1}{n}XX^TX\pmb{v}^*=\frac{1}{n}\lambda X\pmb{v}^*$

But, using the definition of $\Sigma$ , we get that $\Sigma(X\pmb{v}^*)=\frac{\lambda}{n}(X\pmb{v}^*)$

Hence $\frac{X\pmb{v}^*}{\Vert X\pmb{v}^*\Vert}$ is an eigenvector of $\Sigma$ with eigenvalue $\frac{\lambda}{n}$

Therefore, we can calculate the PCA solution by calculating the eigenvalues of K rather than $\Sigma$ .

3.In PCA, W is the compression matrix and U is the recovering matrix. The corresponding objective of finding W and U is phrased as:

$argmin_{W\in\mathbb{R}^{r\times d},U\in\mathbb{R}^{d\times r}}\sum^n_{i=1}\Vert x_i-UWx_i\Vert^2_2$

Prove the lemma: Let (U,W) be a solution to Objective(1). Then the columns of U are orthonormal (that is, $U^TU$ is the identity matrix of $\mathbb{R}^r$ ) and $W=U^T$ .

Solution:

Choose any U and W and consider the mapping $\pmb{x}\rightarrow UW\pmb{x}$

The range of this mapping, $R=\left\{ UW\pmb{x}:\pmb{x}\in\mathbb{R}^d\right\}$ is an r-dimensional linear subspace of $\mathbb{R}^d$ .

Let $V\in\mathbb{R}^{d\times r}$ be a matrix whose columns form an orthonormal basis of this subspace (i.e. $V^TV=I$ ).

Each vector in R can be written as $V\pmb{y}$ , where $\pmb{y}\in\mathbb{R}^r$ .

For every $\pmb{x}\in\mathbb{R}^d$ and $\pmb{y}\in\mathbb{R}^r$ we have

$\Vert \pmb{x}-V\pmb{y}\Vert^2_2=\Vert x\Vert^2+\Vert\pmb{y}\Vert^2-2\pmb{y}^T(V^T\pmb{x})$

This difference is minimised for $\pmb{y}=V^T\pmb{x}$ .

Therefore, for each $\pmb{x}$ we have that:

$V\pmb{y}=VV^T\pmb{x}=argmin_{\tilde{x}\in R}min\Vert\pmb{x}-\tilde{\pmb{x}}\Vert^2_2$

This holds for all $\pmb{x}_i$ and we can replace U,W by $V, V^T$ without increasing the objective:

$\sum^n_{i=1}\Vert\pmb{x}_i-UW\pmb{x}_i\Vert^2_2\geq\sum^n_{i=1}-VV^T\pmb{x}_i\Vert^2_2$

As this holds for any U,W, the proof is complete.

4.Let $\Sigma=VDV^T$ be the spectral decomposition of $\Sigma$ . D is a diagonal matrix, such that $D_{i,i}$ is the i-th largest eigenvalue of $\Sigma$ . The columns of V are the corresponding eigenvectors, and $V^TV=VV^T=I$ . Then the solution to $argmax_{U\in\mathbb{R}^{d\times r}:U^TU=I}trace(U^T\Sigma U)$ is the matrix U whose columns are the r first eigenvectors of $\Sigma$ .

Solution:

Choose a matrix $U\in\mathbb{R}^{d,r}$ with orthonormal columns and let $B=V^TU$ . Then, $VB=VV^TU=U$ . It follows that

$U^T\Sigma U=B^TV^TVDV^TVB=B^TDB$

and therefore

$trace(U^T\Sigma U)=\Sigma^d_{j=1}D_{j,j}\sum^r_{i=1}B_{j,i}^2$ .

Note that $B^TB=U^TVV^TU=U^TU=I$ . Hence the columns of B are orthonormal and $\sum^d_{j=1}\sum^r_{i=1}B_{j,i}^2=r$ .

Define $\tilde{B}\in\mathbb{R}^{d\times d}$ to be the matrix such that the first r columns are the columns of B and in addition, $\tilde{B}^T\tilde{B}=I$ . Then for every j: $\sum^d_{i=1}\tilde{B}^2_{j,i}=1$ , which implies that $\sum^r_{i=1}B_{j,i}^2\leq 1$ .

It follows that

$trace(U^T\Sigma U)\leq max_{\beta\in [0,1]^d:\Vert\beta\Vert_1\leq r}\sum^d_{j=1}D_{j,j}\beta_j=\sum^r_{j=1}D_{j,j}$

Therefore for every matrix $U\in\mathbb{R}^{d\times r}$ with orthonormal columns, the following inequality hols: $trace(U^T\Sigma U)\leq\sum^r_{j=1}D_{j,j}$

But if we set U to the matrix with the r leading eigenvectors of $\Sigma$ as its columns, we obtain $trace(U^T\Sigma U)=\Sigma^r_{j=1}D_{j,j}$ and thereby the optimal solution.

5.The variance captured by the first r eigenvectors of $\Sigma$ is the sum over its r largest eigenvalues $\sum^r_{i=1}\lambda_i$ .

Solution:

The variance in a dataset X is defined as

$var(X)=\frac{1}{n}\sum^n_{i=1}\Vert x_i-0\Vert^2$

$=\frac{1}{n}\sum^n_{i=1}\Vert x_i\Vert^2=\frac{1}{n}\sum^n_{i=1}\langle x_i,x_i\rangle$ $=tr(\Sigma)=tr(V^TDV)=tr(VV^TD)=tr(D)$

$=\sum^d_{i=1}D_{i,i}=\sum^d_{i=1}\lambda_i$

The variance in a projected dataset WX, with $W=[\pmb{v}_1,...,\pmb{v}_r]^T$ , is defined as

$var(WX)=\frac{1}{n}\Vert W\pmb{x}_j\Vert^2 =\frac{1}{n}\sum^n_{j=1}\langle W\pmb{x}_j,W\pmb{x}_j\rangle =\frac{1}{n}\sum^n_{j=1}\pmb{x}_j^T W^TW\pmb{x}_j$

$\frac{1}{n}\sum^n_{j=1}\pmb{x}_j^T(\pmb{v}_1\pmb{v}_1^T+...+\pmb{v}_r\pmb{v}_r^T)\pmb{x}_j=\sum^r_{i=1}\pmb{v}_i^T\sum^n_{j=1}\frac{1}{n}(\pmb{x}_j\pmb{x}_j^T)\pmb{v}_i=$

$(\pmb{v}_1^T\Sigma\pmb{v}_1+...+\pmb{v}_r^T\Sigma\pmb{v}_r)=\sum^r_{i=1}\pmb{v}_i^T\Sigma\pmb{v}_i=\sum^r_{i=1}\pmb{v}_i^T\lambda_i\pmb{v}_i=\sum^r_{i=1}\lambda_i$

(Based on 4, we know that the eigenvectors are orthonormal.)

DFS code and Minimum DFS code

August 25, 2018August 26, 2018 ~ Dandan Peng ~ Leave a comment

Edge Code

We represent each code e={start, end} with start and end as the starting and ending nodes of edge. The edge code is defined as:

$\left\{ Index_{start}, Index_{end},L_V(start), L_E(e), L_V(end)\right\}$

where the Index of a vertex is an arbitrary number, in sequential visiting order, given to the vertices in the graph. The numbers marked in red in the following graph are the indices of the vertices.

Screen Shot 2018-08-25 at 18.03.27

There are many edge codes for this graph:

Forward edge: {0,1,X,a,Y}, {1,2,Y,b,X}, {2,3,X,c,Z}, {1,4,Y,d,Z}

Backward edge: {3,1,Z,b,Y}, {2,0,X,a,Z}

It is clear to see that we have $Index_{start}<Index_{end}$ for forward edges and $Index_{start}>Index_{end}$ for backward edges.

The order of edges and DFS code

Suppose we have two edges

$e_1=\left\{ Index_{start1},Index_{end1}\right\}$

$e_2=\left\{ Index_{start2},Index_{end2}\right\}$

To determine the ordering between 2 edges we must consider: i) the types of the edges, ii) the visiting order (Index) of the nodes in each edge.

If $e_1$ and $e_2$ are both forward edges, then $e_1 < e_2$ if one of the following holds:

$Index_{end1}<Index_{end2}$
$Index_{end1}=Index_{end2}$ and $Index_{start1}<Index_{start2}$

If $e_1$ and $e_2$ are both backward edges, then $e_1 < e_2$ if one of the following holds:

$Index_{start1}<Index_{start2}$
$Index_{start1}=Index_{start2}$ and $Index_{end1}<Index_{end2}$

If $e_1$ and $e_2$ are one forward edge and one backward edge, then $e_1 < e_2$ if one of the following two holds:

$e_1$ is forward edge and $e_2$ is backward edge and $Index_{end1}\leq Index_{start2}$
$e_1$ is backward edge and $e_2$ is forward edge and $Index_{start1}<Index_{end2}$

The DFS code is just the concatenation of all the edge codes in order. i.e. the DFS code for the above DFS tree is:

{0,1,X,a,Y,1,2,Y,b,X,2,0,X,a,X,2,3,X,c,Z,3,1,Z,b,Y,1,4,Y,d,Z}.

DFS Lexicographic Order and Minimum DFS code

The DFS lexicographic order of the DFS code is defined as follows:

If we have two DFS codes $\alpha$ and $\beta$

$\alpha={\alpha_0,...,\alpha_m}, \beta={\beta _0,...,\beta_n}$

then $\alpha\leq\beta$ if and only if either of the following is true

1. $\exists t,0\leq t\leq min(m,n), \forall k<t, \alpha_k=\beta_k$ and $\alpha_t<\beta_t$

2. $m\leq n$ and $\forall k\leq m, \alpha_k=\beta_k$

So we can order a list of DFS codes according to the DFS lexicographic order, and we define a minimum DFS code as the minimum one among all the possible DFS codes of a graph.

Kernel PCA

August 24, 2018August 24, 2018 ~ Dandan Peng ~ Leave a comment

Kernel PCA

Assume that we are dealing with centred data $\sum^n_(i=1)\phi(x_i)=0$

The covariance matrix then takes from $C=\frac{1}{n}\phi(x_i)\phi(x_i)^T$

Then we have to find eigenvalues $\lambda\geq0$ and nonzero eigenvectors v in Hilbert space satisfying: $\lambda v=Cv$

All solutions v with $\lambda\neq0$ lie in the span of $\phi(x_1),...,\phi(x_n)$ , due to the fact that

$\lambda v=Cv=\frac{1}{n}\sum^n_{i=1}\phi(x_i)\phi(x_i)^Tv=\frac{1}{n}\sum^n_{i=1}(\phi(x_i)^Tv)\phi(x_i)$ ( $\phi(x_i)^Tv$ is a inner product between two vectors, it is a scalar. So we can reorder it to the front. )

We can get 2 useful consequence based on the former equation, the first is:

If we choose a point $\phi(x_j)$ and multiply it on both side. We can get $\lambda\phi({x_j})v=\phi(x_j) Cv$

$\lambda\langle\phi(x_j),v\rangle=\langle\phi(x_j),Cv\rangle$

The second consequence is that the eigenvector can be written as a linear combination of points in Hilbert space:

$v=\sum^n_{i=1}\alpha_i\phi(x_i)$ ( $\alpha_i=\frac{\phi(x_i)^Tv}{\lambda}$ )

Replace consequence 2 into 1:

We can get $\lambda\langle\phi(x_j),\sum^n_{i=1}\alpha_i\phi(x_i)\rangle=\langle\phi(x_j),\frac{1}{n}\sum^n_{k=1}\phi(x_k)\phi(x_k)^T\sum^n_{i=1}\alpha_i\phi(x_i)\rangle$ (43)

There are two properties of inner product that we should know is that:

$\langle x, \lambda x'\rangle=\lambda\langle x,x'\rangle$ ( $\lambda$ is a constant)

$\langle x,\sum_i\alpha_ix'_i\rangle =\sum_i\alpha_i\langle x,x'_i\rangle$ ( $\alpha_i$ is a constant)

Have known these two properties, we can write equation (43) as:

$\lambda\sum^n_{i=1}\alpha_i\langle\phi(x_j),\phi(x_i)\rangle=\frac{1}{n}\sum^n_{i=1}\alpha_i\langle\phi(x_j),\sum^n_{k=1}\phi(x_k)\langle\phi(x_k),\phi(x_j)\rangle\rangle$

This can be rewritten, for all j=1,…,n, as:

$\lambda\sum^n_{i=1}\alpha_i\langle\phi(x_j),\phi(x_i)\rangle=\frac{1}{n}\sum^n_{i=1}\alpha_i\langle\phi(x_j),\sum^n_{k=1}\phi(x_k)\langle\phi(x_k),\phi(x_i)\rangle\rangle$ ; (44)

$\lambda\sum^n_{i=1}\alpha_i\langle\phi(x_j),\phi(x_i)\rangle=\frac{1}{n}\sum^n_{i=1}\alpha_i\sum^n_{k=1}\langle\phi(x_k),\phi(x_i)\rangle\langle\phi(x_j),\phi(x_k)\rangle$ ; (45)

$n\lambda\sum^n_{i=1}\alpha_i\underbrace{\langle\phi(x_j),\phi(x_i)\rangle}=\sum^n_{i=1}\alpha_i\underbrace{\sum^n_{k=1}\langle\phi(x_j),\phi(x_k)\rangle\langle\phi(x_k),\phi(x_i)\rangle}$ . (46)

The two underprices implies that: $\langle\phi(x_j),\phi(x_k)\rangle=K(x_j,x_k)=K_{jk}$

$\sum^n_{k=1}K_{jk}K_{ki}=[K\cdot K]_{ji}=[K^2]_{ij}$ (This is the definition of matrix multiplication: the jth row of K multiply with the ithcolumn of K is the ij entry of the result matrix.)