Some interesting problems in PCA

1.For PCA, in order to find U and W we need to find a solution to the following objective function:

argmin_ {U\in R^{d\times r}:U^TU=I}\sum^n_{i=1}\Vert x_i-UU^Tx_i\Vert_2^2

How can we simplify the equation above such that we transform the problem into a trace maximisation problem? Show the intermediate steps and the new objective function.

Solution:

\sum^n_{i=1}\Vert x_i-UU^Tx_i\Vert^2_2=\Vert X-UU^TX\Vert^2_2

=tr((X-UU^TX)^T(X-UU^TX))=tr(X^TX-2XTUU^TX+X^TUU^TUU^TX)

=tr(X^TX)-tr(X^TUU^TX)=\Vert X\Vert^2_2-tr(U^TXX^TU)=\Vert X\Vert^2_2-tr(U^TXX^TU)

Now we turn the objective into a trace maximisation problem:
argmax_{U\in R^{d\times r}:U^TU=I}trace(U^TXX^TU)

=argmax_{U\in R^{d\times r}:U^TU=I}trace(U^T\sum^n_{i=1}\pmb{x}_i\pmb{x}_i^TU)

If we define \Sigma=\frac{1}{n}\sum^n_{i=1}\pmb{x}_i\pmb{x}_i^T), this is equivalent (up to a constant factor) to:
argmax_{U\in R^{d\times r}:U^TU=I}race(U^T\Sigma U)


2.With a mean-centered data matrix X, a key component of PCA is the spectral decomposition of a matrix M defined as M= VDVT where:

M ∈Rd,d

V is a matrix whose columns \pmb{v}_1, \pmb{v}_2, ..., \pmb{v}_d  are the eigenvectors of M

D is the diagonal matrix with D_{i,i}=\lambda_i and \lambda_i is the eigenvalue of \pmb{v}_i

Assume that obtaining the square matrix M is computationally expensive. Can you

suggest an alternative method to obtain the eigenvectors of M?

Solution:

Consider K=X^TX such that K_{ij}=\langle x_i, x_j\rangle

Suppose v is an eigenvector of K: K\pmb{v}^*=\lambda\pmb{v}^* for some \lambda\in \mathbb{R}.

Multiplying the equation by \frac{1}{n}X and using the definition of K we obtain:

\frac{1}{n}XX^TX\pmb{v}^*=\frac{1}{n}\lambda X\pmb{v}^*

But, using the definition of \Sigma, we get that \Sigma(X\pmb{v}^*)=\frac{\lambda}{n}(X\pmb{v}^*)

Hence \frac{X\pmb{v}^*}{\Vert X\pmb{v}^*\Vert} is an eigenvector of \Sigma with eigenvalue \frac{\lambda}{n}

Therefore, we can calculate the PCA solution by calculating the eigenvalues of K rather than \Sigma.


3.In PCA, W is the compression matrix and U is the recovering matrix. The corresponding objective of finding W and U is phrased as:

argmin_{W\in\mathbb{R}^{r\times d},U\in\mathbb{R}^{d\times r}}\sum^n_{i=1}\Vert x_i-UWx_i\Vert^2_2

Prove the lemma: Let (U,W) be a solution to Objective(1). Then the columns of U are orthonormal (that is, U^TU is the identity matrix of \mathbb{R}^r) and W=U^T.

Solution:

Choose any U and W and consider the mapping \pmb{x}\rightarrow UW\pmb{x}

The range of this mapping, R=\left\{ UW\pmb{x}:\pmb{x}\in\mathbb{R}^d\right\} is an r-dimensional linear subspace of \mathbb{R}^d .

Let V\in\mathbb{R}^{d\times r} be a matrix whose columns form an orthonormal basis of this subspace (i.e. V^TV=I ).

Each vector in R can be written as V\pmb{y} , where \pmb{y}\in\mathbb{R}^r .

For every \pmb{x}\in\mathbb{R}^d and \pmb{y}\in\mathbb{R}^r we have

\Vert \pmb{x}-V\pmb{y}\Vert^2_2=\Vert x\Vert^2+\Vert\pmb{y}\Vert^2-2\pmb{y}^T(V^T\pmb{x})

This difference is minimised for \pmb{y}=V^T\pmb{x} .

Therefore, for each \pmb{x} we have that:

V\pmb{y}=VV^T\pmb{x}=argmin_{\tilde{x}\in R}min\Vert\pmb{x}-\tilde{\pmb{x}}\Vert^2_2

This holds for all \pmb{x}_i and we can replace U,W by V, V^T without increasing the objective:

\sum^n_{i=1}\Vert\pmb{x}_i-UW\pmb{x}_i\Vert^2_2\geq\sum^n_{i=1}-VV^T\pmb{x}_i\Vert^2_2

As this holds for any U,W, the proof is complete.


4.Let \Sigma=VDV^T be the spectral decomposition of \Sigma. D is a diagonal matrix, such that D_{i,i} is the i-th largest eigenvalue of \Sigma. The columns of V are the corresponding eigenvectors, and V^TV=VV^T=I. Then the solution to argmax_{U\in\mathbb{R}^{d\times r}:U^TU=I}trace(U^T\Sigma U) is the matrix U whose columns are the r first eigenvectors of \Sigma.

 Solution:

Choose a matrix U\in\mathbb{R}^{d,r} with orthonormal columns and let B=V^TU. Then, VB=VV^TU=U. It follows that

U^T\Sigma U=B^TV^TVDV^TVB=B^TDB

and therefore

trace(U^T\Sigma U)=\Sigma^d_{j=1}D_{j,j}\sum^r_{i=1}B_{j,i}^2.

Note that B^TB=U^TVV^TU=U^TU=I. Hence the columns of B are orthonormal and \sum^d_{j=1}\sum^r_{i=1}B_{j,i}^2=r.

Define \tilde{B}\in\mathbb{R}^{d\times d} to be the matrix such that the first r columns are the columns of B and in addition, \tilde{B}^T\tilde{B}=I. Then for every j: \sum^d_{i=1}\tilde{B}^2_{j,i}=1, which implies that \sum^r_{i=1}B_{j,i}^2\leq 1.

It follows that

trace(U^T\Sigma U)\leq max_{\beta\in [0,1]^d:\Vert\beta\Vert_1\leq r}\sum^d_{j=1}D_{j,j}\beta_j=\sum^r_{j=1}D_{j,j}

Therefore for every matrix U\in\mathbb{R}^{d\times r} with orthonormal columns, the following inequality hols: trace(U^T\Sigma U)\leq\sum^r_{j=1}D_{j,j}

But if we set U to the matrix with the r leading eigenvectors of \Sigma as its columns, we obtain trace(U^T\Sigma U)=\Sigma^r_{j=1}D_{j,j} and thereby the optimal solution.


5.The variance captured by the first r eigenvectors of \Sigma is the sum over its r largest eigenvalues \sum^r_{i=1}\lambda_i .

Solution:

The variance in a dataset X is defined as

var(X)=\frac{1}{n}\sum^n_{i=1}\Vert x_i-0\Vert^2

=\frac{1}{n}\sum^n_{i=1}\Vert x_i\Vert^2=\frac{1}{n}\sum^n_{i=1}\langle x_i,x_i\rangle =tr(\Sigma)=tr(V^TDV)=tr(VV^TD)=tr(D)

=\sum^d_{i=1}D_{i,i}=\sum^d_{i=1}\lambda_i

The variance in a projected dataset WX, with W=[\pmb{v}_1,...,\pmb{v}_r]^T , is defined as

var(WX)=\frac{1}{n}\Vert W\pmb{x}_j\Vert^2 =\frac{1}{n}\sum^n_{j=1}\langle  W\pmb{x}_j,W\pmb{x}_j\rangle =\frac{1}{n}\sum^n_{j=1}\pmb{x}_j^T W^TW\pmb{x}_j

\frac{1}{n}\sum^n_{j=1}\pmb{x}_j^T(\pmb{v}_1\pmb{v}_1^T+...+\pmb{v}_r\pmb{v}_r^T)\pmb{x}_j=\sum^r_{i=1}\pmb{v}_i^T\sum^n_{j=1}\frac{1}{n}(\pmb{x}_j\pmb{x}_j^T)\pmb{v}_i=

(\pmb{v}_1^T\Sigma\pmb{v}_1+...+\pmb{v}_r^T\Sigma\pmb{v}_r)=\sum^r_{i=1}\pmb{v}_i^T\Sigma\pmb{v}_i=\sum^r_{i=1}\pmb{v}_i^T\lambda_i\pmb{v}_i=\sum^r_{i=1}\lambda_i

(Based on 4, we know that the eigenvectors are orthonormal.)

DFS code and Minimum DFS code

Edge Code

We represent each code e={start, end} with start and end as the starting and ending nodes of edge. The edge code is defined as:

\left\{ Index_{start}, Index_{end},L_V(start), L_E(e), L_V(end)\right\}

where the Index of a vertex is an arbitrary number, in sequential visiting order, given to the vertices in the graph. The numbers marked in red in the following graph are the indices of the vertices.

Screen Shot 2018-08-25 at 18.03.27

There are many edge codes for this graph:

Forward edge: {0,1,X,a,Y}, {1,2,Y,b,X}, {2,3,X,c,Z}, {1,4,Y,d,Z}

Backward edge: {3,1,Z,b,Y}, {2,0,X,a,Z}

It is clear to see that we have Index_{start}<Index_{end} for forward edges and Index_{start}>Index_{end} for backward edges.

The order of edges and DFS code

Suppose we have two edges

e_1=\left\{ Index_{start1},Index_{end1}\right\}

e_2=\left\{ Index_{start2},Index_{end2}\right\}

To determine the ordering between 2 edges we must consider: i) the types of the edges, ii) the visiting order (Index) of the nodes in each edge.

If e_1 and e_2 are both forward edges, then e_1 < e_2 if one of the following holds:

  1. Index_{end1}<Index_{end2}
  2. Index_{end1}=Index_{end2} and Index_{start1}<Index_{start2}

If e_1 and e_2 are both backward edges, then e_1 < e_2 if one of the following holds:

  1. Index_{start1}<Index_{start2}
  2. Index_{start1}=Index_{start2} and Index_{end1}<Index_{end2}

If e_1 and e_2 are one forward edge and one backward edge, then e_1 < e_2 if one of the following two holds:

  1. e_1 is forward edge and e_2 is backward edge and Index_{end1}\leq Index_{start2}
  2. e_1 is backward edge and e_2 is forward edge and Index_{start1}<Index_{end2}

The DFS code is just the concatenation of all the edge codes in order. i.e. the DFS code for the above DFS tree is:

{0,1,X,a,Y,1,2,Y,b,X,2,0,X,a,X,2,3,X,c,Z,3,1,Z,b,Y,1,4,Y,d,Z}.

DFS Lexicographic Order and Minimum DFS code

The DFS lexicographic order of the DFS code is defined as follows:

If we have two DFS codes \alpha and \beta

\alpha={\alpha_0,...,\alpha_m}, \beta={\beta _0,...,\beta_n}

then \alpha\leq\beta if and only if either of the following is true

1.\exists t,0\leq t\leq min(m,n), \forall k<t, \alpha_k=\beta_k and \alpha_t<\beta_t

2.m\leq n and \forall k\leq m, \alpha_k=\beta_k

So we can order a list of DFS codes according to the DFS lexicographic order, and we define a minimum DFS code as the minimum one among all the possible DFS codes of a graph.

Kernel PCA

Kernel PCA

Assume that we are dealing with centred data \sum^n_(i=1)\phi(x_i)=0

The covariance matrix then takes from C=\frac{1}{n}\phi(x_i)\phi(x_i)^T

Then we have to find eigenvalues \lambda\geq0 and nonzero eigenvectors v in Hilbert space satisfying: \lambda v=Cv

All solutions v with \lambda\neq0 lie in the span of \phi(x_1),...,\phi(x_n), due to the fact that

\lambda v=Cv=\frac{1}{n}\sum^n_{i=1}\phi(x_i)\phi(x_i)^Tv=\frac{1}{n}\sum^n_{i=1}(\phi(x_i)^Tv)\phi(x_i)   (\phi(x_i)^Tv is a inner product between two vectors, it is a scalar. So we can reorder it to the front. )

We can get 2 useful consequence based on the former equation, the first is:

If we choose a point \phi(x_j) and multiply it on both side. We can get \lambda\phi({x_j})v=\phi(x_j) Cv

\lambda\langle\phi(x_j),v\rangle=\langle\phi(x_j),Cv\rangle

The second consequence is that the eigenvector can be written as a linear combination of points in Hilbert space:

v=\sum^n_{i=1}\alpha_i\phi(x_i)  (\alpha_i=\frac{\phi(x_i)^Tv}{\lambda})

Replace consequence 2 into 1:

We can get \lambda\langle\phi(x_j),\sum^n_{i=1}\alpha_i\phi(x_i)\rangle=\langle\phi(x_j),\frac{1}{n}\sum^n_{k=1}\phi(x_k)\phi(x_k)^T\sum^n_{i=1}\alpha_i\phi(x_i)\rangle    (43)

There are two properties of inner product that we should know is that:

\langle x, \lambda x'\rangle=\lambda\langle x,x'\rangle (\lambda is a constant)

\langle x,\sum_i\alpha_ix'_i\rangle =\sum_i\alpha_i\langle x,x'_i\rangle (\alpha_i is a constant)

Have known these two properties, we can write equation (43) as:

\lambda\sum^n_{i=1}\alpha_i\langle\phi(x_j),\phi(x_i)\rangle=\frac{1}{n}\sum^n_{i=1}\alpha_i\langle\phi(x_j),\sum^n_{k=1}\phi(x_k)\langle\phi(x_k),\phi(x_j)\rangle\rangle

This can be rewritten, for all j=1,…,n, as:

\lambda\sum^n_{i=1}\alpha_i\langle\phi(x_j),\phi(x_i)\rangle=\frac{1}{n}\sum^n_{i=1}\alpha_i\langle\phi(x_j),\sum^n_{k=1}\phi(x_k)\langle\phi(x_k),\phi(x_i)\rangle\rangle ;   (44)

\lambda\sum^n_{i=1}\alpha_i\langle\phi(x_j),\phi(x_i)\rangle=\frac{1}{n}\sum^n_{i=1}\alpha_i\sum^n_{k=1}\langle\phi(x_k),\phi(x_i)\rangle\langle\phi(x_j),\phi(x_k)\rangle;   (45)

n\lambda\sum^n_{i=1}\alpha_i\underbrace{\langle\phi(x_j),\phi(x_i)\rangle}=\sum^n_{i=1}\alpha_i\underbrace{\sum^n_{k=1}\langle\phi(x_j),\phi(x_k)\rangle\langle\phi(x_k),\phi(x_i)\rangle}.  (46)

The two underprices implies that: \langle\phi(x_j),\phi(x_k)\rangle=K(x_j,x_k)=K_{jk}

\sum^n_{k=1}K_{jk}K_{ki}=[K\cdot K]_{ji}=[K^2]_{ij}  (This is the definition of matrix multiplication: the jth row of K multiply with the ithcolumn of K is the ij entry of the result matrix.)

 

Kernel PCA as an eigenvector problem

Equation (46) can be rewritten as:

n\lambda\textbf{K}\pmb{\alpha}=\textbf{K}^2\pmb{\alpha}   (47)

where \pmb{\alpha} denotes the column vector with entries \alpha_1,...,\alpha_n

To find the solution of Equation (47), we solve the problem

n\lambda\pmb{\alpha}=\textbf{K}\pmb{\alpha}

which we obtain by multiplying (47) by \textbf{K}^{-1} from the left.

 

Normalizing the coefficients

We require that the eigenvectors \pmb{v}^k to have unit length, that is \langle \pmb{v}^k,\pmb{v}^k\rangle=1 for all k=1,…,r.

That means that

1=\langle\pmb{v}^k,\pmb{v}^k\rangle

=\sum^n_{i,j=1}\alpha_i^k\alpha^k_j\langle(x_i),\phi(x_j)\rangle

=\sum^n_{i,j=1}\alpha^k_i\alpha^k_jK_{i,j}

=\langle\alpha^k,\pmb{K\alpha}^k=\lambda_k \langle\alpha^k,\alpha^k\rangle

As eigenvectors of \pmb{K}, the \alpha^k have unit form.

Therefor we have to rescale them by \sqrt{\frac{1}{\lambda}} to enforce that their norm is \frac{1}{\lambda}.