Importance Sampling

October 28, 2018November 9, 2018 ~ Dandan Peng ~ Leave a comment

Importance Sampling is not actually a method for sampling from a probability distribution as the name might suggest. In fact, it is a variant of Monte Carlo approximation so it’s actually an approximation method.

In Monte Carlo approximation we approximate the expected value of $f(x)$ :

$E(f(x)) \approx \frac{1}{n}\sum^n_{i=1}f(x_i)$ . Here $x_i$ is distributed according to the same distribution as X ( $X~P, x_i~P$ ). But this was kind of a big assumption that we can efficiently draw samples from the true distribution P of this random variable X. What happens if we can’t do that? That certainly can happen that we might not be able to draw those samples efficiently.

Let’s focus on the density p.

$E(f(x))=\int f(x)p(x)dx=\int f(x)\frac{p(x)}{q(x)}$ $\forall q(PDF) s.t. q(x)=0 \rightarrow p(x)=0$

We can view the formula like this

$E(f(x))=\int f(x)p(x)dx=\int(f(x)\frac{p(x)}{q(x)})q(x)dx$

$\frac{p(x)}{q(x)}$ is our new ‘f(x)’ function, and q(x) is the probability distribution. So we can approximate this by $\frac{1}{n}\sum^n_{i=1}f(x_i)\frac{p(x_i)}{q(x_i)}$ , and $x_i$ are drawn according to q before they are drawn according to p.

This is the importance sampling trick.

If we write the expectation like this: $E(f(x))\approx\frac{1}{n}\sum^n_{i=1}f(x_i)w(x)i)$ where $x_i~q, w(x)=\frac{p(x)}{q(x)}$ . w is called importance weight.

Amortized Analysis-Potential Method

October 18, 2018October 20, 2018 ~ Dandan Peng ~ Leave a comment

Quote from Introduction to Algorithms.

In an amortised analysis, we average the time required to perform a sequence of data-structure operations over all the operations performed.

The potential method of amortised analysis represents the prepaid work as “potential energy”, or just “potential”, which can be released to pay for future operations. We associate the potential with the data structure as a whole rather than with specific objects within the data structure.

The potential method works as follows. We will perform $n$ operations, starting with an initial data structure $D_0$ . For each i=1,2,…,n, we let $c_i$ be the actual cost of the $i$ th operation and $D_i$ be the data structure that results after applying the $i$ th operations to the data structure $D_{i-1}$ . A potential function $\Phi$ maps each data structure $D_i$ to a real number $\Phi(D_i)$ , which is the potential associated with data structure $D_i$ . The amortized cost $\hat{c_i}$ of the $i$ th operation with respect to potential function $\Phi$ is defined by

$\hat{c_i}=c_i+\Phi(D_i)-\Phi(D_{i-1})$ (1)

The amortised cost of each operation is therefore its actual cost plus the change in potential due to the operation. By equation (1), the total amortised cost of the $n$ operation is

$\sum^n_{i=1}\hat{c_i}=\sum^n_{i=1}(c_i+\Phi(D_i)-\Phi(D_{i-1}))$

$= \sum^n_{i=1}c_i+\Phi(D_n)-\Phi(D_0)$

Consider an extensible array that can store an arbitrary number of integers. Each add operation inserts a new element after all the elements previously inserted. If there are no empty cells left, a new array of double the size is allocated, and all the data from the old array is copied to the corresponding entries in the new array. For instance, consider the following sequence of insertions, starting with an array of length 1: Screen Shot 2018-10-18 at 21.43.05

The table is doubled in the second, third, and fifth steps. As each insertion takes O(n) time in the worst case, a simple analysis would yield a bound of $O(n^2)$ time for n insertions. But it is not this bad. Let’s analyze a sequence of n operations using the potential method.

Suppose we can define a potential function $\Phi$ on states of a data structure with the following properties:

$\Phi(h_0)=0$ , where $h_0$ is the initial state of the data structure.
$\Phi(h_t)\geq 0$ for all states $h_t$ of the data structure occurring during the course of the computation.

Intuitively, the potential function will keep track of the precharged time at any point in the computation. It measures how much saved-up time is available to pay for expensive operations. It is analogous to the bank balance in the banker’s method. But interestingly, it depends only on the current state of the data structure, irrespective of the history of the computation that got it into that state.

We then define the amortised time of the operation as

$c+\Phi(h')-\Phi(h)$ ,

where $c$ is the actual cost of the operation and $h$ and $h'$ are the states of the data structure before and after the operation, respectively. Thus the amortised time is the actual time plus the change in potential. Ideally, $\Phi$ should be defined so that the amortised time of each operation is small. Thus the change in potential should be positive for low-cost operations and negative for high-cost operations.

Now consider a sequence of $n$ operations taking actual times $c_0, c_1, c_2, ..., c_{n-1}$ and producing data structures $h_1, h_2, ..., h_n$ starting from $h_0$ . The total amortised time is the sum of the individual amortised times:

$(c_0+\Phi(h_1)-\Phi(h_0))+(c_1+\Phi(h_2)-\Phi(h_1))+...+(c_{n_1}+\Phi(h_n)-\Phi(h_n-1)) =c_0+c_1+...+c_{n-1}+\Phi(h_n)-\Phi(h_0)=c_0+c_1+...+c_{n-1}+\Phi(h_n)$ .

Therefore the amortised time for a sequence of operations overestimates of the actual time by $\Phi(h_n)$ , which by assumption is always positive. Thus the total amortised time is always an upper bound on the actual time.

For dynamically resizable arrays with resizing by doubling, we can use the potential function

$\Phi(h)=2n-m$ ,

where $n$ is the current number of elements and $m$ is the current length of the array. If we start with an array of length 0 and allocate an array of length 1 when the first element is added, and thereafter double the array size whenever we need more space, we have $\Phi(h_0)$ and $\Phi(h_t)\geq 0$ for all $t$ . The latter inequality holds because the number of elements is always at least half the size of the array.

Now we would like to show that adding an element takes amortised constant time. There are two cases.

If $n<m$ , then the actual cost is 1, $n$ increased by 1, and $m$ does not change. Then the potential increase by 2, so the amortised time is 1+2 = 3.
If $n=m$ , then the array is doubled, so the actual time is $n+1$ . But the potential drops from $n$ to 2, so amortised time is $n+1+(2-n)=3$ .

In both cases, the amortised time is O(1).

Eckart-Young Theorem

August 26, 2018August 26, 2018 ~ Dandan Peng ~ Leave a comment

A singular Value Decomposition is defined as a factorisation of a given matrix $D\in\mathbb{R}^{n\times d}$ into three matrices:

$D=L\Delta R^T$

One can discard the singular vectors that correspond to zero singular values, to obtain the reduced SVD:
$D=L_r\Delta_r R_r^T=\sum^r_{i=1}\delta_i\pmb{l}_i\pmb{r}_i^T$

Eckart-Young Theorem:

If $D_r$ is the matrix defined as $\sum^r_{i=1}\delta_i\pmb{l}_i\pmb{r}_i^T$ , then $D_r$ is the rank-r matrix that minimises the objective $\Vert D-D_r\Vert_F$ .

The Frobenius Norm of a matrix A, $\Vert A\Vert_F$ , is defined as

$\Vert A\Vert_F=\sqrt{\sum^n_{i=1}\sum^n_{j=1}A_{i,j}^2}$

Proof:

Assume D is of rank k(k>r).

Since $\Vert D-D_r\Vert_F=\Vert L\Delta R^T-D_r\Vert_F=\Vert\Delta-L^TD_rR\Vert_F$ .

Denoting $N=L^TD_rR$ , we can compute the Frobenius norm as

$\Vert\Delta-N\Vert^2_F=\sum_{i,j}\vert \Delta_{i,j}-N_{i,j}\vert^2=\sum^k_{i=1}\vert\delta_i-N_{i,i}\vert^2+\sum_{I>k}\vert N_{i,i}\vert^2+\sum_{I\neq j}\vert N_{i,j}\vert^2$

This is minimised if all off-diagonal terms of N and all $N_{i,i}$ for I>k are zero. The minimum of $\sum\vert\delta_i-N_{i,i}\vert^2 is obtained for$ latex N_{i,i} =\delta_i $ for i=1,…,r and all other $N_{i,i}$ are zero.

We do not need the full L and R for computing $D_r$ , only their first r columns. This can be seen by splitting L and R into blocks:

$L=\left[ L_r,L_0 \right]$ and $R=\left[ R_r,R_0\right]$

and

Screen Shot 2018-08-26 at 19.54.41

Some interesting problems in PCA

August 25, 2018August 26, 2018 ~ Dandan Peng ~ Leave a comment

1.For PCA, in order to find U and W we need to find a solution to the following objective function:

$argmin_ {U\in R^{d\times r}:U^TU=I}\sum^n_{i=1}\Vert x_i-UU^Tx_i\Vert_2^2$

How can we simplify the equation above such that we transform the problem into a trace maximisation problem? Show the intermediate steps and the new objective function.

Solution:

$\sum^n_{i=1}\Vert x_i-UU^Tx_i\Vert^2_2=\Vert X-UU^TX\Vert^2_2$

$=tr((X-UU^TX)^T(X-UU^TX))=tr(X^TX-2XTUU^TX+X^TUU^TUU^TX)$

$=tr(X^TX)-tr(X^TUU^TX)=\Vert X\Vert^2_2-tr(U^TXX^TU)=\Vert X\Vert^2_2-tr(U^TXX^TU)$

Now we turn the objective into a trace maximisation problem:
$argmax_{U\in R^{d\times r}:U^TU=I}trace(U^TXX^TU)$

$=argmax_{U\in R^{d\times r}:U^TU=I}trace(U^T\sum^n_{i=1}\pmb{x}_i\pmb{x}_i^TU)$

If we define $\Sigma=\frac{1}{n}\sum^n_{i=1}\pmb{x}_i\pmb{x}_i^T)$ , this is equivalent (up to a constant factor) to:
$argmax_{U\in R^{d\times r}:U^TU=I}race(U^T\Sigma U)$

2.With a mean-centered data matrix X, a key component of PCA is the spectral decomposition of a matrix M defined as M= VDV^Twhere:

M ∈R^d,d

V is a matrix whose columns $\pmb{v}_1, \pmb{v}_2, ..., \pmb{v}_d$ are the eigenvectors of M

D is the diagonal matrix with $D_{i,i}=\lambda_i$ and $\lambda_i$ is the eigenvalue of $\pmb{v}_i$

Assume that obtaining the square matrix M is computationally expensive. Can you

suggest an alternative method to obtain the eigenvectors of M?

Solution:

Consider $K=X^TX$ such that $K_{ij}=\langle x_i, x_j\rangle$

Suppose v is an eigenvector of K: $K\pmb{v}^*=\lambda\pmb{v}^*$ for some $\lambda\in \mathbb{R}$ .

Multiplying the equation by $\frac{1}{n}X$ and using the definition of K we obtain:

$\frac{1}{n}XX^TX\pmb{v}^*=\frac{1}{n}\lambda X\pmb{v}^*$

But, using the definition of $\Sigma$ , we get that $\Sigma(X\pmb{v}^*)=\frac{\lambda}{n}(X\pmb{v}^*)$

Hence $\frac{X\pmb{v}^*}{\Vert X\pmb{v}^*\Vert}$ is an eigenvector of $\Sigma$ with eigenvalue $\frac{\lambda}{n}$

Therefore, we can calculate the PCA solution by calculating the eigenvalues of K rather than $\Sigma$ .

3.In PCA, W is the compression matrix and U is the recovering matrix. The corresponding objective of finding W and U is phrased as:

$argmin_{W\in\mathbb{R}^{r\times d},U\in\mathbb{R}^{d\times r}}\sum^n_{i=1}\Vert x_i-UWx_i\Vert^2_2$

Prove the lemma: Let (U,W) be a solution to Objective(1). Then the columns of U are orthonormal (that is, $U^TU$ is the identity matrix of $\mathbb{R}^r$ ) and $W=U^T$ .

Solution:

Choose any U and W and consider the mapping $\pmb{x}\rightarrow UW\pmb{x}$

The range of this mapping, $R=\left\{ UW\pmb{x}:\pmb{x}\in\mathbb{R}^d\right\}$ is an r-dimensional linear subspace of $\mathbb{R}^d$ .

Let $V\in\mathbb{R}^{d\times r}$ be a matrix whose columns form an orthonormal basis of this subspace (i.e. $V^TV=I$ ).

Each vector in R can be written as $V\pmb{y}$ , where $\pmb{y}\in\mathbb{R}^r$ .

For every $\pmb{x}\in\mathbb{R}^d$ and $\pmb{y}\in\mathbb{R}^r$ we have

$\Vert \pmb{x}-V\pmb{y}\Vert^2_2=\Vert x\Vert^2+\Vert\pmb{y}\Vert^2-2\pmb{y}^T(V^T\pmb{x})$

This difference is minimised for $\pmb{y}=V^T\pmb{x}$ .

Therefore, for each $\pmb{x}$ we have that:

$V\pmb{y}=VV^T\pmb{x}=argmin_{\tilde{x}\in R}min\Vert\pmb{x}-\tilde{\pmb{x}}\Vert^2_2$

This holds for all $\pmb{x}_i$ and we can replace U,W by $V, V^T$ without increasing the objective:

$\sum^n_{i=1}\Vert\pmb{x}_i-UW\pmb{x}_i\Vert^2_2\geq\sum^n_{i=1}-VV^T\pmb{x}_i\Vert^2_2$

As this holds for any U,W, the proof is complete.

4.Let $\Sigma=VDV^T$ be the spectral decomposition of $\Sigma$ . D is a diagonal matrix, such that $D_{i,i}$ is the i-th largest eigenvalue of $\Sigma$ . The columns of V are the corresponding eigenvectors, and $V^TV=VV^T=I$ . Then the solution to $argmax_{U\in\mathbb{R}^{d\times r}:U^TU=I}trace(U^T\Sigma U)$ is the matrix U whose columns are the r first eigenvectors of $\Sigma$ .

Solution:

Choose a matrix $U\in\mathbb{R}^{d,r}$ with orthonormal columns and let $B=V^TU$ . Then, $VB=VV^TU=U$ . It follows that

$U^T\Sigma U=B^TV^TVDV^TVB=B^TDB$

and therefore

$trace(U^T\Sigma U)=\Sigma^d_{j=1}D_{j,j}\sum^r_{i=1}B_{j,i}^2$ .

Note that $B^TB=U^TVV^TU=U^TU=I$ . Hence the columns of B are orthonormal and $\sum^d_{j=1}\sum^r_{i=1}B_{j,i}^2=r$ .

Define $\tilde{B}\in\mathbb{R}^{d\times d}$ to be the matrix such that the first r columns are the columns of B and in addition, $\tilde{B}^T\tilde{B}=I$ . Then for every j: $\sum^d_{i=1}\tilde{B}^2_{j,i}=1$ , which implies that $\sum^r_{i=1}B_{j,i}^2\leq 1$ .

It follows that

$trace(U^T\Sigma U)\leq max_{\beta\in [0,1]^d:\Vert\beta\Vert_1\leq r}\sum^d_{j=1}D_{j,j}\beta_j=\sum^r_{j=1}D_{j,j}$

Therefore for every matrix $U\in\mathbb{R}^{d\times r}$ with orthonormal columns, the following inequality hols: $trace(U^T\Sigma U)\leq\sum^r_{j=1}D_{j,j}$

But if we set U to the matrix with the r leading eigenvectors of $\Sigma$ as its columns, we obtain $trace(U^T\Sigma U)=\Sigma^r_{j=1}D_{j,j}$ and thereby the optimal solution.

5.The variance captured by the first r eigenvectors of $\Sigma$ is the sum over its r largest eigenvalues $\sum^r_{i=1}\lambda_i$ .

Solution:

The variance in a dataset X is defined as

$var(X)=\frac{1}{n}\sum^n_{i=1}\Vert x_i-0\Vert^2$

$=\frac{1}{n}\sum^n_{i=1}\Vert x_i\Vert^2=\frac{1}{n}\sum^n_{i=1}\langle x_i,x_i\rangle$ $=tr(\Sigma)=tr(V^TDV)=tr(VV^TD)=tr(D)$

$=\sum^d_{i=1}D_{i,i}=\sum^d_{i=1}\lambda_i$

The variance in a projected dataset WX, with $W=[\pmb{v}_1,...,\pmb{v}_r]^T$ , is defined as

$var(WX)=\frac{1}{n}\Vert W\pmb{x}_j\Vert^2 =\frac{1}{n}\sum^n_{j=1}\langle W\pmb{x}_j,W\pmb{x}_j\rangle =\frac{1}{n}\sum^n_{j=1}\pmb{x}_j^T W^TW\pmb{x}_j$

$\frac{1}{n}\sum^n_{j=1}\pmb{x}_j^T(\pmb{v}_1\pmb{v}_1^T+...+\pmb{v}_r\pmb{v}_r^T)\pmb{x}_j=\sum^r_{i=1}\pmb{v}_i^T\sum^n_{j=1}\frac{1}{n}(\pmb{x}_j\pmb{x}_j^T)\pmb{v}_i=$

$(\pmb{v}_1^T\Sigma\pmb{v}_1+...+\pmb{v}_r^T\Sigma\pmb{v}_r)=\sum^r_{i=1}\pmb{v}_i^T\Sigma\pmb{v}_i=\sum^r_{i=1}\pmb{v}_i^T\lambda_i\pmb{v}_i=\sum^r_{i=1}\lambda_i$

(Based on 4, we know that the eigenvectors are orthonormal.)

DFS code and Minimum DFS code

August 25, 2018August 26, 2018 ~ Dandan Peng ~ Leave a comment

Edge Code

We represent each code e={start, end} with start and end as the starting and ending nodes of edge. The edge code is defined as:

$\left\{ Index_{start}, Index_{end},L_V(start), L_E(e), L_V(end)\right\}$

where the Index of a vertex is an arbitrary number, in sequential visiting order, given to the vertices in the graph. The numbers marked in red in the following graph are the indices of the vertices.

Screen Shot 2018-08-25 at 18.03.27

There are many edge codes for this graph:

Forward edge: {0,1,X,a,Y}, {1,2,Y,b,X}, {2,3,X,c,Z}, {1,4,Y,d,Z}

Backward edge: {3,1,Z,b,Y}, {2,0,X,a,Z}

It is clear to see that we have $Index_{start}<Index_{end}$ for forward edges and $Index_{start}>Index_{end}$ for backward edges.

The order of edges and DFS code

Suppose we have two edges

$e_1=\left\{ Index_{start1},Index_{end1}\right\}$

$e_2=\left\{ Index_{start2},Index_{end2}\right\}$

To determine the ordering between 2 edges we must consider: i) the types of the edges, ii) the visiting order (Index) of the nodes in each edge.

If $e_1$ and $e_2$ are both forward edges, then $e_1 < e_2$ if one of the following holds:

$Index_{end1}<Index_{end2}$
$Index_{end1}=Index_{end2}$ and $Index_{start1}<Index_{start2}$

If $e_1$ and $e_2$ are both backward edges, then $e_1 < e_2$ if one of the following holds:

$Index_{start1}<Index_{start2}$
$Index_{start1}=Index_{start2}$ and $Index_{end1}<Index_{end2}$

If $e_1$ and $e_2$ are one forward edge and one backward edge, then $e_1 < e_2$ if one of the following two holds:

$e_1$ is forward edge and $e_2$ is backward edge and $Index_{end1}\leq Index_{start2}$
$e_1$ is backward edge and $e_2$ is forward edge and $Index_{start1}<Index_{end2}$

The DFS code is just the concatenation of all the edge codes in order. i.e. the DFS code for the above DFS tree is:

{0,1,X,a,Y,1,2,Y,b,X,2,0,X,a,X,2,3,X,c,Z,3,1,Z,b,Y,1,4,Y,d,Z}.

DFS Lexicographic Order and Minimum DFS code

The DFS lexicographic order of the DFS code is defined as follows:

If we have two DFS codes $\alpha$ and $\beta$

$\alpha={\alpha_0,...,\alpha_m}, \beta={\beta _0,...,\beta_n}$

then $\alpha\leq\beta$ if and only if either of the following is true

1. $\exists t,0\leq t\leq min(m,n), \forall k<t, \alpha_k=\beta_k$ and $\alpha_t<\beta_t$

2. $m\leq n$ and $\forall k\leq m, \alpha_k=\beta_k$

So we can order a list of DFS codes according to the DFS lexicographic order, and we define a minimum DFS code as the minimum one among all the possible DFS codes of a graph.

Kernel PCA

August 24, 2018August 24, 2018 ~ Dandan Peng ~ Leave a comment

Kernel PCA

Assume that we are dealing with centred data $\sum^n_(i=1)\phi(x_i)=0$

The covariance matrix then takes from $C=\frac{1}{n}\phi(x_i)\phi(x_i)^T$

Then we have to find eigenvalues $\lambda\geq0$ and nonzero eigenvectors v in Hilbert space satisfying: $\lambda v=Cv$

All solutions v with $\lambda\neq0$ lie in the span of $\phi(x_1),...,\phi(x_n)$ , due to the fact that

$\lambda v=Cv=\frac{1}{n}\sum^n_{i=1}\phi(x_i)\phi(x_i)^Tv=\frac{1}{n}\sum^n_{i=1}(\phi(x_i)^Tv)\phi(x_i)$ ( $\phi(x_i)^Tv$ is a inner product between two vectors, it is a scalar. So we can reorder it to the front. )

We can get 2 useful consequence based on the former equation, the first is:

If we choose a point $\phi(x_j)$ and multiply it on both side. We can get $\lambda\phi({x_j})v=\phi(x_j) Cv$

$\lambda\langle\phi(x_j),v\rangle=\langle\phi(x_j),Cv\rangle$

The second consequence is that the eigenvector can be written as a linear combination of points in Hilbert space:

$v=\sum^n_{i=1}\alpha_i\phi(x_i)$ ( $\alpha_i=\frac{\phi(x_i)^Tv}{\lambda}$ )

Replace consequence 2 into 1:

We can get $\lambda\langle\phi(x_j),\sum^n_{i=1}\alpha_i\phi(x_i)\rangle=\langle\phi(x_j),\frac{1}{n}\sum^n_{k=1}\phi(x_k)\phi(x_k)^T\sum^n_{i=1}\alpha_i\phi(x_i)\rangle$ (43)

There are two properties of inner product that we should know is that:

$\langle x, \lambda x'\rangle=\lambda\langle x,x'\rangle$ ( $\lambda$ is a constant)

$\langle x,\sum_i\alpha_ix'_i\rangle =\sum_i\alpha_i\langle x,x'_i\rangle$ ( $\alpha_i$ is a constant)

Have known these two properties, we can write equation (43) as:

$\lambda\sum^n_{i=1}\alpha_i\langle\phi(x_j),\phi(x_i)\rangle=\frac{1}{n}\sum^n_{i=1}\alpha_i\langle\phi(x_j),\sum^n_{k=1}\phi(x_k)\langle\phi(x_k),\phi(x_j)\rangle\rangle$

This can be rewritten, for all j=1,…,n, as:

$\lambda\sum^n_{i=1}\alpha_i\langle\phi(x_j),\phi(x_i)\rangle=\frac{1}{n}\sum^n_{i=1}\alpha_i\langle\phi(x_j),\sum^n_{k=1}\phi(x_k)\langle\phi(x_k),\phi(x_i)\rangle\rangle$ ; (44)

$\lambda\sum^n_{i=1}\alpha_i\langle\phi(x_j),\phi(x_i)\rangle=\frac{1}{n}\sum^n_{i=1}\alpha_i\sum^n_{k=1}\langle\phi(x_k),\phi(x_i)\rangle\langle\phi(x_j),\phi(x_k)\rangle$ ; (45)

$n\lambda\sum^n_{i=1}\alpha_i\underbrace{\langle\phi(x_j),\phi(x_i)\rangle}=\sum^n_{i=1}\alpha_i\underbrace{\sum^n_{k=1}\langle\phi(x_j),\phi(x_k)\rangle\langle\phi(x_k),\phi(x_i)\rangle}$ . (46)

The two underprices implies that: $\langle\phi(x_j),\phi(x_k)\rangle=K(x_j,x_k)=K_{jk}$

$\sum^n_{k=1}K_{jk}K_{ki}=[K\cdot K]_{ji}=[K^2]_{ij}$ (This is the definition of matrix multiplication: the jth row of K multiply with the ithcolumn of K is the ij entry of the result matrix.)

Kernel PCA as an eigenvector problem

Equation (46) can be rewritten as:

$n\lambda\textbf{K}\pmb{\alpha}=\textbf{K}^2\pmb{\alpha}$ (47)

where $\pmb{\alpha}$ denotes the column vector with entries $\alpha_1,...,\alpha_n$

To find the solution of Equation (47), we solve the problem

$n\lambda\pmb{\alpha}=\textbf{K}\pmb{\alpha}$

which we obtain by multiplying (47) by $\textbf{K}^{-1}$ from the left.

Normalizing the coefficients

We require that the eigenvectors $\pmb{v}^k$ to have unit length, that is $\langle \pmb{v}^k,\pmb{v}^k\rangle=1$ for all k=1,…,r.

That means that

$1=\langle\pmb{v}^k,\pmb{v}^k\rangle$

$=\sum^n_{i,j=1}\alpha_i^k\alpha^k_j\langle(x_i),\phi(x_j)\rangle$

$=\sum^n_{i,j=1}\alpha^k_i\alpha^k_jK_{i,j}$

$=\langle\alpha^k,\pmb{K\alpha}^k=\lambda_k \langle\alpha^k,\alpha^k\rangle$

As eigenvectors of $\pmb{K}$ , the $\alpha^k$ have unit form.

Therefor we have to rescale them by $\sqrt{\frac{1}{\lambda}}$ to enforce that their norm is $\frac{1}{\lambda}$ .

Multidimensional Scaling

August 23, 2018August 26, 2018 ~ Dandan Peng ~ Leave a comment

Classical scaling

The task of MDS is to find the original Euclidean coordinates from a given distance matrix D.

The Euclidean distance between the $x_i$ and $x_j$ is given by

$d^2_ij=\sum^d_{k=1}(x_{ik}-x_{jk})^2=(x_i-x_j)^2$

B is the inner-product of dimensional-reduced coordinates. The general $b_{ij}$ term of B is given by

$b_{ij}=\sum^d_{k=1}x_{ik}x_{jk}=x_i^Tx_j$

We can derive B from D:

$d_{ij}^2=x_i^Tx_i+x_j^Tx_j-2x_i^Tx_j=b_{ii}+b_{jj}-2b_{ij}$

For convenient computation, we centre the coordinate matrix X, which implies that $\sum^n_{i=1}b_{ij}=0$ (the sum of the column of B is zero), because the inner product matrix is symmetric, so $\sum^n_{j=1}b_{ij}=0$

Summing over i, we obtain

$\frac{1}{n}\sum^n_{i=1}d_{ij}^2=\frac{1}{n}\sum^n_{i=1}b_{ii}+b_{jj}-\frac{1}{n}\sum^n_{i=1}b_{ij}=\frac{1}{n}\sum^n_{i=1}b_{ii}+b_{jj}$ (59) because $\sum^n_{i=1}b_{ij}=0$

$b_{jj}=\frac{1}{n}\sum^n_{i=1}d^2_{ij}-\frac{1}{n}\sum^n_{i=1}b_{ii}$

Summing over j, we obtain

$\frac{1}{n}\sum^n_{j=1}d^2_{ij}=b_{ii}+\frac{1}{n}\sum^n_{j=1}b_{jj}$ (60)

$b_{ii}=\frac{1}{n}\sum^n_{j=1}d^2_{ij}-\frac{1}{n}\sum^n_{j=1}b_{jj}$

Summing over equation(60), we obtain

$\frac{1}{n}\sum^n_{i=1}(\frac{1}{n}\sum^n_{j=1}d^2_{ij})=\frac{1}{n}\sum^n_{i=1}b_{ii}+\frac{1}{n}\sum^n_{i=1}\frac{1}{n}\sum^n_{j=1}b_{jj}$

$\frac{1}{n^2}\sum^n_{i=1}\sum^n_{j=1}d^2_{ij}=\frac{1}{n}\sum^n_{i=1}b_{ii}+\frac{1}{n}b_{jj}=\frac{2}{n}\sum^n_{i=1}b_{ii}$

We already know that $b_{ij}=-\frac{1}{2}(d^2_{ij}-b_{ii}-b_{jj})$ ;

so $b_{ij}=-\frac{1}{2}(d^2_{ij}-d^2_{i\bullet}-d^2_{\bullet j}+d^2_{\bullet\bullet})$

Define the matrix $\pmb{A}$ as $(a_{ij})$ and observe that:
$\pmb{B}=\pmb{HAH}$

The inner product matrix $\pmb{B}$ can be expressed as

$\pmb{B}=\pmb{XX}^T$

where $X=(\pmb{x}_1,...,\pmb{x}_n)^T$ is the $n\times p$ matrix of coordinates.

The rank of $\pmb{B}$ is then

$rank(\pmb{B})=rank(\pmb{XX}^T)=rank(\pmb{X})=p$

$\pmb{B}$ is symmetric, positive definite, of rank p and has p non-zero eigenvalues.

B can now be written as $\pmb{B}=\pmb{V}_p\pmb{Lambda}_p\pmb{V}_p^T$

where $\pmb{Lambda}_p=diag(\lambda_1,...,\lambda_p)$ , the diagonal matrix of the eigenvalues of B, and $\pmb{V}_p=(\pmb{v}_1,...,\pmb{v}_p)$ , the matrix of corresponding eigenvectors.

Hence the coordinate matrix X containing the point configuration in $\mathbb{R}^p$ is given by $\pmb{X}=\pmb{V}_p\pmb{\Lambda}_p^{\frac{1}{2}}$

Pseudocode

The algorithm for recovering coordinates from distances $D_{ij}=d_{ij}$ between pairs of points is as follows:

Form matrix $A_{ij}=\left\{ -\frac{1}{2}d_{ij}^2 \right\}$
Form matrix $latex \pmb{B}=\pmb{HAH}$, where H is the cantering matrix $\pmb{H}=\pmb{I}-\frac{1}{n}\pmb{11}^T$
Find the spectral decomposition of B, and V is the matrix of corresponding eigenvectors.
If the points were originally in a p-dimensional space, the first p eigenvalues of K are nonzero and the remaining n-p are zero. Discard these from $\Lambda$ (rename as $\Lambda_p$ ), and discard the corresponding eigenvalues from V (rename as $\pmb{V}_p$ ).
Find $\pmb{X}=\pmb{V}_p\pmb{\Lambda}_p^{\frac{1}{2}}$ , and then the coordinates of the points are given by the rows of X.

The Journey Begins

August 22, 2018August 22, 2018 ~ Dandan Peng ~ Leave a comment

Thanks for joining me!

Good company in a journey makes the way seem shorter. — Izaak Walton

post