Importance Sampling

Importance Sampling is not actually a method for sampling from a probability distribution as the name might suggest. In fact, it is a variant of Monte Carlo approximation so it’s actually an approximation method.

In Monte Carlo approximation we approximate the expected value of f(x) :

E(f(x)) \approx \frac{1}{n}\sum^n_{i=1}f(x_i) . Here x_i is distributed according to the same distribution as X (X~P, x_i~P ). But this was kind of a big assumption that we can efficiently draw samples from the true distribution P of this random variable X. What happens if we can’t do that? That certainly can happen that we might not be able to draw those samples efficiently.

Let’s focus on the density p.

E(f(x))=\int f(x)p(x)dx=\int f(x)\frac{p(x)}{q(x)}   \forall q(PDF) s.t. q(x)=0 \rightarrow p(x)=0

We can view the formula like this

E(f(x))=\int f(x)p(x)dx=\int(f(x)\frac{p(x)}{q(x)})q(x)dx

\frac{p(x)}{q(x)} is our new ‘f(x)’ function, and q(x) is the probability distribution. So we can approximate this by \frac{1}{n}\sum^n_{i=1}f(x_i)\frac{p(x_i)}{q(x_i)} , and x_i are drawn according to q before they are drawn according to p.

This is the importance sampling trick.

If we write the expectation like this: E(f(x))\approx\frac{1}{n}\sum^n_{i=1}f(x_i)w(x)i) where x_i~q, w(x)=\frac{p(x)}{q(x)} . w is called importance weight.

Amortized Analysis-Potential Method

Quote from Introduction to Algorithms.

In an amortised analysis, we average the time required to perform a sequence of data-structure operations over all the operations performed.

The potential method of amortised analysis represents the prepaid work as “potential energy”, or just “potential”, which can be released to pay for future operations. We associate the potential with the data structure as a whole rather than with specific objects within the data structure.

The potential method works as follows. We will perform n operations, starting with an initial data structure D_0 . For each i=1,2,…,n, we let c_i be the actual cost of the i  th operation and D_i be the data structure that results after applying the i th operations to the data structure D_{i-1} . A potential function \Phi maps each data structure D_i to a real number \Phi(D_i) , which is the potential associated with data structure D_i . The amortized cost \hat{c_i} of the i th operation with respect to potential function \Phi is defined by

\hat{c_i}=c_i+\Phi(D_i)-\Phi(D_{i-1})     (1)

The amortised cost of each operation is therefore its actual cost plus the change in potential due to the operation. By equation (1), the total amortised cost of the n operation is

\sum^n_{i=1}\hat{c_i}=\sum^n_{i=1}(c_i+\Phi(D_i)-\Phi(D_{i-1}))

= \sum^n_{i=1}c_i+\Phi(D_n)-\Phi(D_0)


Consider an extensible array  that can store an arbitrary number of integers. Each add operation inserts a new element after all the elements previously inserted. If there are no empty cells left, a new array of double the size is allocated, and all the data from the old array is copied to the corresponding entries in the new array. For instance, consider the following sequence of insertions, starting with an array of length 1:   Screen Shot 2018-10-18 at 21.43.05

The table is doubled in the second, third, and fifth steps. As each insertion takes O(n) time in the worst case, a simple analysis would yield a bound of O(n^2) time for n insertions. But it is not this bad. Let’s analyze a sequence of n operations using the potential method.

Suppose we can define a potential function \Phi on states of a data structure with the following properties:

  • \Phi(h_0)=0 , where h_0 is the initial state of the data structure.
  • \Phi(h_t)\geq 0 for all states h_t of the data structure occurring during the course of the computation.

Intuitively, the potential function will keep track of the precharged time at any point in the computation. It measures how much saved-up time is available to pay for expensive operations. It is analogous to the bank balance in the banker’s method. But interestingly, it depends only on the current state of the data structure, irrespective of the history of the computation that got it into that state.

We then define the amortised time of the operation as

c+\Phi(h')-\Phi(h),

where c is the actual cost of the operation and h and h' are the states of the data structure before and after the operation, respectively. Thus the amortised time is the actual time plus the change in potential. Ideally, \Phi should be defined so that the amortised time of each operation is small. Thus the change in potential should be positive for low-cost operations and negative for high-cost operations.

Now consider a sequence of n operations taking actual times c_0, c_1, c_2, ..., c_{n-1} and producing data structures h_1, h_2, ..., h_n starting from h_0 . The total amortised time is the sum of the individual amortised times:

(c_0+\Phi(h_1)-\Phi(h_0))+(c_1+\Phi(h_2)-\Phi(h_1))+...+(c_{n_1}+\Phi(h_n)-\Phi(h_n-1)) =c_0+c_1+...+c_{n-1}+\Phi(h_n)-\Phi(h_0)=c_0+c_1+...+c_{n-1}+\Phi(h_n) .

Therefore the amortised time for a sequence of operations overestimates of the actual time by \Phi(h_n) , which by assumption is always positive. Thus the total amortised time is always an upper bound on the actual time.

For dynamically resizable arrays with resizing by doubling, we can use the potential function

\Phi(h)=2n-m ,

where n is the current number of elements and m is the current length of the array. If we start with an array of length 0 and allocate an array of length 1 when the first element is added, and thereafter double the array size whenever we need more space, we have \Phi(h_0) and \Phi(h_t)\geq 0 for all t. The latter inequality holds because the number of elements is always at least half the size of the array.

Now we would like to show that adding an element takes amortised  constant time. There are  two cases.

  • If n<m , then the actual cost is 1, n increased by 1, and m  does not change. Then the potential increase by 2, so the amortised time is 1+2 = 3.
  • If n=m, then the array is doubled, so the actual time is n+1 . But the potential drops from n to 2, so amortised time is n+1+(2-n)=3 .

In both cases, the amortised time is O(1).

Eckart-Young Theorem

A singular Value Decomposition is defined as a factorisation of a given matrix D\in\mathbb{R}^{n\times d} into three matrices:

D=L\Delta R^T

One can discard the singular vectors that correspond to zero singular values, to obtain the reduced SVD:
D=L_r\Delta_r R_r^T=\sum^r_{i=1}\delta_i\pmb{l}_i\pmb{r}_i^T 

Eckart-Young Theorem:

If D_r is the matrix defined as \sum^r_{i=1}\delta_i\pmb{l}_i\pmb{r}_i^T, then D_r is the rank-r matrix that minimises the objective \Vert D-D_r\Vert_F .

The Frobenius Norm of a matrix A, \Vert A\Vert_F , is defined as

\Vert A\Vert_F=\sqrt{\sum^n_{i=1}\sum^n_{j=1}A_{i,j}^2}

Proof:

Assume D is of rank k(k>r).

Since \Vert D-D_r\Vert_F=\Vert L\Delta R^T-D_r\Vert_F=\Vert\Delta-L^TD_rR\Vert_F .

Denoting N=L^TD_rR , we can compute the Frobenius norm as

\Vert\Delta-N\Vert^2_F=\sum_{i,j}\vert \Delta_{i,j}-N_{i,j}\vert^2=\sum^k_{i=1}\vert\delta_i-N_{i,i}\vert^2+\sum_{I>k}\vert N_{i,i}\vert^2+\sum_{I\neq j}\vert N_{i,j}\vert^2 

This is minimised if all off-diagonal terms of N and all N_{i,i} for I>k are zero. The minimum of \sum\vert\delta_i-N_{i,i}\vert^2 is obtained for latex N_{i,i} =\delta_i $ for i=1,…,r and all other N_{i,i} are zero.

We do not need the full L and R for computing D_r , only their first r columns. This can be seen by splitting L and R into blocks:

L=\left[ L_r,L_0 \right] and R=\left[ R_r,R_0\right]

and

Screen Shot 2018-08-26 at 19.54.41

Some interesting problems in PCA

1.For PCA, in order to find U and W we need to find a solution to the following objective function:

argmin_ {U\in R^{d\times r}:U^TU=I}\sum^n_{i=1}\Vert x_i-UU^Tx_i\Vert_2^2

How can we simplify the equation above such that we transform the problem into a trace maximisation problem? Show the intermediate steps and the new objective function.

Solution:

\sum^n_{i=1}\Vert x_i-UU^Tx_i\Vert^2_2=\Vert X-UU^TX\Vert^2_2

=tr((X-UU^TX)^T(X-UU^TX))=tr(X^TX-2XTUU^TX+X^TUU^TUU^TX)

=tr(X^TX)-tr(X^TUU^TX)=\Vert X\Vert^2_2-tr(U^TXX^TU)=\Vert X\Vert^2_2-tr(U^TXX^TU)

Now we turn the objective into a trace maximisation problem:
argmax_{U\in R^{d\times r}:U^TU=I}trace(U^TXX^TU)

=argmax_{U\in R^{d\times r}:U^TU=I}trace(U^T\sum^n_{i=1}\pmb{x}_i\pmb{x}_i^TU)

If we define \Sigma=\frac{1}{n}\sum^n_{i=1}\pmb{x}_i\pmb{x}_i^T), this is equivalent (up to a constant factor) to:
argmax_{U\in R^{d\times r}:U^TU=I}race(U^T\Sigma U)


2.With a mean-centered data matrix X, a key component of PCA is the spectral decomposition of a matrix M defined as M= VDVT where:

M ∈Rd,d

V is a matrix whose columns \pmb{v}_1, \pmb{v}_2, ..., \pmb{v}_d  are the eigenvectors of M

D is the diagonal matrix with D_{i,i}=\lambda_i and \lambda_i is the eigenvalue of \pmb{v}_i

Assume that obtaining the square matrix M is computationally expensive. Can you

suggest an alternative method to obtain the eigenvectors of M?

Solution:

Consider K=X^TX such that K_{ij}=\langle x_i, x_j\rangle

Suppose v is an eigenvector of K: K\pmb{v}^*=\lambda\pmb{v}^* for some \lambda\in \mathbb{R}.

Multiplying the equation by \frac{1}{n}X and using the definition of K we obtain:

\frac{1}{n}XX^TX\pmb{v}^*=\frac{1}{n}\lambda X\pmb{v}^*

But, using the definition of \Sigma, we get that \Sigma(X\pmb{v}^*)=\frac{\lambda}{n}(X\pmb{v}^*)

Hence \frac{X\pmb{v}^*}{\Vert X\pmb{v}^*\Vert} is an eigenvector of \Sigma with eigenvalue \frac{\lambda}{n}

Therefore, we can calculate the PCA solution by calculating the eigenvalues of K rather than \Sigma.


3.In PCA, W is the compression matrix and U is the recovering matrix. The corresponding objective of finding W and U is phrased as:

argmin_{W\in\mathbb{R}^{r\times d},U\in\mathbb{R}^{d\times r}}\sum^n_{i=1}\Vert x_i-UWx_i\Vert^2_2

Prove the lemma: Let (U,W) be a solution to Objective(1). Then the columns of U are orthonormal (that is, U^TU is the identity matrix of \mathbb{R}^r) and W=U^T.

Solution:

Choose any U and W and consider the mapping \pmb{x}\rightarrow UW\pmb{x}

The range of this mapping, R=\left\{ UW\pmb{x}:\pmb{x}\in\mathbb{R}^d\right\} is an r-dimensional linear subspace of \mathbb{R}^d .

Let V\in\mathbb{R}^{d\times r} be a matrix whose columns form an orthonormal basis of this subspace (i.e. V^TV=I ).

Each vector in R can be written as V\pmb{y} , where \pmb{y}\in\mathbb{R}^r .

For every \pmb{x}\in\mathbb{R}^d and \pmb{y}\in\mathbb{R}^r we have

\Vert \pmb{x}-V\pmb{y}\Vert^2_2=\Vert x\Vert^2+\Vert\pmb{y}\Vert^2-2\pmb{y}^T(V^T\pmb{x})

This difference is minimised for \pmb{y}=V^T\pmb{x} .

Therefore, for each \pmb{x} we have that:

V\pmb{y}=VV^T\pmb{x}=argmin_{\tilde{x}\in R}min\Vert\pmb{x}-\tilde{\pmb{x}}\Vert^2_2

This holds for all \pmb{x}_i and we can replace U,W by V, V^T without increasing the objective:

\sum^n_{i=1}\Vert\pmb{x}_i-UW\pmb{x}_i\Vert^2_2\geq\sum^n_{i=1}-VV^T\pmb{x}_i\Vert^2_2

As this holds for any U,W, the proof is complete.


4.Let \Sigma=VDV^T be the spectral decomposition of \Sigma. D is a diagonal matrix, such that D_{i,i} is the i-th largest eigenvalue of \Sigma. The columns of V are the corresponding eigenvectors, and V^TV=VV^T=I. Then the solution to argmax_{U\in\mathbb{R}^{d\times r}:U^TU=I}trace(U^T\Sigma U) is the matrix U whose columns are the r first eigenvectors of \Sigma.

 Solution:

Choose a matrix U\in\mathbb{R}^{d,r} with orthonormal columns and let B=V^TU. Then, VB=VV^TU=U. It follows that

U^T\Sigma U=B^TV^TVDV^TVB=B^TDB

and therefore

trace(U^T\Sigma U)=\Sigma^d_{j=1}D_{j,j}\sum^r_{i=1}B_{j,i}^2.

Note that B^TB=U^TVV^TU=U^TU=I. Hence the columns of B are orthonormal and \sum^d_{j=1}\sum^r_{i=1}B_{j,i}^2=r.

Define \tilde{B}\in\mathbb{R}^{d\times d} to be the matrix such that the first r columns are the columns of B and in addition, \tilde{B}^T\tilde{B}=I. Then for every j: \sum^d_{i=1}\tilde{B}^2_{j,i}=1, which implies that \sum^r_{i=1}B_{j,i}^2\leq 1.

It follows that

trace(U^T\Sigma U)\leq max_{\beta\in [0,1]^d:\Vert\beta\Vert_1\leq r}\sum^d_{j=1}D_{j,j}\beta_j=\sum^r_{j=1}D_{j,j}

Therefore for every matrix U\in\mathbb{R}^{d\times r} with orthonormal columns, the following inequality hols: trace(U^T\Sigma U)\leq\sum^r_{j=1}D_{j,j}

But if we set U to the matrix with the r leading eigenvectors of \Sigma as its columns, we obtain trace(U^T\Sigma U)=\Sigma^r_{j=1}D_{j,j} and thereby the optimal solution.


5.The variance captured by the first r eigenvectors of \Sigma is the sum over its r largest eigenvalues \sum^r_{i=1}\lambda_i .

Solution:

The variance in a dataset X is defined as

var(X)=\frac{1}{n}\sum^n_{i=1}\Vert x_i-0\Vert^2

=\frac{1}{n}\sum^n_{i=1}\Vert x_i\Vert^2=\frac{1}{n}\sum^n_{i=1}\langle x_i,x_i\rangle =tr(\Sigma)=tr(V^TDV)=tr(VV^TD)=tr(D)

=\sum^d_{i=1}D_{i,i}=\sum^d_{i=1}\lambda_i

The variance in a projected dataset WX, with W=[\pmb{v}_1,...,\pmb{v}_r]^T , is defined as

var(WX)=\frac{1}{n}\Vert W\pmb{x}_j\Vert^2 =\frac{1}{n}\sum^n_{j=1}\langle  W\pmb{x}_j,W\pmb{x}_j\rangle =\frac{1}{n}\sum^n_{j=1}\pmb{x}_j^T W^TW\pmb{x}_j

\frac{1}{n}\sum^n_{j=1}\pmb{x}_j^T(\pmb{v}_1\pmb{v}_1^T+...+\pmb{v}_r\pmb{v}_r^T)\pmb{x}_j=\sum^r_{i=1}\pmb{v}_i^T\sum^n_{j=1}\frac{1}{n}(\pmb{x}_j\pmb{x}_j^T)\pmb{v}_i=

(\pmb{v}_1^T\Sigma\pmb{v}_1+...+\pmb{v}_r^T\Sigma\pmb{v}_r)=\sum^r_{i=1}\pmb{v}_i^T\Sigma\pmb{v}_i=\sum^r_{i=1}\pmb{v}_i^T\lambda_i\pmb{v}_i=\sum^r_{i=1}\lambda_i

(Based on 4, we know that the eigenvectors are orthonormal.)

DFS code and Minimum DFS code

Edge Code

We represent each code e={start, end} with start and end as the starting and ending nodes of edge. The edge code is defined as:

\left\{ Index_{start}, Index_{end},L_V(start), L_E(e), L_V(end)\right\}

where the Index of a vertex is an arbitrary number, in sequential visiting order, given to the vertices in the graph. The numbers marked in red in the following graph are the indices of the vertices.

Screen Shot 2018-08-25 at 18.03.27

There are many edge codes for this graph:

Forward edge: {0,1,X,a,Y}, {1,2,Y,b,X}, {2,3,X,c,Z}, {1,4,Y,d,Z}

Backward edge: {3,1,Z,b,Y}, {2,0,X,a,Z}

It is clear to see that we have Index_{start}<Index_{end} for forward edges and Index_{start}>Index_{end} for backward edges.

The order of edges and DFS code

Suppose we have two edges

e_1=\left\{ Index_{start1},Index_{end1}\right\}

e_2=\left\{ Index_{start2},Index_{end2}\right\}

To determine the ordering between 2 edges we must consider: i) the types of the edges, ii) the visiting order (Index) of the nodes in each edge.

If e_1 and e_2 are both forward edges, then e_1 < e_2 if one of the following holds:

  1. Index_{end1}<Index_{end2}
  2. Index_{end1}=Index_{end2} and Index_{start1}<Index_{start2}

If e_1 and e_2 are both backward edges, then e_1 < e_2 if one of the following holds:

  1. Index_{start1}<Index_{start2}
  2. Index_{start1}=Index_{start2} and Index_{end1}<Index_{end2}

If e_1 and e_2 are one forward edge and one backward edge, then e_1 < e_2 if one of the following two holds:

  1. e_1 is forward edge and e_2 is backward edge and Index_{end1}\leq Index_{start2}
  2. e_1 is backward edge and e_2 is forward edge and Index_{start1}<Index_{end2}

The DFS code is just the concatenation of all the edge codes in order. i.e. the DFS code for the above DFS tree is:

{0,1,X,a,Y,1,2,Y,b,X,2,0,X,a,X,2,3,X,c,Z,3,1,Z,b,Y,1,4,Y,d,Z}.

DFS Lexicographic Order and Minimum DFS code

The DFS lexicographic order of the DFS code is defined as follows:

If we have two DFS codes \alpha and \beta

\alpha={\alpha_0,...,\alpha_m}, \beta={\beta _0,...,\beta_n}

then \alpha\leq\beta if and only if either of the following is true

1.\exists t,0\leq t\leq min(m,n), \forall k<t, \alpha_k=\beta_k and \alpha_t<\beta_t

2.m\leq n and \forall k\leq m, \alpha_k=\beta_k

So we can order a list of DFS codes according to the DFS lexicographic order, and we define a minimum DFS code as the minimum one among all the possible DFS codes of a graph.

Kernel PCA

Kernel PCA

Assume that we are dealing with centred data \sum^n_(i=1)\phi(x_i)=0

The covariance matrix then takes from C=\frac{1}{n}\phi(x_i)\phi(x_i)^T

Then we have to find eigenvalues \lambda\geq0 and nonzero eigenvectors v in Hilbert space satisfying: \lambda v=Cv

All solutions v with \lambda\neq0 lie in the span of \phi(x_1),...,\phi(x_n), due to the fact that

\lambda v=Cv=\frac{1}{n}\sum^n_{i=1}\phi(x_i)\phi(x_i)^Tv=\frac{1}{n}\sum^n_{i=1}(\phi(x_i)^Tv)\phi(x_i)   (\phi(x_i)^Tv is a inner product between two vectors, it is a scalar. So we can reorder it to the front. )

We can get 2 useful consequence based on the former equation, the first is:

If we choose a point \phi(x_j) and multiply it on both side. We can get \lambda\phi({x_j})v=\phi(x_j) Cv

\lambda\langle\phi(x_j),v\rangle=\langle\phi(x_j),Cv\rangle

The second consequence is that the eigenvector can be written as a linear combination of points in Hilbert space:

v=\sum^n_{i=1}\alpha_i\phi(x_i)  (\alpha_i=\frac{\phi(x_i)^Tv}{\lambda})

Replace consequence 2 into 1:

We can get \lambda\langle\phi(x_j),\sum^n_{i=1}\alpha_i\phi(x_i)\rangle=\langle\phi(x_j),\frac{1}{n}\sum^n_{k=1}\phi(x_k)\phi(x_k)^T\sum^n_{i=1}\alpha_i\phi(x_i)\rangle    (43)

There are two properties of inner product that we should know is that:

\langle x, \lambda x'\rangle=\lambda\langle x,x'\rangle (\lambda is a constant)

\langle x,\sum_i\alpha_ix'_i\rangle =\sum_i\alpha_i\langle x,x'_i\rangle (\alpha_i is a constant)

Have known these two properties, we can write equation (43) as:

\lambda\sum^n_{i=1}\alpha_i\langle\phi(x_j),\phi(x_i)\rangle=\frac{1}{n}\sum^n_{i=1}\alpha_i\langle\phi(x_j),\sum^n_{k=1}\phi(x_k)\langle\phi(x_k),\phi(x_j)\rangle\rangle

This can be rewritten, for all j=1,…,n, as:

\lambda\sum^n_{i=1}\alpha_i\langle\phi(x_j),\phi(x_i)\rangle=\frac{1}{n}\sum^n_{i=1}\alpha_i\langle\phi(x_j),\sum^n_{k=1}\phi(x_k)\langle\phi(x_k),\phi(x_i)\rangle\rangle ;   (44)

\lambda\sum^n_{i=1}\alpha_i\langle\phi(x_j),\phi(x_i)\rangle=\frac{1}{n}\sum^n_{i=1}\alpha_i\sum^n_{k=1}\langle\phi(x_k),\phi(x_i)\rangle\langle\phi(x_j),\phi(x_k)\rangle;   (45)

n\lambda\sum^n_{i=1}\alpha_i\underbrace{\langle\phi(x_j),\phi(x_i)\rangle}=\sum^n_{i=1}\alpha_i\underbrace{\sum^n_{k=1}\langle\phi(x_j),\phi(x_k)\rangle\langle\phi(x_k),\phi(x_i)\rangle}.  (46)

The two underprices implies that: \langle\phi(x_j),\phi(x_k)\rangle=K(x_j,x_k)=K_{jk}

\sum^n_{k=1}K_{jk}K_{ki}=[K\cdot K]_{ji}=[K^2]_{ij}  (This is the definition of matrix multiplication: the jth row of K multiply with the ithcolumn of K is the ij entry of the result matrix.)

 

Kernel PCA as an eigenvector problem

Equation (46) can be rewritten as:

n\lambda\textbf{K}\pmb{\alpha}=\textbf{K}^2\pmb{\alpha}   (47)

where \pmb{\alpha} denotes the column vector with entries \alpha_1,...,\alpha_n

To find the solution of Equation (47), we solve the problem

n\lambda\pmb{\alpha}=\textbf{K}\pmb{\alpha}

which we obtain by multiplying (47) by \textbf{K}^{-1} from the left.

 

Normalizing the coefficients

We require that the eigenvectors \pmb{v}^k to have unit length, that is \langle \pmb{v}^k,\pmb{v}^k\rangle=1 for all k=1,…,r.

That means that

1=\langle\pmb{v}^k,\pmb{v}^k\rangle

=\sum^n_{i,j=1}\alpha_i^k\alpha^k_j\langle(x_i),\phi(x_j)\rangle

=\sum^n_{i,j=1}\alpha^k_i\alpha^k_jK_{i,j}

=\langle\alpha^k,\pmb{K\alpha}^k=\lambda_k \langle\alpha^k,\alpha^k\rangle

As eigenvectors of \pmb{K}, the \alpha^k have unit form.

Therefor we have to rescale them by \sqrt{\frac{1}{\lambda}} to enforce that their norm is \frac{1}{\lambda}.

Multidimensional Scaling

Classical scaling

The task of MDS is to find the original Euclidean coordinates from a given distance matrix D.

The Euclidean distance between the x_i and x_j is given by

d^2_ij=\sum^d_{k=1}(x_{ik}-x_{jk})^2=(x_i-x_j)^2

B is the inner-product of dimensional-reduced coordinates. The general b_{ij} term of B is given by

b_{ij}=\sum^d_{k=1}x_{ik}x_{jk}=x_i^Tx_j

We can derive B from D:

d_{ij}^2=x_i^Tx_i+x_j^Tx_j-2x_i^Tx_j=b_{ii}+b_{jj}-2b_{ij}

For convenient computation, we centre the coordinate matrix X, which implies that \sum^n_{i=1}b_{ij}=0 (the sum of the column of is zero), because the inner product matrix is symmetric, so \sum^n_{j=1}b_{ij}=0

Summing over i, we obtain

\frac{1}{n}\sum^n_{i=1}d_{ij}^2=\frac{1}{n}\sum^n_{i=1}b_{ii}+b_{jj}-\frac{1}{n}\sum^n_{i=1}b_{ij}=\frac{1}{n}\sum^n_{i=1}b_{ii}+b_{jj} (59) because \sum^n_{i=1}b_{ij}=0

b_{jj}=\frac{1}{n}\sum^n_{i=1}d^2_{ij}-\frac{1}{n}\sum^n_{i=1}b_{ii}

Summing over j, we obtain

\frac{1}{n}\sum^n_{j=1}d^2_{ij}=b_{ii}+\frac{1}{n}\sum^n_{j=1}b_{jj} (60)

b_{ii}=\frac{1}{n}\sum^n_{j=1}d^2_{ij}-\frac{1}{n}\sum^n_{j=1}b_{jj}

Summing over equation(60), we obtain

\frac{1}{n}\sum^n_{i=1}(\frac{1}{n}\sum^n_{j=1}d^2_{ij})=\frac{1}{n}\sum^n_{i=1}b_{ii}+\frac{1}{n}\sum^n_{i=1}\frac{1}{n}\sum^n_{j=1}b_{jj}

\frac{1}{n^2}\sum^n_{i=1}\sum^n_{j=1}d^2_{ij}=\frac{1}{n}\sum^n_{i=1}b_{ii}+\frac{1}{n}b_{jj}=\frac{2}{n}\sum^n_{i=1}b_{ii}

We already know that b_{ij}=-\frac{1}{2}(d^2_{ij}-b_{ii}-b_{jj});

so b_{ij}=-\frac{1}{2}(d^2_{ij}-d^2_{i\bullet}-d^2_{\bullet j}+d^2_{\bullet\bullet})

Define the matrix \pmb{A} as (a_{ij}) and observe that:
\pmb{B}=\pmb{HAH}

The inner product matrix \pmb{B} can be expressed as

\pmb{B}=\pmb{XX}^T

where X=(\pmb{x}_1,...,\pmb{x}_n)^T is the n\times p matrix of coordinates.

The rank of \pmb{B} is then

rank(\pmb{B})=rank(\pmb{XX}^T)=rank(\pmb{X})=p

\pmb{B} is symmetric, positive definite, of rank p and has p non-zero eigenvalues.

B can now be written as \pmb{B}=\pmb{V}_p\pmb{Lambda}_p\pmb{V}_p^T

where \pmb{Lambda}_p=diag(\lambda_1,...,\lambda_p), the diagonal matrix of the eigenvalues of B, and \pmb{V}_p=(\pmb{v}_1,...,\pmb{v}_p), the matrix of corresponding eigenvectors.

Hence the coordinate matrix X containing the point configuration in \mathbb{R}^p is given by \pmb{X}=\pmb{V}_p\pmb{\Lambda}_p^{\frac{1}{2}}

Pseudocode

The algorithm for recovering coordinates from distances D_{ij}=d_{ij} between pairs of points is as follows:

  1. Form matrix A_{ij}=\left\{ -\frac{1}{2}d_{ij}^2 \right\}
  2. Form matrix $latex \pmb{B}=\pmb{HAH}$, where H is the cantering matrix \pmb{H}=\pmb{I}-\frac{1}{n}\pmb{11}^T
  3. Find the spectral decomposition of B, and V is the matrix of corresponding eigenvectors.
  4. If the points were originally in a p-dimensional space, the first p eigenvalues of K are nonzero and the remaining n-p are zero. Discard these from \Lambda (rename as \Lambda_p ), and discard the corresponding eigenvalues from V (rename as \pmb{V}_p ).
  5. Find \pmb{X}=\pmb{V}_p\pmb{\Lambda}_p^{\frac{1}{2}}, and then the coordinates of the points are given by the rows of X.