Principal Components Analysis

Content from:

PCA Intuition

PCA looks for a single lower-dimensional linear subspace that captures most of the variation in the data. In other words, PCA aims to minimize the error introduced by projecting the data into this linear subspace.

PCA Flow

The overflow of PCA involves rotating the data to obtain uncorrelated features and then performing dimensionality reduction through projection.

  • Eigenvectors give the direction of uncorrelated features

  • Eigenvalues are the variance of the new features

  • Dot product gives the projection of uncorrelated features

  1. Deriving uncorrelated features through eigenvector

    1. Mean Normalize data

    2. Get covariance matrix

    3. Perform SVD to get the eigenvector matrix (first matrix) and eigenvalue (diagonal value of second matrix)

    4. The eigenvector should be organized according to descending eigenvalue

  2. Dot product to project data matrix XXto the first kkcolumn of eigenvector matrix UUthrough X=XU[:k]X' = XU[:k]

  3. Calculate the percentage of variance explained via i=0kSiij=0dSjj\frac{\sum_{i=0}^{k} S_{ii}}{\sum_{j=0}^d S_{jj}}

A simple 2D PCA walkthrough motivating variance minimization and PCA as rotation problem

PCA as a Linear Combination

SVD

From SVD decomposition we know that any matrix Am×nA_{m \times n}can be factorized: A=Um×mSm×nVn×nTA = U_{m\times m}S_{m\times n}V_{n\times n}^T where UU and VV are orthogonal matrices with orthonormal eigenvectors chosen from AATAA^T and ATAA^TAand SSis a diagonal matrix with rr elements equal to the root of the positive eigenvalues of AATAA^Tand ATAA^TA( They have the same positive eigenvalues).

SVD to linear combination of outer product

We can rewrite the SVD equations as AV=USAV = US, and then apply the column view of matrix multiplication and the fact that SS is a diagonal matrix, we have:

Avi=σiuiAei=σiuiviTAcolumn i=σiuiviT\begin{align*} & Av_i = \sigma_i u_i \\ & \Longrightarrow A e_i = \sigma_i u_i v_i^T \\ & A_{\text{column i}} = \sigma_i u_i v_i^T \end{align*}

This means that we can think of matrix AA as a series of outer product of uiu_iand viv_i

A=σ1u1v1T+...+σrurvrTA = \sigma_1u_1v_1^T + ...+\sigma_ru_rv_r^T

Covariance Matrix and SVD

Now in terms of covariance matrix, if XX is already centered, it can be written as

Σ=E[(XXˉ)(XXˉ)]=XXTn\Sigma=E[(X-\bar{X})(X-\bar{X})]=\frac{XX^T}{n}

We can subsequently do SVD on this sample covariance matrix Σ=σ1u1v1T+...\Sigma = \sigma_1 u_1 v_1^T + .... If the singular value σi\sigma_i is ordered from largest to smallest, this means the impact of this linear combination is also ordered that way.

Furthermore

  • The total variance of the data equals to the trace of the matrix Σ\Sigma which equals to the sum of squares of Σ\Sigma's singular values. So we can get ratio of variance lost if we drop smaller singular values

  • The first eigenvector u1u_1of Σ\Sigma points to the most important direction of the data

  • The error, calculated as the sum of the perpendicular squared distance from each point to u1u_1is minimum when SVD is used.

If Z=AXZ=AX, then the covariance of ZZ can be calculated as Vz=AVxATV_z = AV_xA^T where VxV_xis the covariance of XX

Given the SVD decomposition, apply AA to a vector xx can be understood as first performing a rotation VTV^T, then a scaling SS, and finally another rotation UU.

Last updated