Principal Components Analysis
Content from:
PCA Intuition
PCA looks for a single lower-dimensional linear subspace that captures most of the variation in the data. In other words, PCA aims to minimize the error introduced by projecting the data into this linear subspace.
PCA Flow
The overflow of PCA involves rotating the data to obtain uncorrelated features and then performing dimensionality reduction through projection.
Eigenvectors give the direction of uncorrelated features
Eigenvalues are the variance of the new features
Dot product gives the projection of uncorrelated features
Deriving uncorrelated features through eigenvector
Mean Normalize data
Get covariance matrix
Perform SVD to get the eigenvector matrix (first matrix) and eigenvalue (diagonal value of second matrix)
The eigenvector should be organized according to descending eigenvalue
Dot product to project data matrix to the first column of eigenvector matrix through
Calculate the percentage of variance explained via
A simple 2D PCA walkthrough motivating variance minimization and PCA as rotation problem
PCA as a Linear Combination
SVD
From SVD decomposition we know that any matrix can be factorized: where and are orthogonal matrices with orthonormal eigenvectors chosen from and and is a diagonal matrix with elements equal to the root of the positive eigenvalues of and ( They have the same positive eigenvalues).
SVD to linear combination of outer product
We can rewrite the SVD equations as , and then apply the column view of matrix multiplication and the fact that is a diagonal matrix, we have:
This means that we can think of matrix as a series of outer product of and
Covariance Matrix and SVD
Now in terms of covariance matrix, if is already centered, it can be written as
We can subsequently do SVD on this sample covariance matrix . If the singular value is ordered from largest to smallest, this means the impact of this linear combination is also ordered that way.
Furthermore
The total variance of the data equals to the trace of the matrix which equals to the sum of squares of 's singular values. So we can get ratio of variance lost if we drop smaller singular values
The first eigenvector of points to the most important direction of the data
The error, calculated as the sum of the perpendicular squared distance from each point to is minimum when SVD is used.
If , then the covariance of can be calculated as where is the covariance of
Given the SVD decomposition, apply to a vector can be understood as first performing a rotation , then a scaling , and finally another rotation .
Last updated