Principal Components Analysis
Content from:
PCA Intuition
PCA looks for a single lower-dimensional linear subspace that captures most of the variation in the data. In other words, PCA aims to minimize the error introduced by projecting the data into this linear subspace.
PCA Flow
The overflow of PCA involves rotating the data to obtain uncorrelated features and then performing dimensionality reduction through projection.
Eigenvectors give the direction of uncorrelated features
Eigenvalues are the variance of the new features
Dot product gives the projection of uncorrelated features
Deriving uncorrelated features through eigenvector
Mean Normalize data
Get covariance matrix
Perform SVD to get the eigenvector matrix (first matrix) and eigenvalue (diagonal value of second matrix)
The eigenvector should be organized according to descending eigenvalue
Dot product to project data matrix Xto the first kcolumn of eigenvector matrix Uthrough X′=XU[:k]
Calculate the percentage of variance explained via ∑j=0dSjj∑i=0kSii
A simple 2D PCA walkthrough motivating variance minimization and PCA as rotation problem
PCA as a Linear Combination
SVD
From SVD decomposition we know that any matrix Am×ncan be factorized: A=Um×mSm×nVn×nT where U and V are orthogonal matrices with orthonormal eigenvectors chosen from AAT and ATAand Sis a diagonal matrix with r elements equal to the root of the positive eigenvalues of AATand ATA( They have the same positive eigenvalues).
SVD to linear combination of outer product
We can rewrite the SVD equations as AV=US, and then apply the column view of matrix multiplication and the fact that S is a diagonal matrix, we have:
This means that we can think of matrix A as a series of outer product of uiand vi
Covariance Matrix and SVD
Now in terms of covariance matrix, if X is already centered, it can be written as
We can subsequently do SVD on this sample covariance matrix Σ=σ1u1v1T+.... If the singular value σi is ordered from largest to smallest, this means the impact of this linear combination is also ordered that way.
Furthermore
The total variance of the data equals to the trace of the matrix Σ which equals to the sum of squares of Σ's singular values. So we can get ratio of variance lost if we drop smaller singular values
The first eigenvector u1of Σ points to the most important direction of the data
The error, calculated as the sum of the perpendicular squared distance from each point to u1is minimum when SVD is used.
If Z=AX, then the covariance of Z can be calculated as Vz=AVxAT where Vxis the covariance of X
Given the SVD decomposition, apply A to a vector x can be understood as first performing a rotation VT, then a scaling S, and finally another rotation U.
Last updated