A simple guide to understand PCA
This post makes it easy to digest principal component analysis
Principal Component Analysis (PCA) is a method to reduce the dimensionality of (usually high dimensional) information while maintaining the amount of information as much as possible. The analysis is performed by determining the principal component vector for a dataset (which is usually given as a matrix ).
Suppose a dataset is given as an n-by-p matrix:
Then we compute its covariant matrix:
In the formula of the covariant matrix, the x with the angle brackets subscripted by j or k is the averaged value of x in terms of the j-th or k-th columns of the matrix, respectively:
For convenience, we introduce a “centered” matrix of the dataset:
By virtue of the matrix, the covariant matrix is rewritten as:
The covariant matrix is a symmetric matrix, and therefore a normal matrix. This means that it is diagonalized by a unitary matrix.
As far as diagonal matrices are concerned, it should also be mentioned that we do not lose generality by placing the eigenvalues from the top, starting with the ones with the largest eigenvalues.
Accordingly, we have
is the eigenvector for the j-th principal component.
After that, we select its first principal component, second principal component, and so on, starting with the one with the larger eigenvalue. (Note that the larger the eigenvalue, the larger the variance.)
Now that we’ve obtained the unitary matrix (i.e. a set of the eigenvectors normalised to unity in terms of the symmetric matrix), we use them to obtain the (vector) values in each principal component (transformed coordinate system). They are given by:
Principal component loading
It’s also important to see to what extent the original components and their principal components. To see this, we calculate the covariance of the i-th column of X and j-th principal component:
where we used the fact that the average of the j-th principal component vector is zero:
Also, we note that the variance of the j-th principal component vector is given by:
Thus, the correlation coefficient between principal component and its original component is: