I share today here a most recent post from a nice Blog called PoissonIsFish. This is a blog from a Portuguese(Brazilian?) speaker doing a PhD in Berlin, Germany.
This is a long and accessible demonstration and explanation on how to do Principal Component Analysis with the R programming language.
Principal component analysis (PCA) is routinely employed on a wide range of problems. From the detection of outliers to predictive modeling, PCA has the ability of projecting the observations described by variables into few orthogonal components defined at where the data ‘stretch’ the most, rendering a simplified overview. PCA is particularly powerful in dealing with multicollinearity and variables that outnumber the samples ().
It is an unsupervised method, meaning it will always look into the greatest sources of variation regardless of the data structure. Its counterpart, the partial least squares (PLS), is a supervised method and will perform the same sort of covariance decomposition, albeit building a user-defined number of components (frequently designated as latent variables) that minimize the SSE from predicting a specified outcome with an ordinary least squares (OLS). The PLS is worth an entire post and so I will refrain from casting a second spotlight.
In case PCA is entirely new to you, there is an excellent Primer from Nature Biotechnology that I highly recommend. Notwithstanding the focus on life sciences, it should still be clear to others than biologists.
There are numerous PCA formulations in the literature dating back as long as one century, but all in all PCA is pure linear algebra. One of the most popular methods is the singular value decomposition (SVD). The SVD algorithm breaks down a matrix of size into three pieces,
where is the matrix with the eigenvectors of , is the diagonal matrix with the singular values and is the matrix with the eigenvectors of . These matrices are of size , and , respectively. The key difference of SVD compared to a matrix diagonalization () is that and are distinct orthonormal (orthogonal and unit-vector) matrices.
PCA reduces the dimensions of your data set down to principal components (PCs). The scores from the first PCs result from multiplying the first columns of with the upper-left submatrix of . The loading factors of the PC are directly given in the row in . Consequently, multiplying all scores and loadings recovers . You might as well keep in mind:
- PCs are ordered by the decreasing amount of variance explained
- PCs are orthogonal i.e. uncorrelated to each other
- The columns of should be mean-centered, so that the covariance matrix
- SVD-based PCA does not tolerate missing values (but there are solutions we will cover shortly)
For a more elaborate explanation with introductory linear algebra, here is an excellent free SVD tutorial I found online. At any rate, I guarantee you can master PCA without fully understanding the process.
Let’s get started with R
Although there is a plethora of PCA methods available for R, I will only introduce two,
- prcomp, a default function from the R base package
- pcaMethods, a Bioconductor package that I frequently use for my own PCAs
I will start by demonstrating that prcomp is based on the SVD algorithm, using the base svd function.
# Generate scaled 4*5 matrix with random std normal samples
(20), 4, 5))
# Perform PCA
(mat, scale. = F, center = F)
By default, prcomp will retrieve PCs. Therefore, in our setting we expect having four PCs.The svd function will behave the same way:
# Perform SVD
# the diagonal of Sigma mySVD$d is given as a vector
# we have 4 PCs, no need for a 5th column
(sigma) <- mySVD$d
# sigma is now our true sigma matrix
Now that we have the PCA and SVD objects, let us compare the respective scores and loadings. We will compare the scores from the PCA with the product of and from the SVD. In R, matrix multiplication is possible with the operator %*%. Next, we will directly compare the loadings from the PCA with from the SVD, and finally show that multiplying scores and loadings recovers . I rounded the results to five decimal digits since the results are not exactly the same! The function t retrieves a transposed matrix.
# Compare PCA scores with the SVD's U*Sigma
theoreticalScores <- mySVD$u %*% sigma
# Compare PCA loadings with the SVD's V
# Show that mat == U*Sigma*t(V)
recoverMatSVD <- theoreticalScores %*%
#Show that mat == scores*t(loadings)
recoverMatPCA <- myPCA$x %*%
- The SVD algorithm is founded on fundamental properties of linear algebra including matrix diagonalization. SVD-based PCA takes part of its solution and retains a reduced number of orthogonal covariates that explain as much variance as possible.
- Use PCA when handling high-dimensional data. It is insensitive to correlation among variables and efficient in detecting sample outliers.
- If you plan to use PCA results for subsequent analyses all care should be undertaken in the process. Although typically outperformed by numerous methods, PCR still benefits from interpretability and can be effective in many settings.
All feedback from these tutorials is very welcome, please enter the Contact tab and leave your comments. I do also appreciate suggestions. Enjoy!