An EM solution to PCA

My PhD advisor recently introduced me to the solution of principle component analysis (PCA) that uses expectation maximization (EM), which led me to this paper that introduced this formulation. Because of how intuitive the algorithm is, I’d like to briefly explain it here.

The paper formulates PCA as a factor model:




The factors x are linearly transformed via C and have some additional Gaussian noise v.

The EM formulation falls out from using x as the latent factors.  Given some initial estimate C_0 and R_0, we can compute the posterior probabilities of x.  With these \hat{x}, we can find \hat{C} and \hat{R}.  And so we can do this iteratively, finding x given our best estimates of C and R, and then finding C and R given our updated x, and so on and so forth, until convergence.

This model becomes classical PCA in the zero noise limit.  i.e.


and we only need to care for solving C.  Because we are in the zero noise limit, given C we know exactly what x must be.  In other words, the posterior probability of x is a point mass. This greatly simplifies the E step of the EM algorithm.

In fact, in the E step, we obtain latent variables X=(C^TC)^{-1}C^Ty.  This intuitively corresponds to projecting the data onto our best guess of the principle component projection.

In the M step, we now update C:

C = YX^T(XX^T)^{-1}

This formula might look familiar as the closed form solution for ordinary least squares.  In fact this is precisely that. We find a new projection that minimizes the squared distance between y and x.

The whole process is akin to holding a wooden rod in place and affixing some 0-length springs from point objects to the rod, and then letting the rod go free to rotate according to the forces of the springs. Once the rod is stabilized, we reattach the springs in new locations.

Just as PCA finds the projection that maximizes the explained variance and minimizes total squared distance, we ultimately find the rod configuration that minimizes the total potential energy stored in the springs.   It is super elegant that by using EM, we find this connection from the generative model of PCA as a factor model to the traditional formulation of PCA as finding min/max problem of the squared distance/explained variance.

I have a diagrammatic view of one iteration of EM, where the circles represent data points and the squiggly lines represent springs.












Credits to Lior Pachter for introducing me to this concept, Valentine Svensson for feedback on this post, and Sam Roweis for coming up with the EM formulation for PCA.

Post script: after reading the paper that first introduced the EM solution to PCA, I noticed that it was written by a fellow Caltech graduate student in the 90’s, and upon googling ‘Sam Roweis’ I was very saddened to learn of his death in 2010 and of this great loss to science and mathematics.


How to read ROC curves

A common question I get asked is how to interpret a receiver operating character (ROC) curve.  What does the shape of the tell me?  Why do I care about area under the curve? This post aims to explain ROC curves from the fundamentals.

A ROC curve is used to graphically describe the behavior of a test that is used to discriminate case vs. control. For example, one can plot a ROC curve for a test that detects individuals who have cancer (case) from individuals who do not have cancer (control).   In particular, it describes a tradeoff between two measures of a test that are (1) how good the test is at picking up cases and (2) how good the test is at not picking up controls. In theory, a good test succeeds in both measures — it picks up all the cases, and it does not pick up all the controls.  However, rarely do we get a perfect test that separates all cases from all controls, whether there is no perfect test or because we have not discovered a better one. A tradeoff will need to be made between (1) finding true positives and (2) not finding true negatives.

(I call these measure 1 and measure 2 from now on. Measure 1 is technically referred to as sensitivity, true positive rate, and recall. Measure 2 is technically referred to as  false positive rate, fall-out and 1 – specificity.  I think these terms are unnecessarily confusing and will just refer to them as measure 1 and measure 2 from now on.)

Let’s take for example, we have designed a new test for cancer that assigns an individual a score between 1 and 100, 1 for lowest risk of cancer and 100 for highest risk of cancer.  Now we want to know how good this test is.  It also wants to know where to set the normal cutoff for the test, so that anyone who receives a score above this cutoff undergoes further medical evaluation for cancer and anyone who receives a score below this cutoff does not have to undergo more expensive or painful procedures.  One way to assess the accuracy of this cancer test and set the normal cutoff is to plot a ROC curve.

One can imagine that, if we set the normal cutoff to a score of 100, no one would be determined to have cancer (because everyone would get a score at or below 100). We would make no mistakes in calling a healthy individual sick (succeed at measure 2), and yet, we completely fail at picking up truly sick individuals (fail at measure 1).   On the other extreme, if the normal cutoff were set at 0, everyone taking the test would be determined to have cancer,  and we would call everyone sick.  Every truly sick person would be called sick (succeed in measure 1) but every healthy person would also be called sick (fail in measure 2).


These two extremes correspond to the two extremes of the ROC curve.



On the ROC curve, the x-axis corresponds to the number of controls (aka negatives) called erroneously. The y-axis corresponds to the number of cases (aka positives) called correctly.  Setting the cutoff such that no one is called sick corresponds to the circle (0% negatives called, 0% positives called).  Setting the cutoff such that everyone is called sick corresponds to the pentagon (100% negatives called, 100% positives called).  Somewhere between the two extremes is where we want to set a good threshold.

Let’s say our cancer drug is a perfect discriminator.  Then at some threshold, everyone below the threshold is healthy and everyone above the threshold is sick. This would correspond to the star.

ROC.002At this point of the curve, it calls all the positives but none of the negatives and is a perfect discriminator.  If we were to draw a line connecting the circle, star, and pentagon, the ROC curve for this perfect discriminator is produced:

ROC.003The area underneath this rectangle is merely 1 x 1, which is 1. The area under the curve (AUC) of a perfect discriminator is 1.

Conceptually, connecting the circle, star and pentagon is equivalent to plotting measure 1 and measure 2 at all possible thresholds.  The ROC curve is the plot all possible pairs of measure 1 and measure 2 of a test at all possible thresholds.

Let’s explore this idea further by considering random discriminator, instead of a perfect one.  In this random discriminator, it randomly assigns individuals scores between 1 and 100 (i.e. the test tells us nothing).  In this case, no matter where we set the threshold, we will pick up the same percentage of positives as negatives (y = x).   This corresponds to the following ROC curve depicted as the green line on y=x:



The area under this ROC curve, which is just the area of the isosceles triangle with two sides of length 1, is 0.5.  We can conclude that The area under the ROC curve of a random discriminator is 0.5.

Now, we can imagine a really really bad discriminator, where it calls all the negatives and ignores all the positives. It would have a ROC curve like this that corresponds like this, which corresponds to a AUC of 0.

ROC.005But then we can just ‘reverse-interpret’ the test and turn it into a perfect discriminator.  In this scenario where we can invert any test that performs worse than random, the worst test you can have is the random one with an AUC of 0.5.

In sum, the area under the curve (AUC) — how much greater it is than 0.5 and how close it is to 1 — gives us an indication of how accurate and useful the test is. For example, if the ROC curves below correspond to two tests (they both perform better than random), I would always use the green curve’s test over the red curve’s test, because it performs better at both measures no matter where the thresholds are set.


The ROC curve itself can also inform us about where to set a proper threshold.  Let’s say this is a cancer screen for prostate cancer, where the followup is a fine needle biopsy to confirm cancer, a procedure that is relatively cheap and does not generally lead to complications.  Then maybe I would be willing to trade off measure 2 for measure 1, so that I can pick up more positives, even if a few more individuals will unnecessarily need to undergo fine needle biopsy.  But if the followup is much serious or expensive, then maybe it is worth it to make sure we are confident that all our calls are positive.  The ROC curve helps assess where to set a threshold such that we obtain the measure 1 and measure 2 tradeoff that is suitable for the use case.