A common question I get asked is how to interpret a receiver operating character (ROC) curve. What does the shape of the tell me? Why do I care about area under the curve? This post aims to explain ROC curves from the fundamentals.
A ROC curve is used to graphically describe the behavior of a test that is used to discriminate case vs. control. For example, one can plot a ROC curve for a test that detects individuals who have cancer (case) from individuals who do not have cancer (control). In particular, it describes a tradeoff between two measures of a test that are (1) how good the test is at picking up cases and (2) how good the test is at not picking up controls. In theory, a good test succeeds in both measures — it picks up all the cases, and it does not pick up all the controls. However, rarely do we get a perfect test that separates all cases from all controls, whether there is no perfect test or because we have not discovered a better one. A tradeoff will need to be made between (1) finding true positives and (2) not finding true negatives.
(I call these measure 1 and measure 2 from now on. Measure 1 is technically referred to as sensitivity, true positive rate, and recall. Measure 2 is technically referred to as false positive rate, fall-out and 1 – specificity. I think these terms are unnecessarily confusing and will just refer to them as measure 1 and measure 2 from now on.)
Let’s take for example, we have designed a new test for cancer that assigns an individual a score between 1 and 100, 1 for lowest risk of cancer and 100 for highest risk of cancer. Now we want to know how good this test is. It also wants to know where to set the normal cutoff for the test, so that anyone who receives a score above this cutoff undergoes further medical evaluation for cancer and anyone who receives a score below this cutoff does not have to undergo more expensive or painful procedures. One way to assess the accuracy of this cancer test and set the normal cutoff is to plot a ROC curve.
One can imagine that, if we set the normal cutoff to a score of 100, no one would be determined to have cancer (because everyone would get a score at or below 100). We would make no mistakes in calling a healthy individual sick (succeed at measure 2), and yet, we completely fail at picking up truly sick individuals (fail at measure 1). On the other extreme, if the normal cutoff were set at 0, everyone taking the test would be determined to have cancer, and we would call everyone sick. Every truly sick person would be called sick (succeed in measure 1) but every healthy person would also be called sick (fail in measure 2).
These two extremes correspond to the two extremes of the ROC curve.
On the ROC curve, the x-axis corresponds to the number of controls (aka negatives) called erroneously. The y-axis corresponds to the number of cases (aka positives) called correctly. Setting the cutoff such that no one is called sick corresponds to the circle (0% negatives called, 0% positives called). Setting the cutoff such that everyone is called sick corresponds to the pentagon (100% negatives called, 100% positives called). Somewhere between the two extremes is where we want to set a good threshold.
Let’s say our cancer drug is a perfect discriminator. Then at some threshold, everyone below the threshold is healthy and everyone above the threshold is sick. This would correspond to the star.
At this point of the curve, it calls all the positives but none of the negatives and is a perfect discriminator. If we were to draw a line connecting the circle, star, and pentagon, the ROC curve for this perfect discriminator is produced:
The area underneath this rectangle is merely 1 x 1, which is 1. The area under the curve (AUC) of a perfect discriminator is 1.
Conceptually, connecting the circle, star and pentagon is equivalent to plotting measure 1 and measure 2 at all possible thresholds. The ROC curve is the plot all possible pairs of measure 1 and measure 2 of a test at all possible thresholds.
Let’s explore this idea further by considering random discriminator, instead of a perfect one. In this random discriminator, it randomly assigns individuals scores between 1 and 100 (i.e. the test tells us nothing). In this case, no matter where we set the threshold, we will pick up the same percentage of positives as negatives (y = x). This corresponds to the following ROC curve depicted as the green line on y=x:
The area under this ROC curve, which is just the area of the isosceles triangle with two sides of length 1, is 0.5. We can conclude that The area under the ROC curve of a random discriminator is 0.5.
Now, we can imagine a really really bad discriminator, where it calls all the negatives and ignores all the positives. It would have a ROC curve like this that corresponds like this, which corresponds to a AUC of 0.
But then we can just ‘reverse-interpret’ the test and turn it into a perfect discriminator. In this scenario where we can invert any test that performs worse than random, the worst test you can have is the random one with an AUC of 0.5.
In sum, the area under the curve (AUC) — how much greater it is than 0.5 and how close it is to 1 — gives us an indication of how accurate and useful the test is. For example, if the ROC curves below correspond to two tests (they both perform better than random), I would always use the green curve’s test over the red curve’s test, because it performs better at both measures no matter where the thresholds are set.
The ROC curve itself can also inform us about where to set a proper threshold. Let’s say this is a cancer screen for prostate cancer, where the followup is a fine needle biopsy to confirm cancer, a procedure that is relatively cheap and does not generally lead to complications. Then maybe I would be willing to trade off measure 2 for measure 1, so that I can pick up more positives, even if a few more individuals will unnecessarily need to undergo fine needle biopsy. But if the followup is much serious or expensive, then maybe it is worth it to make sure we are confident that all our calls are positive. The ROC curve helps assess where to set a threshold such that we obtain the measure 1 and measure 2 tradeoff that is suitable for the use case.