Roc curve from scratch python

Because you don’t learn it until you build it

Roc curve from scratch python

Photo by Nina PhotoLab on Unsplash

What about starting with a quote?.

“In the past, I’ve tried to teach machine learning using […] different programming languages […], and what I found is that students were able to learn the most productively […] using a relatively high level language like Octave.”, Andrew NG.

Building something from scratch was the method used by Andrew NG to teach his famous Coursera’s machine learning course (in plain Octave 😂), with one of the greatest ratings on the platform. Like Andrew, I truly believe that building things is the best way to learn because it forces us to understand every step of the algorithm. Unlike Andrew, I prefer to use Python and Numpy 😎 because of their simplicity and massive adoption. It sounds kind of crazy going directly against his advice, but the times change, and we can change too.

Evaluating machine learning models could be a challenging task. There are a vast of metrics, and just by looking at them, you might feel overwhelmed. The usual first approach is to check out accuracy, precision, and recall. But when you dig a little deeper, you will probably run into a ROC graph. The problem is that it isn’t as easy to understand as the others.

What to do then? Build it from scratch!.

The Receiving operating characteristic (ROC) graph attempts to interpret how good (or bad) a binary classifier is doing. There are several reasons why a simple confusion matrix isn’t enough to test your models. Still, the ROC representation solves incredibly well the following: the possibility to set more than one threshold in one visualization.

One of the following scenarios is true before we move on: the first is that you understood everything I said in the last paragraph, so I can keep going and start building the ROC curve. The second is that you didn’t understand much. If that is the case, I don’t want to look rude. Therefore, I have something for you.

If you feel confident about your knowledge, you can skip the next section. But if you don’t (or you need a little refresher), I encourage you to read it. Understanding the following concepts, it’s essential because the ROC curve is built upon them.

Foundations

Note: the following terms will be superficially tackled. In case you want a more detailed guide, look here or here.

  • Confusion Matrix: When predicting a binary classification problem, it’s usual to label the positive case as 1 and the negative as 0. When the prediction is equal to the actual value, it is a true case; otherwise, it is a negative one. This matrix aims to reveal information about the proportion of the combinations of these scenarios, including true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). In the visualization below, you can see an example.

Roc curve from scratch python

Image Created by Author.

Roc curve from scratch python

Image Created by Author.
  • True positive rate: This metric is calculated using the true positives and the false negatives according to the following formula.

Roc curve from scratch python

Image Created by Author.
  • False positive rate: This metric is calculated using the false positives and the true negatives according to the following formula.

Roc curve from scratch python

Image Created by Author.
  • Predict_proba: For most models, you can retrieve the probability of the prediction. Probabilities show if there is a difference in the confidence level of the model on the selected class. A higher predict_proba will represent higher confidence levels about the output.
  • Threshold: This value defines when models switch their opinion about the output, from positive to negative or vice-versa. By default, models will have a threshold of 0.5. If the value is higher than 0.5, the model will predict the sample as positive, but if it’s lower, it will predict it as negative.

ROC Curve From Scratch

The ROC graph has the true positive rate on the y axis and the false positive rate on the x axis. As you might be guessing, this implies that we need a way to create these metrics more than once to give the chart its natural shape. We need an algorithm to iteratively calculate these values.

Despite that there is an implementation of this metric in scikit-learn (which we will be visiting later), if you are already here, it’s a strong indication that you are brave enough to build instead of just copy-paste some code. To get an idea of what we will be actually doing, I prepared for you the following steps, along with visualizations… Enjoy!.

Step 1, choosing a threshold: As we discussed earlier, the ROC curve’s whole idea is to check out different thresholds, but how? Well, that’s part of our job. There are different ways to do it, but we will take the simplest. Just by setting the thresholds into equally distant partitions, we can solve our first dilemma.

Roc curve from scratch python

Image Created by Author.

The thresholds that we need to look at are equal to the number of partitions we set, plus one. We will iterate over every threshold defined in this step.

Step 2, threshold comparison: In every iteration, we must compare the predicted probability against the current threshold. If the threshold is higher than the predicted probability, we label the sample as a 0, and with 1 on the contrary. If you aren’t still clear about this, I’m sure the next illustration will help.

Roc curve from scratch python

Image Created by Author.

In the visualization, there are two examples of different iterations. The number of positive predicted cases for a high threshold is always lower or equal compared to a smaller one.

Step 3, calculating TPR and FPR: We are nearly done with our algorithm. The last part is to calculate the TPR and FPR at every iteration. The method is simple. It’s precisely the same we saw in the last section. The only difference is that we need to save the TPR and FPR in a list before going into the next iteration. The list of TPRs and FPRs pairs is the line in the ROC curve. So, we are officially done!

I know you want another visualization. You can see how different thresholds change the value of our TPR and FPR.

Roc curve from scratch python

Image Created by Author.

Now that you are an expert in the algorithm, it’s time to start building! The first step before starting is to have some probabilities and some predictions. To address that issue quickly, we will gather it using scikit-learn (it’s not cheating because it is just an input for the algorithm).

Note: There might be slight changes in the results for your case because I didn’t set the random_state parameter on “make_classification”.

To start, we need a method to replicate step 3, which is accomplished by the following.

The core of the algorithm is to iterate over the thresholds defined in step 1. We go through steps 2 & 3 to add the TPR and FPR pair to the list at every iteration.

Do you think it will work? 🤞

Roc curve from scratch python

Image Created by Author.

Obviously, it was going to work 🤣. Using ten partitions, we obtained our first ROC graph. But you can see how increasing the number of partitions gives us a better approximation of the curve.

Roc curve from scratch python

Image Created by Author.

But let’s compare our result with the scikit-learn’s implementation.

Roc curve from scratch python

Image Created by Author.

Pretty much the same 😎. But we are not over yet. The ROC curve comes along with a metric: “the area under the curve”.

AUC From Scratch

The area under the curve in the ROC graph is the primary metric to determine if the classifier is doing well. The higher the value, the higher the model performance. This metric’s maximum theoric value is 1, but it’s usually a little less than that.

The AUC can be calculated for functions using the integral of the function between 0 and 1.

But in this case, it’s not that simple to create a function. Nonetheless, a good approximation is to calculate the area, separating it into smaller pieces (rectangles and triangles).

Roc curve from scratch python

Image Created by Author.

As the number increases, the area under the triangles becomes more negligible, so we can ignore it. We have our last challenge, though: calculate the AUC value. What we have to do is to sum every area of the rectangles we just draw. Despite not being the optimal implementation, we will use a for loop to make it easier for you to catch.

Roc curve from scratch python

Image Created by Author.

Again, we compare it against scikit-learn’s implementation

Roc curve from scratch python

Image Created by Author.

There is a minimal difference because of the point’s locations, but the value is almost the same.

We are done! 🙌

There are improvements to be made to the algorithm, but it was just for pedagogical purposes. I really hope that seeing every step, helps you to interpret better the metrics.

See you soon

Now it’s time for you to decide. What worked for you the best, Octave or Python. I will wait for your answer in the comments!. I really hope that this blog was somehow interesting to you.

Roc curve from scratch python

I’m also on Linkedin and Twitter. I will gladly talk with you!
In case you feel like reading a little more, check out some of my recent posts:

How do you make a ROC curve in Python?

Step 1 - Import the library - GridSearchCv. ... .
Step 2 - Setup the Data. ... .
Step 3 - Spliting the data and Training the model. ... .
Step 5 - Using the models on test dataset. ... .
Step 6 - Creating False and True Positive Rates and printing Scores. ... .
Step 7 - Ploting ROC Curves..

How do you make a ROC curve from scratch?

ROC Curve in Machine Learning with Python.
Step 1: Import the roc python libraries and use roc_curve() to get the threshold, TPR, and FPR. ... .
Step 2: For AUC use roc_auc_score() python function for ROC..
Step 3: Plot the ROC curve..
Step 4: Print the predicted probabilities of class 1 (malignant cancer).

How do I plot ROC curve?

To plot the ROC curve, we need to calculate the TPR and FPR for many different thresholds (This step is included in all relevant libraries as scikit-learn ). For each threshold, we plot the FPR value in the x-axis and the TPR value in the y-axis. We then join the dots with a line. That's it!

How do you get AUC from ROC curve in Python?

ROC Curves and AUC in Python The AUC for the ROC can be calculated using the roc_auc_score() function. Like the roc_curve() function, the AUC function takes both the true outcomes (0,1) from the test set and the predicted probabilities for the 1 class.