Evaluation Metrics in Machine Learning: Top 10 Metrics You Should Know
Updated on Jul 03, 2023 | 9 min read | 6.0k views
Share:
For working professionals
For fresh graduates
More
Updated on Jul 03, 2023 | 9 min read | 6.0k views
Share:
Table of Contents
Deciding the right metric is a crucial step in any Machine Learning project. Every Machine Learning model needs to be evaluated against some metrics to check how well it has learnt the data and performed on test data. These are called the Performance Metrics and are different for regression and classification models.
By the end of this tutorial, you will know:
Regression problems involve predicting a target with continuous values from a set of independent features. This is a type of Supervised learning where we compare the prediction with the actual value and then calculate the difference/error term. Lesser the error, better is the performance of the model. We have different types of Regression metrics that are most widely used currently. Let’s go over them one by one.
Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.
Mean Squared Error(MSE) is the most used regression metric. It uses squared errors (Y_Pred – Y_actual) to calculate errors. The squaring results in two important changes to the usual error calculation. One, that the error can be negative and squaring the errors will turn all the errors into positive terms and hence can be easily added.
Second, that the squaring increases the errors which are already large and reduces the errors with values less than 1. This magnifying effect penalises the instances where the error is large. MSE is highly preferred because it is differentiable at all the points to calculate the gradient of the loss function.
The shortcoming of MSE is that it squares the error terms which lead to overestimation of the errors. Root Mean Squared Error (RMSE), on the other hand, takes a square root to reduce that effect. This is useful when large errors are not desired.
Mean Absolute Error (MAE) calculates the error by taking an absolute value of the error which is Y_Pred – Y_Actual. This is useful as it is not overestimating the larger errors unlike MSE and is also robust to outliers. Therefore, it is not suitable for applications which require special treatment for outliers. MAE is a linear score which means all the individual differences are weighted equally.
R Squared is a goodness fit measure for regression models. It calculates the scatter of data points along the regression fit line. It is also called the Coefficient of Determination. Higher R Squared value means that there is less difference between the observed value and the actual values.
R Squared value keeps on increasing as more and more features are added into the model. This means that R Squared is not the right measurement of performance as it might give a large R Square even if the features are not adding any value.
In Regression Analysis, R Squared is used to determine the strength of correlation between the features and the target. In simple terms, it measures the strength of the relationship between your model and the dependent variable on a 0 – 100% scale. R Squared is the ratio between the Residual Sum of Squares(SSR) and the Total Sum of Squares(SST). R sqr is defined as:
R Sqr = 1 – SSR/SST ,where
SSR is the sum of the squares of the difference between the actual observed value Y and the predicted value Y_Pred. SST is the sum of the squares of the difference between the actual observed value Y and the average of the observed value Y_Avg.
Generally, more the R sqr, better is the model. But is it so always? No.
Adjusted R Squared Error overcomes the shortcoming of R Squared of not able to correctly estimate the improvement in model performance when more features are added. R Square value shows an incomplete picture and can be very misleading.
In essence, the R sqr value always increases on adding new features, even if the feature is decreasing the model’s performance. You might not know when your model started to overfit.
Adjusted R Sqr adjusts for this increase of variables and its value decreases when a feature doesn’t improve the model. We use adjusted R sqr to compare the goodness-of-fit for regression models that contain different numbers of independent variables.
Read: Cross-Validation in Machin Learning
Just like regression metrics, there are different types of metrics for classification as well. Different types of metrics are used for different types of classification and data. Let’s go over them one by one.
Accuracy is the most straightforward and simple metric for classification. It just calculates what percentage of predictions are correct from the total number of instances. For example, if 90 out of 100 instances are predicted correctly, then the accuracy will be 90%. Accuracy, however, is not the correct metric for most classification tasks as it doesn’t take into account the class imbalance.
For a better picture of model performance, we need to see how many false positives were predicted and how many false negatives were predicted by the model. Precision tells us how many of the total positives were predicted as positives. Or in other words, the proportion of positive instances that were correctly predicted as positives out of total positive predictions. Recall tells us how many true positives were predicted out of total actual positives. Or in other words, it gives the proportion of predicted true positives from the total number of actual positives.
A Confusion Matrix is a combination of True Positives, True Negatives, False Positives and False Negatives. It tells us how many were predicted out of the actual true positives and negatives. It is an NxN matrix where N is the number of classes. Confusion Matrix is not so confusing after all!
F1 Score combines the Precision and Recall into one metric for an averaged out value. F1 Score is actually the harmonic mean of Precision and Recall values. This is crucial because if in some case the recall value is 1, i.e. 100% and the precision value is 0, the F1 score will be 0.5 if we take the arithmetic mean of Precision & Recall instead of Harmonic mean. But if we take the Harmonic mean, F1 Score will be 0. This tells us that Harmonic mean penalizes extreme values more.
Check out: 5 Types of Classification Algorithms in Machine Learning
Accuracy and F1 score are nor good metrics when it comes to imbalanced data. AUC (Area Under Curve) ROC (Receiver Operator Characteristics) curve tells us the degree of separability of classes predicted by the model. Higher the score, more is the ability of the model to predict 0s as 0s and 1s as 1s. The AUC ROC Curve is plotted using the True Positive Rate (TPR) on the Y-axis and False Positive rate on the X-axis.
TPR = TP/TP+FN
FPR = FP/TN+FP
If AUC ROC comes out to be 1, it means that the model is correctly predicting all the classes and there is complete separability.
If it is 0.5, it means that there is no separability and the model is predicting all random outputs.
If it is 0, it means that the model is predicting the inverted classes. That is, 0s as 1s and 1s as 0s.
Evaluation metrics are numerical measurements that are used to rate the effectiveness of AI models. They enable us to gauge a model’s effectiveness by contrasting its predictions with the actual results. These measurements shed light on the model’s advantages, disadvantages, and general performance.
AI predictive models are intended to classify or forecast based on input data. They can be broadly divided into two categories: regression models and categorization models. Regression models are utilised for continuous output variables while classification models are employed when the output is categorical.
In marketing and customer relationship management (CRM) software, gain and lift charts are evaluation tools that are frequently utilised. These graphs demonstrate the improvement over random selection, which aids in evaluating the efficacy of prediction models. They shed light on the model’s capacity to recognise advantageous occurrences.
The effectiveness of binary classification models is assessed using the Kolmogorov-Smirnov (KS) chart. The biggest discrepancy between the cumulative distributions of positive and negative examples is what is measured. A more significant KS value denotes a more effective model.
Log loss, commonly, is a typical evaluation statistic employed in classification issues. The discrepancy between expected probabilities and actual results is measured. A more accurate model is one with a lower Log Loss value.
Another evaluation statistic applied to categorization issues is the Gini Coefficient. It assesses the disparity between the likelihood of good and bad events. A model with a lower prediction bias will perform better when the Gini Coefficient is higher.
In ranking and survival analysis tasks, the Concordant – Discordant Ratio (CDR) is employed. It gauges the degree of congruence between expected and actual rankings. A higher CDR value indicates a better model’s capacity to organise instances correctly.
By dividing the data into training and testing sets, the technique of cross-validation can be utilised to evaluate the effectiveness of predictive models. Estimating a model’s ability to generalise to new data and reduce overfitting is helpful.
In addition to the ones described above, several other performance measures are frequently employed in machine learning. These include the R2-Score for regression models, logarithmic loss, and classification accuracy. Each statistic offers a different viewpoint on the model’s performance and can be applied to assess particular criteria.
By penalising inaccurate predictions, a classification model’s performance is measured using logarithmic loss, often known as log loss. Instead of the predicted labels, it takes into account the predicted probabilities.
Regression model evaluation metrics include the R2-Score, or coefficient of determination. It calculates the percentage of the dependent variable’s variance that the independent variables can account for. The model fits the data better when the R2-Score is higher.
In this article, we discussed the various performance metrics for classification and regression. These are the most used metrics and hence it is crucial to know about them. For classification, there are even more metrics which are specifically made for multi-class classification and multi-label classification such as Kappa Score, Precision at K, Average Precision at K, etc.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Top Resources