Methods for comparing data mining models

How do we compare the relative performance of several data mining models? Previously, we discussed some basic model evaluation methods and metrics. Now we delve into more of them: ROC curves, Kappa statistic, mean square error, relative squared error, mean absolute error, and relative absolute error are the various metrics used, discussed below.

ROC curves

Receiver Operating Characteristics curves plot the true positives against the false positives, thereby characterising the trade-off between hits and false alarms.

The area under the curve represents the accuracy of the model – the larger the area, the more accurate the model. If the area is less than half, random guesses will outperform the model.

Kappa statistic

The kappa statistic is an index that compares correct classifications against chance classifications. It can be thought of as a chance-corrected measurement of agreement. Possible values range from -1 for complete disagreement, to 1 for perfect agreement.

Kappa is (TP + TN – Ctp – Ctn) / Total

In this case Kappa = (80 + 40 – 60 – 20) / 135 = 72.7%

Mean Square Error

MSE is used to evaluate model with numeric (not nominal) values. This is the most commonly used measure (including the root mean squared error, RMSE). The direction of the error does not matter, because the square of any number is always positive. However, the effect of squaring also tends to exaggerate the effects of the outliers.

Mean absolute error

MAE takes the absolute value of each error (making all negative errors positive). So the direction of the error does not matter, as in MSE, however, the outliers’ effects are not as exaggerated.

Relative Square Error

Sometimes it is necessary to know the relative rather than the absolute error value. When the MSE or the MAE is 500, is huge if your average instance has a value of 1000. If your average instance has a value of 1 million – then 500 is very small error. Thus to objectively compare the error rates of two numeric models, one needs to consider the size of the error relative to the average value of the instances.


The Coefficient of Correlation measure the linear relationship between attributes, and the value range from -1 for perfect opposite correlation to 1 for prefect correlation – where 0 indicates no correlation.

To use correlation in evaluating a model, one computes the correlation coefficient of the test data set and the predictions of the model. The model with the correlation coefficient that is closest to +1 is deemed the best predictor.

Deciding which performance metric to use

The performance metric is decided on a case-by-case basis, according to the needs of the problem domain. One should consider what the costs of each type of error are and therefore which ones we are trying to minimise. In addition, some types of metrics are applicable only to numeric problems, and others to nominal problems.

It is important to note that when mining data, it is important to use several learning algorithms and produce several models, then evaluate each of them. This lends itself to a reiterative process, where after evaluation, one may:

  • Select a different algorithm
  • Use different parameters for the algorithm
  • Alter the pre-processing used
  • Collect new data or different data
  • Redefine the problem entirely

Copyright © 2008-present Brendan Graetz