Evaluation

In this section, we describe the specific evaluation measures that are used to rank the different algorithms. First, the performance measures for multi-class classification are introduced. Second, we describe which measures are calculated for the participating algorithms, how those are presented and how the algorithms will be ranked.

Contents

1      Performance measures

1.1      Accuracy for multi-class classification

Classification accuracy, or equivalently the error rate, is often used to assess classification performance. For binary classifications, accuracy is defined as the number of correctly classified samples divided by the total number of samples. For extending the accuracy measure to three-class classification, two main choices can be made (Hand & Till, 2001). The difference between them is whether or not the difference between the other two classes is taken into account when the performance for one class is assessed.

Table 1 shows the confusion matrix for classification of three classes.

Table 1: Confusion matrix for a three-class classification problem

  True class  
  c­0 c1 c2  
Hypothesized class C0 n0,0 n0,1 n0,2  
C1 n1,0 n1,1 n1,2  
C2 n2,0 n2,1 n2,2  
Column totals: n­0 n1 n2    

For simple calculation of the accuracy [1], all diagonal elements of the matrix, the true positives (tp) and true negatives (tn), are divided by the total number of samples (n).

                                                          [1]

The alternative, the average accuracy [2], assesses the accuracy separately for each class without distinguishing between the other two classes. For calculation of the accuracy for i=0, the true positive samples (tp­­i) are n0,0. The true negative samples in this case (tni) are n1,1, n1,2, n2,1 and n2,2. The separate per-class accuracies are averaged to give a final accuracy.

                                      [2]

Equation [2] is mainly applicable when the class sizes are very different. In this challenge, we use the accuracy in equation [1] as it provides a more good measure for the overall classification accuracy (Hand & Till, 2001).

1.2       AUC for multi-class classification

The receiver-operator characteristic (ROC) is a two-dimensional graph with the sensitivity on the y-axis and 1-specificity on the x-axis. The performance of a classifier can be visualized as a curve in this graph by applying a range of thresholds on the probabilistic output of the classifier and calculating the sensitivity and specificity. The area under this curve (AUC) is a performance measure which is equivalent to the probability that a randomly chosen positive sample will have a higher probability of being positively classified than a randomly chosen negative sample (Fawcett, 2006). The advantage of ROC analysis - and accordingly the AUC measure - is that the performance of a classifier is measured independently of the chosen threshold.

When more than two dimensions are used the ROC-curve becomes very complex. For three-class classification, the ROC surface is 6-dimensional. Therefore, multi-class ROC analysis is often generalized to multiple per-class or pairwise ROC curves (Fawcett, 2006).

Similarly to accuracy in the previous section, also the multi-class AUC measure can be determined using two different approaches. The difference between the two approaches is whether or not the third class is taken into account when the difference between a pair of classes is assessed.

First, Provost and Domingos (2001)calculate the multi-class AUC by generating a ROC curve for every class and measuring the AUCs. These per-class AUCs are averaged using the class priors:

                                                                    [3]

This method has the advantage that the separate ROC curve can be easily generated and visualized. The method calculates an AUC for every class separately, which is sensitive for the class distributions. Therefore in averaging the class priors are used, but still the total AUC depends on the class sizes.

Second, Hand and Till (2001)proposed a different method for multi-class AUC which is based on calculating an AUC for every pair of classes, without using information from the third class. The method is based on the principle that the AUC is equivalent to the probability that a randomly chosen member of class ci will have a larger estimated probability of belonging to class Ci than a randomly chosen member of class cj. Using this principle, the AUC can also be calculated directly from the ranks of test samples instead of first calculating the ROC curves. To achieve this, the class ci and cj test samples are ranked in increasing order of the output probability for class ci.  Si is the sum of the ranks of the class ci test samples. The AUC for a class ci given another class cj, , is then given by equation [4], see (Hand & Till, 2001)for the complete derivation.

                                                                            [4]

For situation with three or more classes, . Therefore, the average of both is used:

.                                                                         [5]

The overall AUC is obtained by averaging equation [5] over all pairs of classes. The number pairs of classes is 

.                                                               [6]

In contrary to the accuracy, AUC measurement does not require a threshold on the classifier’s output probabilities and therefore the AUC generally does not rely on the class priors (Hand & Till, 2001). However, the first multi-class approach is dependent on the class priors as these are used for averaging the per-class AUCs. Therefore, the second approach for AUC is adopted as this method is independent of the class sizes  (Fawcett, 2006).

1.3       True positive fraction

For binary classifications in computer-aided diagnosis, often the sensitivity and the specificity are reported in addition to the accuracy. For this multi-class application, the true positive fraction (TPF, [7]) for the three classes provides the same information. The TPF for the diseased class (TPFAD; TPFMCI) can be interpreted as the two-class sensitivity, and the TPF for the control group equals the two-class specificity.

                                                                        [7]

1.4       Confidence intervals and statistical testing

Confidence intervals on the accuracy, AUC and TPF will be determined with bootstrapping on the test set (1000 resamples). To assess whether the difference in performance between two algorithms is significant, the McNemar test (Dietterich, 1996) is used. 

2     Final results and ranking

In this challenge, teams will submit outcomes on a large test set.  Submitting the diagnostic label for each sample is obligatory. Additionally, submitting the output probabilities for each label is preferred. The python scripts for evaluation can be downloaded on the submit page.

Accuracy [1] and the TPFi [7] for the three classes are calculated from the diagnostic labels. For every class, an ROC curve is calculated from the output probabilities (Figure 1), showing the ability of the classifier to separate that class from the other two classes. A general AUC is calculated using equation [4-6]. These measures are calculated separately on the test and training set.

Table 1 shows the confusion matrix. For every team, a confusion matrix is made based on the test data (Table 1).

Table 2 reports the results on the training data for every team. This table can be made by every team with the provided evaluation scripts.

Table 3 and Table 4 report the results on the test data, and will be presented at the workshop.

Submitted algorithms are ranked based on accuracy on the test set (Table 4). Algorithms for which output probabilities are available are also ranked on the AUC on the test set. The algorithm with the best accuracy (rank=1) on the test set, is the winning algorithm. In case two or more algorithms would have an equal accuracy, the average rank is assigned to these algorithms.

Figure 1: Example ROC curves for each of the three classes separately.

Table 2: Performance on the training data for Team X

Data Accuracy
(CI)
TPF­CN
 (CI)
TPFMCI
 (CI)
TPF­AD
 (CI)
AUC
 (CI)
AUC­CN
 (CI)
AUC­MCI
 (CI)
AUC­AD
 (CI)

This is just an example, these results are not real

All training data 90.0 
(85.0-95.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
EMC training data 90.0 
(85.0-95.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
95.0
(90.0-98.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
UP training data 90.0 
(85.0-95.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
VUmc training data 90.0 
(85.0-95.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)

Table 3: Performance on the test data for Team X

Data Accuracy
 (CI)
TPF­CN
 (CI)
TPFMCI
 (CI)
TPF­AD
 (CI)
AUC
 (CI)
AUC­CN
 (CI)
AUC­MCI
 (CI)
AUC­AD
 (CI)

This is just an example, these results are not real

All test data 90.0 
(85.0-95.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
EMC test data 90.0 
(85.0-95.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
UP test data 90.0 
(85.0-95.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
VUmc test data 90.0 
(85.0-95.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)
88.8 
(83.0-91.0)

Table 4: Performance and ranking of the algorithms based on the test data

Method Accuracy
(CI)
TPF­CN
 (CI)
TPFMCI
 (CI)
TPF­AD
 (CI)
AUC
 (CI)
AUCCN
 (CI)
AUCMCI
 (CI)
AUCAD
 (CI)
Rank based on accuracy Rank based on AUC

This is just an example, these results are not real

Team 1 90.0 
(85.0-95.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
95.0 
(90.0-98.0)
88.8 
(85.0-91.0)
88.8 
(85.0-91.0)
88.8 
(85.0-91.0)
88.8 
(85.0-91.0)
1 1
Team 2 70.0 
(85.0-95.0)
75.0 
(70.0-78.0)
75.0 
(70.0-78.0)
75.0 
(70.0-78.0)
66.6 
(65.0-71.0)
66.6 
(65.0-71.0)
66.6 
(65.0-71.0)
66.6 
(65.0-71.0)
3 3
Team 3 80.0 
(85.0-95.0)
85.0 
(80.0-88.0)
85.0 
(80.0-88.0)
85.0 
(80.0-88.0)
77.7 
(75.0-81.0)
77.7 
(75.0-81.0)
77.7 
(75.0-81.0)
77.7 
(75.0-81.0)
2 2

2.1       Missing data

If an algorithm fails to produce an output for certain subjects, these subjects are considered misclassified as a fourth class. This fourth class is considered in calculation of all above mentioned performance measures. For calculation of the per-class ROC curves, the calculation of sensitivity and specificity was performed on the subjects that were classified by the algorithm and subsequently scaled to the total dataset to take in account missing samples.

References

Dietterich, T., 1996. Statistical tests for comparing supervised classification learning algorithms. Oregon State Univ. Tech. Rep. 1–24.Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recogn Lett, 27(8), 861–874.

Hand, D. J., & Till, R. J. (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Mach Learn, 45, 171–186.

Provost, F., & Domingos, P. (2001). Well-trained PETs: Improving probability estimation trees. CeDER Working Paper #IS-00-04. New York, NY, USA. Retrieved from http://medcontent.metapress.com/index/A65RM03P4874243N.pdf