Evaluation
In this section, we describe the specific evaluation measures that are used to rank the different algorithms. First, the performance measures for multi-class classification are introduced. Second, we describe which measures are calculated for the participating algorithms, how those are presented and how the algorithms will be ranked.
Contents
1 Performance measures
1.1 Accuracy for multi-class classification
Classification accuracy, or equivalently the error rate, is often used to assess classification performance. For binary classifications, accuracy is defined as the number of correctly classified samples divided by the total number of samples. For extending the accuracy measure to three-class classification, two main choices can be made (Hand & Till, 2001). The difference between them is whether or not the difference between the other two classes is taken into account when the performance for one class is assessed.
Table 1 shows the confusion matrix for classification of three classes.
Table 1: Confusion matrix for a three-class classification problem
True class |
||||||
---|---|---|---|---|---|---|
c_{0} | c_{1} | c_{2} | ||||
Hypothesized class |
C_{0} | n_{0,0} | n_{0,1} | n_{0,2} | ||
C_{1} | n_{1,0} | n_{1,1} | n_{1,2} | |||
C_{2} | n_{2,0} | n_{2,1} | n_{2,2} | |||
Column totals: |
n_{0} | n_{1} | n_{2} |
For simple calculation of the accuracy [1], all diagonal elements of the matrix, the true positives (tp) and true negatives (tn), are divided by the total number of samples (n).
[1]
The alternative, the average accuracy [2], assesses the accuracy separately for each class without distinguishing between the other two classes. For calculation of the accuracy for i=0, the true positive samples (tp_{i}) are n_{0,0}. The true negative samples in this case (tn_{i}) are n_{1,1}, n_{1,2}, n_{2,1} and n_{2,2}. The separate per-class accuracies are averaged to give a final accuracy.
[2]
Equation [2] is mainly applicable when the class sizes are very different. In this challenge, we use the accuracy in equation [1] as it provides a more good measure for the overall classification accuracy (Hand & Till, 2001).
1.2 AUC for multi-class classification
The receiver-operator characteristic (ROC) is a two-dimensional graph with the sensitivity on the y-axis and 1-specificity on the x-axis. The performance of a classifier can be visualized as a curve in this graph by applying a range of thresholds on the probabilistic output of the classifier and calculating the sensitivity and specificity. The area under this curve (AUC) is a performance measure which is equivalent to the probability that a randomly chosen positive sample will have a higher probability of being positively classified than a randomly chosen negative sample (Fawcett, 2006). The advantage of ROC analysis - and accordingly the AUC measure - is that the performance of a classifier is measured independently of the chosen threshold.
When more than two dimensions are used the ROC-curve becomes very complex. For three-class classification, the ROC surface is 6-dimensional. Therefore, multi-class ROC analysis is often generalized to multiple per-class or pairwise ROC curves (Fawcett, 2006).
Similarly to accuracy in the previous section, also the multi-class AUC measure can be determined using two different approaches. The difference between the two approaches is whether or not the third class is taken into account when the difference between a pair of classes is assessed.
First, Provost and Domingos (2001)calculate the multi-class AUC by generating a ROC curve for every class and measuring the AUCs. These per-class AUCs are averaged using the class priors :
[3]
This method has the advantage that the separate ROC curve can be easily generated and visualized. The method calculates an AUC for every class separately, which is sensitive for the class distributions. Therefore in averaging the class priors are used, but still the total AUC depends on the class sizes.
Second, Hand and Till (2001)proposed a different method for multi-class AUC which is based on calculating an AUC for every pair of classes, without using information from the third class. The method is based on the principle that the AUC is equivalent to the probability that a randomly chosen member of class c_{i} will have a larger estimated probability of belonging to class C_{i} than a randomly chosen member of class c_{j}. Using this principle, the AUC can also be calculated directly from the ranks of test samples instead of first calculating the ROC curves. To achieve this, the class c_{i} and c_{j} test samples are ranked in increasing order of the output probability for class c_{i}. S_{i} is the sum of the ranks of the class c_{i} test samples. The AUC for a class c_{i} given another class c_{j}, , is then given by equation [4], see (Hand & Till, 2001)for the complete derivation.
[4]
For situation with three or more classes, . Therefore, the average of both is used:
. [5]
The overall AUC is obtained by averaging equation [5] over all pairs of classes. The number pairs of classes is .
. [6]
In contrary to the accuracy, AUC measurement does not require a threshold on the classifier’s output probabilities and therefore the AUC generally does not rely on the class priors (Hand & Till, 2001). However, the first multi-class approach is dependent on the class priors as these are used for averaging the per-class AUCs. Therefore, the second approach for AUC is adopted as this method is independent of the class sizes (Fawcett, 2006).
1.3 True positive fraction
For binary classifications in computer-aided diagnosis, often the sensitivity and the specificity are reported in addition to the accuracy. For this multi-class application, the true positive fraction (TPF, [7]) for the three classes provides the same information. The TPF for the diseased class (TPF_{AD}; TPF_{MCI}) can be interpreted as the two-class sensitivity, and the TPF for the control group equals the two-class specificity.
[7]
1.4 Confidence intervals and statistical testing
Confidence intervals on the accuracy, AUC and TPF will be determined with bootstrapping on the test set (1000 resamples). To assess whether the difference in performance between two algorithms is significant, the McNemar test (Dietterich, 1996) is used.
2 Final results and ranking
In this challenge, teams will submit outcomes on a large test set. Submitting the diagnostic label for each sample is obligatory. Additionally, submitting the output probabilities for each label is preferred. The python scripts for evaluation can be downloaded on the submit page.
Accuracy [1] and the TPF_{i} [7] for the three classes are calculated from the diagnostic labels. For every class, an ROC curve is calculated from the output probabilities (Figure 1), showing the ability of the classifier to separate that class from the other two classes. A general AUC is calculated using equation [4-6]. These measures are calculated separately on the test and training set.
Table 1 shows the confusion matrix. For every team, a confusion matrix is made based on the test data (Table 1).
Table 2 reports the results on the training data for every team. This table can be made by every team with the provided evaluation scripts.
Table 3 and Table 4 report the results on the test data, and will be presented at the workshop.
Submitted algorithms are ranked based on accuracy on the test set (Table 4). Algorithms for which output probabilities are available are also ranked on the AUC on the test set. The algorithm with the best accuracy (rank=1) on the test set, is the winning algorithm. In case two or more algorithms would have an equal accuracy, the average rank is assigned to these algorithms.
Figure 1: Example ROC curves for each of the three classes separately.
Table 2: Performance on the training data for Team X
Data |
Accuracy (CI) |
TPF_{CN} (CI) |
TPF_{MCI} (CI) |
TPF_{AD} (CI) |
AUC (CI) |
AUC_{CN} (CI) |
AUC_{MCI} (CI) |
AUC_{AD} (CI) |
---|---|---|---|---|---|---|---|---|
This is just an example, these results are not real |
||||||||
All training data |
90.0 (85.0-95.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
EMC training data |
90.0 (85.0-95.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
UP training data |
90.0 (85.0-95.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
VUmc training data |
90.0 (85.0-95.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
Table 3: Performance on the test data for Team X
Data |
Accuracy (CI) |
TPF_{CN} (CI) |
TPF_{MCI} (CI) |
TPF_{AD} (CI) |
AUC (CI) |
AUC_{CN} (CI) |
AUC_{MCI} (CI) |
AUC_{AD} (CI) |
---|---|---|---|---|---|---|---|---|
This is just an example, these results are not real |
||||||||
All test data |
90.0 (85.0-95.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
EMC test data |
90.0 (85.0-95.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
UP test data |
90.0 (85.0-95.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
VUmc test data |
90.0 (85.0-95.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
88.8 (83.0-91.0) |
Table 4: Performance and ranking of the algorithms based on the test data
Method |
Accuracy (CI) |
TPF_{CN} (CI) |
TPF_{MCI} (CI) |
TPF_{AD} (CI) |
AUC (CI) |
AUC_{CN} (CI) |
AUC_{MCI} (CI) |
AUC_{AD} (CI) |
Rank based on accuracy | Rank based on AUC |
---|---|---|---|---|---|---|---|---|---|---|
This is just an example, these results are not real |
||||||||||
Team 1 |
90.0 (85.0-95.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
95.0 (90.0-98.0) |
88.8 (85.0-91.0) |
88.8 (85.0-91.0) |
88.8 (85.0-91.0) |
88.8 (85.0-91.0) |
1 | 1 |
Team 2 |
70.0 (85.0-95.0) |
75.0 (70.0-78.0) |
75.0 (70.0-78.0) |
75.0 (70.0-78.0) |
66.6 (65.0-71.0) |
66.6 (65.0-71.0) |
66.6 (65.0-71.0) |
66.6 (65.0-71.0) |
3 | 3 |
Team 3 |
80.0 (85.0-95.0) |
85.0 (80.0-88.0) |
85.0 (80.0-88.0) |
85.0 (80.0-88.0) |
77.7 (75.0-81.0) |
77.7 (75.0-81.0) |
77.7 (75.0-81.0) |
77.7 (75.0-81.0) |
2 | 2 |
2.1 Missing data
If an algorithm fails to produce an output for certain subjects, these subjects are considered misclassified as a fourth class. This fourth class is considered in calculation of all above mentioned performance measures. For calculation of the per-class ROC curves, the calculation of sensitivity and specificity was performed on the subjects that were classified by the algorithm and subsequently scaled to the total dataset to take in account missing samples.
References
Dietterich, T., 1996. Statistical tests for comparing supervised classification learning algorithms. Oregon State Univ. Tech. Rep. 1–24.Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recogn Lett, 27(8), 861–874.
Hand, D. J., & Till, R. J. (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Mach Learn, 45, 171–186.
Provost, F., & Domingos, P. (2001). Well-trained PETs: Improving probability estimation trees. CeDER Working Paper #IS-00-04. New York, NY, USA. Retrieved from http://medcontent.metapress.com/index/A65RM03P4874243N.pdf