Random subsampling and k fold cross validation are two common methods of resam-pling [Gei75, Sch93]. In random subsampling, the data is randomly
... [Show More] partitioned into dis-joint training and test sets multiple times. Accuracies obtained from each partition areaveraged. Ink-fold cross-validation, the data is randomly split intokmutually exclusivesubsets of approximately equal size. A learning algorithm is trained and testedktimes;each time it is tested on one of thekfolds and trained using the remainingk-1folds.The cross-validation estimate of accuracy is the overall number of correct classifications,divided by the number of examples in the data. The random subsampling method hasthe advantage that it can be repeated an indefinite number of times. However, it has thedisadvantage that the test sets are not independently drawn with respect to the underlyingdistribution of examplesD. Because of this, using at-test for paired differences withrandom subsampling can lead to increased chance of Type I error—that is, identifyinga significant difference when one does not actually exist [Die88]. Using at-test on theaccuracies produced on each fold ofkfold cross-validation has lower chance of Type Ierror but may not give a stable estimate of accuracy. It is common practice to repeatkfold cross-validationntimes in order to provide a stable estimate. However, this of courserenders the test sets non-independent and increases the chance of Type I error. Unfortu-nately, there is no satisfactory solution to this problem. Alternative tests suggested byDietterich [Die88] have low chance of Type I error buthighchance of Type II error—thatis, failing to identify a significant difference when one does actually exist.Stratificationis a process often applied during random subsampling andk-fold cross-validation. Stratification ensures that the class distribution from the whole dataset is pre-served in the training and test sets.Stratification has been shown to help reduce thevariance of the estimated accuracy—especially for datasets with many classes [Koh95b].Stratified random subsampling with a pairedt-test is used herein to evaluate accuracy.Appendix D reports results for the major experiments using the 5×2cv pairedttest rec-ommended by Dietterich [Die88]. As stated above, this test has decreased chance of typeI error, but increased chance of type II error (see the appendix for details). [Show Less]