Evaluating classification accuracy: the impact of resampling and dataset size
Correct prediction is important criterion in evaluating classifiers in
supervised learning context. The accuracy rate is a widely accepted indicator of
the probability of misclassification of a classifier. Nevertheless, true accuracy
remains unknown in most cases since it is not always possible to include the
whole population in a study, and it is difficult to calculate the probability
distribution of the data. Therefore, researchers often rely on computing
estimation from the available data through sampling. When the available data is
small or limited, it is common to rely on a resampling technique for accuracy
estimation. In this paper, we study the impact of the resampling against
non-resampling estimation method, with different dataset sizes on the sample
distribution variance. Initial results indicate that there is a significant difference
in the variance of the sample distribution between resampling and
non-resampling. We also found that the larger the dataset size, the less
significant the difference in variance.