Algorithm validation best practices
Due to lack of guidelines the validation of Raman techniques has been poorly standardized. This often results in bias in terms of accuracy and predictive capability. Here we provide general guidelines on validation of Raman algorithms for classification and regression. These guidelines apply for both chemometrics, machine learning and deep learning.
Training/validation/test split
Splitting of a Raman data set into train, validation and test is the most ideal assessment of predictive capability and should in general always be used: The training data set will be used develop the model. The validation data set is used to asses to which degree the model is overfitting and thereby stop the modeling to prevent this. The testing data set represents a completely independent data set and will be used to report the predictive metrics.
Cross validation
When sample sizes are limited, cross validation can be used to 1) assess when the algorithm overfits and 2) estimate the performance metrics. In cross validation a spectrum or several spectra is left out, a model is build on the training dataset and the left out spectrum/spectra is then predicted. This repeated iteratively for all samples. It is very important here to ensure the following:
Common mistakes
Training/validation/test split
Splitting of a Raman data set into train, validation and test is the most ideal assessment of predictive capability and should in general always be used: The training data set will be used develop the model. The validation data set is used to asses to which degree the model is overfitting and thereby stop the modeling to prevent this. The testing data set represents a completely independent data set and will be used to report the predictive metrics.
Cross validation
When sample sizes are limited, cross validation can be used to 1) assess when the algorithm overfits and 2) estimate the performance metrics. In cross validation a spectrum or several spectra is left out, a model is build on the training dataset and the left out spectrum/spectra is then predicted. This repeated iteratively for all samples. It is very important here to ensure the following:
- Cross validation should be performed so it becomes as unbiased as possible: i.e., each left out spectrum (or batch of spectra) should be from a specific sample. For instance in case of tissues from multiple patients, all spectra from a particular patient should be left out in the cross validation. This is absolutely necessary in order to develop robust algorithms and accurate estimates of predictive capability.
- Reporting metrics (e.g., accuracy, ROC, R2 etc) should be based on the cross-validated outcomes and not the trained model.
Common mistakes
- Exhaustive optimization of hyperparameters: there is a very good chance by luck that a some model turn out with predictive capability. It is therefore good practice to keep count on the number of optimizations performed.
- Leak between training and test data will result in overoptimistic performance