Cross-validation is a method to assess the performance of an algorithm by partitioning the training data into multiple subsamples. It helps avoid overfitting by ensuring the model is validated on unseen data, rather than being both trained and tested on the same points.

K-fold cross-validation

  1. Randomly shuffle the dataset and split it into equally-sized subsets (“folds”).
  2. For each fold (from to ):
        a. Train the model on all folds except fold .
        b. Evaluate the validation error on fold .
  3. Average the validation errors to estimate the model’s generalization error.

This technique balances bias and variance:

  • Lower (e.g., 5) runs faster but has higher bias.
  • Higher (e.g., 10) gives a lower-bias estimate but takes more time.

Leave-One-Out Cross-Validation (LOOCV)

  1. Special case of K-fold where (the number of data points).
  2. For each data point :
        a. Train the model on all data except point .
        b. Evaluate the model on the single held-out point .
  3. Average all errors to get the overall validation score.

LOOCV provides an almost unbiased estimate of generalization error,
but is computationally expensive for large datasets since it requires training runs.