In Section 3.6 (Decision Trees), on Page 8, the authors describe a study by Satoła and Satoła (2024) that used a Decision Tree model to predict subclinical mastitis in dairy cows. The text states:
“To avoid overfitting, the tree was pruned by limiting its depth, and the training set was split 80:20 to prevent data leakage.”
This statement is scientifically flawed and reveals a misunderstanding of data leakage.
Data leakage occurs when information from outside the training dataset is used to create the model, often leading to overly optimistic performance that does not generalize to real-world data.
A simple 80:20 train-test split does not prevent data leakage. It only separates data into training and testing sets.
Data leakage can still occur through: Preprocessing steps applied before the split (e.g., normalization, imputation), Feature selection using the entire dataset, Temporal or spatial dependencies, and Duplicated samples across splits.