Applications of Machine Learning in Food Safety and HACCP Monitoring of Animal-Source Foods

Down

In Section 3.6 (Decision Trees), on Page 8, the authors describe a study by Satoła and Satoła (2024) that used a Decision Tree model to predict subclinical mastitis in dairy cows. The text states:

“To avoid overfitting, the tree was pruned by limiting its depth, and the training set was split 80:20 to prevent data leakage.”
This statement is scientifically flawed and reveals a misunderstanding of data leakage.

Data leakage occurs when information from outside the training dataset is used to create the model, often leading to overly optimistic performance that does not generalize to real-world data.
A simple 80:20 train-test split does not prevent data leakage. It only separates data into training and testing sets.

Data leakage can still occur through: Preprocessing steps applied before the split (e.g., normalization, imputation), Feature selection using the entire dataset, Temporal or spatial dependencies, and Duplicated samples across splits.

ScienceGuardians

Did You Know?

Applications of Machine Learning in Food Safety and HACCP Monitoring of Animal-Source Foods

ScienceGuardians

Did You Know?

Welcome to ScienceGuardians, the First Fully Verified Journal Club,Safeguarding the Integrity of Science

Applications of Machine Learning in Food Safety and HACCP Monitoring of Animal-Source Foods

Welcome to ScienceGuardians, the First Fully Verified Journal Club,
Safeguarding the Integrity of Science