ScienceGuardians

ScienceGuardians

Did You Know?

ScienceGuardians hosts editors too

Applications of Machine Learning in Food Safety and HACCP Monitoring of Animal-Source Foods

Authors: Panagiota-Kyriaki Revelou,Efstathia Tsakali,Anthimia Batrinou,Irini F. Strati
Journal: Foods
Publisher: MDPI AG
Publish date: 2025-3-8
ISSN: 2304-8158 DOI: 10.3390/foods14060922
View on Publisher's Website
Up
0
Down
::

In Section 3.6 (Decision Trees), on Page 8, the authors describe a study by Satoła and Satoła (2024) that used a Decision Tree model to predict subclinical mastitis in dairy cows. The text states:

“To avoid overfitting, the tree was pruned by limiting its depth, and the training set was split 80:20 to prevent data leakage.”
This statement is scientifically flawed and reveals a misunderstanding of data leakage.

Data leakage occurs when information from outside the training dataset is used to create the model, often leading to overly optimistic performance that does not generalize to real-world data.
A simple 80:20 train-test split does not prevent data leakage. It only separates data into training and testing sets.

Data leakage can still occur through: Preprocessing steps applied before the split (e.g., normalization, imputation), Feature selection using the entire dataset, Temporal or spatial dependencies, and Duplicated samples across splits.

  • You must be logged in to reply to this topic.