This paper reports an average AUC value of approximately 0.8 across nine toxicity endpoints in the AI-SHIPS project using machine learning models trained on both in vitro and in silico data. However, only ~400 compounds had in vitro data available, and the remaining ~1600 compounds had their values predicted by QSAR models. Given this heavy reliance on predicted data to train further prediction models, how reliable and unbiased are the final toxicity predictions, especially when the ground-truth experimental validation is limited to a relatively small subset of compounds? Could this layered modeling approach risk compounding prediction errors or reinforcing bias from earlier models?
