The reported high performance—e.g., F1 scores above 0.9 for multi-class classification (Fig. 4) and AUC values up to 0.987 (Table 4)—is impressive. However, the evaluation appears limited to curated datasets, and the authors do not report on performance across clinically meaningful subgroups (e.g., age, ethnicity, comorbidity). Additionally, despite integrating datasets from two countries, the generalizability to other populations or imaging protocols remains questionable. How would the models perform in resource-constrained settings with lower-quality imaging or incomplete patient records? Without external validation or deployment trials, the reported metrics may overestimate the model’s effectiveness in actual clinical workflows.
