To ensure the performance comparison between models is definitive, were the exact same data splits (e.g., using the same timestamps T1/T2) used across all the source papers? Furthermore, was the strategy for sampling candidate items during evaluation consistent for all models? Could you confirm if the hyperparameter tuning procedures and computational budgets were comparable across the different studies?