The study provides valuable insights into multimodal identity perception, demonstrating that combining face and voice stimuli does not significantly enhance matching accuracy compared to unimodal stimuli. Could the authors discuss how results might vary with familiar individuals or more culturally diverse stimulus sets? Given the sequential design of the matching tasks, exploring how simultaneous presentation might influence outcomes could provide further context to the observed patterns. Thank you!