The paper mentions that MatKG includes over 70,000 entities and 5.4 million unique triples after data cleaning, but the initial extraction yielded around half a million entities and 11 million triples. Given the significant reduction in data volume during cleaning, how were potential biases introduced by the cleaning process (e.g., removal of non-ASCII entities, fuzzy clustering, and ChatGPT standardization) evaluated, and what steps were taken to ensure that critical information was not lost or misrepresented in the final knowledge graph?
