1. Figures 2 and 9 report percentage increases of over 10,000% for terms like underscore (e.g., from 0.013% to 1.37% in Fig. 9). Such extreme percentages are mathematically driven by near-zero baselines in 2022. The authors do not report absolute changes or confidence intervals for these ratios, making the visual impact disproportionate and potentially misleading. Why were raw effect sizes (e.g., percentage-point differences) not emphasized instead?
2. The authors identified five new terms (e.g., heighten, foster) by selecting words that showed a “clear step-change in frequency in 2024” in Environmental Science abstracts. This selection criterion essentially guarantees a post-2022 increase, biasing the set toward terms that mechanically support the study’s hypothesis. How can this not inflate the apparent LLM effect?
3. Figure 14 shows that retracted papers in 2024 had higher proportions of LLM-associated terms (15.8%) than published papers (8.1%). The authors call this “hypothesis-generating” but present it without controlling for field, journal quality, preprint status, or retraction reason (e.g., fraud vs. honest error). Given these confounders, why include this figure at all if it cannot support any substantive claim about LLMs and retractions?
4. Figures 4 and 5 show that delve and underscore had the highest absolute 2024 percentages in Arts & Humanities, Social Sciences, and Business, yet the text concludes “growth of LLM terms was highest in STEM fields” (based on relative percentages). This selectively favors relative over absolute metrics. Which metric is scientifically appropriate, and why?