You used GPT-3.5 API with embedded search over recent literature, while ChatGPT and GPT-4 were tested via the web interface without retrieval augmentation. Isn’t this an unfair comparison? The “superior” performance of GPT-3.5 API might just be because you fed it newer text, not because it’s actually better at identifying emerging trends. How do you disentangle the effect of your retrieval pipeline from the model’s actual analytical capability?
In your topic modeling (Fig. 4), you set cluster numbers arbitrarily based on paper counts (≤1000 → 5 clusters, etc.). That seems pretty arbitrary; why not use coherence scores or cross-validation? And the “unique keywords” in Table S1 aren’t shown in the main paper. How do we know your cluster interpretations aren’t just post-hoc storytelling?
You claim LLMs identify “emerging research directions,” but you never actually check if these directions are truly emerging or just rephrased existing hot topics. Did you compare with expert surveys or horizon scanning studies? Without that, how do we know this isn’t just elegant pattern recognition instead of real insight generation?