You state that access to data is “universal but uneven,” with most EDGAR downloads by bots. If everyone’s AI is mining the same public data, doesn’t this risk “algorithmic herding” and the rapid erosion of any alpha from these sources? How sustainable are these advantages?
You note that firms are already altering disclosures (e.g., reducing negative words) to game machine readers. Doesn’t this strategic adaptation threaten the long-term reliability of NLP models that rely on such data?
You detail advanced models like BERT and GPT, but what is the realistic path for a typical researcher to use them? The computational cost and “black box” nature are significant hurdles. Can you provide practical guidance for applying these techniques without vast resources?