You note in the results (Page 7) that there were inherent differences in the calculation engines between the AI-generated R models (using the Heemod package) and the original Excel models. Specifically: (1) Discounting was applied on a per-cycle basis in the R models versus a year-by-year basis in Excel. (2) The R models assumed 100% progression-free survival state occupancy in the first model cycle, whereas one Excel model applied a half-cycle correction. This raises a critical question: Is your study demonstrating the replication of models, or the re-creation of models with different underlying methodological assumptions that happen to produce similar point estimates (ICERs)?
The distinction is non-trivial. In health technology assessment (HTA), the validity of a model depends not only on its output but on the correctness and transparency of its methods. A process that automatically generates code but silently alters foundational techniques (like discounting conventions or half-cycle corrections) is not replicating the original work. It is building a different, albeit numerically similar, model. This could compromise the entire study’s conclusion about automation feasibility for HTA submission, where methodological fidelity is as important as numerical accuracy.
Were these methodological differences (discounting period, half-cycle correction) a conscious simplification you introduced, or an unintended consequence of GPT-4’s use of the Heemod package defaults? If unintended, does this suggest that without extremely precise prompt engineering, LLMs may default to their own “standard” methodologies rather than faithfully implementing the study-specific methods described in the prompt?
How do you reconcile the goal of “replication” with the acceptance of these methodological divergences? Should the benchmark for success in such automation studies be stricter, requiring not only output alignment but also procedural fidelity?
Could this issue be more widespread? You checked for errors line-by-line, but would a reviewer focused on methodological adherence (as an HTA body would) classify the use of a different discounting application as a “major error” in the context of model specification, even if it only changes the ICER marginally?