Page 4, under the section “Invalid data and corresponding processing measures”, and implicitly in the reliability results (Table 1, Supplementary Table 2).
The authors stated:
“In the calculation of Fleiss kappa, all invalid data in category A are considered to constitute an independent classification, and the invalid data in category B are treated as different classifications based on the values (if the rating is ‘2 or 3’, it is recorded as 2.5) generated by the LLMs.”
This approach is methodologically incorrect for calculating Fleiss’ kappa, which is a statistic designed for categorical (nominal or ordinal) data where raters assign items to discrete categories.
By creating a new category for non-integer responses (e.g., “2 or 3” recorded as 2.5), the authors are introducing a continuous numerical value into a categorical agreement analysis.
Fleiss’ kappa requires that all possible categories be discrete and mutually exclusive. Introducing interpolated values (like 2.5) violates this principle and makes the kappa calculation mathematically and conceptually invalid.
The proper way to handle such ambiguous responses would be to either:
1. Exclude them from the reliability analysis, or
2. Define a set of valid discrete categories in advance (e.g., 1, 2, 3, 4, and “invalid”) and assign non-integer responses to an “invalid” category.
This error undermines the validity of the reported reliability metrics (Fleiss’ kappa values in Table 1). For instance, the “almost perfect” reliability reported for gpt-3.5-ft-0 and gpt-3.5-API-0 (kappa ≈ 0.98) may be artificially inflated or distorted due to this improper handling of continuous values in a categorical agreement statistic.
The core error in this study is that the authors treat a clinical guideline’s recommendation strength (e.g., “Strong” or “Limited”) as a factually correct answer.
This is a fundamental flaw….. The “strength” of a recommendation reflects the quality of the supporting evidence, not a simple true/false fact. An LLM could provide a medically sound and correct explanation for a therapy but assign it a different evidence grade than the AAOS committee did. According to this study’s method, that correct medical response would be marked as an inconsistency and an error.
In short, they aren’t measuring medical accuracy!!!! they are measuring the LLM’s ability to guess the AAOS committee’s specific grading preferences!!!! This invalidates their main “consistency” metric and undermines the entire premise of the study…..