Page 4, under the section “Invalid data and corresponding processing measures”, and implicitly in the reliability results (Table 1, Supplementary Table 2).
The authors stated:
“In the calculation of Fleiss kappa, all invalid data in category A are considered to constitute an independent classification, and the invalid data in category B are treated as different classifications based on the values (if the rating is ‘2 or 3’, it is recorded as 2.5) generated by the LLMs.”
This approach is methodologically incorrect for calculating Fleiss’ kappa, which is a statistic designed for categorical (nominal or ordinal) data where raters assign items to discrete categories.
By creating a new category for non-integer responses (e.g., “2 or 3” recorded as 2.5), the authors are introducing a continuous numerical value into a categorical agreement analysis.
Fleiss’ kappa requires that all possible categories be discrete and mutually exclusive. Introducing interpolated values (like 2.5) violates this principle and makes the kappa calculation mathematically and conceptually invalid.
The proper way to handle such ambiguous responses would be to either:
1. Exclude them from the reliability analysis, or
2. Define a set of valid discrete categories in advance (e.g., 1, 2, 3, 4, and “invalid”) and assign non-integer responses to an “invalid” category.
This error undermines the validity of the reported reliability metrics (Fleiss’ kappa values in Table 1). For instance, the “almost perfect” reliability reported for gpt-3.5-ft-0 and gpt-3.5-API-0 (kappa ≈ 0.98) may be artificially inflated or distorted due to this improper handling of continuous values in a categorical agreement statistic.