ScienceGuardians

ScienceGuardians

Did You Know?

ScienceGuardians serves the community for free

Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs

Authors: Li Wang,Xi Chen,XiangWen Deng,Hao Wen,MingKe You,WeiZhi Liu,Qi Li,Jian Li
Journal: npj Digital Medicine
Publisher: Springer Science and Business Media LLC
Publish date: 2024-2-20
ISSN: 2398-6352 DOI: 10.1038/s41746-024-01029-4
View on Publisher's Website
Up
0
Down
::

Page 4, under the section “Invalid data and corresponding processing measures”, and implicitly in the reliability results (Table 1, Supplementary Table 2).

The authors stated:

“In the calculation of Fleiss kappa, all invalid data in category A are considered to constitute an independent classification, and the invalid data in category B are treated as different classifications based on the values (if the rating is ‘2 or 3’, it is recorded as 2.5) generated by the LLMs.”
This approach is methodologically incorrect for calculating Fleiss’ kappa, which is a statistic designed for categorical (nominal or ordinal) data where raters assign items to discrete categories.

By creating a new category for non-integer responses (e.g., “2 or 3” recorded as 2.5), the authors are introducing a continuous numerical value into a categorical agreement analysis.
Fleiss’ kappa requires that all possible categories be discrete and mutually exclusive. Introducing interpolated values (like 2.5) violates this principle and makes the kappa calculation mathematically and conceptually invalid.

The proper way to handle such ambiguous responses would be to either:

1. Exclude them from the reliability analysis, or
2. Define a set of valid discrete categories in advance (e.g., 1, 2, 3, 4, and “invalid”) and assign non-integer responses to an “invalid” category.

This error undermines the validity of the reported reliability metrics (Fleiss’ kappa values in Table 1). For instance, the “almost perfect” reliability reported for gpt-3.5-ft-0 and gpt-3.5-API-0 (kappa ≈ 0.98) may be artificially inflated or distorted due to this improper handling of continuous values in a categorical agreement statistic.

  • You must be logged in to reply to this topic.