1. Prompt Engineering Bias
The authors simulate low-, medium-, and high-skill student prompts for ChatGPT, but they were crafted by instructors rather than actual students.
How do the authors justify the ecological validity of these prompts, given that real students might express intent, misunderstanding, or urgency in ways that instructors can’t authentically replicate? Could this have biased the AI-generated output toward more “instructor-like” phrasing and structure?
2. Sample Size Limitations
The study relies on a relatively small sample of 36 code submissions (9 ChatGPT + 27 students), limiting statistical power.
Do the authors believe that the Kruskal–Wallis test is appropriate for such a limited dataset with high inter-assignment variability? Were any power calculations or confidence intervals reported to ensure robustness of findings?
3. Assignment Complexity Bias
HW2 (the stop sign task) is visually evaluated based on graphical correctness (a domain where ChatGPT currently underperforms).
Does this unfairly disadvantage ChatGPT compared to textual or logic-based tasks? Could the choice of assignments inadvertently amplify performance differences between students and AI?
4.Confounding Effect of Code Sanitization
Code submissions were stripped of notebook-specific elements and author identifiers.
Could the sanitization process have unintentionally erased subtle markers of authenticity, such as formatting habits, comment style, or IDE-specific syntax quirks that graders might normally use to detect AI authorship?
5. Reliability of AI Detectors
The OpenAI text classifier was used, despite known limitations and a low success rate on short code snippets.
Why did the study rely on this detector, and why weren’t alternative tools (e.g., GPTZero, CodeCarbon, or stylometry-based classifiers) included to cross-validate results?
6.Grader Expectation Bias
Graders were told there were three AI submissions per assignment.
How might this prior knowledge have influenced their identification strategy (i.e., motivated reasoning)? Would blind grading with no fixed AI quota yield different detection rates?
7. Overgeneralization of Results
The conclusion that “ChatGPT performs like a mid-level student” is based on only three assignments from one Python course.
Can this conclusion be generalized to other courses, languages (e.g., Java), or even more diverse and open-ended programming challenges like algorithms or database queries?
8. Skill vs. Effort Confounding
The study assumes student skill levels are reflected solely by code quality.
Could effort level (time spent, code iterations, peer feedback) or motivational factors have played a larger role in distinguishing student vs. AI work, and how were those controlled?
9. Implicit Value Judgments on AI Use
The authors suggest students using ChatGPT lack understanding.
Should using ChatGPT necessarily imply academic dishonesty, or could it reflect modern tool-assisted learning methods that deserve reevaluation within course design?
10. Instructor-Driven Detection Practices
The authors recommend stylistic constraints (e.g., comment style, variable naming) to catch AI use.
Could this push students into superficial compliance rather than deeper learning, and does it risk penalizing those who genuinely struggle but adhere to these “telltale” AI styles?