The model’s sentiment F1-score is only 0.72 ….. That means nearly 30% of tweet classifications (positive/negative/neutral) are likely wrong. Given they report razor-thin margins (e.g., psychiatrist negative 36.10% vs positive 22.77%), this error rate completely undermines any claim about which perception dominates. You can’t confidently say negative “dominates” when your tool misclassifies 1 in 3 tweets.
– With an F1 of 0.72, how can you be certain the negative majority for psychiatrist isn’t just classification noise? A 13-point gap disappears fast if 30% of labels are wrong.
– Who labeled the 1500 training tweets, and what was their inter-rater reliability? If one person did it (likely, given author list), the ground truth is just one person’s opinion; not a valid gold standard.
– You excluded non-classifiable tweets. What percentage of total tweets were tossed, and could that bias results? For example, sarcastic or ambiguous tweets about psychiatrists might be harder to classify and systematically excluded.