<aside> <img src="i" alt="i" width="40px" />
This post explains why 91.7% agreement can still leave room for doubt when labels are highly imbalanced. You’ll see (1) what raw agreement measures, (2) how Cohen’s kappa adjusts for chance agreement, and (3) what to report so reliability claims stay interpretable under skewed base rates.
</aside>
Amir’s Tenacious-Bench reliability check reports:
That sounds excellent—until you look at label prevalence:
When one label dominates, two passes can agree frequently even if they’re not very discriminative—because “correct” is the default answer.
Think of agreement as having two components:
Here’s the logic chain:
flowchart TD
A["Labels are highly imbalanced<br>(most are 'correct')"] --> B["Expected chance agreement becomes high"]
B --> C["Raw agreement looks strong"]
B --> D["Kappa has less 'headroom' to increase"]
C --> E["Headline can overstate reliability"]
D --> F["Kappa can look only moderate<br>even with few disagreements"]
Raw agreement answers:
How often did the two passes give the same label?
It’s operationally meaningful. If only 7 out of 83 labels flipped, the rubric isn’t wildly unstable.
But raw agreement doesn’t ask whether agreement was easy to obtain. With a dominant label, agreement can be high by default.