When 91.7% Agreement Isn’t the Whole Story

This post explains why 91.7% agreement can still leave room for doubt when labels are highly imbalanced. You’ll see (1) what raw agreement measures, (2) how Cohen’s kappa adjusts for chance agreement, and (3) what to report so reliability claims stay interpretable under skewed base rates.

</aside>

The headline (and the missing context)

Amir’s Tenacious-Bench reliability check reports:

Observed (raw) agreement: 91.7% (76 / 83 matches)
All four dimensions clear the spec’s ≥ 80% threshold.

That sounds excellent—until you look at label prevalence:

Pass 1: 92.7% “correct”
Pass 2: 90.2% “correct”

When one label dominates, two passes can agree frequently even if they’re not very discriminative—because “correct” is the default answer.

Quick visual: what skew does to “chance agreement”

Think of agreement as having two components:

Agreement you earn (the rubric is consistently applied)
Agreement you get for free (both passes mostly pick the common label)

Here’s the logic chain:

flowchart TD
	A["Labels are highly imbalanced<br>(most are 'correct')"] --> B["Expected chance agreement becomes high"]
	B --> C["Raw agreement looks strong"]
	B --> D["Kappa has less 'headroom' to increase"]
	C --> E["Headline can overstate reliability"]
	D --> F["Kappa can look only moderate<br>even with few disagreements"]

What raw agreement tells you (and what it can’t)

Raw agreement answers:

How often did the two passes give the same label?

It’s operationally meaningful. If only 7 out of 83 labels flipped, the rubric isn’t wildly unstable.

But raw agreement doesn’t ask whether agreement was easy to obtain. With a dominant label, agreement can be high by default.