Validation

Where KICK agrees with the world, where it doesn't

We score every KICK Rating version against external signals — Brownlow top-10, All-Australian squads, AFLCA Champion Player winners. No rating system ships before we measure it.

What this page is

Any rating system can claim to be good. We score ours, publish the results, and commit to not shipping a new version unless the scores improve on binding pass criteria. This page is the permanent record.

Five internal metrics (computed from our own data) and three external metrics (external authoritative signals) feed into every validation run. The code is in validate_kick.py in our repo and runs in under a minute.

v1.1 results (current)

External signals

Brownlow top-10 overlap: 4.36/10 average across 2000–2025 seasons.
All-Australian squad overlap: 9.09/22 (42.7%) average across 1991–2025. KICK top-22 and AA 22 agree on fewer than half their selections — the strongest signal that our rating needs position-aware work.
AFLCA Champion Player winner in KICK top-10: 82.6% hit rate (19/23 years since 2003). The four misses: Barry Hall (2005), Robbie Gray (2014), Dan Hannebery (2015), Zak Butters (2023). Two of the four are forwards — consistent with our internal position-gap finding.

Internal signals

Position-coherence gaps (mean career KICK across players with 50+ games):
- Midfielder: 59.5 baseline
- Defender: 36.8 — 22.6 pts below mid
- Ruck: 39.3 — 20.2 pts below mid
- Forward: 27.5 — 31.9 pts below mid (the biggest visible gap)
Era gap (1890s → 2020s top-100 mean): 45.8 points. Expected, given pre-2000 data sparsity. v1.6 looked at era rebaseline (offline) and held — both naive and percentile-matched approaches produced indefensible top-20 lists; the structural ceiling extends to era as well. Pre-2000 readers are routed to decade leaderboards and positional all-time leaderboards where era fairness already exists.
Single-game volatility: forwards 30% of top-100 games — within the healthy range; goals are not over-weighted in aggregate.

v1.2 — what we tried, why we held

v1.2's target was to narrow the three position gaps without regressing Brownlow alignment. Binding pass criteria: D ≤ 19.0, R ≤ 16.0, F ≤ 27.0, Brownlow ≥ 4.1 (no more than −0.3 from v1.1).

We ran a 48-configuration weight sweep across five weights. Result: top configs narrowed the defender gap to ~18.9, but the forward gap barely moved (best 31.4 vs target 27.0). Brownlow alignment dropped 0.4–0.6 points across every top config. No config met all four criteria simultaneously.

Decision: held. Full sweep in data/validation/sweep/results.md.

v1.3 — same verdict, deeper explanation

v1.3 moved past weight tuning to three new mechanisms, with an expanded six-criterion bar (added AA squad overlap ≥ 50% and AFLCA winner hit rate ≥ 87%).

Role-aware multipliers. Auto-classify each player's position from their stat profile; apply position-specific multipliers (defender: one_percenters ×2.5; forward: marks_inside_50 ×2.5; ruck: hit_outs ×1.4).
Scoring-involvement composite. Forward-specific raw-score addition combining goals + goal_assists + 0.5×marks_inside_50.
Era normaliser. Separate divisors for <1965 / 1965–1999 / 2000+ to address the 45.8-point era gap.

A 32-configuration mechanism sweep. Zero configs cleared all six criteria.

Pattern across all 32:

Position gaps responded strongly to role_aware + era_on (best config: D12.5, R10.5, F12.5 — crushing all three gap targets).
Every such config tanked Brownlow and AFLCA alignment — Brownlow dropped as low as 2.67, AFLCA to 56.5%.
AA squad overlap never cleared 45.23% (max). The 50% target was unreachable with available mechanisms.
AFLCA came within 0.04 of the 87% target at rank 8 (role_mild configs), but paid for it in Brownlow and D-gap.

The tension is structural. Every mechanism that lifts defenders/rucks/forwards relative to midfielders pushes mids out of year-by-year top-N rankings, which drags Brownlow and AFLCA down (both signals reward midfielders). You can't close the position gap AND hold top-10 alignment with the same lever.

Decision: held. Same process as v1.2 — criteria published in advance, criteria not met, no ship.

v1.4 — held; structural ceiling confirmed

v1.2 taught us weights aren't enough. v1.3 taught us mechanisms layered on box-score data aren't enough either. v1.4's hypothesis was that the ceiling is the data — specifically, that AFL Tables lacks the fields that distinguish quality of contribution from quantity, and that with hit-out-to-advantage, intercept marks, score involvements and contested-mark-in-traffic plumbed in, the position-vs-Brownlow tension would relax.

Data acquisition succeeded beyond expectation. AFL.com.au's match centre exposes a JSON feed gated only by an anonymous token that the public site fetches itself. Seven advanced fields populated for every AFL Premiership match back to 2012 (not the worst-case 2018+ we'd planned for): intercept marks, score involvements, hit-outs-to-advantage, intercepts, metres gained, pressure acts, and f50 ground-ball gets. 2,879 matches scraped at ~0.35s throttle, reconciled to AFL Tables match IDs at 99.76% via fuzzy name matching. 18.2% of all 678,721 career player-games carry these fields; the other 81.8% pre-date 2012 and retain v1.1 scoring.

The sweep failed cleanly. 972 configurations — five new-field weights at three levels each (3⁵ = 243 weight combos) crossed with four v1.3 mechanism toggles. Zero of 972 cleared all six binding pass criteria.

Position gaps responded strongly: the best config closed the defender gap from 22.17 to 11.92 (a 46% reduction), and the forward gap fell to 21.34–22.88 (below the 24 target). Intercept_marks at weight 5.0 alone moved the defender gap more than any v1.3 mechanism combination.
Brownlow regressed everywhere. Every top-50 config dropped Brownlow below the 3.811 floor. The only config that held Brownlow at 4.111 was the v1.1 baseline — which fails all three gap criteria.
AA overlap hit a hard ceiling at 45.85%. No config in any combination reached 50%. AA selection is itself highly correlated with Brownlow vote distribution.
AFLCA bifurcated. Configs weighting hit-outs-to-advantage crashed AFLCA to 65–74%. Configs without it held v1.1's 82.61%. The 87% target was unreachable.
Score involvements were nearly null. SI at any weight changed results by ≤ 0.5 points on any metric — v1.1 already captures most forward creation through goals + goal_assists + marks_inside_50.

The interpretation: the ceiling is structural to the single-number target, not to box-score poverty. The binding criteria require the rating to simultaneously (a) rank positions proportional to their on-field impact, and (b) agree with awards that strongly over-index midfielders. Those are incompatible when midfielders genuinely win most Brownlows (they do) AND when the position-gap target says defenders deserve higher KICK scores than a ratio consistent with those awards (it does).

Three honest attempts, three clean fails against the same criteria, targeting three different things (weights, mechanisms, data). Decision: held. v1.1 stays live as the headline metric.

What we shipped instead: Positional KICK (KICK-M / KICK-D / KICK-F / KICK-R) as a parallel rating, calibrated within position. The defender problem is solved at the surface (KICK-D ranks defenders against defenders) without breaking Brownlow alignment on the headline. v1.6's pre-2000 era rebaseline was investigated offline 2026-04-24 and also held — the same structural ceiling extends to era. v1.7 (game-level WOWY) and v1.9 (career-greatness composite) are the candidate next moves; they target different gaps that don't conflict with award alignment.

Full sweep results: data/validation/sweep_v14/results.md in the repo, plus the long-form findings in V1.4_FINDINGS.md.

v1.7 — game-level WOWY, also held

A different angle entirely. Where v1.2–v1.4 were trying to fix overall KICK from the inside, v1.7 attempted a parallel metric — WOWY (With Or Without You) — that measures something KICK can't see: the average margin a player's team posts in games they appear in vs games they miss, within their career span. The hypothesis: irreplaceable defensive leaders (Steven May archetype) and glue-guys whose stats look ordinary but whose teams collapse without them should surface here, even when KICK's volume-based weights miss them.

All compute used data already on disk — lineups in data/parsed/matches/*.json plus career-span info from KICK ratings. No new scrape required.

Two compute attempts, both failed face validity. V1.0 used 100-game career floor with 20-missed-game floor and per-season averaging (intended to mitigate team-strength drift). Top 5 was Lee Spurr, Isaac Heeney, Herb Henderson, Clyde Laidlaw, Luke Ryan — dominated by short-career players in dynasty teams. 0 of 10 face-validity target stars made the top 20. Bontempelli, Petracca, Daicos were ineligible (too healthy, missed under 20 games). Andrews ranked #2131 of 2,332 with WOWY −41.

V1.1 re-tuned with raised career floor (200 games), lowered missed floor (12 games), and dropped per-season averaging (it made noise worse, not better). Top 20 was more defensible (Cunnington, Voss, Garry Lyon, Robert Harvey, Burgoyne, Dustin Fletcher, David Mundy, Bontempelli, May) — but only 2 of 10 face-validity targets in top 20. Sicily, McGovern, Daicos all under the 200-game floor; Andrews, Cripps, Oliver, Gawn all eligible but ranked deep.

A 9-combination parameter sweep then tested every reasonable floor-pair from 100/10 up to 250/12. Best result was 175/12 with 3 of 10 hits (McGovern, May, Bontempelli). The gate required 6. No combination cleared 4. Lower floors brought back short-career noise; higher floors excluded healthy stars who never miss enough games. There's no parameter combination in the metric design that produces a defensible top-20.

The structural ceiling. Three compute attempts hit the same wall. Game-level WOWY is fundamentally noisier in AFL than in sports where it's been validated:

23 games per season (vs NBA's 82) — small “without” samples for healthy stars.
22-player rosters with depth + interchange — single-player absences are partially absorbed in ways game-level data can't see.
Coaches actively rebalance tactics around an absence — the team self-corrects within the game.
Co-absences cluster (injury periods, rest weeks, end-of-career declines correlate with team weakness) — the “without” sample is rarely a clean comparison.
Career trajectory + team trajectory often correlate — Andrews' career improved as Brisbane improved; his “without” cohort clusters in pre-improvement years. The metric reads team-strength drift as “player has no impact” when the reality is “player and team peaked together.”

NBA WOWY (and its descendants like RAPM) work because they have minute-precise on/off-court tracking, 30+ missed games per star per season, and clean replacement causality. AFL public data offers none of those. Champion Data has the rotation feeds; we don't.

Decision: held permanently. Same discipline as v1.2 / v1.3 / v1.4 / v1.6 — criteria published, criteria not met, no ship. The compute pipeline (compute_wowy.py), cached match index, validation harness, and parameter sweep all stay on disk for any future revisit. The relevant revisit conditions: Champion Data rotation feeds become accessible at reasonable cost, OR per-quarter scoring chains become public from AFL.com.au, OR a Bayesian / expected-margin restructure proves to mitigate the team-context confound (none on the horizon). Long-form writeup: V1.7_PLAN.md in the repo.

What did ship from the v1.7 work: the Career-best games block on every player profile — the top 10 highest-KICK individual matches a player ever had, expandable to top 20. It was in scope for the v1.7 launch, doesn't depend on WOWY, and is independently good. See any active player profile (e.g. Marcus Bontempelli).

Can you help?

Four holds confirms the single-number ceiling on volume-stat box scores AND the noise ceiling on game-level WOWY. Plenty of work left at the surface layer — better visualisations, per-position content, era-aware framing, validation against other public ratings (Wheelo, HPN, Squiggle). If you've got data, an angle, or a critique, hello@kicker.au.

The validation code and full reports are in the data/validation/ directory of our repo — full transparency, this is the work we show alongside the rating.