Validation

Where KICK agrees with the world, where it doesn't

We score every KICK Rating version against external signals — Brownlow top-10, All-Australian squads, AFLCA Champion Player winners. No rating system ships before we measure it.

What this page is

Any rating system can claim to be good. We score ours, publish the results, and commit to not shipping a new version unless the scores improve on binding pass criteria. This page is the permanent record.

Five internal metrics (computed from our own data) and three external metrics (external authoritative signals) feed into every validation run. The code is in validate_kick.py in our repo and runs in under a minute.

v1.1 results (current)

External signals

Internal signals

v1.2 — what we tried, why we held

v1.2's target was to narrow the three position gaps without regressing Brownlow alignment. Binding pass criteria: D ≤ 19.0, R ≤ 16.0, F ≤ 27.0, Brownlow ≥ 4.1 (no more than −0.3 from v1.1).

We ran a 48-configuration weight sweep across five weights. Result: top configs narrowed the defender gap to ~18.9, but the forward gap barely moved (best 31.4 vs target 27.0). Brownlow alignment dropped 0.4–0.6 points across every top config. No config met all four criteria simultaneously.

Decision: held. Full sweep in data/validation/sweep/results.md.

v1.3 — same verdict, deeper explanation

v1.3 moved past weight tuning to three new mechanisms, with an expanded six-criterion bar (added AA squad overlap ≥ 50% and AFLCA winner hit rate ≥ 87%).

A 32-configuration mechanism sweep. Zero configs cleared all six criteria.

Pattern across all 32:

The tension is structural. Every mechanism that lifts defenders/rucks/forwards relative to midfielders pushes mids out of year-by-year top-N rankings, which drags Brownlow and AFLCA down (both signals reward midfielders). You can't close the position gap AND hold top-10 alignment with the same lever.

Decision: held. Same process as v1.2 — criteria published in advance, criteria not met, no ship.

v1.4 — held; structural ceiling confirmed

v1.2 taught us weights aren't enough. v1.3 taught us mechanisms layered on box-score data aren't enough either. v1.4's hypothesis was that the ceiling is the data — specifically, that AFL Tables lacks the fields that distinguish quality of contribution from quantity, and that with hit-out-to-advantage, intercept marks, score involvements and contested-mark-in-traffic plumbed in, the position-vs-Brownlow tension would relax.

Data acquisition succeeded beyond expectation. AFL.com.au's match centre exposes a JSON feed gated only by an anonymous token that the public site fetches itself. Seven advanced fields populated for every AFL Premiership match back to 2012 (not the worst-case 2018+ we'd planned for): intercept marks, score involvements, hit-outs-to-advantage, intercepts, metres gained, pressure acts, and f50 ground-ball gets. 2,879 matches scraped at ~0.35s throttle, reconciled to AFL Tables match IDs at 99.76% via fuzzy name matching. 18.2% of all 678,721 career player-games carry these fields; the other 81.8% pre-date 2012 and retain v1.1 scoring.

The sweep failed cleanly. 972 configurations — five new-field weights at three levels each (35 = 243 weight combos) crossed with four v1.3 mechanism toggles. Zero of 972 cleared all six binding pass criteria.

The interpretation: the ceiling is structural to the single-number target, not to box-score poverty. The binding criteria require the rating to simultaneously (a) rank positions proportional to their on-field impact, and (b) agree with awards that strongly over-index midfielders. Those are incompatible when midfielders genuinely win most Brownlows (they do) AND when the position-gap target says defenders deserve higher KICK scores than a ratio consistent with those awards (it does).

Three honest attempts, three clean fails against the same criteria, targeting three different things (weights, mechanisms, data). Decision: held. v1.1 stays live as the headline metric.

What we shipped instead: Positional KICK (KICK-M / KICK-D / KICK-F / KICK-R) as a parallel rating, calibrated within position. The defender problem is solved at the surface (KICK-D ranks defenders against defenders) without breaking Brownlow alignment on the headline. v1.6's pre-2000 era rebaseline was investigated offline 2026-04-24 and also held — the same structural ceiling extends to era. v1.7 (game-level WOWY) and v1.9 (career-greatness composite) are the candidate next moves; they target different gaps that don't conflict with award alignment.

Full sweep results: data/validation/sweep_v14/results.md in the repo, plus the long-form findings in V1.4_FINDINGS.md.

v1.7 — game-level WOWY, also held

A different angle entirely. Where v1.2–v1.4 were trying to fix overall KICK from the inside, v1.7 attempted a parallel metric — WOWY (With Or Without You) — that measures something KICK can't see: the average margin a player's team posts in games they appear in vs games they miss, within their career span. The hypothesis: irreplaceable defensive leaders (Steven May archetype) and glue-guys whose stats look ordinary but whose teams collapse without them should surface here, even when KICK's volume-based weights miss them.

All compute used data already on disk — lineups in data/parsed/matches/*.json plus career-span info from KICK ratings. No new scrape required.

Two compute attempts, both failed face validity. V1.0 used 100-game career floor with 20-missed-game floor and per-season averaging (intended to mitigate team-strength drift). Top 5 was Lee Spurr, Isaac Heeney, Herb Henderson, Clyde Laidlaw, Luke Ryan — dominated by short-career players in dynasty teams. 0 of 10 face-validity target stars made the top 20. Bontempelli, Petracca, Daicos were ineligible (too healthy, missed under 20 games). Andrews ranked #2131 of 2,332 with WOWY −41.

V1.1 re-tuned with raised career floor (200 games), lowered missed floor (12 games), and dropped per-season averaging (it made noise worse, not better). Top 20 was more defensible (Cunnington, Voss, Garry Lyon, Robert Harvey, Burgoyne, Dustin Fletcher, David Mundy, Bontempelli, May) — but only 2 of 10 face-validity targets in top 20. Sicily, McGovern, Daicos all under the 200-game floor; Andrews, Cripps, Oliver, Gawn all eligible but ranked deep.

A 9-combination parameter sweep then tested every reasonable floor-pair from 100/10 up to 250/12. Best result was 175/12 with 3 of 10 hits (McGovern, May, Bontempelli). The gate required 6. No combination cleared 4. Lower floors brought back short-career noise; higher floors excluded healthy stars who never miss enough games. There's no parameter combination in the metric design that produces a defensible top-20.

The structural ceiling. Three compute attempts hit the same wall. Game-level WOWY is fundamentally noisier in AFL than in sports where it's been validated:

NBA WOWY (and its descendants like RAPM) work because they have minute-precise on/off-court tracking, 30+ missed games per star per season, and clean replacement causality. AFL public data offers none of those. Champion Data has the rotation feeds; we don't.

Decision: held permanently. Same discipline as v1.2 / v1.3 / v1.4 / v1.6 — criteria published, criteria not met, no ship. The compute pipeline (compute_wowy.py), cached match index, validation harness, and parameter sweep all stay on disk for any future revisit. The relevant revisit conditions: Champion Data rotation feeds become accessible at reasonable cost, OR per-quarter scoring chains become public from AFL.com.au, OR a Bayesian / expected-margin restructure proves to mitigate the team-context confound (none on the horizon). Long-form writeup: V1.7_PLAN.md in the repo.

What did ship from the v1.7 work: the Career-best games block on every player profile — the top 10 highest-KICK individual matches a player ever had, expandable to top 20. It was in scope for the v1.7 launch, doesn't depend on WOWY, and is independently good. See any active player profile (e.g. Marcus Bontempelli).

Can you help?

Four holds confirms the single-number ceiling on volume-stat box scores AND the noise ceiling on game-level WOWY. Plenty of work left at the surface layer — better visualisations, per-position content, era-aware framing, validation against other public ratings (Wheelo, HPN, Squiggle). If you've got data, an angle, or a critique, hello@kicker.au.

The validation code and full reports are in the data/validation/ directory of our repo — full transparency, this is the work we show alongside the rating.