Upsets in football: Grimsby vs Manchester United in the EFL Cup, Denmark winning the Euros, Leicester becoming Premier League champions. There’s always a chance, and so when making a football management game, that has to be built in. I wanted matches to feel fair, and fairness turns out to be a big part of making games fun. But I didn’t want results to be deterministic, where the better team simply always wins. So a match formula was needed.

While testing the early version of mine, progression was far too RNG-dependent. Even when a team was significantly stronger than its opponents, across a 38-game season every match felt like a coinflip. Teams that should have been competing for titles were finishing mid-table. My napkin-maths approach to skill cap compression clearly wasn’t the answer. As it turns out, a logistic curve is much better suited to something like this. But before we get to what I ended up with, let’s look at where I started.

The original problem

The original idea was simple. I wanted to take into account the sorts of things that affect a real game of football: skill, the way the team is set up, form, home/away advantage, and fan morale. That all looked like this:

skill_diff = clamp((player_skill - opp_skill) * 1.5, -50, +15)
dominance  = skill_diff + matchup_mod + playstyle_mod + form_mod + location_mod + pop_mod
win_pct    = clamp(35 + dominance, 5, 90)
draw_pct   = clamp(30 - abs(dominance) * 0.4, 5, 30)
loss_pct   = 100 - win_pct - draw_pct

Nice on paper, broken in practice. Capping how much skill could affect win percentage meant rating stopped mattering past a point. Even with a huge gulf between teams (picture a Premier League side playing a non-league one) the result was effectively a coinflip. Over a full season that doesn’t just feel unrealistic, it feels bad to play. Idle games are built on progression over hundreds of hours, and a formula where the high-rated team you’ve spent months building has roughly the same odds as the squad you started with is unusable.

Diagnosing it

The realisation that the formula’s shape was the problem came while running simulations trying to land on a realistic points-per-game number for winning a given division. No matter how much I tweaked skill differential, home advantage, or the playstyle matchup modifier, PPG barely shifted. Even big changes to the numbers, changes that should have made a clear difference, were negligible. This is how that looked:

Original capped formula 100% 75% 50% 25% 0% -40 -20 0 +20 +40 skill gap (player − opponent) win probability cap kicks in at +10 → win % stuck at 50
The old formula's win probability as a function of skill gap. Past +10 skill gap the cap kicks in and rating stops mattering.

That’s when it clicked. I was trying to tune my way out of a structural problem. I wasn’t going to fix this by adjusting parameters; I needed to change the function itself. So I stopped tuning and went to Claude Code to help me actually look at the curve.

The logistic fix

I could have kept playing with the clamps to squeeze more out of the existing formula, but it was never going to produce the results I wanted for Pocket XI. A logistic curve solves the underlying problem. It’s smooth, bounded, and has a tunable steepness, which means team skill can meaningfully affect win rate across the full range while staying within sensible probability limits. Here’s the logistic mapped against the original:

New logistic curve vs original capped formula 100% 75% 50% 25% 0% -40 -20 0 +20 +40 skill gap (player − opponent) win probability old (capped) new (logistic)
The new logistic curve (mint) against the old capped formula (amber, dashed). The shaded region is the win probability the old formula was hiding from skilled players.

The curve uses three specific parameters, all tuned across thousands of simulations to produce realistic PPG totals across every division.

A 2.0× skill multiplier acts as a gain knob for team rating. Before the logistic gets applied, every input (skill gap, playstyle matchup, form, home advantage) is summed into one number called dominance. The multiplier decides how much of that sum comes from raw skill versus the other modifiers. At 1.0×, skill was about the same weight as the other modifiers combined, so a strong-but-mismatched team could lose to a weaker side with a good playstyle counter. Matches felt swingy because surrounding factors were drowning out the thing the player had actually been working on, which devalued their progress. At 2.0×, skill is the loudest voice in the calculation. Modifiers still add texture (a playstyle counter still matters, home advantage still nudges the result) but they shape the outcome rather than overturn it. A clearly better team usually wins, and when it doesn’t, the reason feels like a real footballing one (“they had the playstyle advantage at home”) rather than arbitrary game mechanics forcing a loss.

k = 0.0972 sets the steepness of the curve, and was the main parameter being tuned across the simulations. A higher k means a sharper curve: the transition from underdog to favourite happens over a narrow range. A lower k gives a gentler curve where even big skill gaps only nudge the win probability, pushing matches back toward coinflip territory. 0.0972 sits in a sweet spot where a meaningful skill advantage produces a meaningful but not crushing win-rate bump, and matches stay competitive at the top of the league. It was found by running simulations against real Premier League points totals until the league behaved realistically.

A shift of −3.27 moves the whole curve left along the dominance axis, which decides what dominance score corresponds to the curve’s inflection point. Without a shift, two evenly matched teams would land exactly on the steepest part of the slope, where every small modifier swings the result hard. The −3.27 offset puts a dominance of 0 in a calmer part of the curve, making neutral matchups feel even-handed rather than knife-edge.

With the curve fixed and behaving everywhere else, a new problem appeared at the top of the table.

The Premier Division is where players spend the endgame. It’s the tenth and final division, somewhere it takes months to reach, and where players are meant to keep playing long after, chasing the title and the two European cups. The new formula was working as intended everywhere else, but at this tier it was working a bit too well. By the time a player had built a Premier-quality team, their skill rating was usually high enough to put them deep into the flat part of the curve against the average opponent. They were winning at the asymptote: high 70s, low 80s win rate. This is the sort of thing that kills endgame enjoyment by making it boring.

The temptation was to fix this by tweaking the curve. Lower k, shift the inflection, cap the multiplier above some threshold. I tried versions of all of these and they all had the same problem. Any change that calmed the Premier endgame also reshaped every other division, undoing fixes I’d already validated. The curve was right. The problem wasn’t the curve.

What I did instead was add a hidden bonus to NPC teams that scales up specifically in the Premier Division: a small, invisible buff to their skill rating that the player never sees in the UI. From the formula’s perspective it’s just a higher opponent skill, which keeps the curve untouched and the maths consistent. From the player’s perspective, Premier opponents feel a notch tougher than their on-paper rating suggests. The taper is per-division and tunable through Remote Config, so I can adjust the difficulty curve post-launch based on real player data without shipping a new build.

The lesson: when one region of a system misbehaves, the instinct is to reshape the whole system. But systems that work everywhere except one place don’t usually need reshaping. They need a localised modifier that addresses the specific region without disturbing anywhere else. The hidden bonus is invisible because the curve underneath it is correct.

Validating it with simulation

In a perfect world I’d have trusted the maths and shipped the game. The reality is that simulating at every step gave me insight that shaped decisions I’d otherwise have got wrong. And it was important not to test against a “default” profile alone, because people play differently. Profile-based simulation lets me check that a Whale doesn’t have an unfair advantage over free-to-play players, that an Optimiser reaches the top division in the time I want, and that a player who focuses on upgrading buildings doesn’t get stuck in mid-table no matter how hard they try.

So how did the new formula actually perform? Seven player profiles, 500 iterations each, a thousand in-game days per run:

ProfileP50 (days)% ever reach
Whale389100%
BuilderFirst429100%
Optimiser47199%
Casual531100%
Collector65996%
Nationalist>100017%
Random>10000%
Premier arrival by profile. 500 iterations per profile, 1000-day simulation. P50 = median across iterations.

The Optimiser represents an engaged daily player completing missions and upgrading strategically. Median time to Premier: 471 days. About fifteen and a half months. Slightly longer than my original twelve-month target, but the alternative was trivialising the early divisions or grinding out the late ones. The final number is defensible: a year-plus of solid play to reach the endgame.

The interesting result isn’t the Optimiser, it’s BuilderFirst. This profile delays upgrading the squad to invest in Training Ground levels first. Common sense says they should be slower. The sim says they reach Premier 42 days ahead of the Optimiser, because Training infrastructure generates more Training every day for the entire run. Optimising the rate beats optimising the current state.

BuilderFirst was designed as a cautionary tale about over-investing in infrastructure, and the sim showed otherwise. Logical in hindsight, but not what I’d initially expected, and the kind of result that justifies simulating before shipping.

The other thing I wanted to validate was the Premier endgame, since the hidden bonus is invisible to the player. The only way to check it’s working is to watch what happens once a median Optimiser arrives:

SeasonMedian finishTitle win %PPG
S17th4%1.58
S24th17%1.74
S34th25%1.79
S42nd38%1.87
S52nd47%1.92
S61st52%1.95
S71st56%1.97
Premier season-by-season climb. Hidden bonus tapers per season, so the climb from new-arrival to title contender takes ~6 in-game years.

First season, they finish 7th with a 4% title chance. By season four they’re a podium regular. By season six they’re winning titles more often than not. A six-in-game-year climb from “just made it” to “best in the league”. The endgame has a progression curve instead of being a victory lap. The hidden bonus is doing its job.

Without simulating, I’d have shipped a game I thought I understood, missing key elements of fun and realism.

I’d have shipped with the wrong skill multiplier. The first logistic version used 1.0× because the maths suggested removing the old cap meant skill no longer needed the boost. On paper, correct. In simulated leagues, playstyle counters and home advantage routinely overturned clear skill gaps and standings looked random. The 2.0× value came from watching the sim and tuning until skill was the loudest voice without drowning out the others.

I’d also have shipped without the hidden Premier bonus, because the sim was the only place I could see median Optimisers winning titles in their first season. That’s exactly the unrealistic, boring endgame I was trying to avoid.

Takeaways

Three lessons I’m taking into everything else I build:

The shape of your formula matters more than its variance. If tuning parameters doesn’t move outcomes the way you expect, the problem isn’t the parameters, it’s the function. Hard caps on linear formulas are a particular trap, because the cap region looks like it’s still doing something when it isn’t.

“Feels fair” is a property you can engineer for. It’s tempting to treat fairness as something soft, something you tune by gut feel until it stops bothering you. It isn’t. It’s a specific relationship between inputs and outputs (in this case, rewarding skill while preserving meaningful upset potential) that you can define, measure, and validate. Once you frame it that way, the engineering follows.

Simulate before you ship. Especially for systems players spend hundreds of hours inside. I’d never have arrived at 2.0× from theory. I’d never have caught the Premier endgame problem from playtesting alone. The simulation surprised me in ways my intuition wouldn’t have, and that surprise is what justifies the cost of building one.