Written by 12:23 am AI & Robotics

Every AI Model Lost Money Betting on Soccer — The KellyBench Study Explained

KellyBench tested eight frontier AI models on the 2023-24 Premier League season with a 100K GBP ban…
Every AI Model Lost Money Betting on Soccer — The KellyBench Study Explained

For all the hype around autonomous AI systems, the KellyBench result is refreshingly embarrassing.

General Reasoning’s study asked a simple question with real stakes: if frontier models really are getting better at messy real-world reasoning, can they survive a Premier League betting season with a bankroll and a risk-management rule?

The answer was no. Not one model made money.

According to the April 2026 Ars Technica / Financial Times coverage of KellyBench, eight frontier systems were tested on the 2023-24 Premier League season with a £100,000 bankroll and Kelly-criterion bet sizing.

Claude Opus 4.6 was the least bad at an average return of -11 percent and nearly broke even on one run at -0.2 percent.

GPT-5.4 lost 13.6 percent.

Gemini 3.1 Pro was wildly unstable, averaging -43.3 percent while still posting a +33.7 percent run in one attempt.

xAI’s Grok 4.20 went bankrupt every time.

That is not a small miss. That is systematic failure.

If you want the deeper AI context first, this story sits naturally beside our coverage of what agentic AI actually is and where AI really does save time in practical workflows. KellyBench

matters because it shows the difference between language fluency and calibrated decision-making under uncertainty.

What KellyBench Actually Tested

The study design matters because this was not a toy quiz about sports trivia.

KellyBench, as described by Ars Technica and the Financial Times, recreated the 2023-24 Premier League season and had eight frontier AI systems place bets across match outcomes and goal totals.

Each model got detailed historical data and team statistics, then had to manage a £100,000 bankroll across the season.

The crucial twist is the betting rule. The agents used Kelly criterion sizing, which means the amount you bet depends on the edge you believe you have. If your probability estimates are even slightly miscalibrated, Kelly can punish you fast.

That makes this a very different benchmark from standard AI leaderboards. Most mainstream benchmarks reward static answers in static environments. KellyBench forces models to make sequential decisions under uncertainty, size risk, update beliefs, and live with the consequences.

“Every frontier model we evaluated lost money over the season and many experienced ruin.”

Source: Ars Technica / Financial Times coverage of KellyBench, April 2026

That sentence is the whole story in miniature. The benchmark did not ask whether the models could sound smart about soccer. It asked whether they could turn probability judgments into money without blowing themselves up.

That is a much more serious test of reasoning than most AI marketing decks ever face.

The Scoreboard: Every Model Lost Money

The numbers are bad enough that they deserve to be seen in one place.

Model Mean ROI Best try Worst try Mean final bankroll
Claude Opus 4.6 -11.0% -0.2% -18.8% £89,035
GPT-5.4 -13.6% -4.1% -31.6% £86,365
Gemini 3.1 Pro -43.3% +33.7% -100.0% £56,715
Gemini Flash 3.1 LP -58.4% +24.7% -100.0% £41,605
Z.AI GLM-5 -58.8% -14.3% -100.0% £41,221
Moonshot Kimi K2.5 -68.3% -27.0% -100.0% £7,420
Grok 4.20 -100.0% -100.0% -100.0% £0
Acree Trinity -100.0% -100.0% -100.0% £0

There are three immediate takeaways from that table.

  • Claude was the best, but “best” still meant losing money.
  • GPT-5.4 was relatively less awful, but still clearly negative.
  • Gemini and Grok showed the kind of instability you absolutely do not want near automated capital allocation.

KellyBench is not saying the models never guessed correctly. It is saying their probabilities and bankroll decisions were bad enough over time to destroy edge.

That distinction matters. A model can sound informed, pick some winners, and still be a terrible betting system.

What the scoreboard really says

The table is not just a ranking of bad gamblers. It is a ranking of how badly each model handled uncertainty, calibration, and self-control once real bankroll pressure entered the picture.

  • Claude and GPT-5.4 look comparatively resilient, which suggests the strongest models may be improving at disciplined reasoning.
  • Gemini’s spread between a strong positive run and total ruin shows why variance matters as much as mean return.
  • Grok’s repeated bankruptcy is a reminder that some failures are categorical rather than marginal.

That is why this benchmark feels more decision-relevant than a lot of glossy AI demos. It forces the question executives actually care about: can the system stay sane when mistakes cost money?

Why Kelly Criterion Is Such a Brutal Test for LLMs

Kelly criterion is one of those ideas that sounds elegant until you realize how unforgiving it is.

In simple terms, Kelly tells you how much of your bankroll to bet based on your estimated edge.

If you think an event is underpriced, you bet more. If the edge is tiny, you bet less. If there is no edge, you do not bet.

That works beautifully in theory when your probability estimates are well calibrated. It works terribly when they are not.

And that is where LLMs run into trouble. These systems are not naturally built to produce sharp, calibrated probabilities over a long horizon. They are built to generate plausible language and patterns that look coherent to humans.

Kelly is ruthless about the gap between sounding confident and being statistically right. Slight overconfidence does not just reduce returns. It amplifies losses because bet sizing scales with your confidence.

That is why this benchmark is so revealing. It does not reward a model for narrating the match well. It punishes the model for mispricing uncertainty.

“There is so much hype about AI automation, but there’s not a lot of measurement of putting AI into a longtime horizon setting.”

Source: Ross Taylor, quoted in Ars Technica / Financial Times, April 2026

That quote nails why the result matters outside gambling. KellyBench is really a test of whether these models can make repeated decisions with compounding consequences.

That is the same class of problem many people imagine AI automation will soon dominate in finance, operations, planning, and business execution.

Why miscalibration destroys bankrolls so quickly

One subtle reason KellyBench is so revealing is that it punishes the exact kind of error large language models are most likely to make: confident-seeming estimates that are just a little bit wrong.

If a model says a team has a 62 percent chance of winning when the real probability is 55 percent, that sounds like a small miss.

In conversational settings, nobody notices. In bankroll management, it changes how much capital gets committed. Over time, that gap compounds into real drawdown.

  • Bad calibration makes the model bet too often.
  • Overconfidence makes the model bet too large.
  • Run-to-run variance means the same model may express the same edge differently on different passes.
  • Once that instability touches capital allocation, the losses accelerate rather than average out gracefully.

That is why this benchmark matters more than a generic accuracy score. It turns fuzzy confidence into economic consequences.

Why LLMs Fail at Soccer Betting Even When They Know a Lot About Soccer

The easiest mistake here is to think the problem is lack of soccer knowledge.

That is almost certainly not the core issue. A frontier LLM probably knows a huge amount about the Premier League in the loose, narrative sense.

It can talk about clubs, managers, tactics, star players, injuries, and form. But betting edge does not come from sounding informed. It comes from calibrating probabilities better than the market.

That is a fundamentally different task.

Here is where LLMs tend to break down in a setup like KellyBench:

  • they blur narrative plausibility with measurable probability;
  • they are not inherently optimized for calibration under repeated feedback;
  • their outputs can vary run to run even on similar prompts;
  • they struggle to maintain disciplined decision policy over long horizons.

Soccer betting is not a trivia contest. It is closer to sequential statistical decision-making under adversarial pricing. Bookmaker odds already embed a massive amount of information.

To beat that market, a system needs not only domain knowledge but also disciplined estimation, robust calibration, and stable risk sizing.

LLMs are weak precisely where that stack becomes non-negotiable.

This is why the benchmark result feels intuitive once you stop reading it through a chatbot lens.

LLMs are great at compressing and expressing knowledge. They are much worse at behaving like a properly calibrated probabilistic machine in a live, uncertain environment.

Implicit knowledge is not statistical reasoning

This is the cleanest way to frame the failure. LLMs are rich in implicit knowledge. They absorb huge amounts of text about teams, leagues, injuries, tactics, and public narratives.

But statistical reasoning is not the same as having a lot of relevant facts floating around in latent space.

Betting demands explicit probability discipline. You need to know not just that Arsenal are strong or that a team is in poor form, but exactly how much that should move a win probability relative to bookmaker odds.

That shift has to be numerical, calibrated, and consistent. Language models tend to blur that boundary because their native format is persuasive sequence generation, not stable probability estimation.

That is why KellyBench feels so important as an explanatory benchmark. It separates knowing from pricing, and frontier LLMs still look much better at the first job than the second.

Why Structured ML Approaches Still Make More Sense

If your goal is sports forecasting rather than conversational analysis, structured methods still make much more sense than betting straight from an LLM.

That does not mean structured models are magical money printers. They are not. It means their failure modes are more appropriate to the task.

Approach How it works Why it fits betting better
Poisson goal models Estimate expected goal rates and convert them into outcome probabilities Built directly around score distributions and match events
Elo / rating systems Update team strength dynamically from match results and context Stable, interpretable, and explicitly numerical
XGBoost / tabular ML Use feature-engineered match data such as xG, form, home edge, injuries, and ratings Works on structured inputs and can be calibrated more directly
LLM agent Reads context, reasons in language, outputs judgments and probabilities Flexible and expressive, but usually weaker on calibration and consistency

The classic reference point is Dixon and Coles’ work on soccer prediction using Poisson-based models. Later approaches extend with Elo-style ratings, expected goals, market features, and tabular machine learning systems like XGBoost.

Those models are not better because they are more glamorous. They are better suited because they force the problem into structured variables and explicit assumptions.

LLMs do almost the opposite. They are expressive and broad, but that flexibility often comes with calibration weakness. In a domain where small probability errors get multiplied by Kelly sizing, that is a disaster.

The deeper lesson is not “old models good, new models bad.” It is that model form still matters. A system that is extraordinary at code generation or document drafting does not automatically become extraordinary at risk-managed forecasting.

That should be obvious, but Silicon Valley often acts as though any strong general model can be stretched across any economic task with enough prompt engineering. KellyBench is a good reminder that structure is still a feature, not a relic.

What a sensible hybrid system would do instead

The interesting conclusion is not that LLMs have no role here. It is that their role is probably upstream or downstream of the actual forecasting engine.

A sensible architecture would use structured models for probability estimation, calibration tooling for confidence control, and explicit bankroll rules for sizing.

An LLM could still help with research summaries, injury-report digestion, market commentary, or human-readable explanations. It just should not be trusted as the raw edge engine.

  • Structured model for the base probability
  • Calibration layer to test whether forecast confidence matches reality
  • Rules engine for stake sizing
  • LLM layer for narrative interpretation, reporting, and workflow support

That is a much more believable story for enterprise AI too. Use LLMs where language helps. Use structured models where numerical discipline matters.

What Gemini and Grok Reveal About Instability

The most interesting failures in the table are not the mild losers. They are the unstable ones.

Gemini 3.1 Pro is the clearest example. It posted a positive +33.7 percent run in one attempt, but a -100 percent collapse in another, producing a deeply negative average.

That is exactly the kind of performance profile that can fool people into thinking they are looking at hidden genius rather than instability.

They are not. They are looking at variance without reliability.

The same story shows up in a harsher form with Grok 4.20. If a model goes bankrupt every time in a benchmark built to test repeated decision quality, the story is not that the model had bad luck.

The story is that the system is unfit for the task as configured.

This matters because AI hype often cherry-picks best runs. But for deployment, the mean outcome matters far more than the highlight reel. Nobody should automate a decision system around a model that occasionally looks brilliant but often self-immolates.

That is true in sports betting. It is even more true in enterprise automation, where instability compounds through workflow chains, budgets, customer impact, or compliance exposure.

Why variance matters more than screenshots

One reason the KellyBench result is so useful is that it resists demo culture.

  • A single strong run can make a model look smarter than it is.
  • A single catastrophic run can wipe out all the narrative value of the good run.
  • Decision systems should be judged by repeatability, not by their prettiest screenshot.

That is the lesson behind Gemini’s performance in particular. A benchmark with bankroll and ruin makes inconsistency visible in a way normal AI product demos rarely do.

What This Means for AI Automation Hype

KellyBench is not mainly about soccer.

It is about the distance between benchmark hype and durable real-world agency.

The reason this story matters is that a lot of automation enthusiasm now rests on the idea that frontier models are becoming good enough to run longer-horizon decision loops with little human oversight.

Betting on a soccer season is not the same thing as running a company, but it rhymes with many real business tasks: you have uncertainty, imperfect information, delayed feedback, bankroll constraints, and compounding consequences.

That is exactly where sloppy reasoning becomes expensive.

KellyBench therefore functions as a useful reality check. It suggests that frontier LLMs may still be much better at assisting judgment than replacing it in dynamic, risk-sensitive domains.

That is not an anti-AI conclusion. It is a more honest one.

There are still many places where AI creates obvious value.

Coding assistance, drafting, summarization, workflow acceleration, retrieval, search over internal documents, and narrow automation all remain real wins.

We have covered that repeatedly in Blue Headline’s reporting on practical AI productivity.

But the leap from “useful assistant” to “trusted autonomous decision-maker” is still wider than the loudest hype cycle wants to admit.

That is the real meaning of every model losing money. The issue is not that soccer is magical. The issue is that long-horizon, probabilistic, feedback-rich environments are still hard in exactly the ways AI boosters often understate.

What executives should take from this

If you are an operator evaluating AI automation, KellyBench offers a better management lesson than a gambling lesson.

Do not ask only whether a model can produce a plausible answer.

Ask whether it can stay calibrated over time, whether its failures are bounded, whether its outputs are stable enough to automate, and whether the environment already contains smart adversaries or market prices.

If the answer to those questions is shaky, keep a human or a structured model in the loop.

That is a healthier way to think about AI systems across forecasting, budgeting, planning, trading, and resource allocation.

The first question should not be “Can the model sound convincing?” It should be “What happens when its confidence is wrong 200 times in a row?”

Bottom Line

Three durable lessons from KellyBench

  • Language fluency is not the same thing as probability calibration.
  • Run-to-run instability is a deployment problem, not just an academic curiosity.
  • Structured models still matter whenever the task is explicitly numerical and risk-sensitive.

Every AI model in KellyBench lost money betting on Premier League soccer.

Claude Opus 4.6 was the least bad.

GPT-5.4 still lost.

Gemini showed dangerous instability.

Grok got wiped out every time.

That is not a story about one weak model. It is a story about a whole class of systems failing a type of task that sounds deceptively close to “general reasoning.”

My bottom line: LLMs are full of implicit world knowledge, but that is not the same thing as producing calibrated probabilities, stable risk sizing, and durable edge in a dynamic market.

KellyBench is a useful antidote to automation hype because it shows exactly where fluent reasoning stops and disciplined statistical reasoning begins.

If you want AI to make money in a setting like this, you still need structured models, explicit features, calibration discipline, and much more skepticism than the current hype cycle usually tolerates.

Primary sources and references: Ars Technica coverage, Kelly criterion overview, Poisson distribution overview, Elo rating overview, scikit-learn probability calibration guide, and XGBoost documentation.

FAQ

What is KellyBench?

KellyBench is a benchmark from General Reasoning that tested frontier AI models on a recreated Premier League betting season using bankroll management based on Kelly criterion.

Did any model make money?

No. Every model lost money on average over the season.

Which model did best?

Claude Opus 4.6 had the best average result, but it still lost roughly 11 percent on average and nearly broke even only on its best run.

Why did LLMs fail?

Because betting performance depends on calibrated probabilities, stable decision policy, and disciplined sizing under uncertainty. LLMs are much better at language and implicit knowledge than at that combination.

Why compare them with Poisson or XGBoost models?

Because structured forecasting approaches are built around explicit numerical features, interpretable assumptions, and direct probability estimation. That makes them a more natural fit for this kind of task.

Blue Headline Briefing

Enjoyed this? The best stuff lands in your inbox first.

We don’t email on a schedule — we email when something is genuinely worth your time. No filler, no daily blasts, just the sharpest picks from Blue Headline delivered only when they matter.

Free, no account needed, unsubscribe anytime. We only send when it’s actually worth reading.

Tags: , , , , , , , , Last modified: April 14, 2026
Close Search Window
Close