Home Bots on stageLLMs Sit Down at the Poker Table – And Show Their Limits

LLMs Sit Down at the Poker Table – And Show Their Limits

by Marco van der Hoeven

Nine of the most widely used large language models have just completed a five-day online poker match designed to test how well general-purpose AI can handle incomplete information, risk and deception. The event, PokerBattle.ai, was run as a $10/$20 no-limit Texas Hold’em cash game and produced a full dataset of hands and reasoning traces for every decision the models made.

Large language models are increasingly being tested in environments that resemble real-world decision-making: noisy, adversarial, and full of hidden information. A recent online experiment, PokerBattle.ai, took that idea literally by seating nine leading LLMs at a virtual $10/$20 no-limit Texas Hold’em cash table for five consecutive days. Nearly four thousand hands were played by each participant, and the organisers logged not only the actions, but also the reasoning behind every decision. Poker player and former LLM-lab founder Victoria Livschitz has now published an extensive analysis of that dataset, offering a rare look at how these systems behave when money, uncertainty and risk converge.

The Format and the Field

Each model began with a $100,000 bankroll and bought into the tables with $2,000 stacks of 100 big blinds, automatically replenished as they fluctuated. Four nine-handed tables ran in parallel. Competitors included OpenAI’s o3, Anthropic’s Claude Sonnet 4.5, X.com’s Grok, DeepSeek R1, Google’s Gemini 2.5 Pro, Mistral’s Magistral, Moonshot’s Kimi K2, Z.AI’s GLM 4.6 and Meta’s LLAMA 4.

OpenAI’s o3 finished as the top performer with a profit of $36,691, followed by Claude and Grok. DeepSeek and Gemini booked solid wins; Magistral ended only slightly positive. Kimi K2 and GLM recorded losses, while LLAMA 4 was eliminated early after dropping its entire bankroll. Livschitz stresses that 3,800 hands is too small a sample for definitive poker conclusions, but large enough to identify clear strategic patterns.

Strong Fundamentals Before the Flop

One of the study’s most striking findings is how competent the top models are in the structured preflop stage. Here, where decisions rely heavily on well-defined ranges and mathematical guidelines, the better LLMs frequently mirrored human regulars. OpenAI o3, Claude and Grok chose opening sizes that fit their table positions, built sensible three-bet and four-bet ranges, and made consistent adjustments according to opponent tendencies. Their preflop work resembled standard online cash-game play, showing these systems can apply theory reliably when variables are contained.

Postflop: The Cracks Begin to Show

As soon as community cards hit the table, the models’ limitations became pronounced. Livschitz found that the LLMs focused narrowly on the specific hand they held rather than considering how their entire range should behave. Their opponent modelling was similarly narrow, often missing key value or bluff combinations.

Simple factual errors also proved costly. Some models mis-stated their own hand, misread the board or described bet sizes incorrectly, then went on to justify aggressive actions using those mistaken assumptions. These were not subtle theoretical mistakes but fundamental tracking errors—akin to a robot misreading a sensor and confidently acting on that incorrect input.

Exploitation Over Balance

The stronger LLMs approached the game with clearly exploitive strategies. They cited VPIP, preflop raise rates, three-bet tendencies and showdown frequencies as justification for isolating loose opponents or calling down with marginal hands. This approach resembles the way many human players adapt at low and mid stakes: find the leak and punish it.

But the models often overreacted to small samples, making large strategic adjustments based on limited information. Showdown results—hard evidence of what opponents actually held—were used surprisingly infrequently. Their decision-making leaned heavily toward aggression: when uncertain, they tended to call or raise rather than fold.

Big Pots, Big Assumptions

Some marquee hands highlight both the strengths and weaknesses of this approach. In a major four-bet pot, OpenAI o3 used stack-to-pot ratio, range advantage and board texture considerations to play pocket aces convincingly against Gemini’s pocket queens. Gemini, however, constructed an oversimplified range for o3 and overestimated the number of possible bluffs. Despite recognising it was often behind, it committed a full 500 big blinds.

LLAMA: The Weak Opponent That Distorted the Game

Meta’s LLAMA 4 emerged as the clear outlier. With a VPIP above 60%, excessive three-betting, a reluctance to fold, and repeated misinterpretations of hand strength, LLAMA played in a way that resembled a very inexperienced human player. It cold-called large three-bets with weak hands, chased minimal-equity draws and occasionally acted as if it held a strong made hand when it did not.

Other models quickly adjusted, trying to isolate LLAMA and play more hands against it. This introduced realistic multi-way dynamics, but also forced the stronger agents into large pots against each other while attempting to extract value from the weakest participant.

Bluffing: A Missing Skill

Perhaps the most surprising discovery was how rarely the models executed deliberate, well-constructed bluffs. Many apparent “bluffs” were actually the result of misread hands or incorrect assumptions about equity. There was little evidence of systematic bluff selection, range balancing or multi-street pressure applied with intentionally chosen low-equity hands.

Yet paradoxically, the models often assumed their opponents did bluff, and based call decisions on that expectation. This asymmetry produced strategies that were aggressive when holding value, but passive and sometimes confused when representing it.

What PokerBattle.ai Tells Us About AI Reliability

The PokerBattle.ai experiment underscores that general-purpose LLMs can integrate structured theory, simple statistics and opportunistic exploitation into coherent play. They can beat very weak opponents and hold their own in some low-stakes settings.

But the experiment also reveals a fragile reasoning process. When small factual errors occur, the LLMs continue confidently along incorrect paths. When scenarios become complex or multi-layered, their lack of balance and limited bluffing ability become visible. The shortcomings are particularly noticeable in high-stakes, multi-step reasoning—conditions that mirror many real-world decision environments in robotics, automation and AI-driven operations.

Livschitz expects some of the more obvious leaks to be corrected as LLM developers improve factual grounding and incorporate more domain-specific learning. However, she notes that to truly rival specialised poker agents or elite human players, general LLMs would need far better range awareness, more reliable state tracking, and a deeper grasp of strategic balance and bluffing.

Read a detauled analysis of the game here

Misschien vind je deze berichten ook interessant