“Get up and hustle, don’t feel sorry for yourself
Get that generational wealth
Can’t play with me, I play chess by myself”
I have always been fascinated by chess. I play it regularly. Chess is uniquely interesting because it allows strategy enthusiasts to practice tactical and strategic thinking. Nothing like a good chess game after reading The Art of War by Sun Tzu or On War by von Clausewitz. It is fun, it is good practice. Don’t get me wrong. It doesn’t replace reality though. Magnus Carlsen is no Napoleon Bonaparte.
Still, I wanted to see how LLMs would fare in a chess game, since I have been tinkering a lot with generative AI technologies in the past few years.
I found the idea of organizing a chess competition between LLMs interesting for two reasons:
- First, since general-purpose LLMs are not trained specifically to excel at chess, I thought it would be nice to use chess to test their emergent tactical and strategic planning capabilities. Obviously, LLMs have certainly memorized some chess games available on the internet during their pretraining phase. But what is noteworthy here is the absence of an objective function to make general-purpose LLMs good at chess. This opens the possibility of testing the emergent thinking and planning capabilities of LLMs (at least when it comes to chess).
- Second, I thought a chess competition is also a nice way to rate LLMs and thus create a new kind of leaderboard, one about the tactical and strategic planning abilities of LLMs.
Chess without the humans !
The recipe
In order to achieve my vision, I simply needed to have access to a lot of different LLMs. For that purpose, I chose to use the Nebius AI studios. It gives me access to at least 17 open-source SOTA models, and I can use them for free since I got $100 of free credits when I created my account.
At the end of this article, I will share the full code necessary to replicate my chess evaluation/competition. You will need to have access to the Nebius AI studio, so create your account here: Nebius AI studios.
To interact with LLMs, I will simply use the Python OpenAI client. This has one advantage, I can even add models from OpenAI in the mix to make the final leaderboard more interesting. It will just be a matter of switching API keys. Consequently, I don’t have to do everything through the OpenAI’s API (I tried and it was way too costly compared to Nebius AI studio).
Playing chess, not checkers.
This came late into my experimentation, but I also decided to add the Stockfish engine into the mix. The Stockfish engine is a highly advanced, open-source chess engine widely considered as among the strongest chess engines in existence.
At its core, Stockfish is a combination of traditional chess engine techniques (search algorithm using alpha-beta pruning) and neural network technology. This combination is the secret sauce that makes Stockfish impressive, surpassing other chess engines that rely solely on traditional or neural network-based evaluations.
I decided to add Stockfish as the ultimate adversary of LLMs as a way to check if a neural network not trained specifically for chess could beat or be competitive against a specialized engine trained to excel at chess. I obviously had an intuition about the potential result, but I wanted to gather experimental evidence.
After a few experimentations, I ended up using Stockfish as the only benchmark judge for my tactical and strategic planning leaderboard. Meaning, instead of playing against each other, LLMs were forced to play several times against Stockfish.
Why? First of all, because I noticed that no LLM, even the most capable ones, was able to beat or get a draw against stockfish (not very surprising). Secondly, I also noticed that games between LLMs tended to end as a draw, not because one LLM is not better than the other at chess, but because LLMs in general are bad at chess ending games. This is probably due to the autoregressive nature of LLMs (the further in the chess game, the less precise any LLM will be at generating the best next move). Not to mention that games between LLMs can last quite a long time because of the latency of API calls. But the code is flexible enough to allow you to run a competition between LLMs if you absolutely desire to do so.
Just remember that for the purpose of this benchmark, I decided to use Stockfish as the only reference. Also, instead of just relying on the outcome of the game (which is always a loss against the Stockfish engine anyway), I decided to use measures of the quality of moves produced during the game. These measures include:
- The cumulative centipawn loss: It measures how much a move deviates from the optimal or best possible move. The lower the better.
- The blunder count: This metric counts the number of moves that result in a significant drop in position value, typically defined as a centipawn loss of 100 or more. This metric helps identify severe mistakes during the game.
- Inaccuracy count: This metric counts the number of moves that result in moderate but notable positional losses, usually between 20 and 100 centipawns.
- Matching moves top N: The count of moves made by the model that match one of the top-N moves suggested by the engine. This metric helps in assessing how well the LLM can mimic high-level or optimal play.
- Elo rating: It is a numerical measure of the playing strength of the model, adjusted based on game performance and outcomes. In this case, the Elo rating is calculated for each game by supposing that the model and the Stockfish engine both have 1500 Elo rating at the start of each game. The idea is to evaluate the average rating loss of each model across several games against the Stockfish engine.
After running the experiment, I was able to draw some surprising and interesting observations and conclusions.
ELO ratings
Average relative ELO Rating
The average ELO ratings for various chess-playing models, excluding Stockfish, range from approximately 1248 to 1354. The ELO rating is a measure of a model’s playing strength, with higher values indicating stronger performance. The lowest ELO rating is observed in Phi-3-mini-4k, with an average of about 1248, while the highest is in Llama-3.1–70B, at around 1354. Models such as Llama-3.1–70B and Nemotron-70B are positioned at the upper end of the range, indicating relatively stronger play. Models like Phi-3-mini-4k and GPT-4o show lower ELO ratings, suggesting weaker relative performance in chess.
Average relative ELO rating excluding Stockfish
The narrow spread in ELO ratings (from 1248 to 1354) shows that while there are differences, they are not large enough to indicate significant superiority among the models. None of the general-purpose or semi-specialized models come close to challenging Stockfish’s prowess. This suggests that while some models can play decently, their overall performance remains clustered in a moderate range, highlighting the limitations of non-specialized models.
Blunders Analysis
Blunders per model per game
There is a variety of average blunders among the models, with DeepSeek-Coder-V2 and GPT-4o showing the highest average counts, and models such as Llama-3.1–70B and Mixtral-8x7B showing fewer, which means they are more reliable.
The variation in blunders among models highlights differences in tactical soundness. Models with fewer blunders, such as Llama-3.1–70B, tend to perform better in practice, suggesting that lower blunder rates contribute to higher ELO ratings. Models that commit more blunders are less reliable in competitive scenarios.
Cumulative Centipawn Loss Analysis
Average cumulative centipawn loss
The range for cumulative centipawn loss spans from lower values (indicating better precision) around 9000–11000 for models like Llama-3.1–70B to higher values exceeding 15000 for models like Phi-3-mini-4k.
Llama-3.1–70B, Nemotron-70B, and Mixtral-8x22B exhibit lower cumulative centipawn losses, suggesting more precise gameplay.
Phi-3-mini-4k and DeepSeek-Coder-V2 have higher centipawn loss averages, indicating more frequent and significant deviations from optimal play.
Lower cumulative centipawn loss correlates with stronger, more consistent gameplay. The data shows that models like Llama-3.1–70B perform closer to optimal, whereas models with higher centipawn loss, such as Phi-3-mini-4k, make more suboptimal moves. While no model matches Stockfish’s zero centipawn loss, the spread indicates that certain general-purpose models are slightly better at maintaining closer-to-optimal moves than others.
Overall, The analysis underscores that while general-purpose models (e.g., GPT-4o, Llama-3.1–70B) can play chess at an acceptable level, specialized models or those trained with chess-specific data (e.g., Stockfish) exhibit vastly superior performance.
Although the ELO and centipawn loss ranges between the models are not extreme, the absence of major outliers (other than Stockfish) points to general similarities in performance capabilities. This is coherent with the fact that these models have not been optimized for chess.
AGI, checkmate ?
Caveat: Instead of chasing AGI, should we focus on building more specialized AIs and then combine their respective strengths? If so, we still need a model robust enough to orchestrate these specialized systems.
Perhaps the path should lean more toward specialization and less toward generalization in the pursuit of AGI.
Only time will tell.
But in the meantime, check out the code of my experiments on github, fork, improve, do with it whatever you want. Feedbacks appreciated of course.