Prisoner's Dilemma - An Extension


In the past few days I have been going down some pretty deep rabbit holes into decision theory, basically trying to understand how I can make rational decisions on purpose instead of completely relying on intuition and just calling something instinct after the fact.
I do not mean that in some motivational self-help way where you optimize your morning routine. I mean literally asking what the actual structure of a decision is, what information I am actually using when making a choice, what incentives are hidden inside the environment, and how those choices ripple out across other people and institutions that are also making decisions back at me in real time. Once I started thinking about it that way, it stopped feeling like a question about finding one clever move and started feeling like a fundamental question about the architecture of thinking itself.
That is basically a step toward metacognition, which is just a fancy word I use to describe thinking about my own thinking. I find that it is not just having a thought that matters, but being able to watch how I reached it, inspect the underlying assumptions, notice when I am optimizing for the wrong thing entirely, and deliberately update the process instead of just blindly accepting the conclusion.
Once I started seeing it that way, I kept ending up in the exact same place. I realized that if I really want to reason about decisions properly, I cannot just think about one isolated person maximizing some clean objective in a vacuum. Real decisions happen around other agents who have their own goals. Other people react, systems react, and incentives constantly interact with each other. My best move totally depends on what someone else is likely to do, and their best move heavily depends on what they think I am likely to do. That is exactly where this whole rabbit hole bent hard into game theory.
What is game theory, actually?
The version of game theory most people hear about is usually kind of flattened out and simplified. It sounds like a bag of famous puzzles that people talk about at dinner parties. But mathematical game theory is really about modeling strategic interaction between multiple independent actors. You have a set of players, a set of available actions or strategies, and some payoff function that specifically maps everyone's choices to outcomes.
If player \( i \) chooses strategy \( s_i \) while everyone else chooses \( s_{-i} \), you can write their payoff as \( u_i(s_i, s_{-i}) \). That one little expression is doing a lot of work. It says my outcome is not just a function of what I do but rather firmly depends on what the rest of the world does too.
That is why it feels so incredibly connected to metacognition. If I want to make a truly rational decision, I need a model not just of the outside system but of my own objective, my own blind spots, and the fact that other entities are also taking the time to model me back.
In the simplest terms, pure rational choice is often framed as finding \( \text{argmax}_s u(s) \), choosing the action that maximizes your raw utility. Game theory says that is incomplete in multi-agent environments. The actual optimal choice you are maximizing is really closer to \( \text{argmax}_{s_i} u_i(s_i, s_{-i}) \). Suddenly, you absolutely need beliefs, expectations, and a real strategy, not just a raw individual preference.
That tension is what heavily pulled me into this space. Rational individual behavior can absolutely produce collectively stupid outcomes. Everybody can be making the correct local move to protect themselves and still end up somewhere significantly worse than the cooperative alternative. That is not some weird edge case but rather the central drama of the whole subject, and the cleanest little model I found for that drama was the Prisoner's Dilemma.
The Prisoner's Dilemma
The setup is famous mostly because it is so brutally compact and easy to understand. Two suspects get separated by the police. Each has two options, to cooperate with the other by staying quiet, or to defect by betraying them. Neither knows what the other is going to do, and each outcome completely depends on the pair of decisions rather than just your own choice in isolation.
If both cooperate, both do pretty well and serve a very minor sentence. If one defects while the other cooperates, the defector gets the absolute best personal outcome going free and the loyal cooperator gets completely wrecked with the maximum sentence. If both decide to defect out of self interest, they both do worse than they would have if they had just trusted each other initially.


Once you write the numbers down, the underlying trap becomes entirely obvious to anyone looking at it. The temptation to defect is higher than the reward for mutual cooperation. The heavy punishment for both defecting is still better than being the only one who cooperates and gets played by an exploiter. So if you only care about protecting yourself in the immediate moment, defection keeps looking like the only safe option.
That is the part that profoundly bothered me when I first really sat down with it. No matter what the other person does, you can completely justify defecting to yourself. If they cooperate, you get more by betraying them and securing a clean getaway. If they defect, you still protect yourself quite a bit by defecting too. So defection strictly dominates every other option. Both players can reason their way there rationally and both will just end up with a worse result than the one they could have easily reached together.
When you play it again and again
The simple one-shot version is definitely interesting, but honestly it is not the version that truly grabbed my attention. It is too clean and too closed off from reality. In a one-off interaction, defection dominates and that is basically the entire end of the story. But I started wondering what actually happens when you play the exact same person two hundred times in a row.
Everything completely changes when repetition enters the picture. Memory suddenly shows up, reputation becomes extremely relevant, and pattern recognition starts playing a huge role. The future actually starts mattering. If I know that I can punish you later for defecting right now, or reward you later for giving me cooperation now, then the logic of the game is no longer just grabbing whatever you want immediately.
This is the expanded version of the game that Robert Axelrod studied in those very famous computer tournaments in the early 1980s. He openly asked people from a bunch of different academic fields to submit programmed strategies for repeated Prisoner's Dilemma play, then had them all digitally face each other in a massive round-robin tournament.
The winner, both times he ran it, was actually the absolute simplest strategy in the whole lineup, known as Tit For Tat. It cooperates first, and then it simply mirrors whatever the opponent did in the prior round. That is the entire thing. It has no giant theory brick, no fancy mathematical machinery, and just relies on a very clean rule of mirroring.
Why I built my own tournament
After reading enough papers, I eventually understood the underlying game theory pretty well. I totally got why Tit For Tat works effectively in most populations. I understood why the Gradual strategy can often improve on it in some complex environments where apologies are needed. I completely got why pure unpunishing cooperation gets exploited terribly and why pure defection kind of inherently poisons its own surrounding environment.
But understanding it in the abstract simply was not enough for me anymore. I genuinely wanted to feel the physical tradeoffs in written code. I wanted to actively watch digital agents win and lose for very specific identifiable reasons. And honestly, I primarily wanted to see if I could build something fundamentally better myself.
So I decided to build the exact tournament field I wanted to see test my hypotheses. I gathered a bunch of classic benchmark strategies from the academic literature, and then carefully programmed three of my own entirely custom agents, Eleanor, Nadia, and Mara. Each one was me deeply exploring a slightly different structural answer to the exact same question.
The true question hiding under all of this work was whether I could personally make something more adaptive and robust than the timeless classics without losing the elegant simplicity that makes those classics so incredibly resilient in the first place.
.png&w=3840&q=75)
.png&w=3840&q=75)
Eleanor, the Conservative Hybrid
Eleanor was my highly structured conservative build. My primary motivation here was that strategies like Gradual are fantastic at punishing and forgiving, but they are incredibly naive because they will try to forgive an opponent infinitely. I wanted to design an agent that was forgiving but possessed a hard limit on its patience so it would stop donating points to bad-faith actors.
She starts very cooperatively, and uses a gradual punishment mechanism similar to the classic Gradual strategy when betrayed. However, I made her far colder in a very specific way. I programmed her so that if the opponent looks genuinely defect-heavy after the initial eight exploratory rounds, she definitively stops pretending that relationship repair is still on the table and hard-locks into permanent defection.
I can define Eleanor's logic mathematically by defining her next move based on the entire history. Let \( H_i^n \) be the history of moves up to the current round \( n \) for a given player \( i \). I set Eleanor's decision function \( E(n) \) as follows:\[ E(n) = \begin{cases} D & \text{if } n = N \\[5pt] D & \text{if } n > 8 \text{ and } \sum_{k=1}^{n-1} \mathbb{I}(H_{opp}^k = D) > 1 + \sum_{k=1}^{n-1} \mathbb{I}(H_{opp}^k = C) \\[5pt] \text{Gradual}(n) & \text{if } H_{opp}^{n-1} = D \\[5pt] C & \text{otherwise} \end{cases} \]
Eleanor is essentially just me telling the machine to be fair and be patient, but to never keep foolishly negotiating with someone who has already proven their overwhelmingly hostile pattern. To wrap things up efficiently, she also ruthlessly takes the final-round free points by defecting at the end, because there is absolutely no future left to preserve with the opponent.






Nadia, the Control System
Nadia was the overly engineered, systematic version of my attempts to solve the problem. My inspiration for her was control theory. I truly did not want her to be just one single named stock strategy with a small parameter tweak. Instead, I genuinely wanted something that behaved more like an automated industrial PID controller.
She constantly heavily tracks a set of internal state variables, calculating her own metrics like trust, pressure, volatility, repair, and grace. I then designed her to combine those internal states with external match features into one single mathematical cooperation signal. If the numerical signal safely clears the defined threshold, she decides to cooperate. If not, she definitively defects.
Nadia calculates a linear decision boundary. Let \( \mathbf{x}_n \) be the vector of extracted features and internal states at round \( n \), such as the opponent's historical cooperation rate, the recent betrayal frequency, and accumulated trust.\[ S_n = \mathbf{w}^T \mathbf{x}_n + b \]\[ N(n) = \begin{cases} C & \text{if } S_n \ge \tau \\ D & \text{otherwise} \end{cases} \]
That structural complexity also inherently means Nadia has the standard downside of all intricate systems, which is more places for small compounding errors to creep in. I arduously found the functional weights through extensive local search algorithms, running endless tournament sweeps, and manually tweaking the coefficients too many times.






Mara, the LLM native
Mara was the absolute wildcard of the entire group and probably the most fun one for me to make. She was my entirely LLM-native experiment regarding game theory play. My core inspiration and the central question there was simply investigating what physically happens if I stop hand-designing the entire rigid policy and just let a large language model sketch the core controller logic organically from a descriptive prompt.
Mara looks at very similar fundamental signals to Nadia, but I allowed the shape of her weighting and logic to come out of the model's textual reasoning rather than utilizing the exact same kind of direct, hyper-optimized manual tuning that I used before. Let the model's generated features be \( \phi(H) \), predicting a composite score:\[ M(n) = f_{LLM}(\phi(H^{n-1})) \]
Mara is intensely interesting because she really is not just a joke agent. She truly can compete against the field. She just does not currently have the exact same rock-solid consistency as Eleanor, or even Nadia when she is having a good run.
The most accurate way that I can personally describe the outcome is that prompt-native cleverness surprisingly gets you extremely far in these complex environments, but the absolute finishing bit of top-tier performance still inevitably comes from much tighter calibration than the unprompted model's initial intuition naturally provides you.






What the tournament showed
The ultimate headline from the data is quite simple. Eleanor won. She did not just win once by random accident, and she did not win through one very weird specific bracket sequence. She won relentlessly across all the validation seed runs in a deeply structured way that felt incredibly real to me. What made that so intensely satisfying was that her final performance perfectly matched the original intuition behind her programmatic design.
Nadia came in third overall, which is still incredibly strong and decidedly stronger than a significantly large portion of the heavily established classic entrants. But her specific performance also clearly exposed the exact tradeoff that I was worried about from the very start. When the situation is incredibly clean, all of that intense control logic can actually become severe overhead.
Mara landed perfectly around the very middle of the pack. She proved competitive, genuinely interesting to analyze, and not remotely embarrassing at all given her genesis, but still she was clearly below the top tier of hand-engineered agents. That result actually told me something deeply useful about LLMs in these spaces.


Perhaps the most fascinating part of the final leaderboard was not even the exact order of the winners but rather the general shape of it. The top tier still surprisingly looks a lot like Axelrod's original world. All Defect does awfully. All Cooperate gets terribly exploited. The winning region is still firmly this very specific middle ground where you start very kind, retaliate appropriately when necessary, forgive readily when recovery currently looks real, and stop completely when it definitively does not.






What the experiment taught me
Simplicity is significantly more powerful than I ever wanted to admit to myself. Watching Eleanor effectively beat Nadia was an exceptionally good reminder that doing substantially more things is generally not the equivalent of doing better things. Every extra moving part I added gave systemic noise somewhere convenient to hide.
Overall resilience matters profoundly more than just looking temporarily brilliant in one isolated matchup. Some of the strategies could spike a lot higher than Eleanor did in very specific pairwise pairings, but that did not actually matter much in the context of a full ecosystem tournament if those identical strategies inherently possessed incredibly dumb failure modes elsewhere.
Repeated multi-agent games systematically reward rigorous consistency far more than raw cleverness. This is frankly the single biggest thing that I am taking away from the whole exhaustive project. Winning strategies are rarely the ones that try their hardest to produce the most amazingly flashy move in any one round to get a small gap up on the competition.
I repeatedly find myself coming eagerly back to this specific project precisely because the Prisoner's Dilemma really is not actually about criminal prisoners in any capacity. It is heavily about every possible situation where offering trust feels incredibly risky, executing betrayal looks wildly tempting, and crucially the social interaction between parties is not actually over after just one single move.
The profound lesson, for me at least, is absolutely not that you should always trust everyone blindly, and it is definitely not that you should always suspiciously betray everyone first to secure an edge. The actual truth is something far closer to maintaining reciprocity with a firm spine. You should start entirely cooperative. You should retaliate immediately when you are forced to. You should forgive when the empirical evidence clearly supports it. And you must stop engaging unconditionally when the structure of the interaction has finally proven itself to be fundamentally bad.
At the end of the day, I realized that nearly all of the most important situations in life are fundamentally repeated games at their core. What ultimately matters the most is not the one singular clever move you can loudly brag about after the fact, but rather the internal steady rule that you continuously keep following reliably when absolutely nobody resets the intricate board for you.