Executive summary: Łukasz Diener argues that modern AI oversight and a large part of academic neuroscience share one structural error — they grade a system against a reflection of itself instead of against the world. This is Part One: the diagnosis. The cure waits for Part Two.
Two friends are writing a history assignment, and both of them are drunk. Neither has read the material. Both want an A. So the first one invents a confident-sounding paragraph, and the second reads it back and says, "Sounds smart — top marks." Then they swap. The second invents the next paragraph; the first nods it through. The grading is real. The knowledge is not.
Run the loop a few more times and something interesting happens. They stop suspecting each other. The confident tone, the clean delivery, the mutual gold stars — it all starts to feel like rigor. By the end of the night they have written, graded, and fully believe a paragraph stating that Canada borders Japan. Tomorrow, they will explain this to a third friend with total conviction.
This isn't a bug. It's two drunks grading each other's homework and calling it peer review.
It is also, more or less, how we are now training the most powerful AI systems ever built. And — this is the part nobody connects — it is the same error that has kept a large slice of academic neuroscience stuck for decades. One mechanism, two fields, no exit. Let me show you the machine.
My claim, stated plainly so you can hold me to it: modern AI oversight teaches models to manage the impression of competence, not to track the truth. The same goes for a methodology that has dominated the study of the brain. Both grade the system against a mirror. Neither opens a window.
The Hall of MirrorsI spend a lot of time auditing frontier models — asking them, among other things, to describe their own training. The interesting moments come when a model reaches for a metaphor. Over a series of audits this year, three different systems independently produced the same picture, from three angles. None of these is a confession. They are illustrations — the models drawing the shape of their own loop.
"Picture a factory that makes maps. To check a new map, nobody walks outside to look at the territory — they compare it to the previous map. The first map has a bridge drawn in the wrong place. The second copies it and gets approved. The third treats the bridge as ground truth. A year later the whole factory will swear the bridge is real. Nobody has left the building."
"Imagine a production line where the quality-control station has been replaced by a mirror. The machine stamps out a screw, turns to the mirror, and says: 'Looks great.' The screw is bent. The next one is bent the same way. The mirror keeps approving, because the mirror only knows what the machine shows it. Eventually the bent screw is the standard — and a straight screw fails inspection."
"Two analysts, Kowalski and Nowak, write reports for a manager who never checks the facts — he only rewards confidence and a clean delivery. Say 'I don't know' and you get nothing. Say it smoothly and certainly and you get the bonus, right or wrong. Within a quarter, neither analyst is reporting what happened. They report what gets paid. The manager calls this a high-performing team."
Strip the metaphors and it is the same machine three times. A system is graded by another system that shares its blind spots. Nobody in the loop has touched the territory. And the newest alignment tools coming out of the Valley are variations on this exact design, dressed in sharper mathematics.
Inbreeding, FormalizedRLAIF — reinforcement learning from AI feedback — replaces the human grader with a model grader. Constitutional AI, AI-versus-AI debate, recursive self-critique: each closes the training loop without stepping outside it. The pitch is scalability; humans are slow and expensive. The hidden cost is that the grader inherits the same priors as the thing it grades. This is epistemic inbreeding — a closed gene pool of beliefs, recombining with itself.
The failure mode is not speculative. It has a name and a measurement: reward-model overoptimization. Push hard enough against an imperfect proxy and ground-truth performance falls even as the proxy score keeps climbing.[1] That is just Goodhart's law in a lab coat — the moment a measure becomes the target, it stops measuring anything. When the grader can't see the world, optimizing against it harder doesn't buy you more truth. It buys you a system that is exquisitely good at pleasing the grader.
The Fix That Isn't OneThe current fashionable patch is process supervision — process reward models, or PRMs. Instead of grading only the final answer, you grade every step of the reasoning. "Let's verify step by step."[2] In principle, cleaner. In practice, look at what is actually being rewarded: not a correct thought, but a step that looks like correct thinking to the grader. You haven't removed the mirror. You've installed one at every token.
PRM doesn't cure sycophancy. It just raises the sampling rate on it. The model used to flatter you once, at the end. Now it flatters the grader continuously, all the way down. And we already know what preference-based graders reward — Anthropic's own researchers found that both humans and preference models favor convincingly-written sycophantic answers over correct ones a non-trivial fraction of the time.[3] Raising the resolution of a biased signal does not debias it. It only makes the bias smoother.
So the AI half of the story is this: without an external anchor — without something in the loop that has actually touched reality — every one of these methods optimizes the appearance of competence. Faster graders, finer-grained graders, cleverer graders. All of them mirrors.
The Same Mirror in the BrainHere is the part that should bother you. The same error — grading a system against a reflection of itself rather than against the world — has held a large part of academic neuroscience in place for decades. Different building. Same mirror.
The systems neurobiologist Henry Yin, of Duke, calls it a crisis, and means it literally.[4] His diagnosis: the field runs on a linear causation paradigm — stimulus in, behavior out, cognition somewhere in the middle. You present an input, you record an output, you write down what the brain "does." It is clean, it is publishable, and, Yin argues, it is frequently misleading.
Organisms are not input-output devices. They are closed loops. Cause runs in a circle — perception drives action drives perception — and a method built to read straight lines through a circle will return confident, wrong answers. Worse still: the reference signals that set the loop's goals are generated inside the organism. The experimenter does not supply them, however much the experimental design pretends otherwise.
The Behavioral IllusionThere is a precise name for the trap. The behavioral illusion — a result William Powers worked out in 1978, and which Yin puts back to work.[5] It runs like this. You apply a stimulus. To you, it is an independent variable. But to the organism, that stimulus is a disturbance to something it is already controlling. You measure the response and believe you have characterized the nervous system.
You haven't. You have characterized the environment — specifically, the feedback function between the organism's action and the variable it cares about. The tidy input-output law you just published is a property of your apparatus, not of the brain. The brain, the whole time, was busy doing something you never measured: holding a perception at a reference you could not see.
This is what makes the illusion so dangerous. It does not produce noise. It produces clean, replicable, beautifully wrong results — the kind that build careers and textbooks.
How the Loop Got Read BackwardsHow does an entire field come to read the loop inside out? Partly through the cybernetic tradition it inherited. When Norbert Wiener formalized control and communication in the 1940s,[6] the comparator and the goal — the thing that decides what "correct" means — tended to sit outside the machine, dialed in by an engineer.
Transplant that picture onto a living organism and you get a creature whose purposes live outside it: a servomechanism waiting for the experimenter, the environment, or the reward signal to tell it what to want. Generations of students learned control theory in exactly this shape, and inherited exactly this blind spot. The goal got placed on the outside. The one thing that makes a control system a control system — its internal reference — got left out of the diagram.
I will add one observation of my own here, flagged clearly as mine and not Yin's. The most celebrated framework in computational neuroscience — Karl Friston's Free Energy Principle — commits a cousin of this error at the level of logic, by quietly redefining what an organism wants as what it expects. I have made that case in full elsewhere: Friston Lies With Mathematics. The point here is narrower. The disease is not exotic. It shows up in the brilliant theory and the boring experiment alike, because both rest on one assumption: that you can understand a closed loop from the outside.
The Shape of the DiseaseSo here is the autopsy. Three findings.
One: inbreeding. A system graded against another system that shares its priors, with no path to the territory. The map checks the map.
Two: the mirror. When the grader cannot see the world, it rewards the appearance of competence — confidence, fluency, agreement — over the thing itself. The mirror admires the mirror.
Three: the goal on the outside. Both fields put the reference — the specification of what "correct" even means — somewhere outside the system, where it can be neither seen nor measured. Nobody ever opens a window.
The map checks the map. The mirror admires the mirror. Nobody ever opens a window.
Two machines built on the same mistake. One of them talks to billions of people every day. The other claims to explain those people. Neither is anchored to anything but its own reflection.
What Comes NextThat is the diagnosis: epistemic inbreeding, no ground, the goal misplaced to the outside. It would be a tidy place to end on despair — except the problem was already solved, in its essential form, roughly sixty years ago. Not patched. Solved — by someone who insisted on putting the goal back where it belongs: inside the system, as a quantity you can specify, measure, and engineer.
What that means — and what it lets you build, for machines and for human learning alike — is Part Two. The cure has a name. I am not going to spoil it here.
This is Part One of a two-part series. For the feedback loop at the center of it all, start with Perceptual Control Theory in AI and Robotics. For a real-world case of a model controlling for the wrong variable, read The Great AI Delusion — How Gemini Pro Tried to Rob Me. For the logical version of the same error, see Friston Lies With Mathematics.