← Back
·11 min read

The Word Made Flesh: Why Language Models Can't Simulate Humans

The Word Made Flesh: Why Language Models Can't Simulate Humans
0:000:00

We deceive ourselves. Professionally and systematically.

A study from 1977 — Nisbett and Wilson, somewhat dated now, but still the foundation for everything that followed. Cited endlessly, yet rarely internalized: humans have remarkably little access to their own cognitive processes. When asked why they made a particular decision, they invent plausible stories — post-hoc rationalizations that sound good, that make sense, that are internally consistent, but that have about as much to do with the actual decision process as a single Instagram story has with an entire vacation.

We feel first, we justify later. And we're so good at mistaking the justification for the original that we don't notice the difference ourselves — and so we shape our reality.

If people are already bad at understanding their own decisions, then everything they write about them is already a reconstruction. Not documentation. A narrative — polished, smoothed, made socially acceptable.

This sounds like Psychology 101, but it leads directly to the heart of the current AI debate.

One Hundred Million for the World's Most Beautiful Façade

My friend Leonard, who works in market research and whom I dragged into the AI rabbit hole at some point — he's just as stuck now as I am — sends me a link at nine on a Saturday morning. Of course I clicked on it immediately like a madman and spent the entire day mentally occupied with it.

Simile, a startup that announced a $100 million Series A on February 12, 2026, led by Index Ventures, backed by the biggest names in the industry. My first thought looking at the website: AI slop. Thin, vague, pitch-deck vocabulary — for a hundred million, I would have expected at least a subpage.

But there it was. A paper behind it. I read it myself first, then ran Claude and ChatGPT over it. One entire Saturday later, I was deep in the material and realized: I'm no longer thinking about Simile, but about the question that has occupied me for months — can language models simulate human behavior? And if so: where does simulation end, and where does self-deception begin?

Simile's founder is Joon Sung Park, the same Joon Sung Park who published the famous "Generative Agents" paper in 2023, together with Percy Liang and Michael Bernstein from Stanford. His claim: Yes. A "foundation model that predicts human behavior in any situation." Digital twins — AI agents that are supposed to simulate real people based on actual survey data from Gallup.

A salvation for insurance companies, marketing, and everyone who profits from market research. The VP of Customer Experience at CVS Health is quoted on Simile's website: "We don't have to fail fast in front of real customers anymore — we can fail safely in a controlled environment."

It sounds tempting. It's based on a premise I consider fundamentally flawed.

The Sims, but with a PhD

The paper everything is built on is fascinating. Park and colleagues placed 25 AI agents in a sandbox world — a kind of Sims, but powered by a language model instead of hand-written scripts. The individual agents woke up in the morning, had breakfast, went to work, made small talk. One organized a Valentine's Day party, others invited each other, five showed up at the virtual café at 5 PM sharp.

Emergent social behavior — not programmed, not scripted. Impressive.

But what was actually measured? Believability. Human evaluators judged how plausible the behavior appeared. And yes, it appears plausible. But believability is not validity. That an agent sounds human doesn't mean it decides like a human — just as a good actor playing a surgeon can look convincing without anyone wanting to let him near an operating table.

The empirical data is somewhat sobering. Another study with 31,865 real online shopping sessions, where researchers wanted to test whether the agents make the same decisions as real users: the LLM agents sounded convincing and were wrong in 88 percent of cases. For box-office predictions, the models achieved a correlation of 0.85 for films already in the training data. For genuinely new films? 0.3. The models remember. They don't predict.

The Review Is Not the Impulse

Here lies the actual thinking error — fundamental, almost embarrassing once you see it.

Text is not behavior. A text, an argument, a description is what remains after a person has pressed their behavior through the double filter of reflection and self-presentation.

If you had a fit of rage and then write an email, the email doesn't contain the rage. It contains a smoothed version of it — polished for the recipient, for the boss, for your self-image.

If at three in the morning, out of frustration, loneliness, and because your cortisol levels have been through the roof since the argument this afternoon, you make a purchase on Amazon, you don't write a blog post about the causal connection between your stress level and the order. You write: "Great product. 4 stars."

The LLM learns the review. Not the impulse. Not the three in the morning. Not the hormone cortisol.

A language model is a model of the human façade, not of human essence.

And it gets one layer more absurd: what people write down is already a rationalization. Now we train a language model on these rationalizations, and the model — a NeurIPS study from 2023 demonstrated this elegantly — rationalizes on top of that. It delivers chain-of-thought explanations that sound plausible to us, that convince any investor — but systematically miss the actual causal influence.

Models trained on the outputs of billions of different people represent the range of human rationalizations — but not the biological state in which the decision actually fell. Not the hunger, not the cortisol, not the sleepless night.

The Body That Nobody Misses

Humans are not brains on stilts. We are biological organisms, permanently influenced by physiological states that we ourselves barely perceive — let alone write down.

Hormones like cortisol fundamentally change how we evaluate risks — systematically shifting decisions toward more intuitive, faster, and often wrong solutions. Sleep deprivation degrades impulse control — precisely the part of the brain that says "No, you don't need that" when you come home at two in the morning and suddenly open your laptop anyway because that one offer will definitely be gone tomorrow.

Antonio Damasio has been researching somatic markers since the '90s: gut feelings in the most literal sense, which guide decisions before the conscious mind even realizes a decision is being made. In 2025, he published a paper with UCLA colleagues that puts it succinctly: multimodal language models "interpret 'heat' without ever feeling warmth, parse 'hunger' without ever knowing need."

All of this — cortisol, hunger, sleep, hormones, interoception — is structurally invisible in text data. The causal path from the HPA axis through cortisol levels to the purchase decision appears in no Amazon review, no tweet. It exists in bodies, in neural pathways, in the diffuse discomfort you can't Google but that determines whether you book the more expensive flight today.

Xu et al. confirmed this empirically in 2025 in Nature Human Behaviour: the agreement between LLM and human concept representations systematically decreases the more bodily the concept becomes. LLMs understand "democracy" better than "toothache." Abstract concepts are well-encoded in language. Bodily experiences exist only as description. Never as experience.

Arguments for the Counterposition

Language is the only medium we share with language models — and it's more powerful than my argument so far might suggest. Max Louwerse has shown across 126 experiments that language statistics reveal a surprising amount about embodied relations — language as a compressed map of the physical world.

But a fundamental difference remains between statistical correlation and causal conditioning. An LLM can portray hunger in a short story so vividly that your mouth waters while reading. But it cannot decide while hungry. It cannot condition its behavior on a bodily state it doesn't have.

This is not an engineering problem. It's a structural limit of the medium. And that's perfectly fine.

Where the Simulation Crashes

For certain applications, Simile's promise works: rough sentiment pictures, hypothesis generation, pretests. Everywhere it's about deliberative, linguistically mediated opinions — rational weighing, conscious argumentation — model simulations have legitimacy. A real one.

But most decisions are something else: fast, affect-laden, body-state-dependent. The impulse in the supermarket. The willingness to pay that rises with hunger. The brand preference that shifts with sleep deficit.

Simile's Gallup partnership is clever — real survey data as a foundation — better than pure language model extrapolation, like a portrait photo is better than a composite sketch. But even the best survey data captures what people say. Not what they do or will do.

The Most Honest Mirror

There's something about these systems that won't let me go — something that has less to do with technology and more with us.

When an LLM hallucinates, claims something with absolute conviction that is simply false, the internet is outraged. The model makes things up! It lies! But that's exactly what we do too. Constantly. We invent memories that never happened. We assert things with a certainty that has no foundation — at family dinners, in salary negotiations, on LinkedIn.

And when we're caught, we reframe the error as a learning moment because that feels better than "I screwed up."

LLMs rationalize because they were trained on our rationalizations. They hallucinate because we hallucinate — only we write our hallucinations down and call them opinions, beliefs, gut feelings.

The mirror this technology holds up to us is uncomfortable, not because it shows what the machine does wrong, but because it shows what we have in common with it. The only difference is: with the model, we notice.

Not Us, but Not Nothing Either

LLMs are a different form of intelligence. Trained on the end products of our biological processes — not on the processes themselves. The press release, not the board meeting. The review, not the midnight impulse.

There will be no AI that behaves exactly like a human. Not because the models aren't good enough — but because text isn't sufficient to replicate biology. Even if we had all the texts of humanity, we would only have the façade. And the façade is what we want to see. Not what we are.

In the Bible it says: The Word became flesh — language becomes alive, becomes body, becomes being. Simile sells the promise that the reverse path works just as well: that you can press the flesh of human experience into words, feed these words into a model, and end up with something that behaves like a human.

But in the pressing, exactly what makes humans human is lost — the body that decides before the mind knows.

The Word became flesh, says John. Flesh became word, we say — every time we write down our experiences. That this word never becomes flesh again is not a weakness of the technology. It's the nature of translation.

Key papers: Park et al. (2023), Kadambi, Aziz-Zadeh, Damasio et al. (2025), Xu et al. (2025), Chemero (2023), Louwerse (2011), Nisbett & Wilson (1977), Turpin et al. (2023), Bisbee et al. (2024), Goli & Singh (2024).