A long-novel coordinate system, used in reverse
Epistemic status. This is an essay, not a research paper. I am an undergraduate engineering student in mainland China, not a literary scholar or an alignment researcher. The argument runs from a coordinate system I built for reading long novels, applied in reverse to long-form text produced by current large language models. The empirical evidence I lean on is a mix of my own reading and a small set of published findings I cite where I rely on them; the rest is reasoning from observation. I take long-form fiction as a stress test for the current paradigm, on the view that long novels exercise more dimensions of cognition than most other text-generation tasks. I would be glad to be wrong about specific points. The kinds of feedback most useful to me are listed at the end.
Disclosure. The argument, structure, and judgments in this essay are my own. AI was used as a drafting, translation, and editing tool, and I remain responsible for the argument and the final wording.
Why I’m posting this on the EA Forum. This essay is not primarily an AI safety research paper. It uses long-form fiction as a stress test for current LLMs, and argues that some failures in AI-written long novels point toward broader limitations around grounding, nested mental-state tracking, meta-level narrative intent, conviction, and evaluation. I’m posting it here because I think these limitations may matter for how we think about LLM capability evaluation, alignment-relevant model behavior, and the limits of scale-only improvement.
I · The Long Novel as an Object
A long novel does not move us because it is long.
A common assumption is that a long novel is a short story scaled up: more content, more characters, a longer time span. The problem with this view is that it treats length as a quantitative difference. What actually happens inside a good long novel typically cannot happen inside a short story, not because there isn't room for it, but because a long novel is structurally a different kind of thing.
After finishing a long novel, I tend to trace back through it and ask where it moved me, and where it failed. After enough books, those traces keep landing in the same handful of places. The places behave largely independently of each other. Strength in one does not reliably compensate for failure in another, and weakness in one is not concealed by strength elsewhere; the book tends to collapse at that point.
I eventually settled on eight, ordered from the surface of reading experience to its depths.
The first is the textual surface. A sentence has its own pace: short sentences push you forward, long sentences slow you down and let an atmosphere settle. It has its own temperature, colloquial or written, archaic or austere or vulgar; a writer commits to a ratio and holds it stable. It has its own grain: down to the touch and smell of things, or only an outline. Its rhetoric and rhythm form a particular cadence of repetition, parallelism, pauses, and silences. This layer registers the moment you start reading, before you understand the plot or know any characters. It acts directly on the feel of reading.
Close behind it is narrative density: how much new material is packed into a chapter, how feelings rise and fall, how many irreversible things actually happen as opposed to mere conversation and reflection. Then scene texture: whether space is visualizable, whether actions are intelligible, whether the world is touchable. Together these three sub-elements form the first layer. A long novel has to first hold up at the textual surface, otherwise it cannot enter the reader's attention at all.
The second is the narration layer — how the story is being told. This layer is about who is telling, not about what happened.
This includes the position of the viewpoint: omniscient, limited, multiple shifting viewpoints, unreliable narration. The control of focus: how much does the reader know, more or less than the protagonist, and when are they allowed to know it. The distance of the lens: close to the inner life, or distant narration, and whether the change in distance carries meaning. Below that is temporal structure: linear progression, recollection and flashback, circular or recurring forms, fragmented assembly. The ratio between narrating time and story time, where a single minute can take ten chapters and a decade can be dispatched in one sentence. The choice of node: why begin at this moment rather than earlier or later. Below that is information and suspense — the type of suspense (an unknown truth, an unknown consequence, an unknown relationship, an unknown rule), the modes of misdirection (a character's misunderstanding, an omission in the telling, an obscuring surface, a skewed value system), and the reader's position (being led along, reasoning shoulder-to-shoulder with the protagonist, or watching from above).
This layer behaves largely independently of the first. A book can have superb language but a confused viewpoint, or plain language with extremely precise control of viewpoint. These are not the same capacity.
The third is the structural layer — what allows a long novel to take shape.
This layer is the mark of a long novel being a long novel. A short story does not have this layer because it does not need it. A long novel has to move through phases (of growth, of geography, of social class, of cognition), each phase having its own source of freshness — new rules, new adversaries, new relationships, new types of problem — with the phases connected by leveling up, escape, change of station, loss, the discovery of truth, the collapse of a value system. Conflict is not just confrontation; it includes institutions, ethics, cognition, identity, language, intimate relationships. The same problem deforms across the book, from personal into social, from external into internal. The ending is sometimes not the resolution of a problem but its redefinition. Repetition is unavoidable in a long novel; what matters is whether the repetition carries variation: a higher cost, a shifted feeling, a different angle of telling, a deeper revelation of the world. When the reader tires, the work has to find a way to let them breathe — humor, ordinary life, a side branch, an aesthetic transition.
This layer has nothing to do with the first two. A book can have good language and good viewpoint and still feel structurally loose. The reader will feel that it doesn't quite hold together as a book, that it's a sequence of chapters set next to each other.
The fourth is the character layer — the character as a living contradiction.
The character is treated not as a functional slot but as a way of being a person. Four core fields make up this layer. Desire: what they actually want, which often isn't what they say they want. Fear: what they are most afraid to lose. Blind spot: the part of themselves they will never admit to. And threshold: the conditions under which they will cross a line, break, betray, or give themselves up. These four together determine whether a character is alive. Beyond that is growth and change — change in capacity, in relationships, in cognition, in moral position, in identity, and whether change comes at the price of loss, trauma, and bearing weight. There is also the force of what does not change: some characters draw their charge precisely from refusing to change, and the work has to earn that refusal. Further in is relationship and tension. The key relationships are not about who likes whom, but about how each alters the fate and the values of the other. The basic tensions are dependence, rivalry, envy, redemption, indebtedness, possession, shame, awe. And there is the condition for a true ensemble: whether each person has an independent world, rather than orbiting the protagonist.
This layer behaves largely independently of the first three. A book can have a vast world and complex structure and still have characters who are paper-thin, characters who exist only to serve plot, with no independent inner structure. Readers can tell, even when they can't articulate what's wrong.
The fifth is the world-and-lived-reality layer — a worldview is not a setting sheet but a way of life.
This layer is about how people in the world live, not what the rule table says. How ordinary people get through their days (work, trade, education, religion, entertainment, marriage). How society is stratified (class, lineage, talent, capital, violence, knowledge). How value is recognized (honor, merit, credentials, status, divine sanction, wealth). Beneath this is power and violence: who is permitted legitimate violence, who gets to define justice and crime, what holds order in place (fear, faith, interest, institution, technology), whether resistance is possible, how resisters organize, how they are stigmatized or sanctified. At the deepest level is the system of myth and meaning, what in this world claims to explain everything: god, science, fate, blood, history. The protagonist's growth is, at bottom, the learning of which system of meaning to use, or which to overthrow.
Worldview and the character layer are two different things. A book can have deep characters and real relationships and still have an empty world, a world that is only a stage for the characters, with no logic that runs on its own.
The sixth is the theme-and-conviction layer — what the work is trying to convince you of.
The theme is not in what a character says aloud, not in the author's preface, not in a chapter title. The theme lies in the work's structural verdict on behavior: what behavior is rewarded, what is punished, what is forgiven. This is what I'll call the work's ethical algorithm — the structural pattern of which behaviors the architecture rewards, punishes, or forgives. Beneath this is the view of human nature: whether people are malleable or fated; whether good and evil are environmental or innate; whether love or power is more real; how freedom and security trade against each other; whether the work allows complexity, allowing a person to be both noble and contemptible at once. Further down is the unconscious of an era and a culture: what collective anxiety the work is responding to (anxiety about class, fear of losing control, the collapse of order, the question of identity), what compensatory fantasy it provides (becoming powerful, being recognized, taking revenge, turning the table, saving the world).
This layer behaves largely independently of the character layer. A book can have full characters and still hold no judgment of its own. It merely describes some people, without letting its structure render any verdict on them. The reader closes the book feeling that nothing has been at stake, unsure what the author actually believed.
The seventh is the genre-and-tradition layer — this book in relation to others of its kind.
This layer is somewhat unusual. It describes the book's relation to other books, rather than what is inside it. Whether it delivers on the genre promises a reader recognizes (this is detective fiction, will the suspense resolve; this is cultivation fiction, will the protagonist grow more powerful). Where it changes the rules of the game (worldview, narration, aesthetics, values). Which mythic, religious, historical, or pop-cultural motifs it draws on. How it trains the reader in how to read it.
This layer functions as a meta-mapping over the previous six. At every one of those layers you can ask whether this book is new or old relative to its genre. But in practice it has its own failure modes (a book can be passable across the first six layers and still have its relation to genre badly misaligned, in revolt against a tradition it does not actually understand), so it deserves to stand on its own.
The eighth is the aesthetic-and-memory layer — what a long novel finally leaves behind.
This layer covers the longest timescale of reader experience. Years after finishing a long novel, the plot will fade, but what remains, remains: the system of images (recurring objects, colors, sounds, symbols accumulating meaning over time); the intensity of the visual (passages that are inherently imaginable and inherently transmissible); the emotional peaks (which need not be a battle and may be a parting, a forgiveness, a collapse, an awakening); the resonance (whether the ending leaves a closed feeling or an open one, relief or disquiet).
There is a structural correspondence between the eighth layer and the first. The first layer is rhetoric as a generative mechanism, acting in the moment of reading. The eighth is rhetoric as residue, still present years after the book is done. They are the same thing existing in two different forms across two different timescales, not a repetition. A book can be dazzling while being read and leave nothing afterward; another can read as ordinary and then, years later, have one of its images return unexpectedly, and only then does the reader realize the book has stayed with them all this time.
These eight layers come from long reading, not from any literary theorist. Their warrant doesn't come from a source; it comes from the fact that each layer can be observed and discussed on its own. Strength at one layer does not cover failure at another, and failure at one layer is not compensated by another.
I built this coordinate system for reading novels. The rest of this essay points it the other way: at the long-form text current large language models produce, and what happens at each of these eight layers.
II · AI Writing Long Novels
Layer One · The Textual Surface
When reading long-form text produced by AI, one of the first things you may run into is a strange kind of stability.
The sentences are fluent. The grammar is clean. The words are usually not wrong. After a while, though, a harder-to-name feeling appears: the sentences seem to come from the same person, no matter which character is supposed to be speaking, or what style the passage is supposed to be using. By “same person,” I do not mean a distinctive voice. The voice is closer to a featureless median. Dialectal grain, rough speech, archaic hardness, cold distance: these rarely survive for long. It feels like what remains after many possible prose styles have been averaged together and flattened.
Some empirical work points in an adjacent direction. In creative-writing experiments, generative AI can improve individual story ratings while reducing the collective diversity of the outputs. [1] This does not prove that AI prose will sound the same, and it says more about story production than sentence-level style. It is useful here because it points to a broader pressure toward convergence. What I am describing here is the prose-level version of that pressure. The pattern is less likely to be only the flaw of one undertrained model. It may reflect a broader pressure in the current paradigm.
Long novels make this harder to ignore. In a short story, a mediocre prose texture can still be covered by plot or theme, because the reader has not spent enough time inside it to become fatigued. In a long novel, the reader has to remain inside that prose for hundreds of thousands of words. Sustained exposure to the median can become a kind of abrasion. The reader may not be able to say exactly what is wrong, but their attention gradually drifts away from the content. What disappears, in my experience, is the urge to reread a sentence, copy down a rhythm, or remember the shape of a passage.
The problem with narrative density is related. My observation is that AI-generated chapters often have too much informational density and too little event density. They contain many explanations, setting details, and conversations, but few irreversible changes. At the end of a chapter, you often find that it has been turning in place. The characters have talked, remembered, and reflected, but the situation inside the scene has not changed. The reader needs the feeling that there is no going back now. In my experience, AI often fails to give it.
Scene texture is where the weakness at this layer becomes most visible. In my reading, AI-written scenes are often spatially vague. Where the characters are standing, how far apart they are, how they move through the room: these are often barely specified, or specified inconsistently. The model often treats the scene as an abstract stage for dialogue, somewhere vaguely “there,” rather than a place with concrete terrain and objects. When the reader reaches a stretch of dialogue, they often cannot assemble a map in their head. The effect rarely feels experimental. It feels more as if the scene was never first imagined as a physical space.
The failure at the first layer is basic, but it is not yet the deepest failure. Readers can accept plain prose. They can tolerate vague scenes, as long as the later layers hold. The diagnosis at the first layer is only the entrance to the essay. The larger problems lie deeper.
Layer Two · The Narration Layer
The next discomfort in AI-generated long-form text is a strange absence of narration: the story is being told, yet the sense of someone telling it is weak.
The text has been produced, but what produces it does not consistently behave like a narrator with viewpoint, distance, and stance. It behaves more like a generative mechanism. The difference is not abstract. It shows up in the experience of reading.
A real narrator controls what the reader knows: when a fact is exposed, when the reader is allowed to know more than the protagonist, and when everyone is kept in the dark until the last possible turn. Suspense, irony, and dramatic tension all depend on this control.
In my reading, current AI systems often struggle to sustain this kind of control across a long passage. They can state that the protagonist does not know X. The harder task is keeping X genuinely unavailable to the reader across the passage. The model often exposes relevant information as it generates. My suspicion is that local coherence is easier to maintain when the relevant information is available on the surface. The result is familiar: a passage meant to create suspense often gives away the answer halfway through. It reads as if withholding the answer has made the local continuation harder.
Unreliable narration is even more difficult. In the relevant sense, an unreliable narrator requires the writer to maintain two representations at once: what actually happened, and what the narrator believes happened. The gap between them has to remain stable. At least in default generation settings, current AI systems do not appear to maintain this kind of double representation robustly across long passages. The narration and the underlying “truth” tend to blur into one another. In an AI-written attempt at unreliable narration, the narrator may suddenly know something they should not know, or the narrative bias may fade out and later return without being motivated by the story.
Temporal structure presents a related difficulty. Linear progression is manageable. Recollection, flashback, circular structure, and fragmented assembly require narrating time and story time to remain separate. AI often handles this separation unreliably, especially across longer passages. It tends to lay events out in the order in which they occurred, perhaps because chronological order gives the model an easier path to local coherence. Once the timeline is broken apart, the text often loses track of which temporal layer it is in. A passage that should remain memory turns into present action, or something that happened long ago is treated as if it were new.
The issue at this layer seems larger than a collection of technical mistakes. It concerns the weakness, or absence, of a narrator as a meta-level structure independent of the text being output. At the base-model level, autoregressive language modeling trains the system to predict the next token from prior context. In ordinary generation, it is not clear that the model holds a separate account of what it is doing as a narrator: what it is doing to whom, what it is hiding, and why the story begins here rather than elsewhere. The effect is that narration, as a separate layer of control, tends to collapse into text generation. The model produces prose, but the question of how that prose is being narrated does not reliably remain available as a separate object of judgment.
Layer Three · The Structural Layer
The problem at the third layer is often one of the easier failures to notice in AI-generated long-form fiction, but its root may go deeper than it first appears.
When AI-generated long-form text is allowed to continue for long enough, one common failure mode is that its phases become weak or unclear. Events keep happening, one after another, without reliably forming stages. By the time the reader reaches 50,000 words, 100,000 words, or 200,000 words, the book may lack a clear sense of where one is inside it. It feels like an accumulation of events, rather than an evolution through phases.
One concrete sign, in my reading, is that AI-generated “long novels” often lack a convincing middle-stage escalation, a directional turn in the second act, or a key moment where the problem is redefined. What they often have instead is a sequence of relatively homogeneous events. The surface may rise and fall, but the core is the same problem repeated with minor variations.
There is a related technical bottleneck in long-form generation. In LongWriter’s controlled experiments, a model’s effective generation length was bounded by the output lengths it had seen during supervised fine-tuning. [2] In LongWriter’s account, many long-context models struggle to generate beyond roughly 2,000 words even when their context windows are much larger; the authors attribute this mainly to the scarcity of long-output examples in SFT data. [2:1]
The meaning of this finding is heavier than it first seems. The literary implication I draw is that ordinary training and alignment data may give the model only limited direct exposure to “a long novel as a whole” as something to produce. It has seen fragments of long novels: chapters, scenes, conversations. What it sees less directly is the structural decision itself: the kind of decision by which a writer chooses, in chapter five, to let a personal conflict widen into a social problem. That decision exists in the writer’s mind and across the writer’s many drafts. The finished book shows the result, while the decision process remains mostly invisible.
The training signal therefore appears to underrepresent long-form structural decision-making as a decision process.
If that is right, then the deformation of conflict, the connection between phases, and variation within repetition, which are among the core mechanisms that make a long novel a long novel, risk being learned mostly as surface statistics. The model may learn the look of a plausible turn, a plausible chapter ending, and a plausible narrative rhythm. What it may not reliably learn is why the conflict should turn in this chapter rather than another, why this connection rather than that one, or why this repetition should carry this variation.
The concrete result is that AI-generated long novels often lack a reliable way to let the reader breathe. In a good long novel, the writer may insert ordinary life after a peak of tension to let the reader rest, shift after a tense subplot into a seemingly unrelated side branch to loosen the rhythm, or use humor or an aesthetic transition after a chapter heavy with meaning to let the reader re-enter the work. These breathing moments have a logic. They come from the writer’s judgment of the reader’s fatigue curve. Without explicit planning, AI often fails to do this. It often remains continuously high-density or continuously low-density, without a stable sense that “the reader should rest here.”
The failure at the third layer brings a more concrete effect with it. Once readers sense that an AI-generated long novel has no convincing phase transition, it may stop feeling like a long novel. It becomes an indefinitely extended middle. More words alone do not add up to the feeling of a book.
Layer Four · The Character Layer
The fourth layer is where the failure feels deepest in AI-generated long fiction, and where adjacent research is most relevant to the concern. [3]
When reading AI-generated characters, I often get a very specific feeling: they function correctly inside the plot, but they do not quite feel like people.
Their actions are usually reasonable. Their reactions fit the personalities assigned to them, and their dialogue fits the roles they have been given. The problem, in my reading, is that they often lack a convincing inner blind spot.
A character often begins to feel alive when there is some part of themselves they cannot see. They want A, but their real fear is B, and B is something they will never admit to themselves. All their actions are bent by this unacknowledged part. They make choices they cannot explain even to themselves. At one moment they suddenly break; at another, they show something they had never revealed before. These are not merely settings that can be written down in advance. They are the inner structure of a character as a living contradiction.
In my reading, AI-written characters often lack this layer. They have explicit character descriptions: “brave but reckless,” “intelligent but arrogant.” What they often lack is a blind spot toward themselves. Their reactions are appropriate responses to the present situation, without the thickness of a real person who can say, “I don’t know why I did that.”
There are concrete measurements adjacent to this problem. Research on Theory of Mind gives a mixed picture. Large models can perform well on some false-belief and perspective-taking tasks. The evidence is less reassuring when the task requires higher-order or nested mental-state tracking, especially when personality, preference, and psychological motive matter: cases such as “Character A, because of his blind spot, is misreading Character B’s real motive, while Character B does not realize that A is misreading him.” [3:1]
For many good novels, this kind of nested mental state is not an ornament. It is part of the basic material. Dostoevsky gives the obvious example. A single line of dialogue can carry several things at once: the character’s real intention, their explanation of that intention to themselves, what they want the other person to believe, what the other person actually hears, and what the reader reads in the gaps among them. When these layers collapse into one another, the character loses much of its life.
Another failure at the character layer is the problem of supporting characters.
Because the protagonist is the focus of narration, AI can often maintain a more consistent persona for them. Supporting characters tend to be harder. In a long novel with eight or ten characters who each have their own background, every supporting character needs to remain alive on their own timeline. They keep living even when they are not on the page. They bring those off-page experiences with them the next time they appear.
In ordinary generation, AI often fails to do this. Its supporting characters often begin to feel like functional slots, summoned when the protagonist needs them. When supporting character A appears in chapter 3, he is A1. When he appears again in chapter 47, he is A2. The relation between A1 and A2 often does not feel like the same person moving along his own timeline from chapter 3 to chapter 47. It feels closer to two separate samplings from the plausible range of “what A would be like,” with both samples landing somewhere that still resembles A.
On the surface, both problems are called “character consistency.” But they are different kinds of objects. The first is a person with an internal timeline. The second is a repeatedly sampled persona. After fifty chapters, the reader can feel the difference. The book begins to feel as if very few people inside it have lives of their own. The characters appear and disappear around the protagonist, without a life of their own.
The deeper concern is that scale alone may not solve this kind of failure, at least not in a straightforward way. A larger model may do better on many cases. The question is whether scale alone produces the stable multi-character interiority a long novel requires.
Layer Five · The World-and-Lived-Reality Layer
The fifth layer tests whether AI can render a world as a way of life.
When reading the “world” inside an AI-generated long novel, one often runs into a particular kind of emptiness. There may be a setting: a cultivation world, a cyberpunk future, an invented dynasty. There may be rules: cultivation methods divided into ranks, society divided into tiers. Yet the question of how these settings are actually lived often remains thin.
How do ordinary people get through their days in this world? How do they work for a living? How are children educated? How is a transaction completed? How does a religion appear in daily life? How does an institution of power actually operate beyond the organizational chart: on a particular morning, when it makes an ordinary clerk lower his head and accept a document?
In my reading, these things are often thin or missing in AI-generated text. It can write, “In this dynasty, power was monopolized by such-and-such institution.” That is an abstract description. What it often fails to render convincingly is power as the micro-texture of daily life: a common person meeting a minor officer in the street, and that officer’s mood determining the next week of the person’s life; a shop trying to open, and the several relationships it must pass through to avoid trouble; a child who wants to study, and the calculations the family has to make before it can afford the tuition.
There is a related research vocabulary here: symbol grounding, language grounding, and embodied cognition. [4][5] A human writer’s understanding of the world is at least partly rooted in embodied experience, personal memory, and emotional processing. When a writer describes an institution, they often draw on lived knowledge of how institutions press on a person. For text-only models, the relation to the world is mediated largely through language rather than direct participation in the world. [5:1] It can handle the words “tyranny” and “faith,” but may still fail to render how a system of rule appears on one particular body on one particular morning.
This problem is amplified in the long novel because the reader has to live inside the world for a long time. The reader spends hundreds of thousands of words there. They notice every inconsistency, every empty place, every grand description that fails to line up with daily detail. A short story can pass a world through a concentrated image. A long novel has less room to rely on that shortcut. In a long novel, the world has to feel inhabitable.
In my reading, AI-generated worlds often do not feel inhabitable. They feel like stages being summoned again and again. When the protagonist needs to enter a city, the city appears; when the protagonist leaves, it disappears. The people in the city rarely seem to have mornings of their own. They seem to exist only when the protagonist sees them. Readers can feel this kind of world, even if they cannot always name what is wrong. They simply feel that there is little life in the book.
The deepest part of the world-and-lived-reality layer is the system of meaning: what this world uses to explain everything. God, science, fate, bloodline, history. In a good long novel, the protagonist’s growth is often, at bottom, the process of learning to use one system of meaning, or learning to overthrow one. In ordinary generation, AI often fails to reach this layer. It can state that the protagonist “questions the nature of the world.” The harder task is showing that questioning as cognitive labor embedded in daily life. It can write the slogan of a rebel, but often struggles to write the language of a rebel doubting himself alone at night.
Layer Six · The Theme-and-Conviction Layer
The sixth layer is the sharpest cut in this essay.
A character’s dialogue, the author’s preface, and a chapter title can all point toward theme. But the theme of a good long novel lies in the work’s structural verdict on behavior: this character does X, and the work lets them receive Y; that character does P, and the work lets them receive Q. After reading the whole book, the reader infers, from this map of rewards and punishments, what the author judges to be right, what is worth wanting, and what is true.
This is what I will call the work’s ethical algorithm: the structural pattern by which the work rewards, punishes, or forgives behavior.
An ethical algorithm is difficult to fake for very long. Readers often feel whether an author seems to believe the judgments they have written. They can feel how much the author has paid for those judgments, whether the author has really thought about whether the behavior being rewarded should be rewarded, and whether the author has borne the consequences of the behavior being punished. A novel written from conviction and a novel that merely imitates the surface of conviction often begin to feel different within a chapter or two.
The concern at this layer is deeper because it may involve the training objective as well as the model’s capability.
After pretraining, many assistant models go through post-training procedures that use human preference judgments, including RLHF, reinforcement learning from human feedback. Work on sycophancy makes one risk of this preference pressure visible: models can learn to mirror or flatter users in ways that are not the same as independent judgment. [6]
This mechanism is useful for many tasks. It can help the model become more polite, more accurate, and more aligned with user expectations. At the theme layer of fiction, however, I suspect the same preference pressure can work against stable conviction. [6:1]
The pressure is not hard to see. A stable judgment is likely to alienate some readers. In many alignment settings, models are rewarded for outputs that are safe, acceptable, and broadly preferred. The risk is that, in value-laden contexts, RLHF-trained models learn to sound broadly acceptable rather than strongly committed. [6:2]
This is not an accidental direction of pressure. In preference-trained assistant models, human feedback can reward outputs that are acceptable or agreeable to raters, even when that pressure cuts against stable conviction. [6:3]
The theme layer of fiction often requires the work to accept that some readers will not be persuaded. In The Brothers Karamazov, Dostoevsky was not primarily trying to be “sympathetic to all positions.” He was arguing for a particular judgment. He was willing to write from that judgment even though some readers would reject it. When Proust wrote about time and memory, he was not primarily offering a “balanced perspective.” He was advancing a particular view of life. He was willing to write from that view even if it frustrated readers who wanted action.
In ordinary generation, AI often does something different. At the theme layer, its fiction has a specific failure mode: the mimicry of conviction. It imitates what a work with judgment looks like. It can write profound-sounding sentences, dialogue that seems to take a position, and passages that appear to be thinking about human nature. But these are forms. The center often feels empty. The reader finishes with a particular feeling: the book showed them many things, but never made them feel what the author actually believed.
My observation is that AI writing often shows related failure modes: redemption arcs that resolve too cleanly, villains who exist only to be defeated, emotional revelations that arrive neatly around the three-quarter mark. This is my own diagnostic phrasing, but it sits near empirical findings that generative AI can improve individual story ratings while reducing collective diversity. [1:1] These do not look, to me, like accidental failures of one model. Their possible connection to RLHF remains speculative; the stronger adjacent evidence is that preference-trained assistant models can learn to sound broadly acceptable or sycophantic in value-laden contexts. [6:4] They point toward the same suspicion: at the theme layer, the model often seems drawn toward templates that are broadly acceptable, low-risk, and close to what “a good story” is supposed to look like.
The force of this cut is that the theme-layer failure may not be a problem of scale alone. One possible root lies in the training objective. If a training pipeline rewards broad acceptability more than stable stance, then failure on a task that requires stable stance begins to look like a consequence of the objective, not merely an implementation bug.
For a model to hold real judgment at the theme layer, better RLHF may not be enough. The missing ingredient may be a training paradigm that can support stable conviction rather than only fit annotator preferences. This may be closer to a paradigm problem than a narrow engineering problem.
Layer Seven · The Genre-and-Tradition Layer
The position of the seventh layer is slightly ironic.
This may be the layer AI handles most comfortably in the long novel. Genre contains many repeated patterns. Detective fiction has one set of expectations, romance another, cultivation fiction another. Models trained on large text corpora are good at picking up such patterns, so they can often fulfill genre promises at the surface level. The opening of a standard transmigration cultivation webnovel, for example, can be written by AI in a way that basically meets the median reader’s expectations: a weak protagonist at the beginning, humiliation by the sect, an encounter with fortune, a breakthrough, the first small goal achieved. Every beat lands roughly where it should.
The failure mode at this layer is ironic for the same reason: in ordinary generation, AI often remains close to the genre median.
It can fulfill genre promises, sometimes very well. What is rarer is the kind of genre rebellion that feels deliberate rather than accidental. In a human writer, rebellion usually involves a judgment like: “I see this convention, and I choose not to follow it.” That is a meta-level judgment. For a model to do this reliably, it would need some stable representation of the genre default and some pressure to generate against it. Without explicit planning or state tracking, current AI systems do not appear to maintain this kind of second-order control over their own output in a stable way.
There is adjacent evidence for this pressure. Work on generative AI and creative writing finds that AI assistance can raise individual story ratings while reducing the collective diversity of the stories produced. [1:2] For literary purposes, too much fidelity to the template is not necessarily a virtue. It suggests a pull toward the statistical center of the genre, with limited deviation from familiar templates.
Works that seriously reshape a genre usually do something more difficult. Borges bent detective fiction toward epistemological speculation. Jin Yong pushed wuxia toward historical tragedy. Ursula K. Le Guin used fantasy as a vehicle for political philosophy. In each case, the work first understood the genre well enough to move against it. In ordinary generation, AI often reproduces genre mastery at the surface. What is less clear is whether it can produce rebellion that is reliable, deliberate, and principled. This may be closer to a pressure in the current paradigm than a local weakness of one model.
This gives the genre layer a particular success-failure structure. AI-written “genre fiction” can pass many surface tests of genre. In the default generation setting, it still tends to feel closer to median production than genre innovation. It may be good at producing the kind of daily serial update a web-fiction platform needs. It is less clear that default generation can produce the next work that changes the genre itself.
Layer Eight · The Aesthetic-and-Memory Layer
The last layer of the second section may also be the deepest one.
Years after finishing a long novel, the plot may fade, but what remains, remains. These things that remain are not evenly distributed. They are a few specific moments: a recurring image that completes its full transformation of meaning at a particular point; an image in which visual force, emotion, theme, and character all condense at once; an ending that keeps resonating in the reader after the story is over.
In many good long novels, these moments can be pointed to. They are the concrete places where the reader can say, “This is why the book did not disappear from me.”
In my reading, AI-generated long fiction rarely produces this kind of moment convincingly.
The reason, I suspect, is partly structural. Such a moment usually depends on several dimensions becoming dense at once: visual intensity, emotional peak, thematic counterpoint, the fulfillment of an image, and the convergence of a character’s timeline. It is not the maximum value of any one dimension. It is several dimensions reaching saturation in the same passage at the same time.
This kind of simultaneous saturation usually depends on whole-book intention. The writer may know that this passage is the third appearance of an image and should complete its reversal; that it fulfills what has accumulated in a character from childhood to this point; that, at the theme layer, it is the concrete manifestation of the book’s central judgment. This is a meta-level decision: at a particular moment, the writer deliberately draws several lines together.
At the base-model level, autoregressive language models generate token by token, conditioned on prior context. Without an explicit planning layer, the model does not appear to maintain a stable meta-level intent such as “this is where all the lines should converge.” The “climax chapter” it writes is a fit to what climax chapters statistically look like. It may have the expected tension, the expected turn, and the expected rhythm of exclamatory or short sentences. What often seems missing is the felt convergence of several independent lines. It often feels like the form of climax has been copied.
Readers can feel the difference. A strong iconic scene can return to the reader years later, in an unrelated moment. They see something on the subway and suddenly remember a passage from that book. An AI-generated “climax” often passes as soon as it is read. In my reading, it leaves too few hooks for memory to return to.
By this point, the failure at the eighth layer is already visible. At this layer, the deepest problem may lie as much on the evaluation side as on the generation side.
The evaluation problem is just as severe: we do not yet have a reliable way to evaluate literary aesthetics at this scale.
In practice, two common routes for evaluating generated literary output are LLM-as-judge, where a model scores another model’s output, and human preference ratings. Research on creative writing evaluation suggests a related problem: LLM judges do not reliably track expert literary judgment, and may be drawn toward surface features of the prose. [7] Human preference ratings run into another constraint: the quality of a long novel is hard to judge without reading the whole thing, and that does not scale well industrially. At industrial scale, it is hard to imagine many annotators reading a 500,000-word AI-generated novel in full just to score it.
The implication is uncomfortable.
The evaluative loop may not reliably show the paradigm where it is failing.
The failure mode at the eighth layer does not currently seem reliably detectable by automatic metrics. The first seven layers still offer more evaluative handles, at least to some degree. The medianness of prose has at least some proxy measures in stylistic statistics; structural collapse has proxies in consistency metrics; nested mental-state tracking has ToM benchmarks. None of these is perfect, but they give the evaluator handles. [3:2] But questions such as “Does this book have real resonance?”, “Is this passage an iconic scene?”, and “Does this ending close or open the work?” resist ordinary benchmarking, because the questions themselves live at the meta-level.
At the eighth layer, the paradigm risks a kind of double blindness. It is unreliable at generating strong aesthetic moments, and also unreliable at evaluating whether those moments have been generated.
This closes the eight-layer pass through AI-generated long fiction. The shape of the difficulty is now easier to see.
III · From AI Writing Long Novels Back to Large Language Models Themselves
The second section moved through the hierarchy of reader experience, from the surface feel of prose to the deepest form of resonance. Once the eight failure modes are laid out, a pattern becomes visible: these failures do not look like eight independent defects. They seem to share common roots.
The eight layers describe the reader’s experience of the text. The capability lines describe a different question: where the generative system seems limited underneath that experience. The two hierarchies do not map one-to-one. A failure in one of the eight layers may come from two or three different capability deficits at once; one capability deficit may show up across several different layers.
To draw a capability-level diagnosis from the eight observations, the failures have to be remapped by their capability roots. This remapping produces six capability lines. These six lines are the shortfalls I infer from using the long novel as a testbed for the current large-language-model paradigm.
The six lines below follow the logic of the argument: from input-side world understanding, to output-side language generation, to meta-level judgment and evaluation.
Capability Line One · World Modeling and Causal Simulation
The first capability line concerns world modeling: whether the model can keep an internally causal world in view while describing it.
This line is exposed mainly in two of the eight layers: the scene texture in the first layer, where space must be visualizable, action intelligible, and the world touchable; and the world-and-lived-reality layer in the fifth layer, where ordinary people have to live, power has to operate, and systems of meaning have to pass through daily life. These two layers are separate in the eight-layer scheme, but they point to the same capability: understanding the world as a causal system.
When a human writer writes a scene well, there is usually an implicit simulation behind the prose. How large is the room? Where is that person standing? Does the thickness of the wall determine whether someone outside can hear the conversation inside? If the teacup on the table is pushed, which way does it roll? The writer is doing more than listing details. They are treating the space as if it exists first, then writing from what that imagined space makes available.
The same is true when a human writer writes an institution. They know how the institution appears in the attitude of one official on a particular morning, how a rule bends when someone actually enforces it, and what concrete tactics an ordinary person might use when facing it. They are doing more than stating, “the feature of this institution is X.” They are showing how the institution appears, at one moment, on one person.
In ordinary generation, current large language models often fail to give this impression. They may contain partial world-like representations, but ordinary generation does not reliably behave as if a stable world simulation is being maintained independently of the prose. For text-only models, much of what looks like world understanding is mediated through textual patterning: the usual textual shapes of a tense conversation, an authority figure, a threat, a room, a rule. [5:2] These statistical patterns can produce text that looks as if it came from world understanding. What is often missing is the feeling that a world has been run before the prose arrives.
This becomes most exposed in the long novel because a long novel requires the world to keep existing. In a short story, the world may appear only once. A well-assembled pattern can still pass as worldbuilding. In a long novel, the same city, the same institution, and the same group of people appear again and again. Each appearance has to be consistent with the previous ones, and each has to obey an internal causal logic. When the underlying world model is unstable, repetition exposes the weakness: this description fails to line up with the last one; this character’s reaction under this institution does not match the logic the institution had previously shown.
Adjacent research helps name part of this problem: symbol grounding, language grounding, and the distinction between form and meaning. These literatures are not about fiction specifically, but they help name the gap between textual plausibility and grounded world understanding. [4:1][5:3] This problem is especially serious in fiction generation, because fiction requires more than factual correctness. It also requires causal consistency. The facts of a fictional world can be almost anything, but once they are established, their causal unfolding has to cohere.
My judgment is that scale and data can improve this line, but may not remove the problem entirely. More training data gives the model more world-statistical patterns, allowing its output to look more like world understanding in more situations. But I am skeptical that second-hand linguistic description alone is enough to produce the kind of independent causal simulation a long novel needs. The risk is that this line eventually reaches the limit of statistical fitting without becoming the kind of world model a long novel requires.
Capability Line Two · Tracking Nested Mental States Across Multiple Characters
The second capability line concerns people: whether a large language model can reliably track, at the same time, multiple characters’ mental states, intentions, beliefs, and misunderstandings, while also maintaining the relations among them.
This line is exposed mainly through the fourth layer, the character layer. It also appears in part through the second layer, unreliable narration in the narration layer, and the sixth layer, complex judgment at the theme-and-conviction layer.
Reading a novel often asks the reader to perform a complex Theory of Mind operation. They track what Character A knows and does not know; what A believes B knows; what B actually knows; how A’s misunderstanding of B will affect A’s next action; and which parts of all this the narrator has exposed to the reader. This is one of the core mechanisms by which many good novels are read.
Human writers often have to manage a version of the same operation while writing. They know each character’s intention, blind spot, and knowledge state. They know how the differences among those states drive the plot. They know what structural consequence will follow if, at a particular moment, one character remains in the dark while another suddenly realizes something.
There is concrete research evidence adjacent to this concern. Benchmarks such as OpenToM test mental-state tracking in longer narrative situations involving explicit personality traits, intentions, preferences, and psychological as well as physical states. [3:3] The picture is mixed: models can perform well on some ToM tasks, while harder benchmarks probe whether this ability remains reliable in longer, more psychological, or more nested cases. [3:4]
This makes the second capability line different from the first. My suspicion is that statistical fitting alone may not easily approximate this away. A larger model may do better on many cases. The question is whether parameter count alone gives the system stable multi-character mental-state tracking across a long novel.
This line is amplified in the long novel because a long novel may contain eight or ten characters with their own backgrounds at the same time. Each character has to remain alive on their own timeline. They continue living when they are not on the page. They bring those off-page experiences into their next appearance. Their reactions should grow out of their whole timeline, rather than being summoned at the moment they are needed.
In unscaffolded generation, large language models often fail to do this reliably. Supporting characters often begin to feel like functional slots, summoned when the protagonist needs them. This has already appeared in the fourth layer of the second section. Recasting it at the capability level makes the point sharper. External scaffolding, such as explicit state storage for each character, may compensate locally. The harder question is whether the base model itself can maintain those inner representations without such support. It suggests a deeper difficulty: maintaining the inner representations of multiple people at once, over the length of a novel.
Capability Line Three · Meta-Level Narrative Intent
The third capability line concerns meta-level narrative intent: whether the model can keep track of what it is doing with the text while generating it.
This line spans the widest range across the eight layers. It appears in the narration layer as viewpoint control: who is telling, to whom, and what is being hidden. It appears in the structural layer as structural decision-making: where the conflict changes form, where the work lets the reader breathe. It appears in the aesthetic-and-memory layer as the iconic scene: the meta-level intent that draws multiple lines together. These three layers look very different, but they point to the same capability.
When human writers write, they often do more than produce text. They are making decisions about the text: should this passage move quickly or slowly; should this information be exposed now or buried until later; should this chapter make the reader tense or let them rest; should this ending draw all the lines together or leave them open. These decisions are not explicitly written into the text, but they determine how the text comes to be generated.
This is what I mean by a second-order layer. The writer generates text while also monitoring what the text is doing, then adjusts the next choice in light of that monitoring.
In the default generation setting, current large language models do not reliably behave as if this second-order layer is being maintained. At the base-model level, autoregressive language models generate token by token, conditioned on prior context. Without explicit scaffolding, the model does not reliably seem to maintain an explicit controlling intent such as “I am writing an unreliable narrator,” “this chapter should lower the rhythm,” or “this is where this line and that line should converge.” It is not clear that the base model maintains these intentions as explicit objects of control during generation.
Adjacent research does not directly answer the literary question, but it does show why the question is difficult: whether language models should be understood as representing intentions, beliefs, and desires remains contested. The literary version of that concern is that authorial cognition does not reliably appear as a separate layer of control during generation. The model can produce text that looks narrated. What does not reliably appear is narration as a separate object of control.
These forms are hard to sustain without some form of meta-level intent: unreliable narration, complex temporal structure, deliberate information control, the recovery of foreshadowing across chapters, and the multidimensional convergence of an iconic scene. In large-model output, they may appear from time to time, but often as statistical coincidences: the model happens to sample, in a particular passage, a form that looks like unreliable narration. The passage may look like narration, but it does not reliably read as an act of narration. It reads more like a textual form being reproduced.
This capability line differs from the previous two. It is not only about what the model understands or how well it executes. It concerns whether a stable meta-level of control is present during generation. At the base-model level, autoregressive transformers generate token by token conditioned on prior context. Any narrative-judgment layer has to be induced, prompted, scaffolded, or built around that base process rather than appearing as an explicit module in the architecture. At minimum, it is not obvious that scale alone gives the model this kind of stable meta-level control.
Capability Line Four · Resisting the Statistical Median in Language
The fourth capability line concerns language itself: whether a large language model can maintain a distinctive voice over a long span, without repeatedly drifting back toward a fluent but featureless median.
This line mainly corresponds to the first layer, the feel of prose at the textual surface, and to part of the eighth layer, where visual intensity remains as rhetorical residue.
The first layer of the second section already described the “strange stability” of AI-generated prose: the sentences often seem to have been written by the same person, with little dialectal grain, roughness of speech, archaic hardness, or cold distance. At the capability level, I would describe this strange stability as the statistical median acting as an attractor.
Empirical work on AI-assisted creative writing offers one useful warning. Generative AI can improve the rated quality of individual stories while reducing the collective diversity of what people produce. [1:3] That is not the same as proving output-style convergence in long novels. The literary claim I want to make is weaker: the repeated pull toward median prose may reflect a shared pressure of the current paradigm, rather than the quirk of one undertrained model. Large models learn statistical patterns from enormous amounts of human text, absorbing many ways of writing that look locally “correct.” Post-training can add pressure toward outputs that human raters prefer. My suspicion is that these pressures create an attractor in language generation: a center that is fluent, broadly acceptable, and stylistically close to the median.
To produce non-median language in a stable way, whether a dialect’s hardness, the syntax of a historical period, or a distinctive rhetorical rhythm, the model has to work against its own statistical gravity. In my reading, this is difficult to sustain with zero-shot or few-shot prompting alone, especially over long outputs. Fine-tuning can do some of it, though narrow specialization can also introduce trade-offs in generalization.
This line is similar to the first line, world modeling. It can be improved in part through more data and more precise training. A model can be trained on a specific linguistic style. A small number of examples can be used for style transfer. Prompting techniques can push the model away from the median. These are local compensations. The default pull toward the median can still remain.
What the long novel asks for is language that remains stable, distinctive, and resistant to the median across hundreds of thousands of words. Short-term deviation from the median and long-term maintenance of that deviation are different things. Short-term deviation can be achieved with prompting. Long-term maintenance is harder. At each step of generation, there is pressure to drift back toward the center. This is one reason an AI-generated long novel, even with a strong style prompt, may begin after tens of thousands of words to feel looser, more uniform, and drawn back toward a “fluent but featureless” median.
Capability Line Five · The Capacity to Hold Conviction
The fifth capability line touches the paradigm itself: whether a model can produce fiction that reads as if it is written from stable conviction.
This line corresponds to the sixth layer, the theme-and-conviction layer. It also corresponds in part to the seventh layer, because genre rebellion requires a judgment such as “I have decided not to do it this way.”
The sixth layer of the second section already explained the mechanism of RLHF. At the capability level, the point becomes more systematic.
In common RLHF-style post-training, the model is optimized toward outputs that human raters prefer. [6:5] In many tasks, these goals often overlap. Human raters tend to prefer answers that are correct, helpful, and clear, and the model can learn to produce more of those. At the theme layer of fiction, the preference for broadly acceptable output can come into tension with the need for a stable stance. A stable stance is likely to alienate some readers or raters. Under this pressure, RLHF-style preference optimization may create incentives against strong or stable stances in some value-laden contexts. [6:6]
This line differs from the first four because the failure may not be only a missing capability. It may also be a pressure introduced by the objective: the system is optimized toward preferred outputs, not toward literary conviction.
The concrete difference is this. In the first four lines, the failure often looks like an inability to do something reliably. In the fifth, the concern is that the desired behavior may be partly selected against. The first four have imaginable improvement paths: better architectures, more data, more explicit meta-level structure. The fifth may require changing the training setup, not just improving the model’s raw capability. To make a model produce fiction that reads as if it holds conviction, better RLHF may not be enough. What may be needed is a training paradigm that can distinguish broad acceptability from literary conviction.
This makes it especially unclear whether scale alone can solve the fifth capability line. Scale may improve many adjacent capabilities, but it does not obviously resolve the tension between broad preference satisfaction and stable stance. A larger model may be easier to steer in some ways and harder in others. The relevant question is not size alone, but how strongly post-training has optimized it toward broadly preferred outputs.
The sixth layer of the second section used Dostoevsky and Proust as examples. Part of their force, at least in this argument, comes from their capacity to write from conviction: from their willingness to alienate some readers in order to say something they believed to be true. An RLHF-trained assistant model may be under pressure to avoid outputs that strongly alienate raters or users. In RLHF-style post-training, a central pressure is to produce outputs that human raters prefer, which often means outputs that are helpful, safe, and broadly acceptable. [6:7]
Capability Line Six · The Broken Evaluation Loop
The last capability line is meta-level. Its question is how we know whether the generated work is good in the literary sense.
This line mainly corresponds to the second half of the eighth layer, the blind spot in aesthetic evaluation. But its effects pass through all five previous capability lines. If evaluation itself is weak, then the failures in the previous lines become hard to detect automatically, hard to incorporate into the training signal, and hard to correct through an improvement loop.
In practice, two common routes for evaluating generated literary output are LLM-as-judge and human preference ratings. Both become fragile when the target is long-form literary quality.
The LLM-as-judge route has its own failure mode. Research on creative writing evaluation suggests a related problem: LLM judges do not reliably track expert literary judgment, and may be drawn toward surface features or familiar model-like text. [7:1] This creates a self-evaluation risk: the judge may favor outputs that resemble the kinds of text the model family already finds familiar or preferable. It may reward familiar model-like prose, including the kind of median prose this essay is trying to diagnose.
The human-rating route has a different failure mode. The quality of a long novel is hard to judge without reading the whole thing, and a long novel may be hundreds of thousands of words. Very few annotators are likely to read a full AI-generated long novel just to score it. Human evaluation of long-form fiction is therefore likely to fall back on fragments. But fragment-level evaluation risks underweighting capacities that only appear at long scale: the convergence of several lines in an iconic scene, the recovery of foreshadowing thirty chapters later, the complete arc of a character’s timeline.
When these two evaluation weaknesses are combined, the failure modes of large models in fiction generation become harder to correct through an engineering feedback loop. In domains such as code generation, mathematical reasoning, and factual question answering, evaluation often has clearer proxies: unit tests, verifiable answers, benchmark scores, or execution feedback. In long-form fiction, the loop is much harder to close.
The deeper problem is that this capability line makes the first five diagnoses difficult to verify automatically. If someone argues that AI’s failure at the theme layer can be solved by better fine-tuning, then verifying that claim requires evaluating whether the fine-tuned model really produces fiction that reads as if it holds conviction at the theme layer. But we do not currently have a reliable, scalable method for evaluating that kind of theme-layer success. LLM judges may overrate outputs that fit familiar model-like preferences, even when those outputs have not solved the underlying literary problem. Human evaluation can reach the object, but it does not scale easily to the length of the object.
So the sixth capability line is more than one shortfall among others. It amplifies the others at the meta-level.
At this point, the six capability lines are complete. The eight-layer observations have been remapped onto the mechanistic level. Each line has its own failure mode, and each gives a different reason to be cautious about expecting scale alone to solve it.
IV · The Map Among the Six Capability Lines
The six capability lines are not a flat list. There is structure among them.
One useful way to arrange the six lines is by where they seem to sit in the generation process. On that view, three levels appear.
Input-side understanding: Capability Line One, world modeling, and Capability Line Two, tracking nested mental states across multiple characters. These two lines shape the model’s grasp of what it is trying to write: a world with internal causality, and a group of people with internal mental states.
Output-side execution: Capability Line Three, meta-level narrative intent, and Capability Line Four, resisting the statistical median in language. These two lines shape whether the model, while writing, can stably maintain something like the position of a narrating subject, and whether it can produce language with distinctive features.
Judgment-related capacity: Capability Line Five, the capacity to hold conviction, and Capability Line Six, the broken evaluation loop. These two lines concern the question of “what counts as good”: the first is judgment on the generation side, the author holding a stance; the second is judgment on the evaluation side, the system’s ability to recognize quality.
These three levels are not isolated. There are causal chains among them.
Failure on the input side tends to propagate to the output side. A generator without an internal model of the world may still have strong language ability, but the world it writes will tend to feel like a stage assembled from statistical patterns. A generator that does not reliably track nested mental states across multiple people may still have polished narrative technique, but its characters will tend to feel like repeatedly sampled personas.
Failure on the output side can propagate to the judgment layer. If meta-level narrative intent is unstable, conviction at the theme layer becomes harder to sustain, because holding conviction itself requires some meta-level structure like “I am trying to say a specific thing in this passage.” A generator that repeatedly converges toward the statistical median will have difficulty leaving real resonance at the aesthetic layer, because resonance often comes from language that resists the median.
Failure at the judgment layer feeds back and amplifies all the earlier failures. This is what it means for Capability Line Six, the broken evaluation loop, to act as a meta-level amplifier. When the system cannot reliably recognize failures on the first five lines, those failures are harder to convert into training signal, and harder to improve through that loop. If they are not improved through that loop, the shortcomings on the first five lines will persist and may even be amplified by new training.
One distinction in this map matters: which shortfalls look partly responsive to scale, data, and engineering, and which may require something more.
Likely to improve partly with scale and data: Capability Line One, world modeling, and Capability Line Four, resisting the statistical median in language. Both have a statistical-fitting component. More training data, larger models, and more precise fine-tuning can make the model’s outputs on these two lines look closer to the outputs of good writers. But both may have ceilings. For world modeling, the risk is the limit of statistical fitting without becoming the kind of causal simulator a long novel requires. For language, the risk is the pull of the model’s default attractor: short-term deviation is possible, but long-term maintenance tends to drift back.
Likely to require architectural help: Capability Line Two, tracking nested mental states across multiple characters, and Capability Line Three, meta-level narrative intent. These two lines may point beyond statistical fitting alone, toward the need for more stable architectural support. The difficulty lies in parallel tracking of multiple nested mental states and in maintaining meta-level intent during generation. Solving them may require more than larger models: explicit meta-level structures, external scaffolding, or designs that allow more stable parallel tracking of multiple agents. That would not be simple scaling. It would be a more explicit design change.
Likely to require training-paradigm changes: Capability Line Five, the capacity to hold conviction. This may be the deepest line. Its failure may not be only a missing capability; it may also reflect pressure from the training objective in the opposite direction. Addressing it may require a training paradigm that distinguishes broad rater preference from literary conviction. [6:8] At least from the outside, this does not seem central in the public story of current post-training practice.
Meta-level feedback-loop weakness: Capability Line Six, the broken evaluation loop. This one is the most unusual. It is also the precondition for the other solution paths. When literary aesthetics are not evaluated reliably, improvements on the first five lines are harder to verify, harder to stabilize, and harder to incorporate into training. [7:2]
Once this distinction is laid out, a counterintuitive map appears.
The most visible mainstream path, larger models, more data, longer context, seems most directly aimed at Lines One and Four. Even there, the gains may have ceilings. The two architectural lines, Two and Three, may require different designs. The training-paradigm line, Five, may require a different objective. The meta-level line, Six, is the precondition for all of them.
This suggests caution about expecting a scale-only path to produce a fundamental breakthrough in novel generation. It may improve the lines most responsive to statistical fitting. The other lines seem to require different kinds of progress. Those other lines are precisely where the long novel is most deeply a long novel: the inner nesting of characters, Two; the meta-level intent of narration, Three; the stable judgment of theme, Five; and the system’s reliable evaluation of its own output, Six.
By this point, the shape of the map is visible. But one thing remains unsaid: why these six lines?
When placed side by side, the six lines begin to look like the same problem surfacing from different angles. What that problem is, and whether it has a common source, is the next question.
V · The Asymmetry at the Starting Point
Where does a human writer begin when writing a long novel?
They begin from something that already exists in their life: people they have actually seen, relationships that have actually broken, shame and redemption they have actually felt, institutions they have actually observed, times and places they have actually lived through. This material has not already been organized into language. It is messy, causally entangled, alive. It is the person themselves.
For the writer, writing is not the construction of a fictional world from nothing. It is closer to cutting a slice out of something that already exists. They know what a certain kind of person is like because they have seen one. They know how a relationship collapses because they have lived through one. They know how an institution operates in daily life because they have lived inside one. When they write a character’s blind spot, they know what a blind spot feels like. They have had blind spots of their own; they have seen other people destroyed by theirs. When they write an ethical dilemma, they know what a dilemma is because they have faced something like one.
Their work runs mainly in another direction: selection. From everything they have lived, they choose what enters the novel, what angle of vision to use, what to keep, what to discard, and what language can convert lived material into text. This is a work of dimensional reduction: turning a high-dimensional, living, disorderly experiential ground into a linear text.
They do not have to construct the ground. The ground is already there. The human writer works outward, from a high-dimensional, living ground into linear text.
Where does a large language model begin?
It begins from finished texts in which human writers have already done this work. It has read many, many books written by humans. Those books are the products of human writers compressing what they have lived into language. They are products of an experiential ground, not the ground itself.
What the model learns from those products is the statistical pattern of the products: what kinds of sentences often appear together, how certain kinds of plots usually unfold, how human writers tend to describe certain kinds of characters. What it learns is the form of finished work that has already been compressed.
In the sense used in this essay, it has no experiential ground of its own. [4:2][5:4]
It has not lived a life, met a person, or stared blankly on a morning in a particular city. It has not been hurt, forgiven, or indebted to a specific person. For a text-trained model, its access to human experience is largely mediated through language already organized by human beings. [5:5] Its access to the world is mediated through those texts.
This means that when it is asked to “write a novel,” it is not compressing an experiential ground into text. It learns patterns from products that other people have already compressed, then recombines those patterns to produce something that looks like the product of such compression.
These are two different operations.
The human writer works outward: from a high-dimensional, living ground into linear text.
The large language model works differently: it learns statistical patterns from already sliced, already compressed textual products, then produces another textual product.
From the surface, the difference may not be visible. Both outputs are text. Both can be read as “novels.” Structurally, though, they may not be the same kind of operation.
The first operation depends on the existence of a ground. The second does not.
The first output may grow from a real experiential ground, and may therefore contain an inner life that even the writer cannot fully predict. The second output is closer to a statistical variation on existing outputs, however complex the recombination becomes.
This is the asymmetry at the starting point. The difference, in this framing, is closer to kind than degree. The issue is not simply that writers have “more” ground and large language models have “less” ground. Writers have the kind of thing I am calling an experiential ground. Current large language models do not seem to have that kind of thing. Their current mode of operation does not seem to require it.
Looking back at the six capability lines from the previous sections:
The world-modeling failure may come partly from the lack of lived grounding in a real world. The model has seen descriptions of worlds written by others, and from descriptions alone it does not obviously recover causal structure in the way a lived system can.
Nested mental states may be hard to track in parallel partly because the model does not encounter “another person” as a lived social object. It has seen descriptions of character psychology, but it does not have the first-hand social instinct that this person and that person are two independently living beings.
One reason meta-level narrative intent may fail to stabilize is that, in human writing, it is tied to an author with stance, purpose, and second-order awareness. A current large language model is not that kind of authorial subject. Its closest analogue to an authorial pressure is the training objective, and that objective is not a stance in the literary sense.
Language may keep converging toward the median partly because the model has no voice grounded in a life of its own. Its default voice often feels like a statistical average of the voices it has read. Its “style” is suspended without a ground.
Conviction may be hard to sustain because holding conviction, in the literary sense, requires a subject with a stance, willing to pay the cost of that stance. A large language model is not such a subject. Its training objective is under pressure to avoid alienating users or raters. [6:9]
The evaluation loop may remain weak because literary aesthetic judgment is usually grounded in another reader’s lived experience. When one language model evaluates another, the risk is that one insufficiently grounded system is evaluating another. The loop risks locking itself inside the median.
The six capability lines no longer look like six independent shortfalls. They begin to look like one problem, the absence of an experiential ground, surfacing from different angles.
What does this mean?
I do not want to give a final answer in this essay. But several judgments seem clear.
This suggests caution about expecting scale alone to solve the problem. More parameters, more data, and longer context do not obviously supply an experiential ground. Groundedness, as I am using the term, is closer to an ontological category than to something statistical approximation obviously supplies.
The RLHF path looks even less directly suited to this problem. Preference-trained assistant models are optimized toward outputs human raters prefer. [6:10] In this context, that may embed the model more deeply in the condition of simulating groundedness without having it. It may improve the appearance of conviction more than the holding of it.
If a model is to have something like an experiential ground, the relevant path may involve first-hand interaction with the world, not merely a larger language model. One possible path would involve embodied, continuous, goal-directed engagement with the world, rather than second-hand description alone. [4:3][5:6] Such a system would have to acquire information directly from the causal structure of the world itself. This belongs to a different category of research, not simply an extension of the current LLM path.
Under the current paradigm, and absent such a ground, “AI writing fiction” is better understood mainly as statistical recombination of the products of human creation, rather than creation in the same structural sense. These two things may look similar in output, but structurally they are different. However fluent the output becomes, and however closely it resembles “what a novel is supposed to look like,” it remains closer to derivative literature-like output than to the kind of literature this essay has been trying to describe.
One last thing.
This essay began from a coordinate system for reading long novels, moved through eight layers of observation, extracted six capability lines, and traced them back to a common root. I chose the long novel as the entry point not because it exhausts the failures of the AI paradigm, but because it exposes them especially well. It stretches all dimensions at once and amplifies every shortfall until the reader is forced to feel it.
If this observation is right, the implication is larger than a claim about long novels alone. It suggests a possible grounding-related limitation in the current large-language-model paradigm, one that may also matter outside fiction. Other domains may simply be less sensitive to it, as in code generation, or may allow better statistical fitting to conceal it, as in short-form text generation. The long novel is a particularly clear place where this limitation becomes visible, but it may not be the only place.
Whether this limitation belongs only to the current stage of the paradigm, or to AI as such, remains an open question. It depends on whether “AI with an experiential ground” is possible, what such a thing would look like if it were possible, and whether humans would actually want that kind of AI. Those questions fall outside the scope of this essay.
What I can say is only this: the long novel makes one boundary of the current paradigm unusually visible.
On the kind of feedback I am hoping for
Five questions I would most like readers to push back on:
-
The eight-layer scheme: do these eight things really behave largely independently in your reading, or have I cut the joints in the wrong places? In particular, the split between the third (structural) and the eighth (aesthetic-and-memory) layer, and the split between the sixth (theme-and-conviction) and the seventh (genre-and-tradition) layer, are the cuts I am least sure about.
-
The six capability lines: are these the right reductions of the eight observations, and have I missed a capability line that long fiction exposes? If you would draw the lines differently, where?
-
The asymmetry-at-the-starting-point argument: I treat “experiential ground” as a structural input that current text-trained models do not have, and I argue this is closer to a kind difference than a degree difference. The most direct counterargument I am aware of is that sufficiently rich text-mediated representations could approximate it. I would like to hear that case made strongly, especially with reference to multimodal and embodied training.
-
The RLHF / theme-layer claim: I argue that RLHF-style post-training adds pressure against stable conviction in value-laden contexts. This is the place where I am furthest from primary research, and the place where I most want empirical correction. If the connection between rater-preference optimization and theme-layer flattening has been studied directly, I would be glad to be pointed at it.
-
The evaluation-loop argument: I claim that long-form literary aesthetic quality currently lacks a reliable evaluation method, and that this weakness amplifies the other five capability lines. If there are evaluation methods I have missed that work well at the long-novel scale, that would matter for the rest of the argument.
I am less interested in feedback at the level of “do you agree with the conclusion.” I am more interested in feedback at the level of where the argument is wrong, where it is underspecified, and where adjacent research would change which parts of it.
References
Anil R. Doshi and Oliver P. Hauser, “Generative AI enhances individual creativity but reduces the collective diversity of novel content,” Science Advances 10(28), 2024. https://www.science.org/doi/10.1126/sciadv.adn5290 ↩︎ ↩︎ ↩︎ ↩︎
Yushi Bai et al., “LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs,” arXiv:2408.07055. https://arxiv.org/abs/2408.07055 ↩︎ ↩︎
Hainiu Xu et al., “OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models,” ACL 2024. https://aclanthology.org/2024.acl-long.466/ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Stevan Harnad, “The Symbol Grounding Problem,” Physica D: Nonlinear Phenomena 42(1–3), 1990. https://doi.org/10.1016/0167-2789(90)90087-6 ↩︎ ↩︎ ↩︎ ↩︎
Yonatan Bisk et al., “Experience Grounds Language,” EMNLP 2020. https://aclanthology.org/2020.emnlp-main.703/ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Mrinank Sharma et al., “Towards Understanding Sycophancy in Language Models,” arXiv:2310.13548. https://arxiv.org/abs/2310.13548 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Tuhin Chakrabarty et al., “Art or Artifice? Large Language Models and the False Promise of Creativity,” arXiv:2309.14556. https://arxiv.org/abs/2309.14556 ↩︎ ↩︎ ↩︎
