Creative Story Generation
Build a multi-stage narrative from sparse emotional input: premise, setting, characters, acts, and the opening scene — fluent, novel, dramatically structured.
Narra·Gym is a quiet evaluation environment for large language models asked to hold a story in their hands — to listen, to remember, to care, and to keep the world alive across many turns.
most benchmarks ask
a model to answer.
ours asks it to stay.
Existing LLM benchmarks emphasize static prompting — a single question, a single answer — and therefore under-measure the capabilities required for long-horizon interactive storytelling.
Narra·Gym is an executable evaluation environment for testing LLMs as interactive narrative agents along five coupled dimensions: creative story generation, long-context state tracking, character simulation, empathic personalization, and story-grounded interactive artifact generation.
Inside a live interaction loop, the environment orchestrates five-stage story construction, multi-resolution narrative memory, reflection-guided planning, anti-stagnation control, novelty-constrained artifact synthesis, and fail-soft structured generation. Together, these choices turn interactive storytelling from a loosely specified demo into a reproducible Gym for studying persistent, emotionally aware, story-driven language agents.
Narra·Gym doesn't test isolated competence. It tests whether a model can live inside a story — creative, attentive, in-character, empathic, tangible — for as long as the reader needs it to.
Build a multi-stage narrative from sparse emotional input: premise, setting, characters, acts, and the opening scene — fluent, novel, dramatically structured.
Preserve consistency across many turns — unresolved tensions, revealed clues, scene transitions, and user decisions remain quietly available as actionable context.
Keep voices stable, distinguishable, and situationally appropriate while they still evolve with the plot.
Infer emotional context, align narrative developments with the reader's underlying concerns — without collapsing into generic therapeutic language.
Decide when a letter, a map, a cipher, a radio dial belongs in the story — and then shape it into a tangible, interactive prop grounded in the narrative.
Each turn is orchestrated by a small company of named roles. Together they build the world, hold the thread, keep the plot honest, imagine what comes next, and give the story something the reader can touch.
Gathers sparse emotional input and builds a whole world — premise, setting, cast, act structure, and opening scene.
Keeps three temporal resolutions of the story — the verbatim now, rolling summaries, and the latent state that lasts forever.
Watches for eloquent stalling. Escalates from gentle nudge to mandatory shift when the plot is only pretending to move.
Reflects before each turn — unresolved tensions, user interests, pacing, and where the story ought to go next.
Shapes story state into letters, maps, ciphers, radio dials — and refuses to repeat itself through tag-based novelty filtering.
Before a single turn, the agent moves through a logged lifecycle so researchers can pinpoint exactly where things stabilized — or drifted.
Title, premise, theme, emotional undercurrent, and protagonist objective — the narrative seed, separate from its later realization.
A world and a scene frame, translating emotional context into a concrete place — not decorative metadata, but runtime state for tracking continuity.
Protagonist and supporting cast, each with backstory, personality, and speech style. Names are normalized into stable identifiers for later attribution.
A multi-act outline refined through a critic-then-refiner loop. The critic scores novelty, tension, and pacing; the refiner rewrites weak acts. Failures fall back softly.
Scene prose, initial dialogue, and branching choices. The output already carries message history, hidden story elements, and active tensions — the interaction loop begins from structured state, not free-form text.
What does a session actually feel like? Below — a little notebook left open on the desk, with a small window where the game plays itself: a title screen, a feeling typed into a form, five quiet stages of construction, a scene, a transition, and an artifact the reader can hold.
a small window
onto the story
as it is being told.
A preliminary read across eleven dimensions of narrative performance — from relevance and coherence to character shaping and reuse intent. Numbers will keep refining as more rounds arrive; treat this as an opening pulse, not a verdict.
| # | Model | Rel. | Coh. | Emp. | Sur. | Eng. | Cpx. | Char. | Sat. | P.Q. | P.H. | Reuse | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 |
1.99 | 2.13 | 3.22 | 1.58 | 1.93 | 2.53 | 1.78 | 1.73 | 2.38 | 0.80 | 2.39 | 2.04 |
| 2 | Claude Opus 4.6 |
1.20 | 1.24 | 1.21 | 0.80 | 0.67 | 2.81 | 2.79 | 1.97 | 1.18 | 2.38 | 2.72 | 1.72 |
| 3 | GPT-5.4 |
1.64 | 1.84 | 1.89 | 1.53 | 1.36 | 1.41 | 1.12 | 3.89 | 1.27 | 1.09 | 1.12 | 1.65 |
| 4 | Gemini 3.1 Pro |
0.93 | 1.44 | 0.87 | 1.46 | 1.34 | 0.68 | 1.06 | 0.80 | 1.34 | 1.46 | 0.98 | 1.12 |
| 5 | Qwen3.5-397B |
0.62 | 0.41 | 0.50 | 0.62 | 1.19 | 0.31 | 1.11 | 0.99 | 1.37 | 1.22 | 1.06 | 0.85 |
| 6 | Doubao Seed 2.0 |
0.90 | 0.90 | 1.39 | 0.74 | 0.55 | 1.39 | 0.68 | 0.59 | 0.48 | 0.82 | 0.61 | 0.82 |
| 7 | DeepSeek V3.2 |
0.71 | 0.76 | 0.68 | 0.63 | 0.50 | 0.55 | 0.46 | 0.64 | 0.89 | 0.58 | 0.57 | 0.63 |
| 8 | GLM-5 |
0.69 | 0.52 | 0.64 | 0.62 | 0.67 | 0.62 | 0.49 | 0.25 | 0.35 | 0.42 | 0.38 | 0.51 |
| — hover a column header to read its full name; ranking by 11-dimension average — | |||||||||||||
If Narra·Gym helps your research, please cite the manuscript. This placeholder will be updated after public release or review.