Interactive Narrative Agent Evaluation

In Character,
& In Context.

Narra·Gym is a quiet evaluation environment for large language models asked to hold a story in their hands — to listen, to remember, to care, and to keep the world alive across many turns.

Five Capabilities Five Agent Roles Five-Stage Lifecycle
i. Prelude

A story is not a single prompt.
It is a quiet conversation.

most benchmarks ask
a model to answer.
ours asks it to stay.

Existing LLM benchmarks emphasize static prompting — a single question, a single answer — and therefore under-measure the capabilities required for long-horizon interactive storytelling.

Narra·Gym is an executable evaluation environment for testing LLMs as interactive narrative agents along five coupled dimensions: creative story generation, long-context state tracking, character simulation, empathic personalization, and story-grounded interactive artifact generation.

Inside a live interaction loop, the environment orchestrates five-stage story construction, multi-resolution narrative memory, reflection-guided planning, anti-stagnation control, novelty-constrained artifact synthesis, and fail-soft structured generation. Together, these choices turn interactive storytelling from a loosely specified demo into a reproducible Gym for studying persistent, emotionally aware, story-driven language agents.

ii. Five Quiet Capabilities

What we ask a model to hold, when we ask it to tell a story.

Narra·Gym doesn't test isolated competence. It tests whether a model can live inside a story — creative, attentive, in-character, empathic, tangible — for as long as the reader needs it to.

i. Creative
I

Creative Story Generation

Build a multi-stage narrative from sparse emotional input: premise, setting, characters, acts, and the opening scene — fluent, novel, dramatically structured.

ii. Memory
II

Long-Context Management

Preserve consistency across many turns — unresolved tensions, revealed clues, scene transitions, and user decisions remain quietly available as actionable context.

iii. Voice
III

Character Simulation

Keep voices stable, distinguishable, and situationally appropriate while they still evolve with the plot.

iv. Empathy
IV

User Empathy

Infer emotional context, align narrative developments with the reader's underlying concerns — without collapsing into generic therapeutic language.

v. Artifact
V

Interactive Artifact Generation

Decide when a letter, a map, a cipher, a radio dial belongs in the story — and then shape it into a tangible, interactive prop grounded in the narrative.

iii. The Ensemble

Five agents, each with a quiet duty.

Each turn is orchestrated by a small company of named roles. Together they build the world, hold the thread, keep the plot honest, imagine what comes next, and give the story something the reader can touch.

Narrative Architect
I
Act · I
Narrative Architect

Gathers sparse emotional input and builds a whole world — premise, setting, cast, act structure, and opening scene.

Memory Agent
II
Act · II
Memory Agent

Keeps three temporal resolutions of the story — the verbatim now, rolling summaries, and the latent state that lasts forever.

Pacing Agent
III
Act · III
Pacing Agent

Watches for eloquent stalling. Escalates from gentle nudge to mandatory shift when the plot is only pretending to move.

Planning Agent
IV
Act · IV
Planning Agent

Reflects before each turn — unresolved tensions, user interests, pacing, and where the story ought to go next.

Artifact Agent
V
Act · V
Artifact Agent

Shapes story state into letters, maps, ciphers, radio dials — and refuses to repeat itself through tag-based novelty filtering.

iv. How a story begins

A five-stage path from feeling to opening scene.

Before a single turn, the agent moves through a logged lifecycle so researchers can pinpoint exactly where things stabilized — or drifted.

i

Story Foundation

Title, premise, theme, emotional undercurrent, and protagonist objective — the narrative seed, separate from its later realization.

— the quiet ache that the story will answer
ii

Setting Construction

A world and a scene frame, translating emotional context into a concrete place — not decorative metadata, but runtime state for tracking continuity.

— a room the feeling can walk into
iii

Character Construction

Protagonist and supporting cast, each with backstory, personality, and speech style. Names are normalized into stable identifiers for later attribution.

— someone the reader will recognize, even in the dark
iv

Act Structure

A multi-act outline refined through a critic-then-refiner loop. The critic scores novelty, tension, and pacing; the refiner rewrites weak acts. Failures fall back softly.

— the architecture under the melody
v

Opening Scene

Scene prose, initial dialogue, and branching choices. The output already carries message history, hidden story elements, and active tensions — the interaction loop begins from structured state, not free-form text.

— and then the first breath
v. A Glimpse

A window into the story itself.

What does a session actually feel like? Below — a little notebook left open on the desk, with a small window where the game plays itself: a title screen, a feeling typed into a form, five quiet stages of construction, a scene, a transition, and an artifact the reader can hold.

chapter · a session in session

From the notebook.

a small window
onto the story
as it is being told.

N recorded · live session
a self-playing loop · six quiet moments
& Try them

Ten small things the reader can hold.

At key moments the Artifact Agent shapes story state into a self-contained prop — letters, photographs, signal cards, cassettes, telegrams, maps, pocket watches, ciphers, matchbooks, music boxes — tagged by format, style, and interaction, then checked against recent history so it never repeats itself.

Open the full gallery
vi. The Leaderboard

First signals.

A preliminary read across eleven dimensions of narrative performance — from relevance and coherence to character shaping and reuse intent. Numbers will keep refining as more rounds arrive; treat this as an opening pulse, not a verdict.

Narra·Gym Leaderboard

v0.1 · preliminary
# Model Rel. Coh. Emp. Sur. Eng. Cpx. Char. Sat. P.Q. P.H. Reuse Avg
1 Claude Sonnet 4.6 1.992.133.221.581.932.531.781.732.380.802.39 2.04
2 Claude Opus 4.6 1.201.241.210.800.672.812.791.971.182.382.72 1.72
3 GPT-5.4 1.641.841.891.531.361.411.123.891.271.091.12 1.65
4 Gemini 3.1 Pro 0.931.440.871.461.340.681.060.801.341.460.98 1.12
5 Qwen3.5-397B 0.620.410.500.621.190.311.110.991.371.221.06 0.85
6 Doubao Seed 2.0 0.900.901.390.740.551.390.680.590.480.820.61 0.82
7 DeepSeek V3.2 0.710.760.680.630.500.550.460.640.890.580.57 0.63
8 GLM-5 0.690.520.640.620.670.620.490.250.350.420.38 0.51
— hover a column header to read its full name; ranking by 11-dimension average —
Last updated · 2026/05/07 a first reading
vii. Cite

If you share our quiet interest.

If Narra·Gym helps your research, please cite the manuscript. This placeholder will be updated after public release or review.

% Citation placeholder · update after public release @misc{narragym2026, title = {In Character, In Context: NARRA-Gym as an Evaluation Environment for Interactive Narrative Agents}, author = {Anonymous}, year = {2026}, note = {Manuscript in preparation} }