Interactive Narrative Agent Evaluation

In Character,
& In Context.

Narra·Gym is a quiet evaluation environment for large language models asked to hold a story in their hands — to listen, to remember, to care, and to keep the world alive across many turns.

Five Capabilities Five Agent Roles Five-Stage Lifecycle

Begin reading → The paper ↗

i. Prelude

A story is not a single prompt.
It is a quiet conversation.

most benchmarks ask
a model to answer.
ours asks it to stay.

Existing LLM benchmarks emphasize static prompting — a single question, a single answer — and therefore under-measure the capabilities required for long-horizon interactive storytelling.

Narra·Gym is an executable evaluation environment for testing LLMs as interactive narrative agents along five coupled dimensions: creative story generation, long-context state tracking, character simulation, empathic personalization, and story-grounded interactive artifact generation.

Inside a live interaction loop, the environment orchestrates five-stage story construction, multi-resolution narrative memory, reflection-guided planning, anti-stagnation control, novelty-constrained artifact synthesis, and fail-soft structured generation. Together, these choices turn interactive storytelling from a loosely specified demo into a reproducible Gym for studying persistent, emotionally aware, story-driven language agents.

ii. Five Quiet Capabilities

What we ask a model to hold, when we ask it to tell a story.

Narra·Gym doesn't test isolated competence. It tests whether a model can live inside a story — creative, attentive, in-character, empathic, tangible — for as long as the reader needs it to.

i. Creative

Creative Story Generation

Build a multi-stage narrative from sparse emotional input: premise, setting, characters, acts, and the opening scene — fluent, novel, dramatically structured.

ii. Memory

Long-Context Management

Preserve consistency across many turns — unresolved tensions, revealed clues, scene transitions, and user decisions remain quietly available as actionable context.

iii. Voice

III

Character Simulation

Keep voices stable, distinguishable, and situationally appropriate while they still evolve with the plot.

iv. Empathy

User Empathy

Infer emotional context, align narrative developments with the reader's underlying concerns — without collapsing into generic therapeutic language.

v. Artifact

Interactive Artifact Generation

Decide when a letter, a map, a cipher, a radio dial belongs in the story — and then shape it into a tangible, interactive prop grounded in the narrative.

iii. The Ensemble

Five agents, each with a quiet duty.

Each turn is orchestrated by a small company of named roles. Together they build the world, hold the thread, keep the plot honest, imagine what comes next, and give the story something the reader can touch.

Act · I

Narrative Architect

Gathers sparse emotional input and builds a whole world — premise, setting, cast, act structure, and opening scene.

Act · II

Memory Agent

Keeps three temporal resolutions of the story — the verbatim now, rolling summaries, and the latent state that lasts forever.

III

Act · III

Pacing Agent

Watches for eloquent stalling. Escalates from gentle nudge to mandatory shift when the plot is only pretending to move.

Act · IV

Planning Agent

Reflects before each turn — unresolved tensions, user interests, pacing, and where the story ought to go next.

Act · V

Artifact Agent

Shapes story state into letters, maps, ciphers, radio dials — and refuses to repeat itself through tag-based novelty filtering.

iv. How a story begins

A five-stage path from feeling to opening scene.

Before a single turn, the agent moves through a logged lifecycle so researchers can pinpoint exactly where things stabilized — or drifted.

Story Foundation

Title, premise, theme, emotional undercurrent, and protagonist objective — the narrative seed, separate from its later realization.

— the quiet ache that the story will answer

Setting Construction

A world and a scene frame, translating emotional context into a concrete place — not decorative metadata, but runtime state for tracking continuity.

— a room the feeling can walk into

iii

Character Construction

Protagonist and supporting cast, each with backstory, personality, and speech style. Names are normalized into stable identifiers for later attribution.

— someone the reader will recognize, even in the dark

Act Structure

A multi-act outline refined through a critic-then-refiner loop. The critic scores novelty, tension, and pacing; the refiner rewrites weak acts. Failures fall back softly.

— the architecture under the melody

Opening Scene

Scene prose, initial dialogue, and branching choices. The output already carries message history, hidden story elements, and active tensions — the interaction loop begins from structured state, not free-form text.

— and then the first breath

v. A Glimpse

A window into the story itself.

What does a session actually feel like? Below — a little notebook left open on the desk, with a small window where the game plays itself: a title screen, a feeling typed into a form, five quiet stages of construction, a scene, a transition, and an artifact the reader can hold.

chapter · a session in session

From the notebook.

a small window
onto the story
as it is being told.

N recorded · live session

EmoNest

Co-creating adaptive narratives that fit your mood — now.

Share what's on your mind

we'll weave a story around what you share.

I can't tell whether I miss him, or the version of myself I was back then.

Begin your journey

Weaving your story

i.Foundation

ii.Setting

iii.Characters

iv.Act Structure

v.Opening Scene

Narration Rain taps the glass of the harbor lighthouse. A half-finished letter rests on the desk.

Elin

You came back. I wasn't sure you would. Sit, if you like — the tea is still warm.

You

I needed to hear the sea again. And — to see you.

Elin · story reveal

There is a letter for you on the desk. From someone who is still waiting.

i.Ask to read the letter.

ii.Stay quiet; let the rain speak first.

iii.Tell her you've been missing someone.

Scene Shift The harbor, moments later

— the bell rings a third time —

An artifact arrives

my dearest —
if the light goes out,
follow the shore east until dawn.
— E.

a self-playing loop · six quiet moments

& Try them

Ten small things the reader can hold.

At key moments the Artifact Agent shapes story state into a self-contained prop — letters, photographs, signal cards, cassettes, telegrams, maps, pocket watches, ciphers, matchbooks, music boxes — tagged by format, style, and interaction, then checked against recent history so it never repeats itself.

Open the full gallery ↗

vi. The Leaderboard

First signals.

A preliminary read across eleven dimensions of narrative performance — from relevance and coherence to character shaping and reuse intent. Numbers will keep refining as more rounds arrive; treat this as an opening pulse, not a verdict.

Narra·Gym Leaderboard

v0.1 · preliminary

Model	Aggregates		Story Dimensions							User Experience
Model	StoryQ	UX	Rel	Coh	Emp	Sur	Eng	Cpx	Char	Sat	PQual	Help	Reuse
Claude Sonnet 4.6	1.83	1.83	1.99	2.13	1.21	1.23	1.93	2.53	1.78	1.73	2.38	0.80	2.39
Claude Opus 4.6	1.69	2.06	1.20	1.24	0.87	1.53	1.36	2.81	2.79	1.97	1.18	2.38	2.72
GPT-5.4	1.73	1.84	1.64	1.84	3.22	1.58	1.28	1.41	1.11	3.89	1.27	1.09	1.12
Qwen3.5-397B	1.02	1.16	0.62	1.44	0.68	1.46	1.19	0.68	1.06	0.99	1.37	1.22	1.06
Gemini 3.1 Pro	0.83	1.15	0.93	0.52	0.72	0.62	1.34	0.55	1.12	0.80	1.34	1.46	0.98
Doubao	0.99	0.63	0.90	0.76	1.89	0.74	0.55	1.39	0.68	0.59	0.48	0.82	0.61
DeepSeek V3.2	0.59	0.72	0.71	0.90	0.64	0.63	0.50	0.31	0.46	0.64	0.89	0.78	0.57
GLM-5	0.60	0.35	0.69	0.41	0.50	0.80	0.67	0.62	0.49	0.25	0.35	0.42	0.38
hover a column header to read its full name; row order follows the 11-dimension overall average

Last updated · 2026/05/28 a first reading

vii. Cite

If you share our quiet interest.

If Narra·Gym helps your research, please cite the manuscript. This placeholder will be updated after public release or review.

% Citation placeholder · update after public release @misc{narragym2026, title = {In Character, In Context: NARRA-Gym as an Evaluation Environment for Interactive Narrative Agents}, author = {Anonymous}, year = {2026}, note = {Manuscript in preparation} }

In Character, & In Context.

A story is not a single prompt. It is a quiet conversation.