Step 3 ยท Test Until You're Happy

The Shakedown

A four-layer plan for testing the Backroom before you trust it with a real book.

For Angie ยท 2026-05-23

ยทWhy we test in layers

Your Notion process is blunt about this step: "This is where most people quit, and it's the most important step. Do not move on until the outputs make you genuinely happy."

The trap is judging the whole machine on one big run. If a 30,000-word run comes back wrong, you don't know which of the forty-one moving parts went wrong, and you've spent real money to learn that. So we do the opposite: four widening layers, smallest and cheapest first. Each layer either earns your confidence or shows you exactly where the seam is. The first two layers cost essentially nothing.

Two facts about your machine right now shape the plan. It's currently in live mode (no .mock-mode flag on disk), and pen-names/ is empty โ€” no real book has ever run through it. So your first real run does two new things at once: it stands up a pen name and writes a book. We keep that first run tiny on purpose.

The smoke test you already passed proved the plumbing connects. It ran in mock mode โ€” every model call returned placeholder text. You have not yet seen whether the prose is something you'd want to read. That is the entire job of Step 3. โ€” What "the machine is ready" actually means

ยทThe four layers at a glance

LayerWhat it testsWhat you're judgingCost
0PlumbingDo files land in the right place? Do the five human moments surface where they should? Does the Boss freeze on a halt?$0 ยท mock
1Pen-name setupDoes the machine correctly inherit your voice files before it ever tries to write?Tiny ยท live
2One shakedown sceneThe real test. Is the prose something you'd write? Do two characters sound like different people?Low ยท live
3A short complete workDrafting in order, the Consiglieres, the severity ladder, the full Cleanup Crew โ€” judged as a whole reading experience.Moderate ยท live

Each layer ends with a happy-gate: a plain question you answer honestly before advancing. If the answer is no, we stop and fix that layer. You never carry a problem forward into a more expensive layer.

0.Layer 0 โ€” Plumbing (mock mode, $0)

Layer 0

Watch the wiring with your own eyes

$0 ยท mock

You already ran an 11-phase smoke test during the build. The point of repeating it now is that you watch it โ€” not to test the code, but to learn the machine's rhythm before any money is on the line.

Do this
  1. Turn mock mode back ON. The flag file controls everything.
    Turn on mock mode. Then confirm the banner says mock mode is ON.
  2. Run one full book end-to-end with the throwaway pen name (spec below). Placeholder prose stands in for everything; every paragraph is stamped [MOCK].
    Start a new book in mock mode using the test pen name. Walk me through every phase. Pause at each of the five human moments and tell me plainly what you're showing me and why.
What to watch for
  • The five human moments actually surface โ€” Intake review, Gate 1 (Foundation), Gate 2 (Blueprint), Drift = discovery (may not fire on a short mock run โ€” that's fine), Cleanup ratification.
  • Files land where the tree says they should โ€” a book folder under pen-names/test-pen/standalones/, planning artifacts, a manuscript/ folder, a runs/ evidence trail.
  • The Boss never asks "shall I continue?" at Note / Word / Sit-down level. If it pings you for permission on small stuff, that's a bug to feed back โ€” the whole design is that it doesn't.
  • Freeze-on-halt works โ€” if you can trigger a "Do it over," the machine should stop cleanly and leave nothing half-changed.
If anything surfaces in the wrong order, lands in the wrong folder, or asks you a question it shouldn't โ€” screenshot it and paste it back. Mock mode is exactly where you want to find wiring bugs, because they cost nothing here.
Happy-gate 0: Did all five human moments surface in the right order, did files land where you expected, and did the Boss stay quiet on the small stuff? If yes โ†’ turn mock mode OFF and go to Layer 1.

1.Layer 1 โ€” Pen-name setup (live, tiny cost)

Layer 1

Stand up the pen name; prove voice inheritance

Tiny ยท live

Before the machine writes a single real sentence, it has to know whose voice it's writing in. This layer sets that up and confirms the machine reads it.

Do this
  1. Turn mock mode OFF (delete the flag). Confirm the banner now says Live mode.
  2. Create the throwaway pen name. Because this is a sandbox, you don't need polished voice files โ€” but the test is more meaningful if you drop in a voice bible, even a rough one, so you can see the machine honor it.
    Create a new throwaway pen name called "test-pen". I'll give you a short voice bible to drop in. Don't run the style-builder from scratch โ€” use what I provide.
  3. Ask the machine to read its own setup back to you.
    Read the test-pen voice files back to me in your own words. What voice do you think you're writing in, and where did you get that?
What to watch for
  • The machine's summary of the voice should sound like your file, not a generic "warm, engaging prose" boilerplate. If it's generic, it didn't really read the file โ€” feed that back.
  • The pen-name folder should contain the seven inherited files (pen-name-profile, voice-bible, style-guide, voice-anchor, crutch-list, prose-rules, audience-profile).
Happy-gate 1: Does the machine describe the voice back to you accurately, citing your file? If yes โ†’ Layer 2.

2.Layer 2 โ€” The shakedown scene (the real test)

Layer 2

One scene. Two people who want different things.

Low ยท live

Your machine already has a name for this: the "shakedown piece." The Boss is built to offer one representative scene as a pilot before committing to a full book. We use that, deliberately, with a two-hander tension scene โ€” two characters, one room, conflicting wants.

Why this scene type, specifically

A two-hander is the single most revealing voice test there is. It exercises three systems at once โ€” dialogue cards (do the characters sound distinct?), interiority (does the POV character's inner life ring true?), and voice inheritance (does it sound like your pen name?) โ€” without the higher bar of a spice or action beat clouding your read. If two characters sound like the same person wearing two name tags, you'll know in a paragraph.

Do this
  1. With the sandbox arrangement filled (spec below), let the machine run the Setup to Gate 2, then ask for the shakedown.
    Don't draft the whole book. Run the Setup, and at the end offer me the shakedown piece โ€” pick a mid-book two-hander scene where the two leads want opposite things. Draft just that one scene, live.
  2. Read it the way you'd read a submission. Out loud, even. Mark the lines that work and the lines that don't.
What you're actually judging
  • Do the two voices separate? Cover the dialogue tags. Can you still tell who's speaking?
  • Is the interiority earned, or on-the-nose? Your original pitch wanted the "afraid of horses because of the stable boy" depth โ€” felt, not spelled out. Look for that.
  • Does it sound like your pen name or like generic competent AI prose?
  • The AI-tells. Em-dash overuse, "not X but Y" constructions, repeated descriptors. These get caught at Cleanup in a full run, but eyeball them here too.
  • Would you change it, or do you have to change it? Your stated goal: a manuscript you'd want to edit by choice, not rescue by necessity.
This is the layer where screenshots matter most. For every line that feels flat, paste it back with one note: "dialogue feels flat here," "this character would never say that," "too on-the-nose." The machine adjusts the dialogue-cards and prose-rules from your notes โ€” that's the feedback loop that makes it sound like you.
Happy-gate 2 (the big one): Reading this scene, are you genuinely happy? Not "impressed it works" โ€” happy, the way you'd be with your own draft. If no, we iterate here, cheaply, until yes. Do not advance on "good enough."

3.Layer 3 โ€” A short complete work (live)

Layer 3

The whole machine, judged as a reading experience

Moderate ยท live

Only after Layer 2 makes you happy. A short story or novelette โ€” long enough to exercise drafting-in-order, the per-piece Consigliere audits, the Boss's severity ladder, and the full six-inspector Cleanup Crew; short enough to read in one sitting and judge whole.

Do this
  1. Set the sandbox arrangement's length_target low โ€” a 6,000โ€“10,000 word novelette is ideal. (The lester-dent framework is literally the 6,000-word pulp formula, if you want a structure that fits this length โ€” though it's a stub, so it'll plan generically.)
  2. Let it run the full pipeline, live, end to end.
    Now draft the whole short work, live, start to finish. Run the full Cleanup Crew at the end and bring me the flags to ratify.
What to watch for
  • Drafting holds order and continuity โ€” scene 4 remembers what happened in scene 2. The Books (world-state) should be doing this.
  • The severity ladder behaves โ€” small stuff settled silently (Sit-downs), only real breaks halting you (Do-it-over / Kick-upstairs). Check the job-log.md afterward to see what the Boss settled without bothering you.
  • The Cleanup Crew earns its keep โ€” pacing flags, repetition, broken promises, the engagement curve. Are the flags real, or noise?
  • You read the finished thing as a reader. Did it pull you? The Cold Read inspector simulates this; your real read is the ground truth.
Happy-gate 3: Did the finished work hold together, did the Boss bother you only when it should have, and would you publish something built this way (under a real pen name)? If yes โ†’ Step 3 is done. You're ready to think about Step 4, the GUI.

ยทYour test material โ€” the sandbox pen name

So you're not inventing test material under pressure, here's a ready-to-use throwaway. Nothing real is on the line; the point is to learn the machine's behavior. Delete the whole test-pen/ folder when you're done.

The pen name

slugtest-pen
display nameQuill Sandoval (a throwaway โ€” not a real brand)
primary genreContemporary mystery / suspense (low world-building overhead, fast to judge)
voice in one lineDry, observational, close third. Short sentences when tense. The narrator notices small physical details and trusts the reader to do the math.

The sandbox book โ€” a filled arrangement

A standalone, deliberately small. The hook is built to force a two-hander so Layer 2 has a scene to draft.

working_title:        The Locked Room Upstairs
pen_name:             test-pen
series_name:          (blank โ€” standalone)
book_number:          (blank)
genre:                contemporary mystery
structure_framework:  lester-dent          # the 6,000-word formula; stub, plans generically
length_target:        8000                  # novelette โ€” read in one sitting

hook:                 A house-sitter finds the one door she was told never
                      to open already unlocked, and the owner's sister at
                      the top of the stairs insisting she opened it herself.
audience:             Cozy-adjacent mystery readers who want tension over gore.
comp_titles:          The Guest List; The Maid

payoff_1:  The reader doubts both women equally before the truth lands.
payoff_2:  The "locked room" reveal recolors the opening scene on reread.
payoff_3:  The house-sitter trusts her own memory by the last page.

pov:            third limited (house-sitter)
tense:          past
heat_level:     closed-door
content_warnings: none load-bearing

voice_reference: (blank โ€” inherit test-pen voice)
disclosure:      "This book was written with author-directed AI tools."

The built-in two-hander

The premise puts two women in a house with one contested fact between them. That's your Layer 2 shakedown scene handed to you: the house-sitter and the sister, top of the stairs, each certain the other is lying. When the Boss offers the shakedown, point it here.

When you're ready, I can write this pen name and arrangement straight onto disk under pen-names/test-pen/ so it's sitting there for Layer 0 โ€” just say the word.

ยทThe screenshot habit

Your Notion process calls screenshots "your universal translator," and EAW's margin note says she did it even when she thought she knew. For this machine specifically, the highest-value screenshots are:

You don't need to diagnose anything. Screenshot, paste, one sentence of what felt off. That's the whole job.

ยทLooking ahead โ€” Step 4, the GUI

You asked to start thinking about the GUI once testing is underway. Here's the shape of it, so it's in your head while you run Layers 0โ€“3.

The good news: Step 3 is also GUI design research. Every time the machine pauses for you during testing, you're discovering a screen the app will need. The five human moments are, almost exactly, the five screens of the eventual interface.

Human moment (the machine)Becomes this in the app (Step 4)
Intake reviewA "New Book" form โ€” the arrangement fields as friendly inputs, with the three on-ramps as a mode picker.
Gate 1 โ€” FoundationA review screen: premise, structure, characters โ€” with Approve / Send back buttons.
Gate 2 โ€” BlueprintA second review screen for world / outline / voice / conflict map.
Drift = discoveryA notification + a small decision card: "the draft found something better โ€” keep it?"
Cleanup ratificationThe richest screen: a flag-by-flag Accept / Reject / Revise queue.

What the GUI does and doesn't change

It changes how you talk to the machine โ€” buttons and windows instead of typed instructions. It does not change the machine underneath; the forty-one skills, the Boss, the severity ladder all keep running exactly as they do now. The Agent SDK (linked in your Notion doc) is the bridge that lets a little app window keep using your Claude subscription to drive those same skills.

One firm recommendation

Do not start the GUI until Happy-gate 2 passes. The doc's Step 4 says "once the command-line version works." If you build a beautiful interface on top of prose you're not happy with, you've decorated the wrong machine. Get the prose right first; the buttons are the easy part, and they'll be obvious once you know which moments matter.

While you test, jot one note per pause: "at this moment I wished I could just click X." That note list becomes the Step 4 spec, the same way your original conversation became the build spec in Step 1.