A four-layer plan for testing the Backroom before you trust it with a real book.
Your Notion process is blunt about this step: "This is where most people quit, and it's the most important step. Do not move on until the outputs make you genuinely happy."
The trap is judging the whole machine on one big run. If a 30,000-word run comes back wrong, you don't know which of the forty-one moving parts went wrong, and you've spent real money to learn that. So we do the opposite: four widening layers, smallest and cheapest first. Each layer either earns your confidence or shows you exactly where the seam is. The first two layers cost essentially nothing.
Two facts about your machine right now shape the plan. It's currently in live mode (no .mock-mode flag on disk), and pen-names/ is empty โ no real book has ever run through it. So your first real run does two new things at once: it stands up a pen name and writes a book. We keep that first run tiny on purpose.
The smoke test you already passed proved the plumbing connects. It ran in mock mode โ every model call returned placeholder text. You have not yet seen whether the prose is something you'd want to read. That is the entire job of Step 3. โ What "the machine is ready" actually means
| Layer | What it tests | What you're judging | Cost |
|---|---|---|---|
| 0 | Plumbing | Do files land in the right place? Do the five human moments surface where they should? Does the Boss freeze on a halt? | $0 ยท mock |
| 1 | Pen-name setup | Does the machine correctly inherit your voice files before it ever tries to write? | Tiny ยท live |
| 2 | One shakedown scene | The real test. Is the prose something you'd write? Do two characters sound like different people? | Low ยท live |
| 3 | A short complete work | Drafting in order, the Consiglieres, the severity ladder, the full Cleanup Crew โ judged as a whole reading experience. | Moderate ยท live |
Each layer ends with a happy-gate: a plain question you answer honestly before advancing. If the answer is no, we stop and fix that layer. You never carry a problem forward into a more expensive layer.
You already ran an 11-phase smoke test during the build. The point of repeating it now is that you watch it โ not to test the code, but to learn the machine's rhythm before any money is on the line.
[MOCK].
pen-names/test-pen/standalones/, planning artifacts, a manuscript/ folder, a runs/ evidence trail.Before the machine writes a single real sentence, it has to know whose voice it's writing in. This layer sets that up and confirms the machine reads it.
pen-name-profile, voice-bible, style-guide, voice-anchor, crutch-list, prose-rules, audience-profile).Your machine already has a name for this: the "shakedown piece." The Boss is built to offer one representative scene as a pilot before committing to a full book. We use that, deliberately, with a two-hander tension scene โ two characters, one room, conflicting wants.
A two-hander is the single most revealing voice test there is. It exercises three systems at once โ dialogue cards (do the characters sound distinct?), interiority (does the POV character's inner life ring true?), and voice inheritance (does it sound like your pen name?) โ without the higher bar of a spice or action beat clouding your read. If two characters sound like the same person wearing two name tags, you'll know in a paragraph.
Only after Layer 2 makes you happy. A short story or novelette โ long enough to exercise drafting-in-order, the per-piece Consigliere audits, the Boss's severity ladder, and the full six-inspector Cleanup Crew; short enough to read in one sitting and judge whole.
length_target low โ a 6,000โ10,000 word novelette is ideal. (The lester-dent framework is literally the 6,000-word pulp formula, if you want a structure that fits this length โ though it's a stub, so it'll plan generically.)job-log.md afterward to see what the Boss settled without bothering you.So you're not inventing test material under pressure, here's a ready-to-use throwaway. Nothing real is on the line; the point is to learn the machine's behavior. Delete the whole test-pen/ folder when you're done.
| slug | test-pen |
|---|---|
| display name | Quill Sandoval (a throwaway โ not a real brand) |
| primary genre | Contemporary mystery / suspense (low world-building overhead, fast to judge) |
| voice in one line | Dry, observational, close third. Short sentences when tense. The narrator notices small physical details and trusts the reader to do the math. |
A standalone, deliberately small. The hook is built to force a two-hander so Layer 2 has a scene to draft.
working_title: The Locked Room Upstairs
pen_name: test-pen
series_name: (blank โ standalone)
book_number: (blank)
genre: contemporary mystery
structure_framework: lester-dent # the 6,000-word formula; stub, plans generically
length_target: 8000 # novelette โ read in one sitting
hook: A house-sitter finds the one door she was told never
to open already unlocked, and the owner's sister at
the top of the stairs insisting she opened it herself.
audience: Cozy-adjacent mystery readers who want tension over gore.
comp_titles: The Guest List; The Maid
payoff_1: The reader doubts both women equally before the truth lands.
payoff_2: The "locked room" reveal recolors the opening scene on reread.
payoff_3: The house-sitter trusts her own memory by the last page.
pov: third limited (house-sitter)
tense: past
heat_level: closed-door
content_warnings: none load-bearing
voice_reference: (blank โ inherit test-pen voice)
disclosure: "This book was written with author-directed AI tools."
The premise puts two women in a house with one contested fact between them. That's your Layer 2 shakedown scene handed to you: the house-sitter and the sister, top of the stairs, each certain the other is lying. When the Boss offers the shakedown, point it here.
When you're ready, I can write this pen name and arrangement straight onto disk under pen-names/test-pen/ so it's sitting there for Layer 0 โ just say the word.
Your Notion process calls screenshots "your universal translator," and EAW's margin note says she did it even when she thought she knew. For this machine specifically, the highest-value screenshots are:
You don't need to diagnose anything. Screenshot, paste, one sentence of what felt off. That's the whole job.
You asked to start thinking about the GUI once testing is underway. Here's the shape of it, so it's in your head while you run Layers 0โ3.
The good news: Step 3 is also GUI design research. Every time the machine pauses for you during testing, you're discovering a screen the app will need. The five human moments are, almost exactly, the five screens of the eventual interface.
| Human moment (the machine) | Becomes this in the app (Step 4) |
|---|---|
| Intake review | A "New Book" form โ the arrangement fields as friendly inputs, with the three on-ramps as a mode picker. |
| Gate 1 โ Foundation | A review screen: premise, structure, characters โ with Approve / Send back buttons. |
| Gate 2 โ Blueprint | A second review screen for world / outline / voice / conflict map. |
| Drift = discovery | A notification + a small decision card: "the draft found something better โ keep it?" |
| Cleanup ratification | The richest screen: a flag-by-flag Accept / Reject / Revise queue. |
It changes how you talk to the machine โ buttons and windows instead of typed instructions. It does not change the machine underneath; the forty-one skills, the Boss, the severity ladder all keep running exactly as they do now. The Agent SDK (linked in your Notion doc) is the bridge that lets a little app window keep using your Claude subscription to drive those same skills.
Do not start the GUI until Happy-gate 2 passes. The doc's Step 4 says "once the command-line version works." If you build a beautiful interface on top of prose you're not happy with, you've decorated the wrong machine. Get the prose right first; the buttons are the easy part, and they'll be obvious once you know which moments matter.
While you test, jot one note per pause: "at this moment I wished I could just click X." That note list becomes the Step 4 spec, the same way your original conversation became the build spec in Step 1.