Article

Jun 9, 2026

Brand Voice Guidelines for AI: Write a Spec a Model Can Actually Follow

Most brand voice docs were written for humans who already knew the brand. Models don't. Here's the spec that closes the gap

A single thin line of light bisecting deep black with one orange break point

TL;DR

  • Brand voice guidelines for AI must be a spec, not a vibe deck — models read instructions, not intent.

  • Build around four load-bearing parts: archetype stack, tunable dials, lexicon, banned phrases.

  • Golden samples do 60-80% of the work; rules do the rest.

  • Wire the spec into the pipeline with a voice-lint gate before any human editor sees a draft.

  • Re-test on every model upgrade. Voice drift is real and silent.

1. Why "friendly but professional" fails as an AI instruction

If your brand voice document still says friendly but professional, approachable yet authoritative, you do not have brand voice guidelines for AI. You have a mood board.

A model reading that instruction has roughly 40,000 examples of what "friendly but professional" looks like across its training data. Almost none of them sound like you. The output regresses to the mean of B2B SaaS blog posts circa 2023, because that is where the gravity is.

Here is the gap in numbers. 89% of B2B marketers now use AI-powered tools for generating or optimizing marketing content, per the CMI/MarketingProfs 2026 B2B Content Marketing Benchmarks report (n=1,015, published December 2025). Meanwhile, the Lucidpress/Marq State of Brand Consistency study found consistent brand presentation associated with revenue increases of up to 33%, with 81% of companies reporting they regularly deal with off-brand content.

Net of that: nearly nine in ten B2B teams are pumping copy through models that cannot reliably hold their voice, while the cost of off-brand content sits directly on the revenue line. The fix is not better prompts. The fix is treating voice as an engineering spec a model can actually follow.

2. Anatomy of an AI-ready voice guide

A voice spec that survives contact with a model has four load-bearing parts and one optional one. Each part exists because models fail in a specific, predictable way without it.

Archetype stack. Three to eight reference personalities whose mechanics you want the model to internalize, with one-line descriptors of what each contributes. Not "we're like Apple." More like: Tim Cook for operational calm and numbers-with-context, Nolan's Batman for terse mission-driven closes, the MBB consultant for Pyramid-principle structure. Archetypes give the model a directional pull when a sentence is ambiguous.

Tunable dials. Five to nine numeric levers, each on a 1-10 scale, with definitions of what each end means. Common ones: technicality, swagger, empathy, urgency, playfulness, authority, velocity. Dials let one spec produce a cold email and a 2,000-word teardown without rewriting the system prompt.

Lexicon. Words and connective phrases you actively use, grouped by function. Operator connectives ("Look —", "Here's the thing", "Quick context"), domain primitives ("throughput", "decision rights", "surface area"), and stance words. The lexicon does more than a tone description ever will, because it gives the model concrete tokens to reach for.

Banned phrase list. Hard blocks. Cliché AI tells (synergy, unprecedented, empower as verb), agency fluff (premier, award-winning), sentence patterns you've outgrown (the not-X-but-Y contrast pair). This is the single highest-ROI section of the spec. A 60-item banned list cuts perceived AI-ness more than any positive instruction.

Optional: rhetorical devices. If your voice has signature moves — setup-snap-sit, antithesis couplets, triadic reveals — name them and show one example each. Most teams skip this and lose 10-15% of their distinctiveness.

3. Golden samples: the few-shot anchors

Rules tell the model what not to do. Samples tell it what to do. In our client work, three to five well-chosen golden samples outperform 2,000 words of voice description by a wide margin.

The selection criteria matter more than the count. Each sample should be a paragraph (120-220 words) that hits three things at once: a recognizable rhetorical move, the right dial settings, and at least two lexicon items in natural use. Pull them from real published work that performed — not aspirational drafts.

Label each sample with context: cold email, 7-empathy 5-swagger, late-stage prospect. Models that get unlabeled samples treat them as a single average; labeled samples let the model pattern-match by situation. If you want to train AI on brand voice in any durable way, this is the lever — not finetuning, not a vector store, just three labeled paragraphs at the top of the system prompt.

A practical note from shipping content pipelines for the last 14 months: rotate samples quarterly. Static anchors cause output to homogenize toward the specific cadence of those three paragraphs, which makes everything sound like a remix of the same essay.

4. Wiring the guide into the pipeline

A voice spec that lives in a Google Doc affects exactly zero pieces of content. The spec has to become code — or at least configuration that something downstream reads on every run.


Voice spec wired into content pipeline with drift-detection feedback loop

The voice-lint gate runs before any human editor sees the draft. Drift signals feed back into the spec.

The critical link in that chain is the voice-lint gate. It is a deterministic check (regex plus a small LLM judge) that runs before a human ever sees the draft. It catches banned phrases, em-dash overuse, opener templates that repeat, and dial drift (e.g., requested swagger=3 but output reads at 7). If you want ai brand voice consistency that holds across 200 pieces of content per quarter, the lint gate is what makes it possible. A human editor will catch the first 20. By piece 50, they're tired.

We wrote more about the gate architecture in AI content pipeline QA gates, and the editor-side workflow in how to edit AI-generated content. The short version: the gate kills 35-45% of drafts on first pass in our experience, which sounds painful and is actually the point. Drafts that fail the gate would have failed the editor anyway, 90 minutes later.

5. Measuring voice drift across models

Voice drift is the thing nobody warns you about until you've shipped under it for six months. Same spec, same prompt, different model — and the output reads 20% off. Sometimes the new model is better at instruction-following and exposes lazy spec writing. Sometimes it has different baseline cadences (more em-dashes, shorter paragraphs, a particular fondness for the word crucial).

In practice we track three drift signals:

First, banned-phrase hit rate per 1,000 words. If it climbs above roughly 2 on a previously-clean model, the new model has different defaults and the banned list needs an update.

Second, dial accuracy. Pull 20 outputs, have two human raters score them on the requested dials, compare to the target. We aim for ±1.5 on a 10-point scale. Anything wider means the dial definitions are too abstract and need concrete behavioral anchors.

Third, golden-sample similarity. Embed the output and the golden samples, measure cosine similarity. Watch the trend, not the absolute number — a 15% drop month-over-month means the spec is decaying.

These are not academic measurements. They are how you catch the moment a content marketing program starts sounding like everyone else's content marketing program.

6. Keeping the guide alive

A voice spec is a living artifact. Treat it like one.

Version it in git. Tag every model upgrade in the changelog. When a new model rolls out (which has been roughly every 8-12 weeks across the major labs in 2025-2026), run the spec against a fixed 10-piece evaluation set before switching production traffic. Compare to the prior model's output on the same prompts. If anything material shifts, update the spec before you update the model in production, not after.

Quarterly, do a banned-phrase refresh. The AI-tell vocabulary shifts as models update — words that were tells in 2024 (synergy, unprecedented) are now table stakes to block, and new tells emerge. Add three to five each quarter, retire ones that no longer appear in output.

Annually, revisit archetypes and dials. Brands evolve. If your audience moved upmarket, the swagger dial probably needs a different definition at 7 than it did 18 months ago.

7. Worked example: an 8-archetype, 7-dial spec in production

Here is the shape of the spec we run internally at Entropy, lightly redacted.

Archetypes (8): Jobs for simplicity and dramatic reveal; Tim Cook for operational calm; Belfort (ethics-stripped) for Straight Line Persuasion mechanics; Stark for sardonic confidence and the parenthetical aside; Nolan's Batman for terse mission-driven closes; MBB consultant for Pyramid-principle structure; best broker for Cialdini's seven; best marketer for Hormozi's Value Equation and Schwartz's awareness stages.

Dials (7), each 1-10: technicality, swagger, empathy, urgency, playfulness, authority, velocity. Each dial has a one-sentence definition at 1, 5, and 10. A blog post like this one runs T7-S5-E7-U2-P3-A7-V5. A cold email runs T4-S6-E8-U4-P4-A6-V8. Same spec, different dials, different output.

Lexicon: roughly 60 words and phrases, grouped into operator vocabulary, operator connectives, and hype words that are allowed only when paired with mechanism in the same sentence.

Banned list: 140 entries across AI-tell vocabulary, agency fluff, hype openers, false-precision tells, and closing slop.

Golden samples: three labeled paragraphs covering founder thought leadership, sales cold email, and educational SEO content.

The whole spec is roughly 4,800 words. It loads into the system prompt of every content generation run. It is the reason content from this pipeline reads as Entropy and not as the average of the internet.

If you want a brand voice prompt template to start from, this is the skeleton: four sections plus golden samples, dials at the top, banned list at the bottom, archetypes and lexicon in the middle.

FAQ

What's the difference between a brand voice guide and brand voice guidelines for AI?

A traditional voice guide describes tone for humans who already absorb context from working at the company. Brand voice guidelines for AI specify mechanics a model can execute: archetype stacks, numeric dials, exact lexicon, banned phrases, and labeled golden samples. The first is a vibe document. The second is a configuration file that produces consistent output across thousands of runs.

How do I train AI on brand voice without fine-tuning?

Few-shot prompting with three to five labeled golden samples in the system prompt outperforms fine-tuning for most brand voice use cases, at roughly 1% of the cost. Fine-tuning makes sense once you've exhausted prompt engineering, have over 1,000 high-quality samples, and need latency or cost savings at high volume. Most teams never get there.

What does a usable brand voice prompt template include?

Four mandatory sections plus samples: an archetype stack (3-8 reference personalities with one-line descriptors), tunable dials (5-9 numeric levers on a 1-10 scale), a lexicon (40-80 words and connectives you actively use), and a banned phrase list (100-200 entries). Add three labeled golden samples at the top. Total length typically 3,000-6,000 words.

How do I measure ai brand voice consistency at scale?

Track three signals weekly: banned-phrase hit rate per 1,000 words (target under 2), dial accuracy via human rating against requested dial settings (target ±1.5 on a 10-point scale), and cosine similarity between output embeddings and golden-sample embeddings (watch the trend). If any signal degrades for two weeks running, the spec or the model needs an update.

How often should I update brand voice guidelines for AI?

Quarterly for the banned-phrase list, since AI-tell vocabulary shifts as models update. Per model upgrade (every 8-12 weeks across major labs in 2025-2026) for dial definitions and golden samples, tested against a fixed 10-piece evaluation set. Annually for archetypes and dial scales, which evolve with the brand itself.

Ship the spec this week

Pick one piece of content your team published in the last 30 days that you'd send to a prospect without flinching. Reverse-engineer the spec from it: name three archetypes it leans on, set seven dials, list 20 lexicon words it uses, and write down 30 phrases you'd ban on sight. That's your v0.1. Run it against next week's draft. Iterate from there.

When you're ready to wire the spec into a pipeline with the lint gate and drift monitoring, come talk to us.


Comparison of three approaches to brand voice in AI content

What separates a voice document from voice infrastructure

© All right reserved

© All right reserved