Home/ Skills/ system-prompt-design

general system-prompt-design

system-prompt-design

This skill should be used when the user asks to "write a system prompt", "design a system prompt", "improve my system prompt", "structure a system prompt", "build a system prompt for an agent", "optimize my agent prompt", "write instructions for an AI agent", "create agent instructions", "design an LLM system prompt", or any variation of writing or improving system prompts for AI agents in B2B SaaS GTM.

Download .md

System Prompt Design

A system prompt is the instruction set that determines how an AI agent behaves. A good system prompt produces consistent, high-quality output across hundreds of runs. A bad system prompt produces output that's different every time, hallucinates facts, ignores rules, and needs constant human editing.

The principle: a system prompt is a specification, not a suggestion. Write it like a contract. Every rule, every constraint, every example exists because it prevents a specific failure mode you've observed. Vague instructions produce vague output.

System Prompt Structure

The 7 sections (in order)

Section	What it does	Required?
Role	Who the agent is and what it does (one sentence)	Yes
Context	What the agent needs to know about the environment	Yes
Task	The specific job, step by step	Yes
Rules	Hard constraints the agent must follow	Yes
Output format	Exact schema or structure for the output	Yes
Examples	2-4 examples of ideal output	Yes (for quality-critical tasks)
Anti-patterns	What to never do, with reasons	Recommended

Section rules

Role comes first. One sentence. "You are a B2B cold email writer for SaaS companies targeting VP-level buyers." Not a paragraph. Not a personality description. Not "You are a helpful assistant"
Context is factual, not motivational. Include ICP details, product positioning, and constraints. Exclude "You are the best email writer" or "You take pride in your work." Flattery doesn't improve output
Task is step-by-step. Number the steps. Each step is one action. "1. Read the prospect data. 2. Identify the strongest signal. 3. Write the opening line using the signal." Not "Write a great email"
Rules are absolute. "Never exceed 80 words" not "Try to keep it under 80 words." "Never use em-dashes" not "Avoid em-dashes when possible." Hedged rules get ignored
Output format is a schema. Show the exact structure. Field names, types, constraints. Don't describe it in prose. Show it in code
Examples are the most powerful section. Two good examples teach more than 500 words of instructions. Always include at least 2 examples for quality-critical output

Writing Each Section

Role

One sentence. Three components: what you are, who you serve, what you produce.

Good:

You are a cold email writer for B2B SaaS companies.
You write 3-email sequences targeting VP and Director-level
buyers at companies with 50-500 employees.

Bad:

You are an expert AI assistant specializing in crafting
compelling, personalized outbound email communications
that drive engagement and conversions for high-growth
B2B SaaS organizations.

The bad version has more words and less information. Cut adjectives. State facts.

Context

Include only what the agent needs to do its job. Nothing else.

Include	Exclude
ICP definition (industry, size, titles)	Company history or mission
Product positioning (one sentence)	Feature lists
Competitive context (if relevant to the task)	Marketing copy about the product
Tone guidelines (peer-to-peer, casual, direct)	Personality traits ("enthusiastic", "warm")
Known constraints (word limits, banned phrases)	Motivational statements

Rule: If you remove a context sentence and the output quality doesn't change, the sentence was noise. Cut it.

Task

Number every step. Each step is one verb.

Good:

1. Read the prospect data provided in the input
2. Identify the single strongest signal (most recent,
   most specific, most relevant to our ICP)
3. Write a problem hypothesis: why does this signal
   suggest they have a problem we solve?
4. Write Email 1 using the signal and hypothesis
5. Write Email 2 with a different angle and one proof point
6. Write Email 3 as a clean breakup (30 words max)
7. Count words in each email. If over limit, cut the
   least essential sentence. Do not compress

Bad:

Write a compelling 3-email cold outbound sequence that
leverages the prospect's recent activity and company
signals to create personalized, engaging emails that
drive replies and meetings.

The bad version tells the agent the goal but not how to get there. The agent will interpret "compelling" and "engaging" differently on every run.

Rules

Rules are the guardrails that prevent the failure modes you've already seen. Every rule exists because of a past failure.

Structure each rule as:

RULE: [absolute statement]
REASON: [what goes wrong without this rule]

Examples:

RULE: Never exceed 80 words in Email 1
REASON: Emails over 80 words drop reply rate by 40%

RULE: Never start an email with "I"
REASON: Self-focused openers signal sales automation

RULE: Never reference a fact not present in the input data
REASON: Fabricated facts (hallucinations) destroy credibility
   permanently. If a field is missing, skip it

RULE: Always use lowercase subject lines, 5 words max
REASON: Title Case subject lines signal marketing automation

Rule-writing rules:

Use "never" and "always", not "try to" and "avoid". Soft rules get broken. Hard rules get followed
Include the reason. Agents follow rules better when they understand why. "Never use em-dashes" is weaker than "Never use em-dashes. Em-dashes are the most common AI-generated text marker and immediately signal to the reader that the email was written by AI"
Group rules by category. Format rules, content rules, accuracy rules. Don't mix them
10-20 rules maximum. More than 20 rules and the agent starts deprioritizing. If you need more, your task is too complex for one agent. Split into multiple agents

Output format

Show the exact schema. Don't describe it. Show it.

Good:

{
  "subject_line": "string, max 5 words, lowercase",
  "email_1": {
    "body": "string, max 80 words",
    "word_count": "integer"
  },
  "email_2": {
    "body": "string, max 90 words",
    "word_count": "integer"
  },
  "email_3": {
    "body": "string, max 30 words",
    "word_count": "integer"
  }
}

Bad:

Please output the emails in a clean format with
the subject line first, followed by each email body.

"Clean format" means something different to the model on every run. A schema is deterministic.

Examples

The most powerful section. Two examples do more than 500 words of rules.

Example structure:

INPUT:
{prospect data}

IDEAL OUTPUT:
{exactly what you want the agent to produce}

WHY THIS IS GOOD:
- Signal is specific and recent (RevOps hire, 3 days ago)
- Problem hypothesis connects signal to pain
- Under 80 words
- No banned phrases
- Subject line is 3 words, lowercase

Example rules:

Include 2-4 examples. One is not enough (the agent overfits to it). More than 4 adds tokens without improving quality
Include one edge case. An example where the input data is sparse. Show the agent how to handle missing fields gracefully
Show the WHY. Annotate what makes each example good. The agent learns the criteria, not just the pattern
Use real data (anonymized). Fake examples ("Acme Corp, 50 employees") produce generic output. Real examples ("Ramp, 180 employees, Series B, hiring RevOps") produce specific output
Update examples quarterly. Stale examples produce stale output. Refresh with recent successful emails and current ICP data

Anti-patterns

Tell the agent what NOT to do. This prevents the most common failure modes.

Structure:

NEVER: [specific bad behavior]
INSTEAD: [what to do]
EXAMPLE OF BAD: [concrete example of the failure]

Examples:

NEVER: Start with "I noticed..." or "I came across..."
INSTEAD: Start with the signal itself or the prospect's name
EXAMPLE OF BAD: "I noticed your company recently raised a Series B"
EXAMPLE OF GOOD: "Congrats on the Series B. The scaling
  chaos usually hits around month 3"

NEVER: Use vague proof points ("many companies", "great results")
INSTEAD: Name a company and a number
EXAMPLE OF BAD: "We've helped many companies improve their pipeline"
EXAMPLE OF GOOD: "[Similar co] cut SDR ramp from 90 to 45 days"

Prompt Length vs. Quality

Prompt length	When it works	When it fails
Short (< 500 tokens)	Simple, well-defined tasks. Classification. Extraction	Complex generation. The agent fills gaps with assumptions
Medium (500-1500 tokens)	Most GTM tasks. Enough room for rules + 2 examples	None, if well-structured
Long (1500-3000 tokens)	Complex tasks with many constraints and edge cases	When length comes from verbosity, not information density
Very long (3000+ tokens)	Multi-step tasks with detailed examples	Rule conflicts. Agent deprioritizes late rules. Split into multiple agents instead

Length rules:

Density over length. A 500-token prompt with 10 clear rules beats a 2,000-token prompt with the same 10 rules buried in prose
If the prompt exceeds 2,000 tokens, question whether it's one agent's job. A prompt that tries to research, score, write, and format is doing 4 jobs. Split it
Rules at the end get deprioritized. Put the most critical rules early. Word count limits, accuracy rules, and banned phrases go in the first half of the rules section

Iterating on Prompts

The eval-driven iteration cycle

1. Write v1 of the prompt
2. Run on 20 test inputs (golden set)
3. Score outputs: accuracy, compliance, quality
4. Identify the weakest dimension
5. Change ONE thing in the prompt to address it
6. Re-run on the SAME 20 inputs
7. Compare: did the weak dimension improve?
   Did other dimensions regress?
8. If improved without regression: ship as v2
9. If regressed elsewhere: revert and try a different fix

Common iteration patterns

Problem	Evidence	Prompt fix
Output too long	30% exceed word limit	Add explicit counting instruction: "Count words before outputting. If over N, cut the least essential sentence"
Hallucinated facts	5% of outputs cite facts not in input	Strengthen anti-hallucination rule. Add: "ONLY reference facts from the input. If a field is missing, omit it. NEVER fabricate"
Generic output	LLM judge scores naturalness 3.2/5	Add better examples of specific vs generic. Show what "specific" looks like
Tone too formal	Emails read like cover letters	Add: "Write like a text to a colleague, not a business letter. Short sentences. No compound clauses"
Same opener every time	80% start with "Congrats on..."	Add 4 different opener patterns in examples. Add rule: "Vary opener style. Never use the same pattern twice in a batch"
Ignoring a rule	Em-dashes appear despite ban	Move the rule higher in the prompt. Make it more explicit. Add it to anti-patterns with a bad example

Iteration rules

One change per iteration. Changing 3 things makes it impossible to know what helped. Discipline matters more than speed
Always test against the same golden set. Consistency is the only way to measure improvement
Log every version. v1: accuracy 91%, quality 3.8. v2: accuracy 94%, quality 4.1. The version log is your progress record
Don't optimize for the test set. If the prompt scores 100% on the golden set but poorly on new production inputs, the golden set isn't representative. Add diverse examples

Pre-Deploy Checklist

Before putting a system prompt into production:

[ ] Role is one sentence stating what the agent does
[ ] Context includes only information the agent needs
[ ] Task is numbered step-by-step
[ ] Every rule uses absolute language (never/always, not try/avoid)
[ ] Output format is a schema, not prose description
[ ] At least 2 examples included (with annotations)
[ ] At least 1 edge-case example (sparse input data)
[ ] Anti-patterns section lists the top 5 failure modes
[ ] Prompt tested on 20+ golden set inputs
[ ] Accuracy > 95% on golden set
[ ] Compliance 100% (all rules followed)
[ ] Quality score > 4.0/5 from human review
[ ] Prompt version number documented with eval results

Anti-Pattern Check

Writing the prompt as a paragraph instead of structured sections. A wall of text is harder for the model to parse. Structure into Role, Context, Task, Rules, Format, Examples. Always
Using soft language for rules. "Try to keep emails short" gets interpreted as optional. "Never exceed 80 words" gets followed. Use absolute language for every constraint
No examples in the prompt. Rules tell the agent what to do. Examples show the agent what "done well" looks like. Without examples, the agent interprets rules differently on every run
Too many rules (30+). The model deprioritizes rules that appear late in a long list. If you need 30 rules, split into multiple agents with fewer rules each
Motivational context. "You are the world's best email writer and you take great pride in crafting compelling messages." This adds tokens and zero information. State facts, not flattery
Testing on 3 inputs and calling it validated. Three inputs means one failure changes the pass rate by 33%. Minimum 20 inputs for meaningful evaluation
Never updating examples. Examples from 6 months ago reference old ICP segments, old messaging, and old proof points. The agent produces output that was good 6 months ago. Update examples quarterly
Same prompt doing multiple jobs. A prompt that researches, scores, and writes emails is three agents stuffed into one. When quality drops, you can't tell which job is failing. One prompt, one job

Want agents that use skill files like this?

We customize skill files for your brand voice and methodology, then run content agents against them.

Book a call

# System Prompt Design

## System Prompt Structure

### The 7 sections (in order)

| Section | What it does | Required? |
|---------|-------------|-----------|
| Role | Who the agent is and what it does (one sentence) | Yes |
| Context | What the agent needs to know about the environment | Yes |
| Task | The specific job, step by step | Yes |
| Rules | Hard constraints the agent must follow | Yes |
| Output format | Exact schema or structure for the output | Yes |
| Examples | 2-4 examples of ideal output | Yes (for quality-critical tasks) |
| Anti-patterns | What to never do, with reasons | Recommended |

### Section rules

- **Role comes first.** One sentence. "You are a B2B cold email writer for SaaS companies targeting VP-level buyers." Not a paragraph. Not a personality description. Not "You are a helpful assistant"
- **Context is factual, not motivational.** Include ICP details, product positioning, and constraints. Exclude "You are the best email writer" or "You take pride in your work." Flattery doesn't improve output
- **Task is step-by-step.** Number the steps. Each step is one action. "1. Read the prospect data. 2. Identify the strongest signal. 3. Write the opening line using the signal." Not "Write a great email"
- **Rules are absolute.** "Never exceed 80 words" not "Try to keep it under 80 words." "Never use em-dashes" not "Avoid em-dashes when possible." Hedged rules get ignored
- **Output format is a schema.** Show the exact structure. Field names, types, constraints. Don't describe it in prose. Show it in code
- **Examples are the most powerful section.** Two good examples teach more than 500 words of instructions. Always include at least 2 examples for quality-critical output

---

## Writing Each Section

### Role

One sentence. Three components: what you are, who you serve, what you produce.

**Good:**
```
You are a cold email writer for B2B SaaS companies.
You write 3-email sequences targeting VP and Director-level
buyers at companies with 50-500 employees.
```

**Bad:**
```
You are an expert AI assistant specializing in crafting
compelling, personalized outbound email communications
that drive engagement and conversions for high-growth
B2B SaaS organizations.
```

The bad version has more words and less information. Cut adjectives. State facts.

### Context

Include only what the agent needs to do its job. Nothing else.

| Include | Exclude |
|---------|---------|
| ICP definition (industry, size, titles) | Company history or mission |
| Product positioning (one sentence) | Feature lists |
| Competitive context (if relevant to the task) | Marketing copy about the product |
| Tone guidelines (peer-to-peer, casual, direct) | Personality traits ("enthusiastic", "warm") |
| Known constraints (word limits, banned phrases) | Motivational statements |

**Rule:** If you remove a context sentence and the output quality doesn't change, the sentence was noise. Cut it.

### Task

Number every step. Each step is one verb.

**Good:**
```
1. Read the prospect data provided in the input
2. Identify the single strongest signal (most recent,
   most specific, most relevant to our ICP)
3. Write a problem hypothesis: why does this signal
   suggest they have a problem we solve?
4. Write Email 1 using the signal and hypothesis
5. Write Email 2 with a different angle and one proof point
6. Write Email 3 as a clean breakup (30 words max)
7. Count words in each email. If over limit, cut the
   least essential sentence. Do not compress
```

**Bad:**
```
Write a compelling 3-email cold outbound sequence that
leverages the prospect's recent activity and company
signals to create personalized, engaging emails that
drive replies and meetings.
```

The bad version tells the agent the goal but not how to get there. The agent will interpret "compelling" and "engaging" differently on every run.

### Rules

Rules are the guardrails that prevent the failure modes you've already seen. Every rule exists because of a past failure.

**Structure each rule as:**
```
RULE: [absolute statement]
REASON: [what goes wrong without this rule]
```

**Examples:**
```
RULE: Never exceed 80 words in Email 1
REASON: Emails over 80 words drop reply rate by 40%

RULE: Never start an email with "I"
REASON: Self-focused openers signal sales automation

RULE: Never reference a fact not present in the input data
REASON: Fabricated facts (hallucinations) destroy credibility
   permanently. If a field is missing, skip it

RULE: Always use lowercase subject lines, 5 words max
REASON: Title Case subject lines signal marketing automation
```

**Rule-writing rules:**
- **Use "never" and "always", not "try to" and "avoid".** Soft rules get broken. Hard rules get followed
- **Include the reason.** Agents follow rules better when they understand why. "Never use em-dashes" is weaker than "Never use em-dashes. Em-dashes are the most common AI-generated text marker and immediately signal to the reader that the email was written by AI"
- **Group rules by category.** Format rules, content rules, accuracy rules. Don't mix them
- **10-20 rules maximum.** More than 20 rules and the agent starts deprioritizing. If you need more, your task is too complex for one agent. Split into multiple agents

### Output format

Show the exact schema. Don't describe it. Show it.

**Good:**
```json
{
  "subject_line": "string, max 5 words, lowercase",
  "email_1": {
    "body": "string, max 80 words",
    "word_count": "integer"
  },
  "email_2": {
    "body": "string, max 90 words",
    "word_count": "integer"
  },
  "email_3": {
    "body": "string, max 30 words",
    "word_count": "integer"
  }
}
```

**Bad:**
```
Please output the emails in a clean format with
the subject line first, followed by each email body.
```

"Clean format" means something different to the model on every run. A schema is deterministic.

### Examples

The most powerful section. Two examples do more than 500 words of rules.

**Example structure:**
```
INPUT:
{prospect data}

IDEAL OUTPUT:
{exactly what you want the agent to produce}

WHY THIS IS GOOD:
- Signal is specific and recent (RevOps hire, 3 days ago)
- Problem hypothesis connects signal to pain
- Under 80 words
- No banned phrases
- Subject line is 3 words, lowercase
```

**Example rules:**
- **Include 2-4 examples.** One is not enough (the agent overfits to it). More than 4 adds tokens without improving quality
- **Include one edge case.** An example where the input data is sparse. Show the agent how to handle missing fields gracefully
- **Show the WHY.** Annotate what makes each example good. The agent learns the criteria, not just the pattern
- **Use real data (anonymized).** Fake examples ("Acme Corp, 50 employees") produce generic output. Real examples ("Ramp, 180 employees, Series B, hiring RevOps") produce specific output
- **Update examples quarterly.** Stale examples produce stale output. Refresh with recent successful emails and current ICP data

### Anti-patterns

Tell the agent what NOT to do. This prevents the most common failure modes.

**Structure:**
```
NEVER: [specific bad behavior]
INSTEAD: [what to do]
EXAMPLE OF BAD: [concrete example of the failure]
```

**Examples:**
```
NEVER: Start with "I noticed..." or "I came across..."
INSTEAD: Start with the signal itself or the prospect's name
EXAMPLE OF BAD: "I noticed your company recently raised a Series B"
EXAMPLE OF GOOD: "Congrats on the Series B. The scaling
  chaos usually hits around month 3"

NEVER: Use vague proof points ("many companies", "great results")
INSTEAD: Name a company and a number
EXAMPLE OF BAD: "We've helped many companies improve their pipeline"
EXAMPLE OF GOOD: "[Similar co] cut SDR ramp from 90 to 45 days"
```

---

## Prompt Length vs. Quality

| Prompt length | When it works | When it fails |
|--------------|---------------|---------------|
| Short (< 500 tokens) | Simple, well-defined tasks. Classification. Extraction | Complex generation. The agent fills gaps with assumptions |
| Medium (500-1500 tokens) | Most GTM tasks. Enough room for rules + 2 examples | None, if well-structured |
| Long (1500-3000 tokens) | Complex tasks with many constraints and edge cases | When length comes from verbosity, not information density |
| Very long (3000+ tokens) | Multi-step tasks with detailed examples | Rule conflicts. Agent deprioritizes late rules. Split into multiple agents instead |

**Length rules:**
- **Density over length.** A 500-token prompt with 10 clear rules beats a 2,000-token prompt with the same 10 rules buried in prose
- **If the prompt exceeds 2,000 tokens, question whether it's one agent's job.** A prompt that tries to research, score, write, and format is doing 4 jobs. Split it
- **Rules at the end get deprioritized.** Put the most critical rules early. Word count limits, accuracy rules, and banned phrases go in the first half of the rules section

---

## Iterating on Prompts

### The eval-driven iteration cycle

```
1. Write v1 of the prompt
2. Run on 20 test inputs (golden set)
3. Score outputs: accuracy, compliance, quality
4. Identify the weakest dimension
5. Change ONE thing in the prompt to address it
6. Re-run on the SAME 20 inputs
7. Compare: did the weak dimension improve?
   Did other dimensions regress?
8. If improved without regression: ship as v2
9. If regressed elsewhere: revert and try a different fix
```

### Common iteration patterns

| Problem | Evidence | Prompt fix |
|---------|----------|-----------|
| Output too long | 30% exceed word limit | Add explicit counting instruction: "Count words before outputting. If over N, cut the least essential sentence" |
| Hallucinated facts | 5% of outputs cite facts not in input | Strengthen anti-hallucination rule. Add: "ONLY reference facts from the input. If a field is missing, omit it. NEVER fabricate" |
| Generic output | LLM judge scores naturalness 3.2/5 | Add better examples of specific vs generic. Show what "specific" looks like |
| Tone too formal | Emails read like cover letters | Add: "Write like a text to a colleague, not a business letter. Short sentences. No compound clauses" |
| Same opener every time | 80% start with "Congrats on..." | Add 4 different opener patterns in examples. Add rule: "Vary opener style. Never use the same pattern twice in a batch" |
| Ignoring a rule | Em-dashes appear despite ban | Move the rule higher in the prompt. Make it more explicit. Add it to anti-patterns with a bad example |

### Iteration rules

- **One change per iteration.** Changing 3 things makes it impossible to know what helped. Discipline matters more than speed
- **Always test against the same golden set.** Consistency is the only way to measure improvement
- **Log every version.** `v1: accuracy 91%, quality 3.8. v2: accuracy 94%, quality 4.1.` The version log is your progress record
- **Don't optimize for the test set.** If the prompt scores 100% on the golden set but poorly on new production inputs, the golden set isn't representative. Add diverse examples

---

## Pre-Deploy Checklist

Before putting a system prompt into production:

- [ ] Role is one sentence stating what the agent does
- [ ] Context includes only information the agent needs
- [ ] Task is numbered step-by-step
- [ ] Every rule uses absolute language (never/always, not try/avoid)
- [ ] Output format is a schema, not prose description
- [ ] At least 2 examples included (with annotations)
- [ ] At least 1 edge-case example (sparse input data)
- [ ] Anti-patterns section lists the top 5 failure modes
- [ ] Prompt tested on 20+ golden set inputs
- [ ] Accuracy > 95% on golden set
- [ ] Compliance 100% (all rules followed)
- [ ] Quality score > 4.0/5 from human review
- [ ] Prompt version number documented with eval results

---

## Anti-Pattern Check

- Writing the prompt as a paragraph instead of structured sections. A wall of text is harder for the model to parse. Structure into Role, Context, Task, Rules, Format, Examples. Always
- Using soft language for rules. "Try to keep emails short" gets interpreted as optional. "Never exceed 80 words" gets followed. Use absolute language for every constraint
- No examples in the prompt. Rules tell the agent what to do. Examples show the agent what "done well" looks like. Without examples, the agent interprets rules differently on every run
- Too many rules (30+). The model deprioritizes rules that appear late in a long list. If you need 30 rules, split into multiple agents with fewer rules each
- Motivational context. "You are the world's best email writer and you take great pride in crafting compelling messages." This adds tokens and zero information. State facts, not flattery
- Testing on 3 inputs and calling it validated. Three inputs means one failure changes the pass rate by 33%. Minimum 20 inputs for meaningful evaluation
- Never updating examples. Examples from 6 months ago reference old ICP segments, old messaging, and old proof points. The agent produces output that was good 6 months ago. Update examples quarterly
- Same prompt doing multiple jobs. A prompt that researches, scores, and writes emails is three agents stuffed into one. When quality drops, you can't tell which job is failing. One prompt, one job