For B2B SaaS in 2026, the right robots.txt allows every AI retrieval and search crawler (OAI-SearchBot, ChatGPT-User, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Googlebot, Bingbot, Applebot) and makes a deliberate call on the four training-only bots (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended). The default-allow approach gives away training data with no citation upside. Default-block kills visibility in ChatGPT, Perplexity, and Gemini, the engines now driving 87.4% of AI referral traffic. This guide publishes the exact configuration we deploy on B2B SaaS sites, with all 14 active AI crawlers compared.
What's the recommended robots.txt for B2B SaaS in 2026?
Below is the exact robots.txt block we deploy on Growth Engineer client sites. It allows all citation-driving crawlers, blocks training-only crawlers (adjust to taste), and protects private routes from every AI bot.
# === AI RETRIEVAL & SEARCH (allow -- these drive citations) ===
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: Claude-SearchBot
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Googlebot
User-agent: Bingbot
User-agent: Applebot
User-agent: GoogleOther
User-agent: DuckAssistBot
Disallow: /admin/
Disallow: /app/
Disallow: /api/
Disallow: /account/
# === AI TRAINING (choose your stance) ===
# Allow if you want long-term brand grounding inside the models.
# Disallow if you object to training on your IP.
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: CCBot
User-agent: Meta-ExternalAgent
User-agent: Amazonbot
Disallow: /admin/
Disallow: /app/
Disallow: /api/
Disallow: /account/
# === CONFIRMED SCRAPERS / NO BUSINESS UPSIDE ===
User-agent: Bytespider
Disallow: /
Sitemap: https://yourdomain.com/sitemap.xml
Two principles drive this config:
- Retrieval bots are non-negotiable allows. Every crawler ending in
-Useror-SearchBotexists to put your page into an AI answer with a citation. Blocking them is leaving pipeline on the table. - Training bots are a values call, not a visibility call. OpenAI's documentation confirms blocking GPTBot does not remove you from ChatGPT Search. The same applies to ClaudeBot and Anthropic's retrieval bots.
What's the difference between GPTBot, OAI-SearchBot, and ChatGPT-User?
OpenAI runs three independently controlled crawlers. GPTBot scrapes the web to build training datasets for future GPT models. OAI-SearchBot crawls and indexes pages so ChatGPT Search can return them as cited results. ChatGPT-User is triggered the moment a real user asks ChatGPT a question and the model needs to fetch your page in real time.
Only the last two drive citations. According to OpenAI's official crawler docs, each user-agent token is independent in robots.txt, meaning you can disallow GPTBot while allowing OAI-SearchBot and ChatGPT-User without contradiction.
| Crawler | Trigger | Drives a citation? |
|---|---|---|
| GPTBot | Scheduled training crawl | No |
| OAI-SearchBot | Search index refresh | Yes |
| ChatGPT-User | Live user question | Yes (highest-intent) |
The practical takeaway: ChatGPT-User hits are the closest thing AI search has to a high-intent inbound lead. Treat them like Googlebot, never block them.
Should I block GPTBot for training but allow OAI-SearchBot for citations?
Yes, that combination is supported and increasingly common. According to Cloudflare Radar's Q1 2026 analysis, GPTBot is the most-blocked AI crawler on the open web, appearing in more Disallow rules than any other AI user-agent. Many of those same sites still allow OAI-SearchBot.
The logic:
- Blocking GPTBot opts you out of contributing fresh training data to future GPT models. It does not remove you from ChatGPT Search.
- Allowing OAI-SearchBot keeps you eligible as a cited source inside ChatGPT Search.
- Allowing ChatGPT-User keeps you reachable when a paying ChatGPT user asks a question your page answers.
The counter-argument: long-term brand grounding inside the model. If GPT-6 is trained without your content, ChatGPT cannot generate ungrounded mentions of your product (the kind that show up without a live web fetch). For most B2B SaaS, that long-term mindshare is worth more than the marginal IP risk, so we default to allow GPTBot unless the company has a specific IP-protection mandate.
How does Anthropic's ClaudeBot differ from Claude-User and Claude-SearchBot?
Anthropic mirrors OpenAI's three-bot framework. Per Anthropic's official documentation, the three active user agents are:
- ClaudeBot -- scrapes content for training Claude models.
- Claude-User -- fetches a page in real time when a Claude user asks a question.
- Claude-SearchBot -- indexes content for Claude's search-grounded answers.
The legacy user agents anthropic-ai and Claude-Web are deprecated. Stripping them from your robots.txt is fine, but most teams leave them in for safety. All three current bots respect robots.txt and Crawl-delay independently, blocking ClaudeBot does not block Claude-User or Claude-SearchBot.
For B2B SaaS, allow all three. Claude is the LLM most B2B technical buyers run inside Cursor, Zed, and internal tools. Being grounded inside Claude's retrieval index is a quiet but compounding pipeline lever.
What is PerplexityBot vs Perplexity-User?
PerplexityBot indexes pages for Perplexity's answer engine. Perplexity-User fires when a live user submits a query and Perplexity fetches your page in real time. Both are documented in Perplexity's crawler docs and verifiable via the perplexity.ai referrer in the user-agent string.
For B2B SaaS, allow both. Perplexity captures roughly 7-15% of AI referral traffic in 2026 and visitors from Perplexity browse 13 pages on average per session, more than the 11.8 pages from Google referrals (per Superlines AI Search Statistics 2026).
One caveat on enforcement. Cloudflare published evidence in August 2024 that Perplexity uses undeclared user agents and rotating ASNs to fetch pages that disallow PerplexityBot. If you have a hard requirement to block Perplexity (paywalled IP, regulated content), pair the robots.txt rule with Cloudflare AI Crawl Control or a WAF rule that drops requests by ASN and verifies user agents against perplexity.ai.
Does blocking Google-Extended affect my ranking in Google AI Overviews?
Officially, no, but practically it depends on what "AI Overviews" means in your region. Google's documentation states that Google-Extended has zero effect on Google Search ranking. Google-Extended is an opt-out token for using your content to train Gemini models and to ground Gemini Apps and Vertex AI features.
Where it gets nuanced:
- Google Search ranking: unaffected by Google-Extended.
- AI Overviews on the SERP: Google has stated these use the same Search index, so blocking Google-Extended should not remove you. In practice, multiple 2026 publisher tests show reduced AI Overview citation rates after Google-Extended is blocked.
- Gemini standalone answers: blocking Google-Extended does remove you here.
- Vertex AI grounding: blocking Google-Extended does remove you here.
For B2B SaaS, allow Google-Extended. The IP-protection upside is small (Google has access to your content via Googlebot regardless), and the citation downside in Gemini and AI Overviews is real.
What about Applebot vs Applebot-Extended?
Applebot is Apple's primary crawler, used to power Siri, Spotlight, and Apple Intelligence retrieval. Applebot-Extended is an opt-out token that controls whether your content can be used to train Apple's foundation models. According to Apple's official documentation, Applebot-Extended makes no HTTP requests of its own, it is a directive applied to the data Applebot already collected.
Key behavior:
- Disallow
Applebot-- removes you from Siri, Spotlight, and Apple Intelligence retrieval. - Disallow
Applebot-Extended-- keeps you in retrieval but opts your content out of foundation-model training. - Allow both -- you appear in Apple Intelligence and contribute to model training.
Apple Intelligence is now embedded across iOS, macOS, and Safari's search bar, with default routing to Apple's models for on-device queries before falling back to ChatGPT. For B2B SaaS targeting iPad/Mac power users, allowing Applebot is non-negotiable. Applebot-Extended is a values call.
Which 14 AI crawlers actually matter in 2026?
Beyond OpenAI, Anthropic, Perplexity, Google, and Apple, four more crawlers should be on every B2B SaaS robots.txt: Bingbot (powers Bing Search and Microsoft Copilot grounding), GoogleOther (Google's generic R&D fetcher, same infrastructure as Googlebot), Meta-ExternalAgent (powers Meta AI grounding inside WhatsApp/Instagram), and Amazonbot (Alexa+ and Rufus retrieval).
See the full comparison table above for purpose, training behavior, and recommended action across all 14. The pattern across operators is consistent:
- One training crawler (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended).
- One search-index crawler (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Bingbot).
- One live user-fetch crawler (ChatGPT-User, Claude-User, Perplexity-User).
Google is the outlier, Googlebot serves all three roles simultaneously, which is why blocking Googlebot at the root is uniquely catastrophic. Treat Googlebot like Bingbot: never block, no exceptions.
How do I deploy and verify my robots.txt for AI crawlers?
Deploy in five steps:
- Place robots.txt at the root. It must live at
https://yourdomain.com/robots.txt. Subdirectories are not crawled for it. - Group rules by user-agent block. Each
User-agent:line followed by itsDisallow:andAllow:rules. MultipleUser-agentlines can share a block. - Always include a Sitemap directive. This is how AI crawlers discover new content fast. Per the Princeton GEO study (2024), fresh content (<13 weeks old) drives ~50% of AI citations.
- Verify with Google's robots.txt tester and the Cloudflare AI Crawl Control dashboard, which shows live crawler hits by user-agent.
- Check live crawler logs. Filter your access logs for the user-agent strings in the table above. If you see ClaudeBot hitting
/admin/, your block isn't working.
For enforcement beyond robots.txt (because 89.4% of AI crawler traffic is training-related and many smaller bots ignore the standard), layer Cloudflare AI Crawl Control or a WAF rule on top. Robots.txt is a signal, not a contract.
What are the most common robots.txt mistakes for AI crawlers?
Five mistakes we see on B2B SaaS audits, ranked by damage:
- Blocking ChatGPT-User. Often happens when teams paste a generic "block all AI bots" snippet from a 2023 blog post. ChatGPT-User is your highest-intent inbound, ChatGPT drives 87.4% of AI referral traffic.
- Blocking Googlebot to "opt out of AI Overviews." This kills your entire Google Search presence. Use Google-Extended (and the
nosnippetmeta tag for finer control) instead. - Using deprecated user agents.
anthropic-ai,Claude-Web, andchatgpt-botare not real or no longer monitored. Anthropic deprecated the first two. - Forgetting GoogleOther. Many AI grounding pipelines (Vertex, Gemini Studio) use GoogleOther as the fetcher. Blocking it silently removes you from those products.
- Inconsistent Disallow paths. A common pattern:
Disallow: /api/for Googlebot but not for GPTBot. The block-by-block structure of robots.txt requires repeating disallows in every user-agent group. Or use the wildcardUser-agent: *block as a baseline.
| User-Agent | Operator | Purpose | Used for Training? | Drives Citations? | Recommended Action (B2B SaaS) |
|---|---|---|---|---|---|
| GPTBot | OpenAI | LLM training data collection | Yes | No | Allow (or block if you object to training) |
| OAI-SearchBot | OpenAI | Indexes pages for ChatGPT Search | No | Yes | Allow |
| ChatGPT-User | OpenAI | Fetches a page when a ChatGPT user asks about it | No | Yes (high-intent) | Allow |
| ClaudeBot | Anthropic | Training data for Claude models | Yes | No | Allow (or block if you object to training) |
| Claude-User | Anthropic | Real-time fetch when a Claude user asks | No | Yes (high-intent) | Allow |
| Claude-SearchBot | Anthropic | Indexes pages for Claude search | No | Yes | Allow |
| PerplexityBot | Perplexity | Indexes pages for Perplexity answers | No | Yes | Allow |
| Perplexity-User | Perplexity | Live fetch on user query | No | Yes (high-intent) | Allow |
| Googlebot | Google Search + AI Overviews grounding | No | Yes | Allow (never block) | |
| Google-Extended | Opt-out token for Gemini training + Vertex grounding | Yes | Yes (Gemini) | Allow | |
| GoogleOther | Generic R&D fetcher across Google teams | Possibly | No | Allow | |
| Bingbot | Microsoft | Bing Search + Copilot grounding | Mixed | Yes | Allow (never block) |
| Applebot | Apple | Siri, Spotlight, Apple Intelligence grounding | No | Yes | Allow |
| Applebot-Extended | Apple | Opt-out token for Apple Intelligence training | Yes | No | Allow (or block if you object to training) |