how-to 11 min read May 03, 2026

robots.txt for ChatGPT, Perplexity, and Gemini: The Right 2026 Settings

Q: What's the difference between GPTBot, OAI-SearchBot, and ChatGPT-User?

GPTBot collects content for training future GPT models. OAI-SearchBot crawls pages so ChatGPT Search can index and cite them. ChatGPT-User is triggered when a real user asks ChatGPT a question and the model fetches your page in real time. Only OAI-SearchBot and ChatGPT-User drive citations.

Q: Should I block GPTBot but allow OAI-SearchBot?

Yes, if your goal is citation visibility without contributing training data. OpenAI explicitly designed the three crawlers to be controlled independently in robots.txt, so you can disallow GPTBot while allowing OAI-SearchBot and ChatGPT-User. Blocking GPTBot does not remove you from ChatGPT Search results.

Q: Does blocking Google-Extended hurt Google Search ranking?

No. Google's official documentation states that Google-Extended has zero effect on Google Search ranking. However, blocking Google-Extended removes your content from Gemini grounding and Vertex AI features, and may reduce visibility in Google AI Overviews.

Q: What is PerplexityBot vs Perplexity-User?

PerplexityBot is the indexing crawler that builds Perplexity's answer index. Perplexity-User fires when a live user asks a question and Perplexity needs to fetch your page in real time. Both should be allowed for B2B SaaS.

Q: Does PerplexityBot actually respect robots.txt?

Officially yes, but Cloudflare published evidence in August 2024 that Perplexity uses undeclared user agents to fetch pages that disallow PerplexityBot. If you must block Perplexity, combine robots.txt with a Cloudflare or WAF rule that blocks by ASN and verifies user agents against perplexity.ai.

Q: What's the difference between ClaudeBot, Claude-User, and Claude-SearchBot?

ClaudeBot scrapes content for training Claude models. Claude-User fetches pages on demand when a Claude user asks a question. Claude-SearchBot indexes content for Claude's search results. The legacy user agents anthropic-ai and Claude-Web are deprecated.

Q: Is there a recommended robots.txt for B2B SaaS in 2026?

Yes. Allow all retrieval and search crawlers (OAI-SearchBot, ChatGPT-User, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Googlebot, Bingbot, Applebot). Decide on training crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended) based on your IP stance.

Q: Does robots.txt actually stop AI crawlers?

Robots.txt is a signal, not a contract. Major operators like OpenAI, Anthropic, Google, and Microsoft honor it. Smaller scrapers and some retrieval bots ignore it. To enforce blocks, layer Cloudflare AI Crawl Control, WAF rules, or pay-per-crawl on top of robots.txt.

Q: Should I block CCBot?

CCBot is Common Crawl, whose dataset trained early GPT models and is still used by many open-source LLMs. If you allow GPTBot and ClaudeBot, blocking CCBot is inconsistent. If you want to opt out of all third-party training datasets, block CCBot alongside the named training bots.

Q: How fast do AI engines pick up robots.txt changes?

OpenAI and Anthropic typically refresh robots.txt within 24 hours. Google honors changes on the next crawl, usually within a few days. Perplexity has shown lag of up to a week. Always verify with a controlled test page after publishing changes.

By Peter Foy

The exact robots.txt config for B2B SaaS in 2026. All 14 AI crawlers compared (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot, Google-Extended) with copy-paste rules.

TL;DR

B2B SaaS sites should allow every AI retrieval and search crawler (OAI-SearchBot, ChatGPT-User, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Googlebot, Bingbot, Applebot) and make a deliberate choice on the four training-only bots (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended). Default-allow gives away training data with no upside. Default-block kills citation visibility entirely.

GPTBot, OAI-SearchBot, and ChatGPT-User are three independent crawlers, control them separately.
Blocking Google-Extended does not affect Google Search ranking, but removes you from Gemini grounding.
PerplexityBot officially honors robots.txt, but Cloudflare documented stealth fetching in 2024.
Allow ClaudeBot, Claude-User, and Claude-SearchBot, the legacy anthropic-ai user agent is deprecated.
ChatGPT drives 87.4% of AI referral traffic, do not block ChatGPT-User under any circumstances.

For B2B SaaS in 2026, the right robots.txt allows every AI retrieval and search crawler (OAI-SearchBot, ChatGPT-User, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Googlebot, Bingbot, Applebot) and makes a deliberate call on the four training-only bots (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended). The default-allow approach gives away training data with no citation upside. Default-block kills visibility in ChatGPT, Perplexity, and Gemini, the engines now driving 87.4% of AI referral traffic. This guide publishes the exact configuration we deploy on B2B SaaS sites, with all 14 active AI crawlers compared.

What's the recommended robots.txt for B2B SaaS in 2026?

Below is the exact robots.txt block we deploy on Growth Engineer client sites. It allows all citation-driving crawlers, blocks training-only crawlers (adjust to taste), and protects private routes from every AI bot.

# === AI RETRIEVAL & SEARCH (allow -- these drive citations) ===
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: Claude-SearchBot
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Googlebot
User-agent: Bingbot
User-agent: Applebot
User-agent: GoogleOther
User-agent: DuckAssistBot
Disallow: /admin/
Disallow: /app/
Disallow: /api/
Disallow: /account/

# === AI TRAINING (choose your stance) ===
# Allow if you want long-term brand grounding inside the models.
# Disallow if you object to training on your IP.
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Google-Extended
User-agent: Applebot-Extended
User-agent: CCBot
User-agent: Meta-ExternalAgent
User-agent: Amazonbot
Disallow: /admin/
Disallow: /app/
Disallow: /api/
Disallow: /account/

# === CONFIRMED SCRAPERS / NO BUSINESS UPSIDE ===
User-agent: Bytespider
Disallow: /

Sitemap: https://yourdomain.com/sitemap.xml

Two principles drive this config:

Retrieval bots are non-negotiable allows. Every crawler ending in -User or -SearchBot exists to put your page into an AI answer with a citation. Blocking them is leaving pipeline on the table.
Training bots are a values call, not a visibility call. OpenAI's documentation confirms blocking GPTBot does not remove you from ChatGPT Search. The same applies to ClaudeBot and Anthropic's retrieval bots.

Most Blocked AI Crawlers in robots.txt (Q1 2026)

GPTBot

42.3%

CCBot

35.1%

ClaudeBot

28.9%

Google-Extended

22.7%

PerplexityBot

18.4%

Bytespider

15.2%

Source: Cloudflare Radar / aicrawlercheck.com Q1 2026 robots.txt analysis

What's the difference between GPTBot, OAI-SearchBot, and ChatGPT-User?

OpenAI runs three independently controlled crawlers. GPTBot scrapes the web to build training datasets for future GPT models. OAI-SearchBot crawls and indexes pages so ChatGPT Search can return them as cited results. ChatGPT-User is triggered the moment a real user asks ChatGPT a question and the model needs to fetch your page in real time.

Only the last two drive citations. According to OpenAI's official crawler docs, each user-agent token is independent in robots.txt, meaning you can disallow GPTBot while allowing OAI-SearchBot and ChatGPT-User without contradiction.

Crawler	Trigger	Drives a citation?
GPTBot	Scheduled training crawl	No
OAI-SearchBot	Search index refresh	Yes
ChatGPT-User	Live user question	Yes (highest-intent)

The practical takeaway: ChatGPT-User hits are the closest thing AI search has to a high-intent inbound lead. Treat them like Googlebot, never block them.

Should I block GPTBot for training but allow OAI-SearchBot for citations?

Yes, that combination is supported and increasingly common. According to Cloudflare Radar's Q1 2026 analysis, GPTBot is the most-blocked AI crawler on the open web, appearing in more Disallow rules than any other AI user-agent. Many of those same sites still allow OAI-SearchBot.

The logic:

Blocking GPTBot opts you out of contributing fresh training data to future GPT models. It does not remove you from ChatGPT Search.
Allowing OAI-SearchBot keeps you eligible as a cited source inside ChatGPT Search.
Allowing ChatGPT-User keeps you reachable when a paying ChatGPT user asks a question your page answers.

The counter-argument: long-term brand grounding inside the model. If GPT-6 is trained without your content, ChatGPT cannot generate ungrounded mentions of your product (the kind that show up without a live web fetch). For most B2B SaaS, that long-term mindshare is worth more than the marginal IP risk, so we default to allow GPTBot unless the company has a specific IP-protection mandate.

Share of AI Referral Traffic by Engine (2026)

ChatGPT

87.4%

Perplexity

7.1%

Gemini

3.2%

Copilot/Bing

1.4%

Claude

0.9%

Source: Lantern AI Referral Traffic Report 2026

How does Anthropic's ClaudeBot differ from Claude-User and Claude-SearchBot?

Anthropic mirrors OpenAI's three-bot framework. Per Anthropic's official documentation, the three active user agents are:

ClaudeBot -- scrapes content for training Claude models.
Claude-User -- fetches a page in real time when a Claude user asks a question.
Claude-SearchBot -- indexes content for Claude's search-grounded answers.

The legacy user agents anthropic-ai and Claude-Web are deprecated. Stripping them from your robots.txt is fine, but most teams leave them in for safety. All three current bots respect robots.txt and Crawl-delay independently, blocking ClaudeBot does not block Claude-User or Claude-SearchBot.

For B2B SaaS, allow all three. Claude is the LLM most B2B technical buyers run inside Cursor, Zed, and internal tools. Being grounded inside Claude's retrieval index is a quiet but compounding pipeline lever.

What is PerplexityBot vs Perplexity-User?

PerplexityBot indexes pages for Perplexity's answer engine. Perplexity-User fires when a live user submits a query and Perplexity fetches your page in real time. Both are documented in Perplexity's crawler docs and verifiable via the perplexity.ai referrer in the user-agent string.

For B2B SaaS, allow both. Perplexity captures roughly 7-15% of AI referral traffic in 2026 and visitors from Perplexity browse 13 pages on average per session, more than the 11.8 pages from Google referrals (per Superlines AI Search Statistics 2026).

One caveat on enforcement. Cloudflare published evidence in August 2024 that Perplexity uses undeclared user agents and rotating ASNs to fetch pages that disallow PerplexityBot. If you have a hard requirement to block Perplexity (paywalled IP, regulated content), pair the robots.txt rule with Cloudflare AI Crawl Control or a WAF rule that drops requests by ASN and verifies user agents against perplexity.ai.

Does blocking Google-Extended affect my ranking in Google AI Overviews?

Officially, no, but practically it depends on what "AI Overviews" means in your region. Google's documentation states that Google-Extended has zero effect on Google Search ranking. Google-Extended is an opt-out token for using your content to train Gemini models and to ground Gemini Apps and Vertex AI features.

Where it gets nuanced:

Google Search ranking: unaffected by Google-Extended.
AI Overviews on the SERP: Google has stated these use the same Search index, so blocking Google-Extended should not remove you. In practice, multiple 2026 publisher tests show reduced AI Overview citation rates after Google-Extended is blocked.
Gemini standalone answers: blocking Google-Extended does remove you here.
Vertex AI grounding: blocking Google-Extended does remove you here.

For B2B SaaS, allow Google-Extended. The IP-protection upside is small (Google has access to your content via Googlebot regardless), and the citation downside in Gemini and AI Overviews is real.

What about Applebot vs Applebot-Extended?

Applebot is Apple's primary crawler, used to power Siri, Spotlight, and Apple Intelligence retrieval. Applebot-Extended is an opt-out token that controls whether your content can be used to train Apple's foundation models. According to Apple's official documentation, Applebot-Extended makes no HTTP requests of its own, it is a directive applied to the data Applebot already collected.

Key behavior:

Disallow Applebot -- removes you from Siri, Spotlight, and Apple Intelligence retrieval.
Disallow Applebot-Extended -- keeps you in retrieval but opts your content out of foundation-model training.
Allow both -- you appear in Apple Intelligence and contribute to model training.

Apple Intelligence is now embedded across iOS, macOS, and Safari's search bar, with default routing to Apple's models for on-device queries before falling back to ChatGPT. For B2B SaaS targeting iPad/Mac power users, allowing Applebot is non-negotiable. Applebot-Extended is a values call.

Which 14 AI crawlers actually matter in 2026?

Beyond OpenAI, Anthropic, Perplexity, Google, and Apple, four more crawlers should be on every B2B SaaS robots.txt: Bingbot (powers Bing Search and Microsoft Copilot grounding), GoogleOther (Google's generic R&D fetcher, same infrastructure as Googlebot), Meta-ExternalAgent (powers Meta AI grounding inside WhatsApp/Instagram), and Amazonbot (Alexa+ and Rufus retrieval).

See the full comparison table above for purpose, training behavior, and recommended action across all 14. The pattern across operators is consistent:

One training crawler (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended).
One search-index crawler (OAI-SearchBot, Claude-SearchBot, PerplexityBot, Bingbot).
One live user-fetch crawler (ChatGPT-User, Claude-User, Perplexity-User).

Google is the outlier, Googlebot serves all three roles simultaneously, which is why blocking Googlebot at the root is uniquely catastrophic. Treat Googlebot like Bingbot: never block, no exceptions.

How do I deploy and verify my robots.txt for AI crawlers?

Deploy in five steps:

Place robots.txt at the root. It must live at https://yourdomain.com/robots.txt. Subdirectories are not crawled for it.
Group rules by user-agent block. Each User-agent: line followed by its Disallow: and Allow: rules. Multiple User-agent lines can share a block.
Always include a Sitemap directive. This is how AI crawlers discover new content fast. Per the Princeton GEO study (2024), fresh content (<13 weeks old) drives ~50% of AI citations.
Verify with Google's robots.txt tester and the Cloudflare AI Crawl Control dashboard, which shows live crawler hits by user-agent.
Check live crawler logs. Filter your access logs for the user-agent strings in the table above. If you see ClaudeBot hitting /admin/, your block isn't working.

For enforcement beyond robots.txt (because 89.4% of AI crawler traffic is training-related and many smaller bots ignore the standard), layer Cloudflare AI Crawl Control or a WAF rule on top. Robots.txt is a signal, not a contract.

What are the most common robots.txt mistakes for AI crawlers?

Five mistakes we see on B2B SaaS audits, ranked by damage:

Blocking ChatGPT-User. Often happens when teams paste a generic "block all AI bots" snippet from a 2023 blog post. ChatGPT-User is your highest-intent inbound, ChatGPT drives 87.4% of AI referral traffic.
Blocking Googlebot to "opt out of AI Overviews." This kills your entire Google Search presence. Use Google-Extended (and the nosnippet meta tag for finer control) instead.
Using deprecated user agents. anthropic-ai, Claude-Web, and chatgpt-bot are not real or no longer monitored. Anthropic deprecated the first two.
Forgetting GoogleOther. Many AI grounding pipelines (Vertex, Gemini Studio) use GoogleOther as the fetcher. Blocking it silently removes you from those products.
Inconsistent Disallow paths. A common pattern: Disallow: /api/ for Googlebot but not for GPTBot. The block-by-block structure of robots.txt requires repeating disallows in every user-agent group. Or use the wildcard User-agent: * block as a baseline.

User-Agent	Operator	Purpose	Used for Training?	Drives Citations?	Recommended Action (B2B SaaS)
GPTBot	OpenAI	LLM training data collection	Yes	No	Allow (or block if you object to training)
OAI-SearchBot	OpenAI	Indexes pages for ChatGPT Search	No	Yes	Allow
ChatGPT-User	OpenAI	Fetches a page when a ChatGPT user asks about it	No	Yes (high-intent)	Allow
ClaudeBot	Anthropic	Training data for Claude models	Yes	No	Allow (or block if you object to training)
Claude-User	Anthropic	Real-time fetch when a Claude user asks	No	Yes (high-intent)	Allow
Claude-SearchBot	Anthropic	Indexes pages for Claude search	No	Yes	Allow
PerplexityBot	Perplexity	Indexes pages for Perplexity answers	No	Yes	Allow
Perplexity-User	Perplexity	Live fetch on user query	No	Yes (high-intent)	Allow
Googlebot	Google	Google Search + AI Overviews grounding	No	Yes	Allow (never block)
Google-Extended	Google	Opt-out token for Gemini training + Vertex grounding	Yes	Yes (Gemini)	Allow
GoogleOther	Google	Generic R&D fetcher across Google teams	Possibly	No	Allow
Bingbot	Microsoft	Bing Search + Copilot grounding	Mixed	Yes	Allow (never block)
Applebot	Apple	Siri, Spotlight, Apple Intelligence grounding	No	Yes	Allow
Applebot-Extended	Apple	Opt-out token for Apple Intelligence training	Yes	No	Allow (or block if you object to training)

Frequently asked questions

What's the difference between GPTBot, OAI-SearchBot, and ChatGPT-User?

GPTBot collects content for training future GPT models. OAI-SearchBot crawls pages so ChatGPT Search can index and cite them. ChatGPT-User is triggered when a real user asks ChatGPT a question and the model fetches your page in real time. Only OAI-SearchBot and ChatGPT-User drive citations.

Should I block GPTBot but allow OAI-SearchBot?

Yes, if your goal is citation visibility without contributing training data. OpenAI explicitly designed the three crawlers to be controlled independently in robots.txt, so you can disallow GPTBot while allowing OAI-SearchBot and ChatGPT-User. Blocking GPTBot does not remove you from ChatGPT Search results.

Does blocking Google-Extended hurt Google Search ranking?

No. Google's official documentation states that Google-Extended has zero effect on Google Search ranking. However, blocking Google-Extended removes your content from Gemini grounding and Vertex AI features, and may reduce visibility in Google AI Overviews, which surface inside Search results.

What is PerplexityBot vs Perplexity-User?

PerplexityBot is the indexing crawler that builds Perplexity's answer index. Perplexity-User fires when a live user asks a question and Perplexity needs to fetch your page in real time. Both should be allowed for B2B SaaS, since Perplexity drives 7-15% of AI referral traffic.

Does PerplexityBot actually respect robots.txt?

Officially yes, but Cloudflare published evidence in August 2024 that Perplexity uses undeclared user agents to fetch pages that disallow PerplexityBot. If you must block Perplexity, combine robots.txt with a Cloudflare or WAF rule that blocks by ASN and verifies user agents against perplexity.ai.

What's the difference between ClaudeBot, Claude-User, and Claude-SearchBot?

ClaudeBot scrapes content for training Claude models. Claude-User fetches pages on demand when a Claude user asks a question. Claude-SearchBot indexes content for Claude's search results. The legacy user agents anthropic-ai and Claude-Web are deprecated.

Is there a recommended robots.txt for B2B SaaS in 2026?

Yes. Allow all retrieval and search crawlers (OAI-SearchBot, ChatGPT-User, Claude-User, Claude-SearchBot, PerplexityBot, Perplexity-User, Googlebot, Bingbot, Applebot). Decide on training crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended) based on your IP stance. Most B2B SaaS companies should allow training too, since it improves long-term brand grounding inside the models.

Does robots.txt actually stop AI crawlers?

Robots.txt is a signal, not a contract. Major operators like OpenAI, Anthropic, Google, and Microsoft honor it. Smaller scrapers and some retrieval bots ignore it. To enforce blocks, layer Cloudflare AI Crawl Control, WAF rules, or pay-per-crawl on top of robots.txt.

Should I block CCBot?

CCBot is Common Crawl, whose dataset trained early GPT models and is still used by many open-source LLMs. If you allow GPTBot and ClaudeBot, blocking CCBot is inconsistent. If you want to opt out of all third-party training datasets, block CCBot alongside the named training bots.

How fast do AI engines pick up robots.txt changes?

OpenAI and Anthropic typically refresh robots.txt within 24 hours. Google honors changes on the next crawl, usually within a few days. Perplexity has shown lag of up to a week. Always verify with a controlled test page after publishing changes.

After the deployment section, point readers to the broader AEO foundation guide.

Audit your robots.txt against the Growth Engineer AEO checklist