How to Block AI Crawlers with robots.txt
AI crawlers are visiting your website whether you've invited them or not. GPTBot, Amazonbot, ClaudeBot, CCBot, and dozens of others are downloading your content to train large language models, power AI search products, or feed AI assistants.
The primary tool for controlling this access is your robots.txt file — a plain text file at the root of your domain that tells crawlers which parts of your site they're allowed to visit. Every major AI crawler operator has publicly committed to respecting robots.txt directives.
But blocking AI crawlers is not a simple on/off decision. Different bots serve different purposes, and blocking the wrong ones can cut you off from AI-driven traffic that's increasingly valuable. This guide gives you the robots.txt rules you need — and a framework for deciding which rules to use.
The AI Crawlers You Need to Know About
Before writing rules, you need to understand what you're blocking. AI crawlers fall into three distinct categories, each with different implications for your site:
| Category | Bots | What They Do | Do They Send Traffic Back? |
|---|---|---|---|
| AI Training | GPTBot, CCBot, Google-Extended, Bytespider | Download content to train AI models | Generally no. CRR is typically near zero. |
| AI Search | OAI-SearchBot, PerplexityBot, Kagi | Crawl content to power AI search with source citations | Yes — they link back to sources, generating measurable referral traffic. |
| AI Assistants | ChatGPT-User, MistralAI-User | Fetch specific pages on behalf of users during conversations | Direct traffic — a user actively asked the AI to visit your page. |
This distinction matters. Blocking all AI crawlers indiscriminately means you're also blocking the ones that cite your content and send real visitors. Understanding the difference between training bots and search bots is the foundation of a smart blocking strategy.
Option 1: Block All AI Training Crawlers
If your goal is to prevent your content from being used to train AI models, add the following to your robots.txt file:
# Block all known AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: Amazonbot
Disallow: / This blocks the most common AI training crawlers. Each User-agent directive targets a specific bot by the name it uses in its HTTP requests.
What this does: Prevents these bots from downloading any page on your site. Your content won't be included in future training runs for models that use these crawlers.
What this doesn't do: It doesn't remove content already ingested. If GPTBot crawled your site before you added these rules, that content may already be in OpenAI's training data. robots.txt is forward-looking — it controls future access, not past crawls.
What you lose: If you block GPTBot entirely, you also block OpenAI's ability to keep its models up to date with your content. ChatGPT may still reference older versions of your pages or stop referencing them altogether as models are retrained.
Option 2: Block Training Bots, Keep AI Search Bots
This is the approach most site owners should consider first. It blocks bots that consume your content purely for model training while keeping the door open for AI search products that cite sources and send traffic back.
# Block AI training crawlers — keep AI search crawlers
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
# Allow OAI-SearchBot (powers ChatGPT search with source links)
User-agent: OAI-SearchBot
Allow: /
# Allow ChatGPT-User (fetches pages when users ask)
User-agent: ChatGPT-User
Allow: /
# Allow PerplexityBot (AI search with citations)
User-agent: PerplexityBot
Allow: / This strategy makes sense when you look at the data. AI search bots like OAI-SearchBot and PerplexityBot tend to have measurably higher Crawl-to-Referral Ratios (CRR) — meaning they crawl your pages and the associated products actually send visitors back via source citations. Training-only bots, by contrast, typically have a CRR near zero.
The distinction between GPTBot (training) and OAI-SearchBot (search) is particularly important for OpenAI's ecosystem. They're separate crawlers with separate user-agent strings, so you can block one and allow the other.
Option 3: Block AI Crawlers from Specific Sections
Rather than blocking bots sitewide, you can restrict access to specific directories — protecting premium content while leaving public content accessible:
# Block AI crawlers from sensitive sections only
User-agent: GPTBot
Disallow: /premium/
Disallow: /members/
Disallow: /api/
User-agent: CCBot
Disallow: /premium/
Disallow: /members/
Disallow: /api/
User-agent: ClaudeBot
Disallow: /premium/
Disallow: /members/
Disallow: /api/
# Allow everything else
User-agent: GPTBot
Allow: /blog/
Allow: /docs/ This approach works well when you have both free and premium content. Let AI crawlers index your blog posts and documentation — which benefits from AI distribution — while keeping paywalled or sensitive sections off limits.
Complete List: AI Crawler User-Agent Strings
Here's a comprehensive reference of the AI-related user-agent strings you can target in robots.txt, along with what each crawler does:
| User-Agent String | Operator | Purpose |
|---|---|---|
GPTBot | OpenAI | Model training and improvement |
OAI-SearchBot | OpenAI | Real-time search with source links |
ChatGPT-User | OpenAI | Fetching pages during user conversations |
CCBot | Common Crawl | Open dataset used by many AI labs for training |
Google-Extended | Gemini model training (separate from Googlebot) | |
ClaudeBot | Anthropic | Model training for Claude |
Bytespider | ByteDance | Model training for TikTok/Douyin AI products |
Diffbot | Diffbot | AI-powered web data extraction |
PerplexityBot | Perplexity | AI search engine with source citations |
Amazonbot | Amazon | Alexa answers and Amazon AI services |
FacebookBot | Meta | AI training for Meta's LLMs |
Applebot-Extended | Apple | AI training for Apple Intelligence features |
For detailed profiles of each of these crawlers — including verification methods, IP ranges, and crawl behavior — see the bot catalogue.
How to Verify Your robots.txt Is Working
Adding rules to robots.txt is only half the job. You need to confirm that your rules are correctly formatted and that bots are actually obeying them.
Step 1: Check your file is accessible
# Check your current robots.txt
curl -s https://yoursite.com/robots.txt
# Verify a specific bot is blocked
curl -s https://yoursite.com/robots.txt | grep -A1 "GPTBot" Your robots.txt must be served at the exact path /robots.txt on your domain root. If it returns a 404, no crawler will see your rules.
Step 2: Validate syntax
Common mistakes that silently break robots.txt rules:
- Wrong user-agent name.
User-agent: GPT-Botwon't match GPTBot. The string must exactly match what the crawler sends. - Missing blank lines between rules. Each user-agent block should be separated by a blank line for clarity. While some parsers are forgiving, consistent formatting prevents ambiguity.
- Conflicting rules. If you have both
Allow: /blog/andDisallow: /for the same user-agent, the more specific rule (Allow) takes precedence — but not all crawlers implement this identically. - BOM or encoding issues. robots.txt should be UTF-8 without a byte order mark. Some CMS platforms add invisible characters that can break parsing.
Step 3: Monitor compliance
Here's the uncomfortable truth about robots.txt: it's a voluntary protocol. There is no technical enforcement mechanism. A well-behaved crawler will respect your directives. A scraper pretending to be GPTBot will ignore them entirely.
To know whether bots are actually obeying your rules, you need to monitor your traffic. If you've blocked GPTBot in robots.txt but still see requests from its user-agent, either:
- The bot hasn't re-crawled your robots.txt yet (crawlers check periodically, not instantly)
- The requests are from a fake bot spoofing GPTBot's user-agent string
Both scenarios require visibility into your actual traffic — which is where bot detection and ongoing monitoring become essential.
The Block-or-Allow Decision Framework
Robots.txt rules are easy to write. Deciding what to write is the harder question. Here's a practical framework:
1. Measure before you block
Before adding blocking rules, understand what's actually happening on your site. Which AI crawlers are visiting? How often? What pages are they consuming? Without this baseline, you're making decisions blind.
Check your server logs for AI bot user-agent strings, or use a bot monitoring tool that tracks crawl activity across all known bots.
2. Separate training bots from search bots
As the table above shows, not all AI bots are the same. The ones that train models and the ones that power search with citations have very different value propositions. Blocking a training bot costs you nothing if that bot's operator doesn't send traffic back. Blocking an AI search bot might cost you a growing source of referral visits.
The Crawl-to-Referral Ratio (CRR) makes this concrete. If a bot has a CRR near zero, blocking it is low-risk. If it has a CRR of 30+, you're giving up real traffic by blocking it.
3. Consider your content type
- Publisher / media sites: Most aggressive about blocking training bots. Their content is their product, and AI training without compensation is a direct threat.
- SaaS / B2B companies: Often benefit from AI distribution. If ChatGPT recommends your product when someone asks for solutions in your category, that's free marketing.
- E-commerce: Product descriptions being used for AI training is less concerning than for editorial content. AI search that links to your product pages is directly valuable.
- Documentation / technical sites: Being cited by AI assistants builds authority and drives traffic from developers seeking your docs.
4. Review and adjust regularly
The AI crawler landscape changes constantly. New bots appear, operators launch new products, and the traffic patterns shift. A robots.txt policy you set six months ago may no longer reflect reality.
Monitor which bots are active on your site, track their CRR over time, and adjust your blocking rules when the data warrants it. This is not a set-and-forget decision.
What robots.txt Can't Do
It's important to be clear about the limits of robots.txt as a defense mechanism:
- It doesn't block access technically. robots.txt is an honor system. Any crawler can ignore it. If a malicious scraper visits your site claiming to be GPTBot, your Disallow rule won't stop it.
- It doesn't retroactively remove content. Data already crawled before you added blocking rules may already be in training datasets. Robots.txt only affects future crawls.
- It doesn't distinguish between page types intelligently. You can block by path, but you can't block "only product pages" or "only articles published after 2025" without those being reflected in your URL structure.
- It's publicly visible. Anyone can read your robots.txt. Some argue that publishing your blocking rules helps scrapers know exactly what you're protecting.
For sites that need stronger protection — actual blocking rather than polite requests — consider rate limiting, IP-based access controls, or a CDN-level bot management solution. But for the vast majority of sites, robots.txt combined with monitoring is the right starting point.
The Bottom Line
Blocking AI crawlers with robots.txt is straightforward technically — a few lines in a text file. The real challenge is making the right blocking decisions, and that requires data.
Don't block blindly. Understand which crawlers are active on your site, categorize them by purpose, measure whether they send traffic back, and then write rules that reflect an informed strategy. Block the bots that take without giving back. Keep the ones that drive real referral traffic through source citations.
And once you've set your rules, monitor whether they're being respected — because the difference between a legitimate bot honoring your robots.txt and a fake bot ignoring it is something only active monitoring can reveal.
Can AI See It monitors 800+ bots on your site, tracks which AI crawlers respect your robots.txt rules, detects fake bots, and measures the Crawl-to-Referral Ratio for every crawler — so your block/allow decisions are based on data, not guesswork. Start monitoring your bot traffic