AI Training Crawlers
These bots crawl websites to collect training data for AI models. Many respect robots.txt — giving you control over whether your content is used for AI training.
AI training crawlers download web pages so their operators can use the content to train large language models and other AI systems. Understanding which bots are active on your site — and whether they respect your access rules — is the first step in controlling how your content is used.
26 AI Training Crawler in our directory
- Amazon Kendra by Amazon
Amazon Kendra is an intelligent search service operated by Amazon, using natural language processing for accurate search results.
- Amazonbot by Amazon
Amazonbot is Amazon's web crawler used to improve products and services, such as training AI models.
- Anchor Browser by Anchor
The Anchor Browser is a web browser for AI agents, operated by Anchor, used for automating workflows and web interactions.
- AwarioSmartBot by Awario
AwarioSmartBot is a web crawler operated by Awario, used for collecting new and updated web data for Internet marketers.
- Big Sur AI by Big Sur AI
Big Sur AI Crawler, operated by Big Sur AI, crawls user websites for AI-infused experiences.
- Brandwatch by Brandwatch
Brandwatch's Magpie Crawler indexes web pages for social media monitoring and analysis.
- Ceramic TerraCotta by Ceramic
Ceramic TerraCotta is a web crawler operated by Ceramic, an AI training infrastructure company. It crawls websites to support Ceramic's AI model training optimization platform.
- ClaudeBot by Anthropic
ClaudeBot is Anthropic's web crawler that collects web content for training Claude AI models. It respects robots.txt directives and supports Crawl-delay.
- Cotoyogi by Research Organization of Information and Systems
Cotoyogi is a bot operated by the Research Organization of Information and Systems for AI training purposes.
- Echobot Bot by Echobox
Echobot is a web scraping bot operated by Echobox for AI training purposes, specifically for automating content distribution for digital publishers.
- Factset_spyderbot by Factset
Factset Spyderbot is a web scraping bot operated by Factset for delivering reliable financial data.
- Google NotebookLM by Google
Google NotebookLM bot operated by Google for AI training.
- Google-CloudVertexBot by Google
Google-CloudVertexBot is a crawler operated by Google for targeted AI training of site owners' own sites.
- GoogleOther by Google
GoogleOther is a generic crawler operated by Google for fetching publicly accessible content from sites, used for internal research and development.
- GPTBot by OpenAI
GPTBot is used to crawl content that may be used in training OpenAI's generative AI foundation models.
- ICC Crawler by NICT
ICC Crawler is a web crawler operated by NICT, collecting web pages for AI training.
- LINER Bot by Liner Bot
LINER Bot is a web crawler operated by Liner Bot, used for AI training by collecting data from the internet.
- Meta-ExternalAgent by Meta
Meta-ExternalAgent is a bot operated by Meta for AI training purposes, specifically for training AI models or improving products by indexing content directly.
- netEstate Imprint Crawler by netEstate
The NetEstate Imprint crawler crawls websites for public contact information.
- Novellum AI Crawl by Novellum
Novellum.ai is building out tools for building agents. This MCP tool will be used by agents to crawl sites.
- PetalBot by Huawei
PetalBot is a search engine crawler operated by Huawei, used for indexing websites and providing content recommendations in Petal Search engine, Huawei Assistant, and AI Search services.
- QualifiedBot by Qualified.com, Inc.
QualifiedBot is a crawler operated by Qualified.com, Inc. to power AI products by crawling customer websites.
- SemrushBot-OCOB by Semrush
SemrushBot-OCOB is a bot operated by Semrush for ai-training, specifically for Content Toolkit.
- SemrushBotSwa by SEMrush
SEMrushBotSwa is a bot operated by SEMrush for ai-training purposes, collecting data for the SEO Writing Assistant tool.
- ShapBot by Parallel
Not officially documented
- WARDBot by WEBSPARK
WARDBot is a monitoring bot operated by WEBSPARK that tracks URL status codes for users.