Skip to main content
Can AI see it

Know what AI sees. Measure what it's worth.

AI Training Crawlers

These bots crawl websites to collect training data for AI models. Many respect robots.txt — giving you control over whether your content is used for AI training.

AI training crawlers download web pages so their operators can use the content to train large language models and other AI systems. Understanding which bots are active on your site — and whether they respect your access rules — is the first step in controlling how your content is used.

26 AI Training Crawler in our directory

  • Amazon Kendra by Amazon

    Amazon Kendra is an intelligent search service operated by Amazon, using natural language processing for accurate search results.

  • Amazonbot by Amazon

    Amazonbot is Amazon's web crawler used to improve products and services, such as training AI models.

  • Anchor Browser by Anchor

    The Anchor Browser is a web browser for AI agents, operated by Anchor, used for automating workflows and web interactions.

  • AwarioSmartBot by Awario

    AwarioSmartBot is a web crawler operated by Awario, used for collecting new and updated web data for Internet marketers.

  • Big Sur AI by Big Sur AI

    Big Sur AI Crawler, operated by Big Sur AI, crawls user websites for AI-infused experiences.

  • Brandwatch by Brandwatch

    Brandwatch's Magpie Crawler indexes web pages for social media monitoring and analysis.

  • Ceramic TerraCotta by Ceramic

    Ceramic TerraCotta is a web crawler operated by Ceramic, an AI training infrastructure company. It crawls websites to support Ceramic's AI model training optimization platform.

  • ClaudeBot by Anthropic

    ClaudeBot is Anthropic's web crawler that collects web content for training Claude AI models. It respects robots.txt directives and supports Crawl-delay.

  • Cotoyogi by Research Organization of Information and Systems

    Cotoyogi is a bot operated by the Research Organization of Information and Systems for AI training purposes.

  • Echobot Bot by Echobox

    Echobot is a web scraping bot operated by Echobox for AI training purposes, specifically for automating content distribution for digital publishers.

  • Factset_spyderbot by Factset

    Factset Spyderbot is a web scraping bot operated by Factset for delivering reliable financial data.

  • Google NotebookLM by Google

    Google NotebookLM bot operated by Google for AI training.

  • Google-CloudVertexBot by Google

    Google-CloudVertexBot is a crawler operated by Google for targeted AI training of site owners' own sites.

  • GoogleOther by Google

    GoogleOther is a generic crawler operated by Google for fetching publicly accessible content from sites, used for internal research and development.

  • GPTBot by OpenAI

    GPTBot is used to crawl content that may be used in training OpenAI's generative AI foundation models.

  • ICC Crawler by NICT

    ICC Crawler is a web crawler operated by NICT, collecting web pages for AI training.

  • LINER Bot by Liner Bot

    LINER Bot is a web crawler operated by Liner Bot, used for AI training by collecting data from the internet.

  • Meta-ExternalAgent by Meta

    Meta-ExternalAgent is a bot operated by Meta for AI training purposes, specifically for training AI models or improving products by indexing content directly.

  • netEstate Imprint Crawler by netEstate

    The NetEstate Imprint crawler crawls websites for public contact information.

  • Novellum AI Crawl by Novellum

    Novellum.ai is building out tools for building agents. This MCP tool will be used by agents to crawl sites.

  • PetalBot by Huawei

    PetalBot is a search engine crawler operated by Huawei, used for indexing websites and providing content recommendations in Petal Search engine, Huawei Assistant, and AI Search services.

  • QualifiedBot by Qualified.com, Inc.

    QualifiedBot is a crawler operated by Qualified.com, Inc. to power AI products by crawling customer websites.

  • SemrushBot-OCOB by Semrush

    SemrushBot-OCOB is a bot operated by Semrush for ai-training, specifically for Content Toolkit.

  • SemrushBotSwa by SEMrush

    SEMrushBotSwa is a bot operated by SEMrush for ai-training purposes, collecting data for the SEO Writing Assistant tool.

  • ShapBot by Parallel

    Not officially documented

  • WARDBot by WEBSPARK

    WARDBot is a monitoring bot operated by WEBSPARK that tracks URL status codes for users.

View all bots in the full catalogue