Robots.txt Generator & Validator
Generate and validate robots.txt files. Block AI crawlers, configure search engine access, and test URL rules.
# Generated by codetidy.dev/robotstxt-generator
User-agent: *
How to Use
- Start with a preset — "Allow All" for open access, "Block All" to prevent all crawling, "Block AI Crawlers" to specifically block AI training bots, or "Standard" for a common starting configuration.
- Customize your rules by adding Disallow paths (e.g.,
/admin,/api) and Allow exceptions. - Add multiple user-agent groups if you want different rules for different bots.
- Enter your Sitemap URL to help crawlers discover your content.
- Copy the generated output and save it as
robots.txtat your website root.
What is robots.txt?
The robots.txt file is a simple text file that follows the Robots Exclusion Protocol (REP), first proposed in 1994 and now widely adopted by all major search engines and web crawlers. Placed at the root of a domain (e.g., https://example.com/robots.txt), it serves as a set of instructions that tell bots which parts of the site they may access.
The file uses a straightforward syntax: User-agent specifies which bot the rules apply to (use * for all bots), Disallow blocks a path, Allow permits a path (overriding a broader Disallow), Sitemap points to your XML sitemap, and Crawl-delay sets a wait time between requests.
Blocking AI Crawlers
With the rise of large language models, many website owners want to prevent AI companies from using their content for training data. Major AI crawlers include GPTBot (OpenAI), ChatGPT-User (OpenAI live browsing), CCBot (Common Crawl), Google-Extended (Google Gemini training), anthropic-ai and Claude-Web (Anthropic), Bytespider (ByteDance), and PerplexityBot (Perplexity AI).
To block these crawlers, add a User-agent directive for each one followed by Disallow: /. This tool's "Block AI Crawlers" preset generates all the necessary rules automatically. Note that blocking these crawlers only prevents future training — content already ingested before the block was added may still be in training datasets.
Common Mistakes
- Blocking CSS and JS files — search engines need to render your pages, so blocking stylesheets and scripts can hurt your rankings.
- Using robots.txt for security — it is publicly readable and does not prevent access; use authentication or server rules for sensitive content.
- Forgetting the leading slash — paths must start with
/(e.g.,Disallow: /admin, notDisallow: admin). - Empty Disallow directive —
Disallow:with no path means "allow everything," which is the opposite of what many expect. - Missing User-agent before rules — every Disallow or Allow must be under a User-agent directive.
Validator Mode
Switch to Validator mode to paste an existing robots.txt file and check it for syntax errors, missing directives, and common issues. The validator parses the file, shows a summary of all rule groups, and lets you test specific URLs against the rules to see whether they would be allowed or blocked for a given bot.
How Search Engines Interpret robots.txt
Each search engine has its own crawler (user-agent) and may interpret robots.txt rules slightly differently. Googlebot follows the Google robots.txt specification, which supports wildcards in paths — Disallow: /private* blocks any path starting with /private. Bingbot also supports wildcards and the $ end-of-URL anchor. Both ignore the Crawl-delay directive in favor of their own webmaster tools settings.
The order of rules matters. When multiple rules apply to the same URL, most crawlers follow the most specific matching rule. A longer path match takes precedence over a shorter one. For example, if you have Disallow: /docs/ and Allow: /docs/public/, the /docs/public/ path will be allowed because the Allow rule is more specific.
robots.txt vs Meta Robots Tag
The robots.txt file and the <meta name="robots"> HTML tag serve different purposes. robots.txt prevents crawlers from accessing a page entirely — the crawler never downloads the HTML. The meta robots tag is embedded within the page and instructs crawlers on whether to index the page, follow its links, or cache its content. Use robots.txt to block entire sections of your site from being crawled, and meta robots to control indexing for individual pages that crawlers are allowed to access.
A common mistake is using robots.txt to prevent a page from appearing in search results. While blocking the crawl does prevent the page content from being indexed, the URL itself may still appear in results if other pages link to it. To truly remove a page from search results, use <meta name="robots" content="noindex"> while allowing the crawl, so the search engine can see and obey the noindex directive.
Testing Your robots.txt
After generating your robots.txt file, use Google Search Console's robots.txt Tester to verify that your rules work as intended. Enter specific URLs and user agents to see whether they would be blocked or allowed. Bing Webmaster Tools offers a similar testing feature. Always test critical pages — your homepage, key landing pages, and your sitemap URL — to ensure they are not accidentally blocked. A misconfigured robots.txt can silently deindex your entire site.
Related Tools
Generate Nginx server configurations with the Nginx Config Generator. Create .gitignore files with the .gitignore Generator. Build Content Security Policy headers with the CSP Generator. Look up HTTP status codes in the HTTP Status Code Reference.
Frequently Asked Questions
- What is a robots.txt file?
- A robots.txt file is a plain text file placed at the root of your website (e.g., example.com/robots.txt) that tells web crawlers which pages they are allowed or disallowed from accessing. It follows the Robots Exclusion Protocol and is used by search engines, AI crawlers, and other bots to determine what content they can index or scrape.
- How do I block AI crawlers like GPTBot and CCBot?
- Add a User-agent directive for each AI crawler followed by Disallow: /. This tool's "Block AI Crawlers" preset generates rules for all major AI crawlers including GPTBot, ChatGPT-User, CCBot, Google-Extended, anthropic-ai, Claude-Web, Bytespider, and more.
- Does robots.txt actually prevent crawling?
- Robots.txt is an advisory protocol — well-behaved bots respect it, but malicious bots may ignore it. It does not provide security or access control. If you need to truly prevent access, use authentication, IP blocking, or server-side access controls.
- What is the difference between Allow and Disallow?
- Disallow tells crawlers not to access a specific path. Allow explicitly permits access to a path, typically used as an exception within a broader Disallow rule. For example, you might Disallow: /admin but Allow: /admin/public.
- What does Crawl-delay do?
- Crawl-delay tells a bot to wait a specified number of seconds between successive requests. Not all bots honor this — Googlebot ignores it, but Bingbot and YandexBot respect it. It is useful for reducing server load from aggressive crawlers.
- Should I include a Sitemap directive?
- Yes. The Sitemap directive points crawlers to your XML sitemap, helping them discover all your pages. This is especially important for new sites or sites with deep link structures.
- Is my data safe with this tool?
- Yes. This tool runs entirely in your browser. No data is sent to any server.
Use this tool from AI agents.
The CodeTidy MCP Server lets Claude, Cursor, and other AI agents
use this tool and 46 others directly. One command: npx @codetidy/mcp