robots.txt & Crawl Control Guide for SEO

The robots.txt file is a plain-text file placed at the root of your website that tells search engine crawlers which parts of your site they can and cannot access. It's one of the oldest web standards — introduced in 1994 — and remains a foundational piece of technical SEO. Understanding how to use it correctly can protect private areas, preserve crawl budget, and prevent indexing problems. Misusing it can hide your entire site from search engines.

What robots.txt Does (and Doesn't Do)

robots.txt controls crawling, not indexing. This is the most misunderstood aspect of robots.txt and the source of most mistakes.

  • Crawling = a search engine bot visiting and downloading a page's content.
  • Indexing = a search engine adding a page to its search results database.

When you block a URL via robots.txt, you prevent the bot from reading the page. But if other websites link to that page, Google may still index the URL — it just won't have any content to show, resulting in a listing that says "No information is available for this page." To truly prevent indexing, use a noindex robots meta tag on the page itself.

Critical rule: If you want a page to be noindexed, do NOT block it in robots.txt. The bot must be able to crawl the page to see the noindex directive.

robots.txt Syntax

The file uses a simple directive-based syntax. Here are the core directives:

DirectivePurposeExample
User-agentSpecifies which bot the rules apply toUser-agent: Googlebot
DisallowBlocks a path from being crawledDisallow: /admin/
AllowOverrides a Disallow for a specific sub-pathAllow: /admin/public/
SitemapPoints to an XML sitemapSitemap: https://example.com/sitemap.xml
Crawl-delaySeconds between requests (Bing respects, Google ignores)Crawl-delay: 10

Wildcard patterns: Use * to match any sequence of characters and $ to match the end of a URL. For example, Disallow: /*.pdf$ blocks all PDF files across your entire site.

User-Agent Targeting

You can create different rule sets for different bots. Common user agents include:

  • * — matches all crawlers (the wildcard default).
  • Googlebot — Google's primary web crawler.
  • Bingbot — Microsoft Bing's crawler.
  • Googlebot-Image — Google's image crawler (block to prevent image indexing).
  • GPTBot — OpenAI's crawler for AI training data.
  • CCBot — Common Crawl's bot, often blocked by sites that don't want AI training scraping.

When a bot checks robots.txt, it looks for a section matching its specific user agent first. If none exists, it falls back to the * rules. One bot cannot inherit rules from another bot's section — each section is independent.

Common robots.txt Examples

Allow everything (default if no robots.txt exists):

User-agent: *
Disallow:

Block a single directory:

User-agent: *
Disallow: /admin/
Disallow: /staging/

Block everything except one directory:

User-agent: *
Disallow: /
Allow: /public/

Block AI crawlers while allowing search engines:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml

Crawl Budget Optimization

Crawl budget matters primarily for large sites (100,000+ pages). For smaller sites, Google generally crawls everything without issues. But for large sites, wasting crawl budget on low-value pages means important pages get crawled less frequently.

Pages to consider blocking:

  • Faceted navigation — filtering URLs like /shoes?color=red&size=10&sort=price can create millions of near-duplicate pages.
  • Internal search results/search?q= pages generate infinite URL combinations.
  • Session/parameter URLs — URLs with session IDs, tracking parameters, or pagination beyond reasonable depth.
  • Print/PDF versions — duplicate content in alternative formats.
  • Staging/development paths — test environments that shouldn't be indexed.

Blocking vs Noindexing: When to Use Which

ScenarioUse robots.txt?Use noindex?Why
Private admin areaYesAlso yesBelt-and-suspenders protection
Thin content pages you want de-indexedNoYesBot must see the noindex directive
Infinite faceted navigationYesOptionalPreserves crawl budget at scale
Pages with sensitive informationNoNoUse authentication — robots.txt is public
Duplicate content with canonical tagNoNoLet bot crawl, read canonical, and consolidate

Testing Your robots.txt

Always validate your robots.txt before deploying changes to production. A single syntax error can accidentally block your entire site.

  • Google Search Console — the robots.txt Tester (under Settings) lets you enter any URL from your site and shows exactly which rule allows or blocks it. This is the most reliable testing method because it uses Google's own parser.
  • Manual testing — visit yourdomain.com/robots.txt in a browser and visually inspect the output. Confirm you can see the file, it returns a 200 status code, and the content-type is text/plain.
  • Staging first — test changes on a staging environment before pushing to production. A robots.txt mistake on a live site can de-index pages within hours if Googlebot crawls during the window.

robots.txt & International Sites

For sites with multiple subdomains or country-code paths, keep these considerations in mind:

  • Each subdomain requires its own robots.txt — blog.example.com/robots.txt and shop.example.com/robots.txt are completely independent files.
  • Subdirectory-based internationalization (/en/, /fr/, /de/) shares a single root robots.txt. Define rules per path prefix: Disallow: /fr/admin/.
  • Include sitemap references for all language versions to ensure comprehensive crawling of international content.

Common Mistakes

  1. Blocking your entire site — a single Disallow: / under User-agent: * hides everything. This happens more often than you'd think, especially after launching from a staging environment.
  2. Blocking CSS and JavaScript — Google needs to render pages to understand them. Blocking CSS/JS files prevents rendering, which hurts both indexing and rankings.
  3. Using robots.txt for security — robots.txt is publicly readable. Anyone can visit /robots.txt to see exactly which paths you're "hiding." Never rely on it for security.
  4. Conflicting with noindex — blocking a page in robots.txt AND adding noindex to it means the bot can't reach the page to see the noindex tag. The page might still get indexed via external links.
  5. Forgetting the trailing slashDisallow: /directory blocks both /directory and /directory-other/. Add a trailing slash (Disallow: /directory/) to be more precise.
  6. Not including a Sitemap directive — while not required, pointing to your sitemap in robots.txt ensures crawlers can find it even if it's not submitted in Search Console.

Try It Yourself

Build a properly formatted robots.txt file with user-agent targeting, disallow rules, allow overrides, and sitemap references — ready to deploy.

robots.txt Generator →

Frequently Asked Questions

No. robots.txt only prevents crawling, not indexing. If other pages link to a URL that's blocked by robots.txt, Google may still index that URL based on external signals — it just won't be able to crawl the page content. The result is an indexed page with no snippet, showing "No information is available for this page" in search results. To prevent indexing, use a robots meta tag with noindex on the page itself.
The robots.txt file must be placed in the root directory of your domain and accessible at yourdomain.com/robots.txt. It must be at the exact root level — not in a subdirectory. For subdomains, each subdomain needs its own robots.txt file (blog.example.com/robots.txt is separate from example.com/robots.txt). The file must be served with a 200 status code and a text/plain content type.
Yes, but with limitations. You can disallow specific paths like "Disallow: /private-page/" to block crawling of individual URLs. You can also use wildcards: "Disallow: /category/*/page/" blocks all paginated category pages. However, remember that blocking crawling doesn't prevent indexing. For sensitive pages, combine robots.txt blocking with a noindex meta tag and consider removing the page from sitemaps as well.
Crawl budget is the number of pages a search engine bot will crawl on your site within a given timeframe. It's determined by two factors: crawl rate limit (how fast the bot can crawl without overloading your server) and crawl demand (how interested Google is in your content). For small sites under 10,000 pages, crawl budget is rarely a concern. For large sites with millions of pages, optimizing robots.txt to block low-value pages helps ensure important pages get crawled frequently.
Use Google Search Console's robots.txt Tester tool (found under Settings > robots.txt) to check whether specific URLs are blocked or allowed by your robots.txt rules. You can also test by entering a URL and seeing which rule applies. For syntax validation, paste your robots.txt content into any robots.txt validator tool. Always test after making changes and before deploying to production.