The robots.txt file is a plain-text file placed at the root of your website that tells search engine crawlers which parts of your site they can and cannot access. It's one of the oldest web standards — introduced in 1994 — and remains a foundational piece of technical SEO. Understanding how to use it correctly can protect private areas, preserve crawl budget, and prevent indexing problems. Misusing it can hide your entire site from search engines.
What robots.txt Does (and Doesn't Do)
robots.txt controls crawling, not indexing. This is the most misunderstood aspect of robots.txt and the source of most mistakes.
- Crawling = a search engine bot visiting and downloading a page's content.
- Indexing = a search engine adding a page to its search results database.
When you block a URL via robots.txt, you prevent the bot from reading the page. But if other websites link to that page, Google may still index the URL — it just won't have any content to show, resulting in a listing that says "No information is available for this page." To truly prevent indexing, use a noindex robots meta tag on the page itself.
Critical rule: If you want a page to be noindexed, do NOT block it in robots.txt. The bot must be able to crawl the page to see the noindex directive.
robots.txt Syntax
The file uses a simple directive-based syntax. Here are the core directives:
| Directive | Purpose | Example |
|---|---|---|
User-agent | Specifies which bot the rules apply to | User-agent: Googlebot |
Disallow | Blocks a path from being crawled | Disallow: /admin/ |
Allow | Overrides a Disallow for a specific sub-path | Allow: /admin/public/ |
Sitemap | Points to an XML sitemap | Sitemap: https://example.com/sitemap.xml |
Crawl-delay | Seconds between requests (Bing respects, Google ignores) | Crawl-delay: 10 |
Wildcard patterns: Use * to match any sequence of characters and $ to match the end of a URL. For example, Disallow: /*.pdf$ blocks all PDF files across your entire site.
User-Agent Targeting
You can create different rule sets for different bots. Common user agents include:
*— matches all crawlers (the wildcard default).Googlebot— Google's primary web crawler.Bingbot— Microsoft Bing's crawler.Googlebot-Image— Google's image crawler (block to prevent image indexing).GPTBot— OpenAI's crawler for AI training data.CCBot— Common Crawl's bot, often blocked by sites that don't want AI training scraping.
When a bot checks robots.txt, it looks for a section matching its specific user agent first. If none exists, it falls back to the * rules. One bot cannot inherit rules from another bot's section — each section is independent.
Common robots.txt Examples
Allow everything (default if no robots.txt exists):
User-agent: *
Disallow:
Block a single directory:
User-agent: *
Disallow: /admin/
Disallow: /staging/
Block everything except one directory:
User-agent: *
Disallow: /
Allow: /public/
Block AI crawlers while allowing search engines:
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml
Crawl Budget Optimization
Crawl budget matters primarily for large sites (100,000+ pages). For smaller sites, Google generally crawls everything without issues. But for large sites, wasting crawl budget on low-value pages means important pages get crawled less frequently.
Pages to consider blocking:
- Faceted navigation — filtering URLs like
/shoes?color=red&size=10&sort=pricecan create millions of near-duplicate pages. - Internal search results —
/search?q=pages generate infinite URL combinations. - Session/parameter URLs — URLs with session IDs, tracking parameters, or pagination beyond reasonable depth.
- Print/PDF versions — duplicate content in alternative formats.
- Staging/development paths — test environments that shouldn't be indexed.
Blocking vs Noindexing: When to Use Which
| Scenario | Use robots.txt? | Use noindex? | Why |
|---|---|---|---|
| Private admin area | Yes | Also yes | Belt-and-suspenders protection |
| Thin content pages you want de-indexed | No | Yes | Bot must see the noindex directive |
| Infinite faceted navigation | Yes | Optional | Preserves crawl budget at scale |
| Pages with sensitive information | No | No | Use authentication — robots.txt is public |
| Duplicate content with canonical tag | No | No | Let bot crawl, read canonical, and consolidate |
Testing Your robots.txt
Always validate your robots.txt before deploying changes to production. A single syntax error can accidentally block your entire site.
- Google Search Console — the robots.txt Tester (under Settings) lets you enter any URL from your site and shows exactly which rule allows or blocks it. This is the most reliable testing method because it uses Google's own parser.
- Manual testing — visit
yourdomain.com/robots.txtin a browser and visually inspect the output. Confirm you can see the file, it returns a 200 status code, and the content-type istext/plain. - Staging first — test changes on a staging environment before pushing to production. A robots.txt mistake on a live site can de-index pages within hours if Googlebot crawls during the window.
robots.txt & International Sites
For sites with multiple subdomains or country-code paths, keep these considerations in mind:
- Each subdomain requires its own robots.txt —
blog.example.com/robots.txtandshop.example.com/robots.txtare completely independent files. - Subdirectory-based internationalization (
/en/,/fr/,/de/) shares a single root robots.txt. Define rules per path prefix:Disallow: /fr/admin/. - Include sitemap references for all language versions to ensure comprehensive crawling of international content.
Common Mistakes
- Blocking your entire site — a single
Disallow: /underUser-agent: *hides everything. This happens more often than you'd think, especially after launching from a staging environment. - Blocking CSS and JavaScript — Google needs to render pages to understand them. Blocking CSS/JS files prevents rendering, which hurts both indexing and rankings.
- Using robots.txt for security — robots.txt is publicly readable. Anyone can visit
/robots.txtto see exactly which paths you're "hiding." Never rely on it for security. - Conflicting with noindex — blocking a page in robots.txt AND adding noindex to it means the bot can't reach the page to see the noindex tag. The page might still get indexed via external links.
- Forgetting the trailing slash —
Disallow: /directoryblocks both/directoryand/directory-other/. Add a trailing slash (Disallow: /directory/) to be more precise. - Not including a Sitemap directive — while not required, pointing to your sitemap in robots.txt ensures crawlers can find it even if it's not submitted in Search Console.
Try It Yourself
Build a properly formatted robots.txt file with user-agent targeting, disallow rules, allow overrides, and sitemap references — ready to deploy.
robots.txt Generator →