Technical SEO5 min read

🤖Robots.txt Configuration

Control how search engines crawl your site with robots.txt directives to optimize crawl budget and protect sensitive paths.

What is Robots.txt?

Robots.txt is a plain text file placed at the root of your website that instructs search engine crawlers (also called bots, spiders, or user-agents) which pages or files they can or cannot request from your site. It's the first file search engines look for when they visit your domain.

Located at https://yourdomain.com/robots.txt, this file acts as a gatekeeper—telling crawlers like Googlebot, Bingbot, and others which areas of your site are open for exploration and which should be left alone.

The Purpose of Robots.txt

The primary purposes of a robots.txt file are:

  1. Manage crawl budget — Prevent crawlers from wasting time on unimportant pages
  2. Protect sensitive paths — Keep private areas like admin panels out of search indexes
  3. Control server load — Reduce the number of requests bots make to your server
  4. Guide crawlers to important content — Direct attention to pages that matter most

Where to Place Robots.txt

The robots.txt file must be placed at the root of your domain. There can only be one robots.txt file per domain, and it must be accessible at the exact URL:

https://example.com/robots.txt

Important placement rules:

  • Must be at the domain root (not in a subdirectory)
  • Must be accessible via HTTP with a 200 status code
  • Only one robots.txt file per domain (subdomains can have their own)
  • File name must be exactly robots.txt (lowercase)

If your site uses subdomains (like blog.example.com or shop.example.com), each subdomain needs its own robots.txt file at its respective root.

Why Robots.txt Matters for SEO

While robots.txt doesn't directly improve rankings, it plays a crucial role in how search engines discover and crawl your content. Proper configuration can significantly impact your SEO performance.

Crawl Budget Optimization

Crawl budget is the number of pages Google allocates to crawl on your site within a given timeframe. For large sites (thousands of pages), crawl budget matters—you want Google spending its limited time on your most important content.

Without robots.txt guidance, crawlers may waste valuable crawl budget on:

  • Duplicate content pages (print versions, sort orders)
  • Auto-generated pages (calendar views, search results)
  • Admin and backend pages
  • Low-value content archives

By blocking these paths, you ensure crawlers focus on pages that actually drive traffic and conversions.

Indexation Control

While robots.txt doesn't directly prevent pages from being indexed (use noindex meta tags for that), it influences which pages search engines discover and crawl. Pages blocked by robots.txt won't be crawled, which typically means they won't be indexed.

Server Performance

Aggressive crawling can strain server resources. Robots.txt helps manage the pace at which bots access your site, protecting server performance for real users.

Preventing Sensitive Content Exposure

While robots.txt is not a security measure, it helps keep sensitive paths out of search results. However, never rely on robots.txt for security—anyone can view the file and see which paths you're hiding.

Directing Crawlers to Your Sitemap

Robots.txt is the standard place to declare your XML sitemap location, helping search engines discover all your important pages efficiently.

Allow and Disallow Directives

The Disallow and Allow directives are the core building blocks of robots.txt. They tell search engine crawlers which paths they can or cannot access.

User-agent Directive

Every set of directives starts with a User-agent line that specifies which crawler the rules apply to. You can target specific bots or use the wildcard * to apply rules to all crawlers.

Syntax: User-agent: [bot-name-or-*]

Disallow Directive

The Disallow directive specifies which paths are off-limits to the specified user-agent.

Syntax: Disallow: [path]

Examples:

  • Disallow: / — Blocks the entire site
  • Disallow: /admin/ — Blocks the /admin/ directory
  • Disallow: /private-page.html — Blocks a specific page
  • Disallow: (empty) — Allows everything (no restrictions)

Allow Directive

The Allow directive overrides Disallow for specific paths. This is useful when you've blocked a directory but want to allow access to specific files within it.

Syntax: Allow: [path]

Examples:

  • Allow: /assets/images/ — Explicitly allows a specific path
  • Allow: /public/file.html — Allows a specific file

Rule Priority and Order

When both Allow and Disapply apply to a URL, Google uses the more specific rule (the longer path). However, the order of rules also matters—place more specific rules after general ones for clarity.

Text
# Block all crawlers from the entire site
User-agent: *
Disallow: /

# Block all crawlers from specific directories
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /temp/

# Allow a specific file in a blocked directory
User-agent: *
Disallow: /private/
Allow: /public-info.html

Basic Allow and Disallow directive examples

Text
# Block everything except images
User-agent: *
Disallow: /
Allow: /images/
Allow: /assets/*.jpg$
Allow: /assets/*.png$

# In this example:
# - All paths are blocked by default
# - /images/ directory is allowed
# - .jpg and .png files in /assets/ are allowed

Using Allow to override Disallow for specific paths

Sitemap Reference in Robots.txt

The Sitemap directive tells search engines where to find your XML sitemap. While not required, including your sitemap location in robots.txt is a best practice that helps crawlers discover all your important pages.

Sitemap Directive Syntax

The sitemap directive is simple: just provide the full URL to your XML sitemap file.

Syntax: Sitemap: [full-url]

Benefits of Declaring Your Sitemap

  1. Discovery — Search engines automatically find your sitemap without manual submission
  2. Centralized information — All crawl guidance in one file
  3. Multiple sitemaps — You can declare multiple sitemaps for different content types
  4. Cross-subdomain sitemaps — Reference sitemaps on different domains

Where to Place Sitemap Directives

Sitemap directives can appear anywhere in your robots.txt file—typically at the beginning or end. Unlike User-agent groups, sitemap directives are independent and apply globally.

Multiple Sitemaps

Large sites often split sitemaps by content type (pages, images, videos, news). You can declare all of them in your robots.txt:

Text
# robots.txt with sitemap declaration
User-agent: *
Disallow: /admin/
Disallow: /search/

Sitemap: https://example.com/sitemap.xml

# Multiple sitemaps for different content types
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-images.xml
Sitemap: https://example.com/sitemap-news.xml

Robots.txt with single and multiple sitemap declarations

Text
# Sitemap on a different domain (useful for cross-subdomain)
User-agent: *
Disallow: /private/

Sitemap: https://sitemaps.example.com/main-sitemap.xml

Sitemap directive can reference sitemaps on external domains

Wildcards and Patterns

Robots.txt supports two wildcard characters that give you powerful pattern-matching capabilities: the asterisk * and the dollar sign $.

Asterisk Wildcard (*)

The * matches any sequence of characters (including zero characters). Use it to block or allow groups of similar URLs.

Common use cases:

  • Blocking URLs with specific query parameters
  • Blocking file types across all directories
  • Blocking auto-generated pages with patterns

Dollar Sign Wildcard ($)

The $ matches the end of a URL. Use it to target specific file extensions or exact URL endings.

Common use cases:

  • Blocking all PDF files
  • Blocking specific file types (.pdf, .xls)
  • Blocking pages with specific endings

Combining Wildcards

You can combine * and $ in a single rule for precise targeting.

Pattern matching examples:

  • /*.pdf$ — Any path ending in .pdf
  • /page*/ — Any path starting with "page"
  • /*?print= — Any URL containing "?print="
  • /*.js$ — Any JavaScript file at any path depth
Text
# Block all PDF files from being crawled
User-agent: *
Disallow: /*.pdf$

# Block all URLs with "print" parameter
User-agent: *
Disallow: /*?print=

# Block all files in any "tmp" directory at any level
User-agent: *
Disallow: /*/tmp/

# Block specific file types
User-agent: *
Disallow: /*.xls$
Disallow: /*.doc$
Disallow: /*.pdf$

# Block search results pages with query parameters
User-agent: *
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=

Wildcard patterns for blocking specific URL patterns

Text
# Allow images but block other assets
User-agent: *
Disallow: /assets/
Allow: /assets/*.jpg$
Allow: /assets/*.png$
Allow: /assets/*.gif$
Allow: /assets/*.webp$

# Block admin and test pages
User-agent: *
Disallow: /admin*
Disallow: /*test/
Disallow: /staging/

Combining wildcards with Allow and Disallow directives

Common Robots.txt Mistakes

Even experienced developers make mistakes with robots.txt. Understanding these common pitfalls helps you avoid SEO problems.

1. Blocking CSS and JavaScript

One of the most damaging mistakes is blocking CSS and JavaScript files. Modern search engines need to render pages to understand them fully—if you block these resources, Google can't see your page as users do.

Problem: Blocking /assets/ or /js/ directories prevents Google from rendering your pages correctly.

Solution: Allow CSS and JavaScript files explicitly.

2. Blocking Your Entire Site

A single Disallow: / directive blocks your entire site from being crawled. This is sometimes done accidentally during development or staging and forgotten when going live.

Impact: Your site disappears from search results entirely.

Prevention: Always check your robots.txt before going live. Use Google Search Console to test.

3. Wrong Path Format

Paths in robots.txt are case-sensitive and must match your actual URL structure exactly.

Common errors:

  • Wrong case: /Admin/ vs /admin/
  • Missing trailing slash: /admin vs /admin/
  • Not matching URL-encoded characters

4. Using Robots.txt for Security

Critical misconception: Robots.txt is NOT a security measure. It's merely a suggestion to well-behaved crawlers.

Problems with relying on robots.txt for security:

  • Malicious bots ignore robots.txt completely
  • Anyone can view your robots.txt and see "hidden" paths
  • Disallowed pages can still be indexed if linked from other sites

Solution: Use proper authentication, authorization, and noindex tags for sensitive content.

5. Conflicting Rules

Multiple rules can conflict, causing unexpected behavior. Always test your robots.txt with Google's testing tools.

6. Forgetting About Crawl-Delay

Some crawlers respect a Crawl-delay directive that slows their requests. While not part of the official standard, it can help manage server load—though Google doesn't support it.

Text
# MISTAKE: Blocking CSS/JS prevents proper rendering
User-agent: *
Disallow: /assets/    # This blocks CSS and JS!
Disallow: /js/        # Also blocks JavaScript

# CORRECT: Allow CSS and JS for search engines
User-agent: *
Disallow: /assets/
Allow: /assets/*.css$
Allow: /assets/*.js$
Allow: /assets/*.woff2$

# Or better: Only block truly private paths
User-agent: *
Disallow: /admin/
Disallow: /private/
# Don't block CSS/JS at all

Mistake: blocking CSS and JavaScript prevents Google from rendering pages

Text
# DANGEROUS: This blocks your entire site!
User-agent: *
Disallow: /

# This is useful for staging/development sites
# but catastrophic if left on a production site

# Always verify your robots.txt doesn't contain this
# unless you intentionally want to block everything

The most dangerous mistake: blocking your entire site from crawlers

User-Agent Targeting

Different search engines use different crawler bots, each with its own user-agent name. You can create specific rules for individual crawlers or use the wildcard * to apply rules to all bots.

The Wildcard User-agent (*)

Using User-agent: * applies rules to all crawlers that respect robots.txt. This is the most common approach.

Targeting Specific Bots

For more granular control, target specific crawlers by name:

Search EngineUser-agent
GoogleGooglebot
Google ImagesGooglebot-Image
Bingbingbot
YahooSlurp
DuckDuckGoDuckDuckBot
BaiduBaiduspider
YandexYandex

Rule Group Order

When you have multiple user-agent groups, crawlers look for their specific user-agent first. If none exists, they fall back to * rules.

Best practice: Place specific user-agent rules before generic * rules for clarity.

Text
# Rules for all crawlers (default)
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /search?

# Specific rules for Googlebot only
User-agent: Googlebot
Disallow: /temp/
Allow: /special-google-content/

# Specific rules for Bingbot only
User-agent: bingbot
Disallow: /heavy-pages/
Crawl-delay: 10

# Block an aggressive bot completely
User-agent: BadBot
Disallow: /

Targeting specific user-agents with custom rules for each crawler

Text
# Allow all bots to access everything
User-agent: *
Disallow:

# This empty Disallow means "disallow nothing"
# Equivalent to allowing access to everything

# A completely permissive robots.txt
User-agent: *
Allow: /
Disallow:

Permissive robots.txt that allows all crawlers full access

Security Considerations

A critical misconception about robots.txt is that it provides security. It does not. Understanding what robots.txt can and cannot do is essential for protecting your site.

Robots.txt is NOT a Security Measure

The robots.txt file is publicly viewable—anyone can access https://yourdomain.com/robots.txt and see exactly which paths you're trying to hide. Malicious bots and scrapers completely ignore robots.txt directives.

Never use robots.txt to hide:

  • Admin panel URLs
  • User data directories
  • API endpoints
  • Configuration files
  • Any truly sensitive information

Disallowed Pages Can Still Be Indexed

Blocking a page in robots.txt prevents crawling, but doesn't prevent indexing. Google can still index a blocked URL if:

  1. It's linked from another site
  2. It appears in your sitemap
  3. Google discovered it before you added the block

The noindex Alternative

For pages you want to keep out of search results, use noindex directives instead of (or in addition to) robots.txt:

Meta tag approach: <meta name="robots" content="noindex">

HTTP header approach: X-Robots-Tag: noindex

The noindex directive explicitly tells search engines not to include a page in their index—much more reliable than robots.txt for preventing indexing.

When to Use Each Approach

GoalSolution
Prevent crawlingrobots.txt Disallow
Prevent indexingnoindex meta tag or header
Block from search resultsnoindex (not robots.txt)
Manage crawl budgetrobots.txt
Hide sensitive contentProper authentication (not robots.txt)

Best Practice: Defense in Depth

For maximum protection:

  1. Use authentication for truly sensitive areas
  2. Use noindex to prevent indexing
  3. Use robots.txt to prevent crawling
  4. Never rely on robots.txt alone for security
HTML
<!-- Meta tag for noindex (in the <head> section) -->
<meta name="robots" content="noindex, nofollow">

<!-- This tells search engines:
     - noindex: Don't include this page in search results
     - nofollow: Don't follow links on this page -->

<!-- For specific search engines -->
<meta name="googlebot" content="noindex">

Using noindex meta tag to prevent indexing (more reliable than robots.txt)

Apache
# .htaccess - Add X-Robots-Tag header for non-HTML files
<FilesMatch "\.pdf$">
  Header set X-Robots-Tag "noindex, nofollow"
</FilesMatch>

# Block indexing of all PDF files
# This works for non-HTML content where you can't add meta tags

X-Robots-Tag HTTP header for preventing indexing of non-HTML files

Technical Specifications

Understanding the technical constraints of robots.txt ensures your implementation works correctly across all search engines.

File Location and Access

  • Location: Must be at the domain root (/robots.txt)
  • Protocol: Accessible via HTTP or HTTPS
  • Status code: Must return HTTP 200 (OK)
  • Content type: Should be served as text/plain

File Size Limits

Google limits the processing of robots.txt files:

  • Maximum file size: 500 KiB (approximately 500KB)
  • Behavior: Google truncates processing at 500 KiB
  • Recommendation: Keep files well under this limit

For complex sites, use pattern matching and wildcard rules rather than listing every path individually.

Case Sensitivity

Both the file name and paths are case-sensitive:

  • File must be named exactly robots.txt (lowercase)
  • Paths must match your actual URL case exactly
  • /Admin/ and /admin/ are different paths

Character Encoding

  • Robots.txt should be encoded in UTF-8
  • Characters outside the ASCII range should be URL-encoded
  • Google recommends using Punycode for international domain names

Crawl-Delay Directive

The Crawl-delay directive is not part of the official standard but is supported by some crawlers (Bing, Yandex, Baidu). Google ignores this directive.

Syntax: Crawl-delay: [seconds]

Purpose: Requests the crawler to wait X seconds between requests.

Comments in Robots.txt

Use # to add comments. Everything after # on a line is ignored.

Note: Comments cannot be placed on the same line as a directive. Place them on their own lines.

Text
# Robots.txt technical example
# Last updated: 2024-01-15
# This file must be at https://example.com/robots.txt

# All bots - block admin and private areas
User-agent: *
Disallow: /admin/      # Case-sensitive: must match actual path
Disallow: /Private/    # Different from /private/
Disallow: /temp/

# Bing-specific: request slower crawling
User-agent: bingbot
Disallow: /admin/
Crawl-delay: 10        # Bing supports this; Google ignores it

# Sitemap declaration
Sitemap: https://example.com/sitemap.xml

# File size recommendation: keep under 500KB
# This example is well under the limit

Technical robots.txt example with comments and proper formatting

Testing Your Robots.txt

Always test your robots.txt implementation before relying on it. Google provides free tools to validate and debug your configuration.

Google Search Console Robots.txt Report

Google Search Console includes a dedicated robots.txt report that shows:

  • Current robots.txt content
  • Any parsing errors
  • When Google last fetched it
  • HTTP status code

Access: Search Console → Settings → robots.txt Report

Google Robots Testing Tool

For quick testing, use Google's online robots.txt testing tool:

URL: support.google.com/webmasters/answer/6062598

Features:

  • Test if a URL is blocked or allowed
  • See which rule applies
  • Identify syntax errors
  • Test changes before publishing

Manual Testing Steps

  1. Check file accessibility: Visit yourdomain.com/robots.txt in a browser
  2. Verify HTTP status: Ensure it returns 200 OK
  3. Test with Google's tool: Validate rules work as expected
  4. Monitor Search Console: Check for errors after deployment
  5. Test specific URLs: Verify important pages aren't accidentally blocked

Common Testing Scenarios

Test if Google can access your CSS/JS:

  • Enter the CSS/JS file URL in the testing tool
  • Verify it shows "Allowed"

Test if admin pages are blocked:

  • Enter /admin/ URLs
  • Verify they show "Blocked"

Test sitemap discovery:

  • Ensure sitemap directive is present
  • Verify sitemap URL is correct and accessible

SEO Checklist

  • CriticalPlace robots.txt at the domain root (https://example.com/robots.txt)
  • CriticalNever block CSS or JavaScript files that affect page rendering
  • CriticalTest with Google Search Console robots.txt testing tool before deploying
  • ImportantUse noindex meta tags for pages you want excluded from search results
  • ImportantInclude Sitemap directive to help search engines discover your pages
  • ImportantRemember robots.txt is not a security measure—use proper authentication for sensitive content
  • ImportantKeep file size under 500KB to ensure Google processes it completely
  • RecommendedUse wildcards (* and $) for pattern matching instead of listing every URL
  • RecommendedReference Google's official robots.txt specifications for complex implementations

Related Guides