what is a robots.txt file

May 26, 2026

What is a Robots.txt File and How Does it Work?

TL;DR
A robots.txt file sits in your website’s root directory and tells crawlers which pages to access or ignore — helping you protect your crawl budget, keep low-value pages out of Google’s index, and control what different bots can see. In 2026, it’s also your main tool for managing AI crawlers: the smart approach is to block AI training crawlers (which harvest your content for model training with no traffic in return) while allowing AI search crawlers (which cite your content in ChatGPT and Perplexity answers and send referral traffic back to you).

robots.txt files are increasingly used by websites across the Internet, with sites leveraging them to allow and disallow crawlers from sources such as OpenAI. 100% of the robots.txt files analyzed by SearchEngineJournal blocked at least one crawler — but in 2026, the more important question isn’t whether to block, it’s which crawlers to block and which to actively welcome. So, let’s answer the question, “What is a robots.txt file?” and get you up to speed on how to use one strategically.

Below, we explore robots.txt, its uses, and its capabilities when properly configured. Keep reading to learn how to create and maintain this file to support your SEO efforts and ensure your website performs efficiently throughout its life.

What Is a robots.txt File?

robots.txt is a standard text file placed in your website’s root directory (e.g., “www.example.com/robots.txt”). Crawler bots, such as those from Google, access this file and follow its instructions, which can then set the tone for the rest of the crawl.

The robots.txt file instructs crawlers which pages on your website to access. This process manages the “crawl budget,” ensuring that a crawler does not overlook anything on your website that you do not deem important before moving on to another site.

What Is a robots.txt File Used For?

When a web crawler from a company like Google accesses your site, it will read the robots.txt file and use it as a set of guidelines for what it should look at on your website. You can use it to focus the efforts of the crawler on only specific areas to avoid wasting crawler resources on unnecessary pages such as:

  • Search results
  • Archives
  • Repeated gallery layouts
  • Development environments
  • Repeated pages for different languages
  • Login pages
  • Private or sensitive sections

Google provides extensive information on how its crawl budget works. However, it does not offer specific details on the number of pages or amount of data it will download. As such, you should be strict about what you show the crawler to make the most of any available crawling opportunity.

Why Is It Important for SEO?

With Google still representing around 90% of the search engine market across all devices, you will want to communicate to it as clearly as possible where your highest-quality pages are. If your Google Search Console results consistently do not display some of your most important pages, leveraging a robots.txt file will ensure that your most high-value content is not ignored.

Similarly, using a robots.txt file, you can prevent the crawler from indexing less SEO-friendly pages, such as dynamically generated URLs, preserving the likelihood that your site will rank with Google.

As only essential URLs will receive indexing, this process can be beneficial for onboarding new users, ensuring that external sources such as search engines only list pages that offer them value.

When Should You Use a robots.txt File?

While the file is universally beneficial, there are specific times when it is more helpful:

  • During site development to prevent Google from finding your site early
  • When you want to manage the crawl budget of a large site
  • If you have duplicate pages and want only one to be the primary one
  • When you launch a new section of your site and want to highlight it
  • When your site contains outdated content you wish to deprioritize

These make a robots.txt file useful for frontloading or pulling back content from visibility.

Managing AI Crawlers in Your robots.txt

Every major AI company now runs web crawlers that visit your site automatically. GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), and Google-Extended are the most common, and the list is growing. Your robots.txt file is the primary mechanism for controlling what these crawlers can access — but the decision about whether to block or allow them is more nuanced than it might seem.

The key distinction is between training crawlers and search crawlers. Training crawlers — like GPTBot and ClaudeBot — harvest your content to train AI models, sending almost no traffic back to your site in return. Search crawlers — like OAI-SearchBot (used for ChatGPT search answers) and PerplexityBot — retrieve your content in real time to cite it in answers to users’ questions, and do send referral traffic back.

Blanket-blocking all AI crawlers removes your site from AI search answers entirely. Since AI search visits are growing fast and AI-referred visitors tend to arrive already informed and ready to act, that’s a real cost for most small businesses with public content to share. The smarter approach in 2026 is to block training crawlers (protecting your content from being absorbed into model datasets with no attribution) while allowing search and citation crawlers (keeping your content visible in ChatGPT, Perplexity, and similar tools). If, however, you have genuinely proprietary content — paywalled resources, sensitive client data, or original research you don’t want scraped — blocking more broadly makes sense.

Note: Google-Extended handles AI training for Google’s Gemini models, but blocking it does not affect your standard Google Search rankings — Googlebot and Google-Extended are separate crawlers. You can block one without affecting the other.

Creating a robots.txt File

When you create a file yourself, start with a simple text file in a plain-text editor, naming it “robots.txt.”

After this, the absolute minimum information you should add would be:

User-agent: *

This line tells every crawler that the rules below apply to all bots. You can then build on it with three main directives:

Disallow blocks access to a specific path. The following would prevent any crawler from accessing your admin area and everything inside it:

Disallow: /admin/

Allow grants access to a specific path even when a broader Disallow rule is in place. This lets you open up one subfolder inside an otherwise blocked section:

Allow: /admin/ruleset/

Sitemap tells crawlers where to find your XML sitemap, helping them navigate your site’s structure more efficiently:

Sitemap: https://www.example.com/sitemap.xml

You can also write separate User-agent blocks to give different rules to different crawlers. For example, you might want to allow Googlebot to access everything while blocking a specific AI training bot from your entire site. Each User-agent block applies only until the next one begins, so the order matters.

robots.txt example

A well-configured robots.txt file for a small business in 2026 balances protecting crawl budget, blocking AI training crawlers, and staying visible in AI search results:

User-agent: *
Disallow: /admin/
Disallow: /search/
Allow: /
Sitemap: https://www.example.com/sitemap.xml

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

The first block applies to all crawlers and keeps admin and search results pages out of the index while allowing access to everything else. The next four blocks block the main AI training crawlers — the ones that harvest content for model training with no attribution or traffic in return. The final four blocks explicitly allow the AI search crawlers, which cite your content in live answers on ChatGPT, Perplexity, and similar tools and do send referral traffic back to your site.

Unlock the Power of robots.txt with Rose & Cactus

A well-made and optimized robots.txt file is essential for managing crawlers that might visit your website. It can highlight important pages you upload and help you get noticed. The above best practices can also help you control crawler behavior, protect sensitive data on your site, and ensure you only present essential data to search engines.

While you may be able to answer, “What is a robots.txt file?” If you want to know more or need someone to create this tool, contact Rose & Cactus today. We can help you streamline your robots.txt strategy and boost your SEO efforts.

FAQs — robots.txt Files

What is a robots.txt file?
A robots.txt file is a plain text file placed at the root of your website that tells web crawlers which pages or sections they are allowed or not allowed to access. It acts as a set of instructions for bots from Google, Bing, AI companies, and other sources, helping you control what gets crawled and indexed.
What is a robots.txt file used for?
It is used to manage your site’s crawl budget by directing search engine bots toward your most valuable pages and away from low-priority ones such as admin areas, duplicate pages, search results, and development environments. In 2026, it is also widely used to control which AI crawlers can access your content.
What does robots txt disallow mean?
The Disallow directive in a robots.txt file tells a crawler not to access a specified path on your site. For example, Disallow: /admin/ prevents any bot matching that user-agent rule from crawling your admin pages. An Allow directive does the opposite, explicitly granting access even within a disallowed section.
When should you use a robots.txt file?
A robots.txt file is most useful during site development (to prevent Google from indexing an unfinished site), when managing the crawl budget on a large site, when you have duplicate or low-quality pages you want to deprioritize, and whenever you want explicit control over which crawlers — including AI bots — can access your content.
Should you block AI crawlers in your robots.txt?
It depends on what the crawler does. AI training crawlers like GPTBot and ClaudeBot harvest content for model training and return almost no traffic — blocking them is reasonable for most sites. AI search crawlers like OAI-SearchBot and PerplexityBot retrieve content to cite in real-time answers and do drive referral traffic — blocking these removes your site from AI search results entirely, which is rarely in a small business’s interest.
Does blocking AI crawlers affect your Google Search ranking?
No, not if done correctly. Googlebot (which handles standard search indexing) is a separate crawler from Google-Extended (which handles AI training for Gemini). Blocking Google-Extended does not affect your Google Search rankings. The only crawler you should never block if you want to rank on Google is Googlebot itself.
Laura Pulling

Laura Pulling

Laura is a content strategist, SEO consultant, and lover of quiz nights. She works with global clients to turn great ideas into well-ranked, high-converting content.

Leave a Reply

Your email address will not be published. Required fields are marked *

Ready to See Real Results?
Let’s Get Started.

We’re not here to follow trends. We’re here to build strategies that bring bold results and lasting growth. Whether you need a complete overhaul or just a strategic boost, Rose & Cactus is ready to deliver.

let's work together

Ready to See Real Results? Let’s Get Started.