
May 26, 2026
robots.txt files are increasingly used by websites across the Internet, with sites leveraging them to allow and disallow crawlers from sources such as OpenAI. 100% of the robots.txt files analyzed by SearchEngineJournal blocked at least one crawler — but in 2026, the more important question isn’t whether to block, it’s which crawlers to block and which to actively welcome. So, let’s answer the question, “What is a robots.txt file?” and get you up to speed on how to use one strategically.
Below, we explore robots.txt, its uses, and its capabilities when properly configured. Keep reading to learn how to create and maintain this file to support your SEO efforts and ensure your website performs efficiently throughout its life.
robots.txt is a standard text file placed in your website’s root directory (e.g., “www.example.com/robots.txt”). Crawler bots, such as those from Google, access this file and follow its instructions, which can then set the tone for the rest of the crawl.
The robots.txt file instructs crawlers which pages on your website to access. This process manages the “crawl budget,” ensuring that a crawler does not overlook anything on your website that you do not deem important before moving on to another site.
When a web crawler from a company like Google accesses your site, it will read the robots.txt file and use it as a set of guidelines for what it should look at on your website. You can use it to focus the efforts of the crawler on only specific areas to avoid wasting crawler resources on unnecessary pages such as:
Google provides extensive information on how its crawl budget works. However, it does not offer specific details on the number of pages or amount of data it will download. As such, you should be strict about what you show the crawler to make the most of any available crawling opportunity.
With Google still representing around 90% of the search engine market across all devices, you will want to communicate to it as clearly as possible where your highest-quality pages are. If your Google Search Console results consistently do not display some of your most important pages, leveraging a robots.txt file will ensure that your most high-value content is not ignored.
Similarly, using a robots.txt file, you can prevent the crawler from indexing less SEO-friendly pages, such as dynamically generated URLs, preserving the likelihood that your site will rank with Google.
As only essential URLs will receive indexing, this process can be beneficial for onboarding new users, ensuring that external sources such as search engines only list pages that offer them value.
While the file is universally beneficial, there are specific times when it is more helpful:
These make a robots.txt file useful for frontloading or pulling back content from visibility.
Every major AI company now runs web crawlers that visit your site automatically. GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity), and Google-Extended are the most common, and the list is growing. Your robots.txt file is the primary mechanism for controlling what these crawlers can access — but the decision about whether to block or allow them is more nuanced than it might seem.
The key distinction is between training crawlers and search crawlers. Training crawlers — like GPTBot and ClaudeBot — harvest your content to train AI models, sending almost no traffic back to your site in return. Search crawlers — like OAI-SearchBot (used for ChatGPT search answers) and PerplexityBot — retrieve your content in real time to cite it in answers to users’ questions, and do send referral traffic back.
Blanket-blocking all AI crawlers removes your site from AI search answers entirely. Since AI search visits are growing fast and AI-referred visitors tend to arrive already informed and ready to act, that’s a real cost for most small businesses with public content to share. The smarter approach in 2026 is to block training crawlers (protecting your content from being absorbed into model datasets with no attribution) while allowing search and citation crawlers (keeping your content visible in ChatGPT, Perplexity, and similar tools). If, however, you have genuinely proprietary content — paywalled resources, sensitive client data, or original research you don’t want scraped — blocking more broadly makes sense.
Note: Google-Extended handles AI training for Google’s Gemini models, but blocking it does not affect your standard Google Search rankings — Googlebot and Google-Extended are separate crawlers. You can block one without affecting the other.
When you create a file yourself, start with a simple text file in a plain-text editor, naming it “robots.txt.”
After this, the absolute minimum information you should add would be:
User-agent: *
This line tells every crawler that the rules below apply to all bots. You can then build on it with three main directives:
Disallow blocks access to a specific path. The following would prevent any crawler from accessing your admin area and everything inside it:
Disallow: /admin/
Allow grants access to a specific path even when a broader Disallow rule is in place. This lets you open up one subfolder inside an otherwise blocked section:
Allow: /admin/ruleset/
Sitemap tells crawlers where to find your XML sitemap, helping them navigate your site’s structure more efficiently:
Sitemap: https://www.example.com/sitemap.xml
You can also write separate User-agent blocks to give different rules to different crawlers. For example, you might want to allow Googlebot to access everything while blocking a specific AI training bot from your entire site. Each User-agent block applies only until the next one begins, so the order matters.
A well-configured robots.txt file for a small business in 2026 balances protecting crawl budget, blocking AI training crawlers, and staying visible in AI search results:
User-agent: * Disallow: /admin/ Disallow: /search/ Allow: / Sitemap: https://www.example.com/sitemap.xml User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: PerplexityBot Allow: / User-agent: Claude-SearchBot Allow: /
The first block applies to all crawlers and keeps admin and search results pages out of the index while allowing access to everything else. The next four blocks block the main AI training crawlers — the ones that harvest content for model training with no attribution or traffic in return. The final four blocks explicitly allow the AI search crawlers, which cite your content in live answers on ChatGPT, Perplexity, and similar tools and do send referral traffic back to your site.
A well-made and optimized robots.txt file is essential for managing crawlers that might visit your website. It can highlight important pages you upload and help you get noticed. The above best practices can also help you control crawler behavior, protect sensitive data on your site, and ensure you only present essential data to search engines.
While you may be able to answer, “What is a robots.txt file?” If you want to know more or need someone to create this tool, contact Rose & Cactus today. We can help you streamline your robots.txt strategy and boost your SEO efforts.
We’re not here to follow trends. We’re here to build strategies that bring bold results and lasting growth. Whether you need a complete overhaul or just a strategic boost, Rose & Cactus is ready to deliver.
let's work together