feat: blockAiBots config (#166)

harlan-zw · web-flow · commit aa827565fc47 · 2024-11-25T00:33:10.000+11:00
diff --git a/docs/content/1.getting-started/0.introduction.md b/docs/content/1.getting-started/0.introduction.md
@@ -13,7 +13,7 @@ The core feature of the module is:
 - Telling [crawlers](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) which paths they can and cannot access using a [robots.txt](https://developers.google.com/search/docs/crawling-indexing/robots/intro) file.
 - Telling [search engine crawlers](https://developers.google.com/search/docs/crawling-indexing/googlebot) what they can show in search results from your site using a `<meta name="robots" content="index">`{lang="html"} `X-Robots-Tag` HTTP header.
 
-New to robots or SEO? Check out the [Conquering Web Crawlers](/learn/controlling-crawlers) guide to learn more about why you might
+New to robots or SEO? Check out the [Controlling Web Crawlers](/learn/controlling-crawlers) guide to learn more about why you might
 need these features.
 
 :LearnLabel{label="Conquering Web Crawlers" to="/learn/controlling-crawlers" icon="i-ph-robot-duotone"}
diff --git a/docs/content/1.getting-started/1.installation.md b/docs/content/1.getting-started/1.installation.md
@@ -43,7 +43,7 @@ while this works out-of-the-box for most providers, it's good to verify this is
 - [Disable Page Indexing](/docs/robots/guides/disable-page-indexing) - You should consider excluding pages that are not useful to search engines, for example
 any routes which require authentication should be ignored.
 
-Make sure you understand the differences between robots.txt vs robots meta tag with the [Conquering Web Crawlers](/learn/conquering-crawlers) guide.
+Make sure you understand the differences between robots.txt vs robots meta tag with the [Controlling Web Crawlers](/learn/conquering-crawlers) guide.
 
 :LearnLabel{label="Conquering Web Crawlers" to="/learn/controlling-crawlers" icon="i-ph-robot-duotone"}
 
@@ -55,4 +55,4 @@ Documentation is provided for module integrations, check them out if you're usin
 - [Nuxt I18n](/docs/robots/guides/i18n) - Disallows are automatically expanded to your configured locales.
 - [Nuxt Content](/docs/robots/guides/content) - Configure robots from your markdown files.
 
-Otherwise, just learn more about [how the module works](/docs/robots/guides/how-it-works).
+Next check out the [robots.txt recipes](/docs/robots/guides/robot-recipes) guide for some inspiration.
diff --git a/docs/content/2.guides/1.disable-page-indexing.md b/docs/content/2.guides/1.disable-page-indexing.md
@@ -11,7 +11,7 @@ The best options to choose are either:
 - [Robots.txt](#robotstxt) - Great for blocking robots from accessing specific pages that haven't been indexed yet.
 - [useRobotsRule](#userobotsrule) - Controls the `<meta name="robots" content="...">` meta tag and `X-Robots-Tag` HTTP Header. Useful for dynamic pages where you may not know if it should be indexed at build time and when you need to remove pages from search results. For example, a user profile page that should only be indexed if the user has made their profile public.
 
-If you're still unsure about which option to choose, make sure you read the [Conquering Web Crawlers](/learn/conquering-crawlers) guide.
+If you're still unsure about which option to choose, make sure you read the [Controlling Web Crawlers](/learn/conquering-crawlers) guide.
 
 :LearnLabel{label="Conquering Web Crawlers" to="/learn/controlling-crawlers" icon="i-ph-robot-duotone"}
 
diff --git a/docs/content/2.guides/1.robot-recipes.md b/docs/content/2.guides/1.robot-recipes.md
@@ -0,0 +1,93 @@
+---
+title: 'Robot.txt Recipes'
+description: 'Several recipes for configuring your robots.txt .'
+---
+
+## Introduction
+
+As a minimum the only recommended configuration for robots is to [disable indexing for non-production environments](/docs/robots/guides/disable-indexing).
+
+Many sites will never need to configure their [`robots.txt`](https://nuxtseo.com/learn/controlling-crawlers/robots-txt) or [`robots` meta tag](https://nuxtseo.com/learn/controlling-crawlers/meta-tags) beyond this, as the [controlling web crawlers](/learn/controlling-crawlers)
+is an advanced use case and topic.
+
+However, if you're looking to get the best SEO and performance results, you may consider some of the recipes on this page for
+your site.
+
+## Robots.txt recipes
+
+### Blocking Bad Bots
+
+If you're finding your site is getting hit with a lot of bots, you may consider enabling the `blockNonSeoBots` option.
+
+```ts [nuxt.config.ts]
+export default defineNuxtConfig({
+  robots: {
+    blockNonSeoBots: true
+  }
+})
+```
+
+This will block mostly web scrapers, the full list is: `Nuclei`, `WikiDo`, `Riddler`, `PetalBot`, `Zoominfobot`, `Go-http-client`, `Node/simplecrawler`, `CazoodleBot`, `dotbot/1.0`, `Gigabot`, `Barkrowler`, `BLEXBot`, `magpie-crawler`.
+
+### Blocking AI Crawlers
+
+AI crawlers can be beneficial as they can help users finding your site, but for some educational sites or those not
+interested in being indexed by AI crawlers, you can block them using the `blockAIBots` option.
+
+```ts [nuxt.config.ts]
+export default defineNuxtConfig({
+  robots: {
+    blockAiBots: true
+  }
+})
+```
+
+This will block the following AI crawlers: `GPTBot`, `ChatGPT-User`, `Claude-Web`, `anthropic-ai`, `Applebot-Extended`, `Bytespider`, `CCBot`, `cohere-ai`, `Diffbot`, `FacebookBot`, `Google-Extended`, `ImagesiftBot`, `PerplexityBot`, `OmigiliBot`, `Omigili`
+
+### Blocking Privileged Pages
+
+If you have pages that require authentication or are only available to certain users, you should block these from being indexed.
+
+```robots-txt [public/_robots.txt]
+User-agent: *
+Disallow: /admin
+Disallow: /dashboard
+```
+
+See [Config using Robots.txt](/docs/robots/guides/robots-txt) for more information.
+
+### Whitelisting Open Graph Tags
+
+If you have certain pages that you don't want indexed but you still want their [Open Graph Tags](/learn/mastering-meta/open-graph) to be crawled, you can target the specific
+user-agents.
+
+```robots-txt [public/_robots.txt]
+# Block search engines
+User-agent: Googlebot
+User-agent: Bingbot
+Disallow: /user-profiles
+
+# Allow social crawlers
+User-agent: facebookexternalhit
+User-agent: Twitterbot
+Disallow: /user-profiles
+```
+
+See [Config using Robots.txt](/docs/robots/guides/robots-txt) for more information.
+
+### Blocking Search Results
+
+You may consider blocking search results from being indexed, as they can be seen as duplicate content
+and can be a poor user experience.
+
+```robots-txt [public/_robots.txt]
+User-agent: *
+# block search results
+Disallow: /*?query=
+# block pagination
+Disallow: /*?page=
+# block sorting
+Disallow: /*?sort=
+# block filtering
+Disallow: /*?filter=
+```
diff --git a/src/const.ts b/src/const.ts
@@ -18,3 +18,21 @@ export const NonHelpfulBots = [
   'BLEXBot',
   'magpie-crawler',
 ]
+
+export const AiBots = [
+  'GPTBot',
+  'ChatGPT-User',
+  'Claude-Web',
+  'anthropic-ai',
+  'Applebot-Extended',
+  'Bytespider',
+  'CCBot',
+  'cohere-ai',
+  'Diffbot',
+  'FacebookBot',
+  'Google-Extended',
+  'ImagesiftBot',
+  'PerplexityBot',
+  'OmigiliBot',
+  'Omigili',
+]
diff --git a/src/module.ts b/src/module.ts
@@ -16,7 +16,7 @@ import { defu } from 'defu'
 import { installNuxtSiteConfig, updateSiteConfig } from 'nuxt-site-config/kit'
 import { relative } from 'pathe'
 import { readPackageJSON } from 'pkg-types'
-import { NonHelpfulBots } from './const'
+import { AiBots, NonHelpfulBots } from './const'
 import { setupDevToolsUI } from './devtools'
 import { resolveI18nConfig, splitPathForI18nLocales } from './i18n'
 import { extendTypes, isNuxtGenerate, resolveNitroPreset } from './kit'
@@ -118,6 +118,12 @@ export interface ModuleOptions {
    * @default false
    */
   blockNonSeoBots: boolean
+  /**
+   * Blocks AI crawlers.
+   *
+   * @default false
+   */
+  blockAiBots: boolean
   /**
    * Override the auto i18n configuration.
    */
@@ -264,7 +270,15 @@ export default defineNuxtModule<ModuleOptions>({
       // credits to yoast.com/robots.txt
       config.groups.push({
         userAgent: NonHelpfulBots,
-        comment: ['Block bots that don\'t benefit us.'],
+        comment: ['Block non helpful bots'],
+        disallow: ['/'],
+      })
+    }
+
+    if (config.blockAiBots) {
+      config.groups.push({
+        userAgent: AiBots,
+        comment: ['Block AI Crawlers'],
         disallow: ['/'],
       })
     }