Skip to content

Commit aa82756

Browse files
authoredNov 24, 2024··
feat: blockAiBots config (#166)
1 parent 3240eec commit aa82756

File tree

6 files changed

+131
-6
lines changed

6 files changed

+131
-6
lines changed
 

Diff for: ‎docs/content/1.getting-started/0.introduction.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ The core feature of the module is:
1313
- Telling [crawlers](https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers) which paths they can and cannot access using a [robots.txt](https://developers.google.com/search/docs/crawling-indexing/robots/intro) file.
1414
- Telling [search engine crawlers](https://developers.google.com/search/docs/crawling-indexing/googlebot) what they can show in search results from your site using a `<meta name="robots" content="index">`{lang="html"} `X-Robots-Tag` HTTP header.
1515

16-
New to robots or SEO? Check out the [Conquering Web Crawlers](/learn/controlling-crawlers) guide to learn more about why you might
16+
New to robots or SEO? Check out the [Controlling Web Crawlers](/learn/controlling-crawlers) guide to learn more about why you might
1717
need these features.
1818

1919
:LearnLabel{label="Conquering Web Crawlers" to="/learn/controlling-crawlers" icon="i-ph-robot-duotone"}

Diff for: ‎docs/content/1.getting-started/1.installation.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ while this works out-of-the-box for most providers, it's good to verify this is
4343
- [Disable Page Indexing](/docs/robots/guides/disable-page-indexing) - You should consider excluding pages that are not useful to search engines, for example
4444
any routes which require authentication should be ignored.
4545

46-
Make sure you understand the differences between robots.txt vs robots meta tag with the [Conquering Web Crawlers](/learn/conquering-crawlers) guide.
46+
Make sure you understand the differences between robots.txt vs robots meta tag with the [Controlling Web Crawlers](/learn/conquering-crawlers) guide.
4747

4848
:LearnLabel{label="Conquering Web Crawlers" to="/learn/controlling-crawlers" icon="i-ph-robot-duotone"}
4949

@@ -55,4 +55,4 @@ Documentation is provided for module integrations, check them out if you're usin
5555
- [Nuxt I18n](/docs/robots/guides/i18n) - Disallows are automatically expanded to your configured locales.
5656
- [Nuxt Content](/docs/robots/guides/content) - Configure robots from your markdown files.
5757

58-
Otherwise, just learn more about [how the module works](/docs/robots/guides/how-it-works).
58+
Next check out the [robots.txt recipes](/docs/robots/guides/robot-recipes) guide for some inspiration.

Diff for: ‎docs/content/2.guides/1.disable-page-indexing.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ The best options to choose are either:
1111
- [Robots.txt](#robotstxt) - Great for blocking robots from accessing specific pages that haven't been indexed yet.
1212
- [useRobotsRule](#userobotsrule) - Controls the `<meta name="robots" content="...">` meta tag and `X-Robots-Tag` HTTP Header. Useful for dynamic pages where you may not know if it should be indexed at build time and when you need to remove pages from search results. For example, a user profile page that should only be indexed if the user has made their profile public.
1313

14-
If you're still unsure about which option to choose, make sure you read the [Conquering Web Crawlers](/learn/conquering-crawlers) guide.
14+
If you're still unsure about which option to choose, make sure you read the [Controlling Web Crawlers](/learn/conquering-crawlers) guide.
1515

1616
:LearnLabel{label="Conquering Web Crawlers" to="/learn/controlling-crawlers" icon="i-ph-robot-duotone"}
1717

Diff for: ‎docs/content/2.guides/1.robot-recipes.md

+93
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
---
2+
title: 'Robot.txt Recipes'
3+
description: 'Several recipes for configuring your robots.txt .'
4+
---
5+
6+
## Introduction
7+
8+
As a minimum the only recommended configuration for robots is to [disable indexing for non-production environments](/docs/robots/guides/disable-indexing).
9+
10+
Many sites will never need to configure their [`robots.txt`](https://nuxtseo.com/learn/controlling-crawlers/robots-txt) or [`robots` meta tag](https://nuxtseo.com/learn/controlling-crawlers/meta-tags) beyond this, as the [controlling web crawlers](/learn/controlling-crawlers)
11+
is an advanced use case and topic.
12+
13+
However, if you're looking to get the best SEO and performance results, you may consider some of the recipes on this page for
14+
your site.
15+
16+
## Robots.txt recipes
17+
18+
### Blocking Bad Bots
19+
20+
If you're finding your site is getting hit with a lot of bots, you may consider enabling the `blockNonSeoBots` option.
21+
22+
```ts [nuxt.config.ts]
23+
export default defineNuxtConfig({
24+
robots: {
25+
blockNonSeoBots: true
26+
}
27+
})
28+
```
29+
30+
This will block mostly web scrapers, the full list is: `Nuclei`, `WikiDo`, `Riddler`, `PetalBot`, `Zoominfobot`, `Go-http-client`, `Node/simplecrawler`, `CazoodleBot`, `dotbot/1.0`, `Gigabot`, `Barkrowler`, `BLEXBot`, `magpie-crawler`.
31+
32+
### Blocking AI Crawlers
33+
34+
AI crawlers can be beneficial as they can help users finding your site, but for some educational sites or those not
35+
interested in being indexed by AI crawlers, you can block them using the `blockAIBots` option.
36+
37+
```ts [nuxt.config.ts]
38+
export default defineNuxtConfig({
39+
robots: {
40+
blockAiBots: true
41+
}
42+
})
43+
```
44+
45+
This will block the following AI crawlers: `GPTBot`, `ChatGPT-User`, `Claude-Web`, `anthropic-ai`, `Applebot-Extended`, `Bytespider`, `CCBot`, `cohere-ai`, `Diffbot`, `FacebookBot`, `Google-Extended`, `ImagesiftBot`, `PerplexityBot`, `OmigiliBot`, `Omigili`
46+
47+
### Blocking Privileged Pages
48+
49+
If you have pages that require authentication or are only available to certain users, you should block these from being indexed.
50+
51+
```robots-txt [public/_robots.txt]
52+
User-agent: *
53+
Disallow: /admin
54+
Disallow: /dashboard
55+
```
56+
57+
See [Config using Robots.txt](/docs/robots/guides/robots-txt) for more information.
58+
59+
### Whitelisting Open Graph Tags
60+
61+
If you have certain pages that you don't want indexed but you still want their [Open Graph Tags](/learn/mastering-meta/open-graph) to be crawled, you can target the specific
62+
user-agents.
63+
64+
```robots-txt [public/_robots.txt]
65+
# Block search engines
66+
User-agent: Googlebot
67+
User-agent: Bingbot
68+
Disallow: /user-profiles
69+
70+
# Allow social crawlers
71+
User-agent: facebookexternalhit
72+
User-agent: Twitterbot
73+
Disallow: /user-profiles
74+
```
75+
76+
See [Config using Robots.txt](/docs/robots/guides/robots-txt) for more information.
77+
78+
### Blocking Search Results
79+
80+
You may consider blocking search results from being indexed, as they can be seen as duplicate content
81+
and can be a poor user experience.
82+
83+
```robots-txt [public/_robots.txt]
84+
User-agent: *
85+
# block search results
86+
Disallow: /*?query=
87+
# block pagination
88+
Disallow: /*?page=
89+
# block sorting
90+
Disallow: /*?sort=
91+
# block filtering
92+
Disallow: /*?filter=
93+
```

Diff for: ‎src/const.ts

+18
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,21 @@ export const NonHelpfulBots = [
1818
'BLEXBot',
1919
'magpie-crawler',
2020
]
21+
22+
export const AiBots = [
23+
'GPTBot',
24+
'ChatGPT-User',
25+
'Claude-Web',
26+
'anthropic-ai',
27+
'Applebot-Extended',
28+
'Bytespider',
29+
'CCBot',
30+
'cohere-ai',
31+
'Diffbot',
32+
'FacebookBot',
33+
'Google-Extended',
34+
'ImagesiftBot',
35+
'PerplexityBot',
36+
'OmigiliBot',
37+
'Omigili',
38+
]

Diff for: ‎src/module.ts

+16-2
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ import { defu } from 'defu'
1616
import { installNuxtSiteConfig, updateSiteConfig } from 'nuxt-site-config/kit'
1717
import { relative } from 'pathe'
1818
import { readPackageJSON } from 'pkg-types'
19-
import { NonHelpfulBots } from './const'
19+
import { AiBots, NonHelpfulBots } from './const'
2020
import { setupDevToolsUI } from './devtools'
2121
import { resolveI18nConfig, splitPathForI18nLocales } from './i18n'
2222
import { extendTypes, isNuxtGenerate, resolveNitroPreset } from './kit'
@@ -118,6 +118,12 @@ export interface ModuleOptions {
118118
* @default false
119119
*/
120120
blockNonSeoBots: boolean
121+
/**
122+
* Blocks AI crawlers.
123+
*
124+
* @default false
125+
*/
126+
blockAiBots: boolean
121127
/**
122128
* Override the auto i18n configuration.
123129
*/
@@ -264,7 +270,15 @@ export default defineNuxtModule<ModuleOptions>({
264270
// credits to yoast.com/robots.txt
265271
config.groups.push({
266272
userAgent: NonHelpfulBots,
267-
comment: ['Block bots that don\'t benefit us.'],
273+
comment: ['Block non helpful bots'],
274+
disallow: ['/'],
275+
})
276+
}
277+
278+
if (config.blockAiBots) {
279+
config.groups.push({
280+
userAgent: AiBots,
281+
comment: ['Block AI Crawlers'],
268282
disallow: ['/'],
269283
})
270284
}

0 commit comments

Comments
 (0)
Please sign in to comment.