This content originally appeared on Ethan Marcotte’s website and was authored by Ethan Marcotte
I’ve been blocking various “artificial intelligence” (“AI”) bots on my site. Why, you ask? Well, I don’t like the idea of my work being hoovered up to train “AI” data models; I don’t like that these companies assume my content’s available to them by default, and that I have to opt out of their scraping; I really don’t want anything I write to support these platforms, which I find unethical, extractive, deeply amoral, and profoundly anti-human.
But! Sadly, this is the world we live in. So I’m opting out.
There are many excellent tutorials out there on how to do this. (For my money, Neil Clarke and Cory Dransfeldt wrote two of my favorites, getting into not just the how of blocking “AI” bots, but also the why.) Most of those tutorials involve instructions on how to edit your site’s robots.txt
file, providing it with a list of user agents — basically, the “names” of the bots, crawlers, and browsers visiting your site — and blocking them from accessing your website.
But as I understand it, the one shortcoming of robots.txt
is that it only works if visiting bots actually honor your robots.txt
file. (Google has a good intro on this, if you’re interested.) That’s why I’ve opted to use my site’s .htaccess
file to block these bots. As a friend put it recently, robots.txt
is a bit like asking bots to not visit my site; with .htaccess
, you’re not asking.
(If you’re wondering if robots that ignore robots.txt
would perhaps lie about their user agent, you’re right to do so. Relying on user agent strings — which are basically swamps filled with lies — is incredibly fraught in any context, including this one. That’s why I consider this approach to be marginally better than robots.txt
, not a perfect solution.)
Anyway. Here’s how I got things working.
First, I polled a few different sources to build a list of currently-known crawler names. Once I had them, I dropped them into a mod_rewrite
rule in my .htaccess
file:
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteBase /
# block “AI” bots
RewriteCond %{HTTP_USER_AGENT} (AdsBot-Google|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|AwarioRssBot|AwarioSmartBot|Bytespider|CCBot|ChatGPT|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|DataForSeoBot|Diffbot|FacebookBot|Google-Extended|GPTBot|ImagesiftBot|magpie-crawler|omgili|Omgilibot|peer39_crawler|PerplexityBot|YouBot) [NC]
RewriteRule ^ – [F]
</IfModule>
If the crawler’s name appears anywhere in that gnarly-looking list of user agents, my site should block it.
If you’re interested, here’s how I automated this in my Jekyll setup:
While I could edit my .htaccess
file every time a new terrible “AI” crawler hits the market, I thought it might be easier to use Jekyll’s data files to build the whole thing dynamically.
Here’s my current bots.yml
file, which I’ve put in my site’s _data
directory:
- AdsBot-Google
- Amazonbot
- anthropic-ai
- Applebot
- Applebot-Extended
- AwarioRssBot
- AwarioSmartBot
- Bytespider
- CCBot
- ChatGPT
- ChatGPT-User
- Claude-Web
- ClaudeBot
- cohere-ai
- DataForSeoBot
- Diffbot
- FacebookBot
- Google-Extended
- GPTBot
- ImagesiftBot
- magpie-crawler
- omgili
- Omgilibot
- peer39_crawler
- PerplexityBot
- YouBot
With that file set up, Jekyll can read its contents by accessing a site.data.bots
variable. And in the template that generates my .htaccess
file, I can use that variable to build my mod_rewrite
rule:
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteBase /
# block “AI” bots
RewriteCond %{HTTP_USER_AGENT} ({{ site.data.bots | sort_natural | join: "|" }}) [NC]
RewriteRule ^ – [F]
</IfModule>
Whenever I add a new name to my bots.yml
file, my .htaccess
file will update itself automatically. Neat!
That’s how I’ve got things working, anyway. We’ll see how it works going forward.
And as you might imagine, I really wish I didn’t have to think about any of this.
Update: I’m grateful to Nick Doty for pointing out that Google-Extended
doesn’t appear in a user agent string, which means it can only be blocked by robots.txt
files. So for now, I’m duplicating the above logic in both my .htaccess
file and in my robots.txt
. I should’ve known it wouldn’t be quite that easy! Capitalism is working fine! Computers were a mistake!
Note: I apologize, but I can’t provide any technical support for anything outlined in this post.
With that said, all of the code blocks on this page are built dynamically, using the data file I use to build my .htaccess
file. As long as I don’t break anything on this little site — which is, to be clear, quite likely! — the snippets above should stay current whenever I add a new bot to the list.
This has been “Blockin’ bots.” a post from Ethan’s journal.
This content originally appeared on Ethan Marcotte’s website and was authored by Ethan Marcotte
Ethan Marcotte | Sciencx (2024-04-12T04:00:00+00:00) Blockin’ bots.. Retrieved from https://www.scien.cx/2024/04/12/blockin-bots/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.