Blockin’ bots.

Here’s how I’m blocking “artificial intelligence” bots, crawlers, and scrapers.


This content originally appeared on Ethan Marcotte’s website and was authored by Ethan Marcotte

I’ve been blocking various “artificial intelligence” (“AI”) bots on my site. Why, you ask? Well, I don’t like the idea of my work being hoovered up to train “AI” data models; I don’t like that these companies assume my content’s available to them by default, and that I have to opt out of their scraping; I really don’t want anything I write to support these platforms, which I find unethical, extractive, deeply amoral, and profoundly anti-human.

But! Sadly, this is the world we live in. So I’m opting out.

There are many excellent tutorials out there on how to do this. (For my money, Neil Clarke and Cory Dransfeldt wrote two of my favorites, getting into not just the how of blocking “AI” bots, but also the why.) Most of those tutorials involve instructions on how to edit your site’s robots.txt file, providing it with a list of user agents — basically, the “names” of the bots, crawlers, and browsers visiting your site — and blocking them from accessing your website.

But as I understand it, the one shortcoming of robots.txt is that it only works if visiting bots actually honor your robots.txt file. (Google has a good intro on this, if you’re interested.) That’s why I’ve opted to use my site’s .htaccess file to block these bots. As a friend put it recently, robots.txt is a bit like asking bots to not visit my site; with .htaccess, you’re not asking.

(If you’re wondering if robots that ignore robots.txt would perhaps lie about their user agent, you’re right to do so. Relying on user agent strings — which are basically swamps filled with lies — is incredibly fraught in any context, including this one. That’s why I consider this approach to be marginally better than robots.txt, not a perfect solution.)

Anyway. Here’s how I got things working.

First, I polled a few different sources to build a list of currently-known crawler names. Once I had them, I dropped them into a mod_rewrite rule in my .htaccess file:

<IfModule mod_rewrite.c>
RewriteEngine on
RewriteBase /

# block “AI” bots
RewriteCond %{HTTP_USER_AGENT} (AdsBot-Google|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|AwarioRssBot|AwarioSmartBot|Bytespider|CCBot|ChatGPT|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|DataForSeoBot|Diffbot|FacebookBot|Google-Extended|GPTBot|ImagesiftBot|magpie-crawler|omgili|Omgilibot|peer39_crawler|PerplexityBot|YouBot) [NC]
RewriteRule ^ – [F]
</IfModule>

If the crawler’s name appears anywhere in that gnarly-looking list of user agents, my site should block it.

If you’re interested, here’s how I automated this in my Jekyll setup:

While I could edit my .htaccess file every time a new terrible “AI” crawler hits the market, I thought it might be easier to use Jekyll’s data files to build the whole thing dynamically.

Here’s my current bots.yml file, which I’ve put in my site’s _data directory:

- AdsBot-Google
- Amazonbot
- anthropic-ai
- Applebot
- Applebot-Extended
- AwarioRssBot
- AwarioSmartBot
- Bytespider
- CCBot
- ChatGPT
- ChatGPT-User
- Claude-Web
- ClaudeBot
- cohere-ai
- DataForSeoBot
- Diffbot
- FacebookBot
- Google-Extended
- GPTBot
- ImagesiftBot
- magpie-crawler
- omgili
- Omgilibot
- peer39_crawler
- PerplexityBot
- YouBot

With that file set up, Jekyll can read its contents by accessing a site.data.bots variable. And in the template that generates my .htaccess file, I can use that variable to build my mod_rewrite rule:


<IfModule mod_rewrite.c>
RewriteEngine on
RewriteBase /

# block “AI” bots
RewriteCond %{HTTP_USER_AGENT} ({{ site.data.bots | sort_natural | join: "|" }}) [NC]
RewriteRule ^ – [F]
</IfModule>

Whenever I add a new name to my bots.yml file, my .htaccess file will update itself automatically. Neat!

That’s how I’ve got things working, anyway. We’ll see how it works going forward.

And as you might imagine, I really wish I didn’t have to think about any of this.

Update: I’m grateful to Nick Doty for pointing out that Google-Extended doesn’t appear in a user agent string, which means it can only be blocked by robots.txt files. So for now, I’m duplicating the above logic in both my .htaccess file and in my robots.txt. I should’ve known it wouldn’t be quite that easy! Capitalism is working fine! Computers were a mistake!


Note: I apologize, but I can’t provide any technical support for anything outlined in this post.

With that said, all of the code blocks on this page are built dynamically, using the data file I use to build my .htaccess file. As long as I don’t break anything on this little site — which is, to be clear, quite likely! — the snippets above should stay current whenever I add a new bot to the list.


This has been “Blockin’ bots.” a post from Ethan’s journal.

Reply via email


This content originally appeared on Ethan Marcotte’s website and was authored by Ethan Marcotte


Print Share Comment Cite Upload Translate Updates
APA

Ethan Marcotte | Sciencx (2024-04-12T04:00:00+00:00) Blockin’ bots.. Retrieved from https://www.scien.cx/2024/04/12/blockin-bots/

MLA
" » Blockin’ bots.." Ethan Marcotte | Sciencx - Friday April 12, 2024, https://www.scien.cx/2024/04/12/blockin-bots/
HARVARD
Ethan Marcotte | Sciencx Friday April 12, 2024 » Blockin’ bots.., viewed ,<https://www.scien.cx/2024/04/12/blockin-bots/>
VANCOUVER
Ethan Marcotte | Sciencx - » Blockin’ bots.. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2024/04/12/blockin-bots/
CHICAGO
" » Blockin’ bots.." Ethan Marcotte | Sciencx - Accessed . https://www.scien.cx/2024/04/12/blockin-bots/
IEEE
" » Blockin’ bots.." Ethan Marcotte | Sciencx [Online]. Available: https://www.scien.cx/2024/04/12/blockin-bots/. [Accessed: ]
rf:citation
» Blockin’ bots. | Ethan Marcotte | Sciencx | https://www.scien.cx/2024/04/12/blockin-bots/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.