Target User Agents and Reduce Spam via robots.txt

Your website’s robots.txt file probably contains some rules that tell compliant search engines and other bots which pages they can visit, and which are not allowed, etc. In most of the robots.txt files that I’ve looked at, all of the Allow and Disallow rules are applied to all user agents. This is done with the wildcard operator, which is written as an asterisk *, like this: User-agent: * This site’s robots.txt file provides a typical example. All of the allow/disallow […]


This content originally appeared on Perishable Press and was authored by Jeff Starr

Bad robots

Your website’s robots.txt file probably contains some rules that tell compliant search engines and other bots which pages they can visit, and which are not allowed, etc. In most of the robots.txt files that I’ve looked at, all of the Allow and Disallow rules are applied to all user agents. This is done with the wildcard operator, which is written as an asterisk *, like this:

User-agent: *

This site’s robots.txt file provides a typical example. All of the allow/disallow rules apply to anything/bot that visits, regardless of their reported user-agent.

Your robots.txt file provides a simple way to block TONS of bad bots, spiders and scrapers.

Targeting specific user agents

So the universal wildcard * is sort of the default for most/many websites. Simply apply all robots rules to all bots. That’s fine, very simple and easy. But if your site is getting hammered by tons of spam and other unwanted bot behavior and creepy crawling, you can immediately reduce the spam traffic by targeting specific user agents via robots.txt. Let’s look at an example:

User-agent: *
Allow: /

User-agent: 360Spider
Disallow: /

User-agent: A6-Indexer
Disallow: /

User-agent: Abonti
Disallow: /

User-agent: AdIdxBot
Disallow: /

User-agent: adscanner
Disallow: /

First and most important, we allow all user agents access to everything. That happens in the first two lines:

User-agent: *
Allow: /

So that means the default rule for any visiting bots is, in human-speak, “go ahead and crawl anywhere you want, no restrictions.” Then from there, the remaining robots rules target specific user-agents, 360Spider, A6-Indexer, Abonti, and so forth. For each of those bots, we disallow access to everything with a Disallow rule:

Disallow: /

So if any of these other bots come along, and they happen to be compliant with robots.txt (many are, surprisingly), they will obey the disallow rule and not crawl anything at your site. In robots language, the slash / matches any/all URLs.

Together, the above set of robots rules effectively says to bots, “360Spider, A6-Indexer, Abonti, AdIdxBot, and adscanner are not allowed here, but all other bots can crawl whatever they’d like.”

Depending on the site, this user-agent technique can greatly reduce the amount of bandwidth-wasting, resource-draining spammy bot traffic. The more spammy bots you block via robots.txt, the less spammy traffic is gonna hit your site. So where to get an awesome list of spammy user-agents? Glad you asked..

Tip: An informative look at how Google interprets the robots.txt specification. Lots of juicy details in that article.

Block over 650+ spammy user-agents

To get you started crafting your own anti-spam robots rules, here is a collection of over 650 spammy user-agents. This is a ready-to-go, copy-&-paste set of user-agent rules for your site’s robots.txt file. By downloading, you agree to the Terms.

650+ user-agents for robots.txtVersion 1.0 ( 21 bytes TXT )
Note: I can’t take credit for this set of rules. It was sent in by a reader some time ago. If you happen to know the source of these robots.txt user-agent rules, let me know so I can add a credit link to the article.

Terms / Disclaimer

This collection of user-agent rules for robots.txt is provided “as-is”, with the intention of helping people protect their sites against bad bots and other malicious activity. By downloading this code, you agree to accept full responsibility for its use. So use wisely, test thoroughly, and enjoy.

Also: Before applying this set of robots rules, make sure to read through the rules and remove any user-agent(s) that you don’t want to block. And to be extra safe, make sure to validate your robots.txt file using an online validator.

Other ways to block bad bots

If you’re asking whether or not robots.txt is the best way to block spammy user-agents, the answer probably is “no”. Much of my work with Apache/.htaccess (like nG Firewall) focuses on strong ways to block bad bots. If you want more powerful antispam/bad-bot protection, check out these tools from yours truly:

  • Blackhole for Bad Bots — Trap bad bots in a virtual black hole (free WordPress plugin)
  • Blackhole for Bad Bots — Trap bad bots in a virtual black hole (PHP/standalone script)
  • 7G Firewall — Super strong firewall (and bad-bot protection) for sites running on Apache or Nginx

So check ’em out if you want stronger protection against online threats. For just casual admins and site owners, the robots.txt user-agent rules provide a simple, effective way to reduce spam.



This content originally appeared on Perishable Press and was authored by Jeff Starr


Print Share Comment Cite Upload Translate Updates
APA

Jeff Starr | Sciencx (2023-01-18T20:16:07+00:00) Target User Agents and Reduce Spam via robots.txt. Retrieved from https://www.scien.cx/2023/01/18/target-user-agents-and-reduce-spam-via-robots-txt/

MLA
" » Target User Agents and Reduce Spam via robots.txt." Jeff Starr | Sciencx - Wednesday January 18, 2023, https://www.scien.cx/2023/01/18/target-user-agents-and-reduce-spam-via-robots-txt/
HARVARD
Jeff Starr | Sciencx Wednesday January 18, 2023 » Target User Agents and Reduce Spam via robots.txt., viewed ,<https://www.scien.cx/2023/01/18/target-user-agents-and-reduce-spam-via-robots-txt/>
VANCOUVER
Jeff Starr | Sciencx - » Target User Agents and Reduce Spam via robots.txt. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2023/01/18/target-user-agents-and-reduce-spam-via-robots-txt/
CHICAGO
" » Target User Agents and Reduce Spam via robots.txt." Jeff Starr | Sciencx - Accessed . https://www.scien.cx/2023/01/18/target-user-agents-and-reduce-spam-via-robots-txt/
IEEE
" » Target User Agents and Reduce Spam via robots.txt." Jeff Starr | Sciencx [Online]. Available: https://www.scien.cx/2023/01/18/target-user-agents-and-reduce-spam-via-robots-txt/. [Accessed: ]
rf:citation
» Target User Agents and Reduce Spam via robots.txt | Jeff Starr | Sciencx | https://www.scien.cx/2023/01/18/target-user-agents-and-reduce-spam-via-robots-txt/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.