ryan Posted Friday at 07:24 PM Share Posted Friday at 07:24 PM Many websites these days are the feeding ground for AI bots. Especially this site! In this post we look at a tool for taming all the hungry crawlers and bots… https://processwire.com/blog/posts/throttling-ai-bot-traffic-in-processwire/ 16 1 Link to comment Share on other sites More sharing options...
Robin S Posted Friday at 10:44 PM Share Posted Friday at 10:44 PM Hi @ryan, This new version of WireRequestBlocker has a breaking change relative to the previous version in that it now requires PHP >= v8, due to the use of str_starts_with(). Because pro modules are not upgradable via the PW admin users don't see notices about requirements before upgrading (and the PHP 8 requirement isn't stated in getModuleInfo() in any case). Could you please highlight the PHP 8 requirement somehow, or change the code so it has the same requirements as previous versions of the module? Thanks. 1 Link to comment Share on other sites More sharing options...
gebeer Posted Saturday at 12:54 AM Share Posted Saturday at 12:54 AM Quote [...]especially from the latest breed of AI bots that have an endless appetite for collecting training data. Hi @ryan, maybe I'm misreading this. But actually you would want bots to collect training data for PW, especially for the API reference part. This website does not publish content that is protected IP. It offers information that aims to attract developers and decision makers to use PW for their business. IMHO, blocking these bots is contra-productive. You are cutting yourself off from a growing number of developers that build projects with AI tools to boost their productivity. In the near future we probably will not be able to compete if we do not use these tools. The more accurate training data and context these AI assistants have for PW, the better they can perform and produce actually usable, production ready code. I would give the current approach of blocking these bots a second thought. 1 Link to comment Share on other sites More sharing options...
Gideon So Posted Saturday at 01:01 AM Share Posted Saturday at 01:01 AM Hi @gebeer While I understand your concern about blocking AI bots but what I get from Ryan's post is that he doesn't completely cut off AI bots. It is because they come too often. He just want to limit their visit rate. I think it is ok because I don't think the document part changes every few seconds. Gideon 4 Link to comment Share on other sites More sharing options...
ryan Posted Saturday at 01:25 AM Author Share Posted Saturday at 01:25 AM @gebeer Throttling is what enables us to allow the AI bots, rather than having to block them for taking over the sites resources. So long as the bots adhere to the rules established in the robots.txt they'll never get throttled. But if they ignore the crawl delay, then those requests get throttled with a 429 error. We even include a retry-after header telling them when they can try again. I used to have to block these bots outright in order to preserve the resources for you and me. Now they can crawl as much as they like, so long as they follow the speed limit. The throttle feature provides a way to enforce the speed limit. 5 Link to comment Share on other sites More sharing options...
gebeer Posted Saturday at 03:46 AM Share Posted Saturday at 03:46 AM 2 hours ago, ryan said: @gebeer Throttling is what enables us to allow the AI bots, rather than having to block them for taking over the sites resources. So long as the bots adhere to the rules established in the robots.txt they'll never get throttled. But if they ignore the crawl delay, then those requests get throttled with a 429 error. We even include a retry-after header telling them when they can try again. I used to have to block these bots outright in order to preserve the resources for you and me. Now they can crawl as much as they like, so long as they follow the speed limit. The throttle feature provides a way to enforce the speed limit. Ryan, thank you for clarifying. This totally makes sense now :-) 4 Link to comment Share on other sites More sharing options...
BrendonKoz Posted 22 hours ago Share Posted 22 hours ago This is awesome timing. Our hosting service only allots a set number of processes per customer, and due to bots we have been getting throttled and web requests were being delayed or outright refused due to too many requests being handled. Our overall traffic is, as reported by our host, about 55% bot requests! 1 Link to comment Share on other sites More sharing options...
ryan Posted 21 hours ago Author Share Posted 21 hours ago @BrendonKoz Great! Please let me know how it works for you. Any sense of which bots are causing the most trouble? The next thing I plan to build for WireRequestBlocker is a user agent counter/profiler, so that it's easier to identify problematic bots. That way you can throttle them specifically rather than throttling as general traffic. 1 Link to comment Share on other sites More sharing options...
BrendonKoz Posted 18 hours ago Share Posted 18 hours ago 2 hours ago, ryan said: @BrendonKoz Great! Please let me know how it works for you. Any sense of which bots are causing the most trouble? The next thing I plan to build for WireRequestBlocker is a user agent counter/profiler, so that it's easier to identify problematic bots. That way you can throttle them specifically rather than throttling as general traffic. I hope to give it a try tomorrow, but if I can't get to it, the first chance I'll have is next week. That said, I will definitely let you know! From a cursory search with recent logs, the following bots were problematic: Bingbot (Microsoft, USA) Bytespyder (ByteDance, so TikTok, China) MJ12bot (Majestic, SEO Tool, UK) AhrefsBot (Ahrefs, SEO Tool, USA) PetalBot (Petal Search Engine; China) CensysInspect (Internet Vulnerability Scanner, USA -- I think this is being abused and used as an attempted attack vector on our site, but they say it abides by crawl delay) I honestly did not realize there was/is a crawl speed directive for robots.txt (that some bots follow). I would've implemented that a long time ago. I do intend to implement ProCache at some point as well but this will be a very nice intermediary. 2 Link to comment Share on other sites More sharing options...
ryan Posted 17 hours ago Author Share Posted 17 hours ago @BrendonKoz I've got all those buts in our list as well, except for Bingbot. As far as I can tell, Bingbot follows the crawl delay, so is one of the good ones. 2 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now