ryan Posted Friday at 07:24 PM Share Posted Friday at 07:24 PM Many websites these days are the feeding ground for AI bots. Especially this site! In this post we look at a tool for taming all the hungry crawlers and bots… https://processwire.com/blog/posts/throttling-ai-bot-traffic-in-processwire/ 15 1 Link to comment Share on other sites More sharing options...
Robin S Posted Friday at 10:44 PM Share Posted Friday at 10:44 PM Hi @ryan, This new version of WireRequestBlocker has a breaking change relative to the previous version in that it now requires PHP >= v8, due to the use of str_starts_with(). Because pro modules are not upgradable via the PW admin users don't see notices about requirements before upgrading (and the PHP 8 requirement isn't stated in getModuleInfo() in any case). Could you please highlight the PHP 8 requirement somehow, or change the code so it has the same requirements as previous versions of the module? Thanks. 1 Link to comment Share on other sites More sharing options...
gebeer Posted yesterday at 12:54 AM Share Posted yesterday at 12:54 AM Quote [...]especially from the latest breed of AI bots that have an endless appetite for collecting training data. Hi @ryan, maybe I'm misreading this. But actually you would want bots to collect training data for PW, especially for the API reference part. This website does not publish content that is protected IP. It offers information that aims to attract developers and decision makers to use PW for their business. IMHO, blocking these bots is contra-productive. You are cutting yourself off from a growing number of developers that build projects with AI tools to boost their productivity. In the near future we probably will not be able to compete if we do not use these tools. The more accurate training data and context these AI assistants have for PW, the better they can perform and produce actually usable, production ready code. I would give the current approach of blocking these bots a second thought. 1 Link to comment Share on other sites More sharing options...
Gideon So Posted yesterday at 01:01 AM Share Posted yesterday at 01:01 AM Hi @gebeer While I understand your concern about blocking AI bots but what I get from Ryan's post is that he doesn't completely cut off AI bots. It is because they come too often. He just want to limit their visit rate. I think it is ok because I don't think the document part changes every few seconds. Gideon 4 Link to comment Share on other sites More sharing options...
ryan Posted yesterday at 01:25 AM Author Share Posted yesterday at 01:25 AM @gebeer Throttling is what enables us to allow the AI bots, rather than having to block them for taking over the sites resources. So long as the bots adhere to the rules established in the robots.txt they'll never get throttled. But if they ignore the crawl delay, then those requests get throttled with a 429 error. We even include a retry-after header telling them when they can try again. I used to have to block these bots outright in order to preserve the resources for you and me. Now they can crawl as much as they like, so long as they follow the speed limit. The throttle feature provides a way to enforce the speed limit. 4 Link to comment Share on other sites More sharing options...
gebeer Posted yesterday at 03:46 AM Share Posted yesterday at 03:46 AM 2 hours ago, ryan said: @gebeer Throttling is what enables us to allow the AI bots, rather than having to block them for taking over the sites resources. So long as the bots adhere to the rules established in the robots.txt they'll never get throttled. But if they ignore the crawl delay, then those requests get throttled with a 429 error. We even include a retry-after header telling them when they can try again. I used to have to block these bots outright in order to preserve the resources for you and me. Now they can crawl as much as they like, so long as they follow the speed limit. The throttle feature provides a way to enforce the speed limit. Ryan, thank you for clarifying. This totally makes sense now :-) 4 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now