Jump to content

Recommended Posts

Posted

Hi @ryan,

This new version of WireRequestBlocker has a breaking change relative to the previous version in that it now requires PHP >= v8, due to the use of str_starts_with().

Because pro modules are not upgradable via the PW admin users don't see notices about requirements before upgrading (and the PHP 8 requirement isn't stated in getModuleInfo() in any case). Could you please highlight the PHP 8 requirement somehow, or change the code so it has the same requirements as previous versions of the module?

Thanks.

  • Like 1
Posted
Quote

[...]especially from the latest breed of AI bots that have an endless appetite for collecting training data.

Hi @ryan,

maybe I'm misreading this. But actually you would want bots to collect training data for PW, especially for the API reference part. This website does not publish content that is protected IP. It offers information that aims to attract developers and decision makers to use PW for their business.

IMHO, blocking these bots is contra-productive. You are cutting yourself off from a growing number of developers that build projects with AI tools to boost their productivity. In the near future we probably will not be able to compete if we do not use these tools.

The more accurate training data and context these AI assistants have for PW, the better they can perform and produce actually usable, production ready code.

I would give the current approach of blocking these bots a second thought.

 

  • Like 1
Posted

Hi @gebeer

While I understand your concern about blocking AI bots but what I get from Ryan's post is that he doesn't completely cut off AI bots. It is because they come too often. He just want to limit their visit rate. I think it is ok because I don't think the document part changes every few seconds.

Gideon

  • Like 4
Posted

@gebeer Throttling is what enables us to allow the AI bots, rather than having to block them for taking over the sites resources. So long as the bots adhere to the rules established in the robots.txt they'll never get throttled. But if they ignore the crawl delay, then those requests get throttled with a 429 error. We even include a retry-after header telling them when they can try again. I used to have to block these bots outright in order to preserve the resources for you and me. Now they can crawl as much as they like, so long as they follow the speed limit. The throttle feature provides a way to enforce the speed limit.

  • Like 5
Posted
2 hours ago, ryan said:

@gebeer Throttling is what enables us to allow the AI bots, rather than having to block them for taking over the sites resources. So long as the bots adhere to the rules established in the robots.txt they'll never get throttled. But if they ignore the crawl delay, then those requests get throttled with a 429 error. We even include a retry-after header telling them when they can try again. I used to have to block these bots outright in order to preserve the resources for you and me. Now they can crawl as much as they like, so long as they follow the speed limit. The throttle feature provides a way to enforce the speed limit.

Ryan, thank you for clarifying. This totally makes sense now :-)

  • Like 4
Posted

This is awesome timing. Our hosting service only allots a set number of processes per customer, and due to bots we have been getting throttled and web requests were being delayed or outright refused due to too many requests being handled. Our overall traffic is, as reported by our host, about 55% bot requests!

  • Like 1
Posted

@BrendonKoz Great! Please let me know how it works for you. Any sense of which bots are causing the most trouble? The next thing I plan to build for WireRequestBlocker is a user agent counter/profiler, so that it's easier to identify problematic bots. That way you can throttle them specifically rather than throttling as general traffic. 

  • Thanks 1
Posted
2 hours ago, ryan said:

@BrendonKoz Great! Please let me know how it works for you. Any sense of which bots are causing the most trouble? The next thing I plan to build for WireRequestBlocker is a user agent counter/profiler, so that it's easier to identify problematic bots. That way you can throttle them specifically rather than throttling as general traffic. 

I hope to give it a try tomorrow, but if I can't get to it, the first chance I'll have is next week. That said, I will definitely let you know!

From a cursory search with recent logs, the following bots were problematic:

  • Bingbot (Microsoft, USA)
  • Bytespyder (ByteDance, so TikTok, China)
  • MJ12bot (Majestic, SEO Tool, UK)
  • AhrefsBot (Ahrefs, SEO Tool, USA)
  • PetalBot (Petal Search Engine; China)
  • CensysInspect (Internet Vulnerability Scanner, USA -- I think this is being abused and used as an attempted attack vector on our site, but they say it abides by crawl delay)

I honestly did not realize there was/is a crawl speed directive for robots.txt (that some bots follow). I would've implemented that a long time ago. I do intend to implement ProCache at some point as well but this will be a very nice intermediary.

  • Like 2
Posted

@BrendonKoz I've got all those buts in our list as well, except for Bingbot. As far as I can tell, Bingbot follows the crawl delay, so is one of the good ones. 

  • Like 2
Posted

As the module name has changed, is there any recommended way to upgrade from the prior module? The ProcessWireUpgrade module doesn't seem to notice there's an update to the WireRequestBlocker, but I'm thinking they'd share the same folder name on the physical server, but if they have a different database record, any custom settings may not transfer?

Posted

@BrendonKoz it should just be a matter of replacing the module files with the new ones. Then do a modules refresh. Then go to the module config page to setup throttling features. It should install the new ProcessRequestBlocker module automatically, which will appear on the Setup top nav menu.

  • Like 2
Posted

@Robin S I didn't intend for it to require PHP 8. I mistakenly was thinking str_contains and str_starts_with came in PHP 7.x. I've updated the download so that it replaces those function usages with strpos(). 

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...