ProcessWire Request Blocker

Helps to prevent scanners and bots from consuming too much of the server’s resources by automatically blocking their IPs when specific patterns are detected.

Overview
Installation
- Updating your .htaccess file (optional but recommended)
- Finding where your server provides the user’s IP address
Configuration
Using request blocker in other PHP applications
Blocking requests from your .htaccess file
- Using your .htaccess file to block by partial URL
- Using your .htaccess file to block bots by user-agent

Overview

WireRequestBlocker watches ProcessWire web requests and blocks requests containing user-defined patterns in URLs or user agent strings. Each time a match occurs it counts as 1 strike. After a defined number of strikes, the IP address is blocked for a period of time, as configured with the module.

This module also supports whitelist rules, enabling you to disable the blocker for requests containing a particular cookie name, user agent, or IP address. The module displays a list of all watched and blocked IPs, enabling you to optionally remove any of them.

When a request matches the blocking rules, an http 400 Bad Request error is returned. If additional requests match the blocking rules (reaching the max number of strikes) then all requests from that IP will be blocked with an HTTP 403 Forbidden error. The block will expire after a set amount of time.

If installed as directed, once an IP is blocked, it is blocked directly by Apache, before PHP or ProcessWire loads. This ensures that a resource hungry scanner or bot can't exhaust the server.

WARNING: You could very easily block legitimate requests with this module so please be careful and read all instructions and configuration settings.

Installation

Copy all files from the module into a new directory: /site/modules/WireRequestBlocker/
In your admin, navigate to Modules > Refresh.
On the Modules > Site tab, click “Install” for this module.
Update your .htaccess file as described in the section below.
Please read through the “Configuration” section.

Updating your .htaccess file (optional but recommended)

Updating your .htaccess file as described here enables WireRequestBlocker to block IP addresses before loading PHP or ProcessWire, which can reduce server resources needed to block an aggressive bot or scanner. Note however that this is optional, as WireRequestBlocker will still work without this, but it will have to load ProcessWire to do so.

Edit your .htaccess file and add the following directly above section #12 (which is right after section #11F):

# WireRequestBlocker
RewriteCond %{DOCUMENT_ROOT}/site/assets/.WireRequestBlocker/%{REMOTE_ADDR}.xt -f [OR]
RewriteCond %{DOCUMENT_ROOT}/site/assets/.WireRequestBlocker/%{REMOTE_ADDR}.xp -f
RewriteRule ^.*$ - [F,L]

If your /site/ directory has a different name, then update the /site/ above as needed.

The above will apply to the majority of servers. However, some servers are different. In my server environment (AWS load balancer or cluster) the IP address is actually passed through the HTTP X-FORWARDED-FOR header instead. If your server is the same, you’d replace the %{REMOTE_ADDR} above with %{HTTP:X-FORWARDED-FOR}, like this:

# WireRequestBlocker
RewriteCond %{DOCUMENT_ROOT}/site/assets/.WireRequestBlocker/%{HTTP:X-FORWARDED-FOR}.xt -f [OR]
RewriteCond %{DOCUMENT_ROOT}/site/assets/.WireRequestBlocker/%{HTTP:X-FORWARDED-FOR}.xp -f
RewriteRule ^.*$ - [F,L]

Finding where your server provides the user’s IP address

If you aren't sure how your server passes through the IP address, upload the following script to your server and then load it in your web browser:

<?php 
echo "<pre>";
print_r($_SERVER);
// IMPORTANT: remove this script as soon as you are done!

In the output, if you see your IP address in REMOTE_ADDR then you should use that. If instead you see it in HTTP_X_FORWARDED_IP or some other property, then you should use that.

Configuration

WireRequestBlocker comes with a few default configuration settings, but these are just examples. You may want to come up with your own. We'll walk through each of the settings below so that you can decide what you'd like to change.

General settings

How long to block an IP address

This setting describes how long a block should last. It also determines how long a non-blocked IP should be watched. A value between 3600 (1 hour) and 86400 (1 day) is recommended. Though in some cases you may want to use as much as 1 week (604800).

How many bad requests to block an IP?

This is the number of strikes that are allowed before an IP address is blocked. If you want to block an IP the first time it strikes, then enter 1. In my case, I like to have at least 2 strikes before I block an IP. A value of 2-3 is ideal.

Where to get IP address from

Most servers will provide the user’s IP address in the "Remote Addr" property. But in some environments, such as load balancers or clusters, the IP address may be provided in a different property, such as the "Forwarded IP". If you aren't sure what property your server provides the user’s IP address in then see the section above "Finding where your server provides the user’s IP address".

Log to Setup > Logs > wire-request-blocker?

Optionally enable a verbose log file for monitoring activity.

Blocking rules

URL strings to trigger a strike

This is a newline separated list of strings (or groups of strings) to match in the URL or query string. When present in a request, the URL will either be watched or blocked, depending on how many strikes are required to block. The strings that you provide here should be text that never appears legitimately in your site’s URLs. For instance, since you are running ProcessWire, chances are that URLs like /wp-json/ aren't valid, so they might be a good thing to match.

Vulnerability scanners like to find leftover .sql files sitting on the server, so they are a good string to match. Strings like database.sql and .sql.gz, etc. are usually a good indicator of a vulnerability scanner bot. Other types of worthwhile strings to block include those containing partial SQL statements or Javascript. Note that these are NOT case sensitive.

URL strings that block the IP immediately

These are the same as the previous setting except that they assign the maximum number of strikes to any matches, resulting in an immediate block rather than a strike.

Blocking groups

The "URL strings…" and "User-agent strings…" also support groups of blocking rules. These are specified in the format A=B|C|D where A is text to match before attempting to match B or C or D. For instance, with a group, we might say that if the URL contains the text /wp then we'd like to check if it contains wp-json, wp-login, wp-admin, etc., and here's how we could specify that rule in a group:

/wp=wp-json|wp-admin|wp-login|wp-content|wp-includes

The benefit here is that WireRequestBlocker doesn't have to attempt matching anything in the group unless the /wp matches first. It's a way of reducing overhead, but can also be handy for grouping related blocking rules.

Here's another example that first checks if the text .sql appears in the URL, and if so, then block it if it contains any of the items that follow (pipe | separated):

.sql=.sql.gz|.sql.tar|backup.sql|dump.sql|db.sql|database.sql

Regular expression blocking groups

A blocking group can also be specified as a PCRE regular expression using the format test=#regex#. The test part is a case-insensitive text/string that must be present in the value before attempting the regex. This ensures that the overhead of matching a regular expression is not attempted unless we know there's a reasonable chance it'll match. The #regex# part is a regular expression that uses # as the open and closing delimiters. Regex modifiers may optionally be appended after the closing #. Let's use the same two examples used above, but expressed as a regular expression instead:

/wp=#wp-(json|admin|login|content|includes)#

.sql=#(\.sql\.(tar|gz)|(backup|dump|db|database)\.sql)#i

Regular expressions are case sensitive by default, so in the 2nd example (.sql) we appended the PCRE i modifier to the regex to make it case insensitive. Please note that regular expression blocking groups are only available in WireRequestBlocker v3 or newer.

User agent strings to trigger a strike

This is just like the setting above except that it attempts to match text in the user-agent string instead. This is useful for blocking bots that identify themselves but may not follow your robots.txt rules. This is a good way to filter out the SEO bots like AhrefsBot, SemrushBot, etc. Please note that this setting IS case sensitive. You can also use the blocking groups, as described in the section above.

User agent strings that block the IP immediately

These are the same as above except that they assign the maximum number of strikes, causing the IP to be blocked immediately, rather than just getting a strike.

IP addresses to block permanently

If a particular IP address is causing issues with your site, you may just want to block it permanently. In this setting you can paste in one or more IP addresses (1 per line) and they will be permanently blocked. You can remove the block on this same screen when/if needed.

IP addresses to block temporarily

This essentially adds 3 strikes (or whatever your defined strikes number is) to an IP address, blocking it immediately. The block will expire in the time configured with this module.

Email address to notify when new IP is blocked

If you want to get a notification email when an IP address is blocked, enter your email address here. This is worthwhile just to make sure things are working as you intend, and also to make sure you aren't accidentally blocking requests that aren't intended.

Whitelist rules

Cookie names to whitelist

If a cookie having a name specified here is present, then the request will not be blocked, even if it matches a blocking rule. A good use case is logged-in ProcessWire users. If you want to disable blocking for logged in users, then you would enter ProcessWire's challenge cookie names, which are only present when a user is logged in. By default these cookie names are wire_challenge for HTTP requests or wires_challenge for HTTPS requests. However, the name may be different if you've set a different session name in your /site/config.php file$config->sessionName or $config->sessionNameSecure setting.

User agent strings to whitelist

While the user agent can be manipulated by anybody, my experience so far is that if the user agent says "Googlebot" then 4 out of 5 times, it is. Likewise for other search engine crawlers. Still, a fake Googlebot 1 out of 5 times is significant. But you don't want to risk blocking the real Googlebot if it happens to follow someone's link to a URL that is blocked. So it's a good idea to whitelist specific search engine user agents, just to play it safe. Please note this setting IS case sensitive.

If you are using a CDN then you'll also want to whitelist the CDN's user agent strings. For instance, mine is "Amazon CloudFront". The reason for this is that the CDN can duplicate user requests to the server, including for URLs that might contain something your blocking rules match. For this reason, you want to be sure you don't block your CDN, and this setting is a handy way to whitelist your CDN's user agent.

IP addresses to whitelist

If you are behind a load balancer you might want to whitelist the load balancer's IP address. Consider any other similar situations where it might be useful to whitelist an IP address.

Current and pending blocked IPs

This shows a summary of all IP address that are being watched or are currently blocked. For each IP address it tells you the number of strikes, the reason why it matched, when it was found, when it expires, and what the URL was when it matched. From here you can also check the box next to any IP address to remove it from the watch list or block list.

What to do if your IP address gets blocked

If while testing this module you accidentally block your own IP address, you lose access to the ProcessWire admin and thus can't go in and remove the block from your IP. Not to worry, there's another way to unblock it:

SSH or FTP to your web server and navigate to this directory: /site/assets/.WireRequestBlocker/

(Note that .WireRequestBlocker begins with a period, which makes it non web accessible).

In this directory is a list of all watched and blocked IP address as filenames.

If the IP filename ends with .ip then it is a watched IP address (i.e. has 1 or more strike, but not yet blocked).
If the IP filename ends with .xt then it means the IP address is temporarily blocked.
And if it ends with .xp then it means the IP address is permanently blocked.

Chances are your IP is temporarily blocked. So if your IP address is 111.222.333.444 then you'll want to delete the file:

/site/assets/.WireRequestBlocker/111.222.333.444.xt

Once you delete that file, your IP address will no longer be blocked.

Using request blocker in other PHP applications

While this module is configured in ProcessWire, it can be used to block matching requests from any PHP application, simply by including the RequestBlocker.php file from your other PHP application. Since the configuration is saved in a dedicated .json file, ProcessWire itself is not needed in order for RequestBlocker to do its work with your configured rules.

Let's say that ProcessWire was running off the root directory of your domain, and WordPress was running off the /blog/ URL of the domain. Edit the WordPress /blog/index.php file and add this at the top, after the opening <?php tag (and before any WordPress-specific code):

require_once('../site/modules/WireRequestBlocker/RequestBlocker.php');
$blocker = new \ProcessWire\RequestBlocker();
$blocker->execute();

If a request matches one of your blocking rules, it will be blocked, before WordPress even gets the chance to load. But if the URL doesn't match a blocking rule, then WordPress will execute normally. Wordpress is only used here as an example, as it could be any other PHP application as well.

Please note that if using the "Email address to notify when new IP is blocked" setting, it will not apply when using RequestBlocker from other PHP applications. This is because RequestBlocker uses ProcessWire’s WireMail interface to send email, which is not available when ProcessWire is not booted.

Blocking requests from your .htaccess file

Using your .htaccess file to block by partial URL

In some cases you may find that certain blocked URLs are hit numerous times throughout the day, but the IP addresses that hit them only do so 1-3 times, and don't hit anything else. This seems to be the case especially with some WordPress vulnerability checkers that check URLs like these below:

/wp-login.php
/wp-json/wp/v2/users/1
/wp-json/oembed/1.0/embed?url=yourdomain.com
/wp-content/some-hacked-file.php
/wp-content/plugins/some-hacked-plugin/vulnerable-file.php
…and so on…

So long as those URLs return 403s or 404s, the IPs hitting these URLs only hit your site once, twice, or maybe three times, before moving on to another site. Blocking their IPs isn't that useful because they likely won't ever be returning. Though you may find hundreds of different IPs hitting these URLs daily.

Ideally we'd prefer that those bots don't consume any resources at all from our sites, so for cases like this we might just prefer to block them directly from the .htaccess file. After section #11F in the .htaccess file, I like to add this, which basically blocks all those WordPress URLs (with a 404) before RequestBlocker or ProcessWire has to spend any time even looking at them:

RewriteCond %{REQUEST_URI} /wp-(content|json|admin|includes|login)
RewriteRule .* - [L,R=404]

The WordPress URLs are just a common example here. There may be other URLs that you find RequestBlocker storing a lot of blocked IPs for, and so using your .htaccess file is one way to stop them even earlier.

If you happen to be running WordPress alongside ProcessWire then of course you should NOT add the rules in this example to your .htaccess file.

Using your .htaccess file to block bots by user-agent

Your .htaccess file is also a worthwhile place to block certain bots that might be ignoring your robots.txt crawl-delay or bots that just want to consume your resources without providing any benefit to your site in return. Admittedly the best way to find these is by examining your Apache access_log files directly. In our case, we found a few bots that were consuming resources and didn't have any clear benefit to our site our users. These can be blocked from your WireRequestBlocker settings, or you can block them even earlier in the request by adding the following to the very top of your .htaccess file:

# BOT BLOCKERS
SetEnvIfNoCase User-Agent .*adscanner.* bad_bot
SetEnvIfNoCase User-Agent .*ahrefsbot.* bad_bot
SetEnvIfNoCase User-Agent .*dataforseobot.* bad_bot
SetEnvIfNoCase User-Agent .*dotbot.* bad_bot
SetEnvIfNoCase User-Agent .*dotnetdotcom.* bad_bot
SetEnvIfNoCase User-Agent .*exabot.* bad_bot
SetEnvIfNoCase User-Agent .*gigabot.* bad_bot
SetEnvIfNoCase User-Agent .*hypefactors.* bad_bot
SetEnvIfNoCase User-Agent .*mauibot.* bad_bot
SetEnvIfNoCase User-Agent .*mj12bot.* bad_bot
SetEnvIfNoCase User-Agent .*petalbot.* bad_bot
SetEnvIfNoCase User-Agent .*rogerbot.* bad_bot
SetEnvIfNoCase User-Agent .*semrush.* bad_bot
SetEnvIfNoCase User-Agent .*seokicks.* bad_bot
SetEnvIfNoCase User-Agent .*seoscanners.* bad_bot
SetEnvIfNoCase User-Agent .*serpstatbot.* bad_bot
SetEnvIfNoCase User-Agent .*sitebot.* bad_bot
SetEnvIfNoCase User-Agent .*trendiction.* bad_bot
<Limit GET POST HEAD>
  Order Allow,Deny
  Allow from all
  Deny from env=bad_bot
</Limit>

It's best not to have TOO many of these as Apache still has to do the work of matching the text in the user agent strings. But I routinely scan our access_log files to see what bots are too aggressive and then add them to my block list, and remove those that are no longer appearing in the logs.

If you google "bad bots by user agent" you should come across compiled lists of them, like this one. But don't go and add them all to your WireRequestBlocker user-agent blocks, or your .htaccess file, as the overhead of scanning for all those bots would likely make it not worthwhile. But blocking just the ones that are targeting your site is definitely worthwhile.

Please see the included README.md file for terms/conditions and additional version-specific details.