Jump to content

Optimizing 404s in ProcessWire


wbmnfktr
 Share

Recommended Posts

To make it really short:

There is a project of mine that gets about 5.000 to close to x.000 requests each day to common standard folders (WP, PHPMyAdmin, Backups, Gallery Scripts, and everything that has been on a 0day-list in the last 10 years) in addition to the regular site requests/traffic.

I will and need to to bypass 404s similar like @ryan wrote in this tutorial. I disabled 404-monitoring in Jumplinks a while back as the database grew each minute and wasn't "maintainable" anymore - due to auto backups each day.

https://processwire.com/blog/posts/optimizing-404s-in-processwire/

Has anyone additional tips in regards to this guide from Ryan?
Something to consider?
Something that needs special attention?
Something that didn't or doesn't work anymore?
Or should I already block all this requests in the CDN somehow? 

Feedback welcome.

I will start this little project next weekend. So... don't worry with feedback.
And if you ask... I actually prefer adding those block rules to the CDN somehow.

  • Like 1
Link to comment
Share on other sites

11 hours ago, bernhard said:

Did you enable all the noise blocking rules in htaccess?

Actually didn't even think about those until know. Some parts in it could work out pretty well, like request URLs.
That's a good tip!

8 hours ago, horst said:

Maybe the 7G-Firewall htaccess rules do help?

Someone told me about it a while back and it doesn't work that well (if at all) with ProCache as PHP is involved if I remember correctly.
Still... will look into it again.

  • Like 1
Link to comment
Share on other sites

31 minutes ago, wbmnfktr said:

Someone told me about it a while back and it doesn't work that well (if at all) with ProCache as PHP is involved if I remember correctly.
Still... will look into it again.

As far as I could see, it uses no PHP, only .htaccess rules for apache.

And there are other around that uses it, like @FireWire and @nbcommunication . Maybe we can ask them about their experiences?

On 1/29/2022 at 12:04 AM, FireWire said:

7G Firewall rules are added to the PW .htaccess file to block a ton of bots and malicious automated page visits. Highly recommended.

https://processwire.com/talk/topic/24230-7g-firewall-tweaks-required/ 

  • Like 1
Link to comment
Share on other sites

10 hours ago, wbmnfktr said:

Actually didn't even think about those until know. Some parts in it could work out pretty well, like request URLs.
That's a good tip!

Someone told me about it a while back and it doesn't work that well (if at all) with ProCache as PHP is involved if I remember correctly.
Still... will look into it again.

So @horst is correct about ProCache. ProCache renders HTML files to disk and uses some pretty clever directives in .htaccess file to detect a URL that has a corresponding page on-disk. If one exists then that HTML file is returned and the PHP interpreter never boots. We're currently running 7G and ProCache on the current site at our company. Unless you are configuring specific caching directives for HTML files in your .htaccess then you shouldn't see any problems.

If you plan to use 7G or any bot blocking, it should be at the very top of your .htaccess so that it's directives are parsed before the rest that are there to serve legitimate traffic. The sooner the bots/malicious requests are deflected, the less impact on your server.

While I've used 7G on many websites in production with no problems, it's important to test as noted by @nbcommunication in the comment above. He mentioned URLs with "null", while I've never seen any issues pop up with that since "null" isn't too common in English, I still keep the directive unchanged because that is there to detect malicious GET strings and similar undesirable things.

7G blocks a good deal of WP related requests however I wrote some additional directives that addressed requests that weren't caught. I'll share some of the additional customizations I've made that further filtered out traffic based on our 404s and web crawlers that we don't care for. If you're willing to get your hands dirty and write some custom htaccess directives you can dial it in even more.

  • Like 2
  • Thanks 1
Link to comment
Share on other sites

Back with more! Prepare for incoming wall of text...

I mentioned adding custom directives to our .htaccess file and wanted to share some more detail on that as well as some other tips. I was reviewing our 404s as a matter of maintenance so to speak to ensure that we had redirects in place as necessary. While reviewing that I found a lot (a lot) of hits that were bogus, clearly bots and even web crawlers for engines we have no interest in being listed for. What I found was in just 48 hours we had 700 total 404s and I imagine on some websites that number could be higher.

By analyzing that log and writing custom directives I was able to take 700 404s logged by ProcessWire down to 200 which are "legitimate" in that it's traffic that to be redirected to a proper destination page. I'm sharing my additional directives here as an example. Again, ANY bot/security directives should be at the very top of your .htaccess file. As always, test test test, and modify for your use case.

# Declare this at the top of your .htaccess file and remove or comment out all other instances of this directive elsewhere
RewriteEngine On

# Block known bad URLs
# Directories including sub-directories
RedirectMatch 404 "\/(wp-includes|wp-admin|wp-content|wordpress|wp|xxxss|cms|ALFA_DATA|functionRouter|rss|feed|feeds|TKVNP|QXXLZ|data\/admin)"

# Top level directories only - There are no assets served from these directories in root, only from /site/assets & /site/templates
RedirectMatch 404 "^/(js|scripts|css|styles|img|images|e|video|media|shwtv|assets|files|123|tvshowbiz)\/"

# Explicit file matching
RedirectMatch 404 "(1index|s_e|s_ne|media-admin|xmlrpc|trafficbot|FileZilla|app-ads|beence|defau1t|legion|system_log|olux|doc)\.(php|xml|life|txt)$"

# Additional filetypes & extensions
RedirectMatch 404 "(\.bak|inc\.)"

# Additional User Agent blocking not present in 7G Firewall
<IfModule mod_rewrite.c>
  # Chinese crawlers that cause significant traffic to bad URLs
  RewriteCond %{HTTP_USER_AGENT} Mb2345Browser|LieBaoFast|zh-CN|MicroMessenger|zh_CN|Kinza|Datanyze|serpstatbot|spaziodati|OPPO\sA33|AspiegelBot|aspiegel|PetalBot [NC]

  RewriteRule .* - [F,L]
</IfModule>

Details on this additional config:

  • It blocks some WP requests that get past 7G
  • My added directives redirect to a 404 which tells the bot that it flat out doesn't exist rather than 403 forbidden which could indicate it may exist. I read somewhere that this is more likely to get cached as a URL not to be revisited (wish I could remember the source, it's not a major issue).
  • Blocks a lot of very specific URLs/files we were seeing
  • Blocks Chinese search engine bots, because we don't operate in China. These amounted to a lot of traffic.
  • Blocks common dev files like .bak and .inc.* which aren't protected by default. Obvs you want to eliminate .bak altogether in production, but added safety fallback.
  • I have not seen this cause any issues in the Admin. Also consider if directives could cause problems in another language.
  • Customize by reviewing your logs

Additional measures
7G and the directives I created are a healthy amount of prevention of malicious traffic. Another resource I use is a Bad Bot gist that blocks numerous crawlers that add traffic to your site but may or may not generate 400s-500s HTTP statuses. This expands on 7G's basic list.

Bad Bot recommendations:

  • Comment out: SetEnvIfNoCase User-Agent "^AdsBot-Google.*" bad_bot There's not really a good reason to block a specific Google bot
  • If you make Curl requests to your server then comment this line out: SetEnvIfNoCase User-Agent "^Curl.*" bad_bot Reason: this will block all Curl requests to your server, including those by your own code. Be sure that you don't need Curl available if leaving this active. This is included in the list to prevent some types of website scrapers. If you want to leave this active and still need to use Curl, then consider changing your User Agent.
  • Comment out: SetEnvIfNoCase User-Agent "^Mediapartners-Google.*" bad_bot Again, not necessary to block Google's bots, might even be a bad idea for SEO or exposure (only they know, right?).

Testing
There's no such thing as too much testing. These directives are powerful and while written well, may have edge cases (like 'null' mentioned previously). There's no replacement for manual testing, specifically it would be a good idea to test any marketing UTMs or URLs with GET strings you may have out there just in case. For automated testing I use broken-link-checker which can be called from the terminal or as a JS module. I prefer this method to using some random site scanning service. This will detect both 404s and 403s by scanning every link on your page and getting a response which is useful for ensuring that your existing URLs have not been affected by your .htaccess directives.

broken-link-checker recommendations:

  • Consider rate limiting your requests using the --requests flag to set the number of concurrent requests. If you don't you could run into rate limits that your managed hosting company, CDN, or you (if you're like me) have built into your own server. This terminal app runs fast so if you have a lot of links or pages those requests can stack up quickly.
  • Consider using the -e flag, at least initially while testing your directives. This excludes external URLs which will help your test complete faster and prevent any false positives if you have broken external links (which you can handle separately).
  • Consider using the -g flag which switches the request to GET which is what browsers do.
  • Shortcut, just copy and paste my command: blc https://www.yoursite.com -roegv --requests 5

If you have access to your Apache access log via a bash/terminal instance then you may consider watching that file for new 404/403 entries for a little bit. You can do this by navigating to the directory with your access log and executing the following command (switch out the name of your log as needed): tail apache.access.log -f | grep "404 " You may consider also checking for 403s by changing out the HTTP status in that command.

"This seems excessive"
I think this is good for every site and once you get it dialed in to your needs can be replicated to others. There's no downside to increasing the security and performance of your hosting server. Consider that any undesirable traffic you block frees up resources for good traffic, and of course reduces your attack surface. If you need to think about scalability then this becomes even more important. The company I work for is looking to expand into 2 additional regions and I'd prefer my server was ready for it! If you get into high traffic circumstances then blocking this traffic may prevent you from needing to "throw money at the problem" by upgrading server specs if your server is running slower. Outside of that, it's just cool knowing that you have a deeper understanding of how this works and knowing you've expanded your developer expertise further.

This isn't meant to be an exhaustive guide but I hope I've helped some people get some extra knowledge and save everyone a few hours on Google looking this up. If I've missed anything or presented inaccurate/incomplete information please let me know and I will update this comment to make it better.

  • Like 11
Link to comment
Share on other sites

Sorry for the delay... I was outside (as in outdoors, enjoying the offline time) and took a while off from the desk.

There is so much great feedback and yes... maybe I was wrong about the 7G firewall solution. Still can't find what I was thinking about.

Nonetheless... the task was postponed a bit so I can dig a little deeper in all your comments and maybe give this a try.

Thanks you so much @horst and @FireWire!

  • Like 2
Link to comment
Share on other sites

18 hours ago, wbmnfktr said:

Sorry for the delay... I was outside (as in outdoors, enjoying the offline time) and took a while off from the desk.

What the hell are "outdoors" and "offline time"? I've never heard of this.

  • Haha 3
Link to comment
Share on other sites

 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...