Jump to content

Module: Process Link Checker


teppo
 Share

Recommended Posts

This is a beta release, so some extra caution is recommended. So far the module has been successfully tested on at least ProcessWire 2.7.2 and 3.0.18, but at least in theory it should work for 2.4/2.5 versions of ProcessWire too.
 
GitHub repo: https://github.com/teppokoivula/ProcessLinkChecker (see README.md for more techy details, settings etc.)
 
What you see is ...
 
This is a module that adds back-end tools for tracking down broken links and unnecessary redirects. That's pretty much all there is to these views right now; I'm still contemplating whether it should also provide a link text section (for SEO purposes etc.)  and/or other features.
 
The magic behind the scenes
 
The admin tool (Process module) is about half of Link Checker; the other half is a PHP class called Link Crawler. This is a tool for collecting links from a ProcessWire site, analysing them and storing the outcome to custom database tables.
 
Link Crawler is intended to be triggered via a cron task, but there's also a GUI tool for running the checker. This is a slow process and can result in issues, but for smaller sites and debugging purposes the GUI method works just fine. Just be patient; the data will be there once you wait long enough :)
 
Now what?
 
For the time being I'd appreciate any comments about the way this is heading and/or whether it's useful to you at all. What would you add to make it more useful for your own use cases? I'm going to continue working on this for sure (it's been a really fun project), but wouldn't mind being pushed to the correct direction early on.
 
This module is already in active use on two relatively big sites I manage. Lately I haven't had any issues with the module, but please consider this a beta release nevertheless; it hasn't been widely tested, and that alone is a reason to avoid calling it "stable" quite yet.

Screenshots

Dashboard:

link-checker-dashboard.png

List of broken links:

link-checker-broken-links.png

List of redirects:

link-checker-redirects.png

Check now tool/tab:

link-checker-check-now.png

Edited by teppo
Updated module description, status and screenshots.
  • Like 18
Link to comment
Share on other sites

Teppo this looks fantastic, nice work! While I haven't yet been able to test it out here I will be soon, as I have a regular need for a tool like this. It's also one of those things that come up with clients a lot: "how do I keep track of when a link no longer works?". I've been using Google Webmaster tools for 404 discovery in the past, but it's often hard to separate the noise from the goods there, and it's not particularly client friendly either. Regarding the cron side of this, I immediately thought of IftRunner (which itself is triggered by cron) and how this might work great as a PageAction with IftRunner. PageActions can also be executed by ListerPro and presumably other tools in the future as well. 

  • Like 2
Link to comment
Share on other sites

Thanks, Ryan. Let me know how it handles once you do test it, would be interesting to know. My tests so far have been very limited in scope, so I'm fully expecting a pile of issues (and most likely a few things I've completely missed).. though of course the opposite would be cool too :)

You've given me something new to consider there, will definitely take IftRunner and PageAction part into consideration.

Link to comment
Share on other sites

  • 1 year later...

I have installed it at a 2.6.10 dev version. The installation process was successfull, but if I want to check the links I get the following messages:

2015-07-31 17:30:34	admin	    START: id!=2, has_parent!=2
2015-07-31 17:30:34	admin	    BATCH: 1/2 (pages 1-52/52)
2015-07-31 17:30:34	admin	        FOUND Page: /
2015-07-31 17:30:35	admin	            CHECKED URL: http://www.juergen-kern.at/site/templates/favicon.ico (200)

Warning: PDOStatement::execute(): MySQL server has gone away in /home/.sites/24/site1275/web/site/modules/ProcessLinkChecker/LinkCrawler.php on line 405

Warning: PDOStatement::execute(): Error reading result set's header in /home/.sites/24/site1275/web/site/modules/ProcessLinkChecker/LinkCrawler.php on line 405

Fatal error: Exception: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away (in /home/.sites/24/site1275/web/wire/core/Modules.php line 2264)

#0 /home/.sites/24/site1275/web/wire/core/Modules.php(2264): PDOStatement->execute()
#1 /home/.sites/24/site1275/web/wire/core/Modules.php(2523): Modules->getModuleConfigData(Object(ProcessPageSearch))
#2 /home/.sites/24/site1275/web/wire/core/Modules.php(446): Modules->setModuleConfigData(Object(ProcessPageSearch))
#3 /home/.sites/24/site1275/web/wire/core/Modules.php(1032): Modules->initModule(Object(ProcessPageSearch), false)
#4 /home/.sites/24/site1275/web/wire/core/Modules.php(939): Modules->getModule('ProcessPageSear...')
#5 /home/.sites/24/site1275/web/wire/modules/AdminTheme/AdminThemeReno/default.php(25): Modules->get('ProcessPageSear...')
#6 /home/.sites/24/site1275/web/wire/core/admin.php(148): require('/home/.sites/24...')
#7 /home/.sites/24/site1275/web/wire/modules/AdminTheme/AdminThemeReno/controller.php(13): require('/home/.sites/24...')
#8 /home/.sites/ in /home/.sites/24/site1275/web/index.php on line 254

Error: 	Exception: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away (in /home/.sites/24/site1275/web/wire/core/Modules.php line 2264)


#0 /home/.sites/24/site1275/web/wire/core/Modules.php(2264): PDOStatement->execute()

#1 /home/.sites/24/site1275/web/wire/core/Modules.php(2523): Modules->getModuleConfigData(Object(ProcessPageSearch))

#2 /home/.sites/24/site1275/web/wire/core/Modules.php(446): Modules->setModuleConfigData(Object(ProcessPageSearch))

#3 /home/.sites/24/site1275/web/wire/core/Modules.php(1032): Modules->initModule(Object(ProcessPageSearch), false)

#4 /home/.sites/24/site1275/web/wire/core/Modules.php(939): Modules->getModule('ProcessPageSear...')

#5 /home/.sites/24/site1275/web/wire/modules/AdminTheme/AdminThemeReno/default.php(25): Modules->get('ProcessPageSear...')

#6 /home/.sites/24/site1275/web/wire/core/admin.php(148): require('/home/.sites/24...')

#7 /home/.sites/24/site1275/web/wire/modules/AdminTheme/AdminThemeReno/controller.php(13): require('/home/.sites/24...')

#8 /home/.sites/

This error message was shown because site is in debug mode ($config->debug = true; in /site/config.php). Error has been logged. 
Link to comment
Share on other sites

  • 11 months later...

Looks like I've missed some messages here. I'm currently using this on a couple of sites with no issues; ProcessWire 2.7.2 and 3.0.18, on two separate servers.  Would be interesting to hear if aforementioned issues still exist.

Link to comment
Share on other sites

@SteveB: shouldn't require any tricks, but to be honest I've never used such a setup myself, so it's probably a mistake on my side. I'll take a closer look at that ASAP :)

@Juergen: README includes instructions for setting up a cron job. The gist of it is that you should make a cron job that runs the module's own Init.php file periodically.

To be honest I'm not entirely sure what you mean by a cron job module or ready.php in this context – but please let me know what I'm missing!

Link to comment
Share on other sites

@Juergen

To setup a cronjob is really easy, but you have to understand some basics first.

The cronjob past has nothing to do with ProcessWire. It is a separate program running on your server which is able to run commands at a certain time. It is either configured in your hosting admin panel (easiest, ask your hosting provider) or you can set-up it yourself through the command line. You can follow this example if you're running a Linux based server.

You need to understand that you can execute a PHP file from the command line. Teppo has provided us with such a script that will activate the link checker. The file is "/ProcessLinkChecker/Init.php". This is the one the cronjob needs to run. If you are unsure what the correct path is you can ask your hosting provider or login into the shell and navigate to the "ProcessLinkChecker" folder and type "pwd". That will give you the current path. It will be something like:

/srv/username/apps/appname/public/site/modules/ProcessLinkChecker/

Combine the path with your new knowlegde from the tutorial and you can set it up.

p.s. If you are on Windows you need to create a "Task" in "Windows Task Scheduler".

p.s. 2 You don't have to wait to test if the link is working since you can test the script by running:

/usr/bin/php /path/to/site/modules/ProcessLinkChecker/Init.php >/dev/null 2>&1

p.s. 3 this whole timing stuff can be pretty  confusing so use a tool like crontab.guru.

p.s. 4 after proofreading this post now it seems pretty hard O0, but believe me after a few times you can set it up in a few minutes.

  • Like 8
Link to comment
Share on other sites

Great module thanks teppo!!

I can't edit crontab via ssh am only able to add crons via admin panel and there I can only provide a url and no path so without changing .htaccess I can't just run domain.com/site/modules/ProcessLinkChecker/Init.php..right?

But I have already set up crons so I thought about copying contents of Init.php in an existing cron which should trigger it..

$linkCrawlerPath = $config->paths->siteModules . 'ProcessLinkChecker/LinkCrawler.php';
if (file_exists($linkCrawlerPath)) {
	require $linkCrawlerPath;
	$crawler = new \LinkCrawler();
	$crawler->start();
}

But then I'm getting those

Notice: Undefined variable: wire in site/modules/ProcessLinkChecker/LinkCrawler.php on line 144
Fatal error: Call to undefined function wire() in site/modules/ProcessLinkChecker/LinkCrawler.php on line 144

Uh, I'm running 3.0.25 that's why the backslash

Any ideas? Or alternative paths? And, easier I included the Init.php in my cron script with the same result..

EDIT: Same error (at least the top one "undefined variable wire") when running from ProcessLinkChecker admin page..

Link to comment
Share on other sites

@Can: The issue you mentioned should be fixed in the latest version of LinkCrawler.php, though please let me know if it still persists. The problem was that LinkCrawler didn't have access to $wire from the global scope, but since PROCESSWIRE was already defined, it wasn't attempting to instantiate ProcessWire either.

I'm no longer entirely sure that current behaviour makes sense in this case (perhaps I should rather allow the user to pass an instance of ProcessWire to LinkCrawler when instantiating it) but at least this seems to fix the issue at hand :)

  • Like 1
Link to comment
Share on other sites

After removing content from site/init.php and site/ready.php for now (throwing errors about redeclared functions) I'm getting this now:

throw new Exception("Unrecognized render method");

I'm invoking LinkCrawler like this within an external script which bootstraps processwire (so not within template file, maybe that's the problem?)

if ($modules->get('ProcessLinkChecker')) {
	require $config->paths->siteModules . 'ProcessLinkChecker/LinkCrawler.php';
	$crawler = new \LinkCrawler();
	$crawler->start();
}

 

Link to comment
Share on other sites

@Can: Sorry for the delay. So far I haven't been able to reproduce the issue you're seeing, which is making it quite difficult to debug. This is one of those cases where it would be tremendously useful to be able to check which values LinkCrawler gets from the Process module, what $this->config contains, what that "unrecognized" render method really is, and so on :) 

Not calling the module from a template file isn't a problem, but I'm a bit confused why it would throw the "unrecognized render method" error. Could you check what the module config page of ProcessLinkChecker lists as the render method?

This error should only happen if render_method config setting contains something weird or if it's undefined. At this point I can only assume that either LinkCrawler doesn't have access to the ProcessLinkChecker module (it tries to get it's config from there) or those config variables are somehow mishandled.

Just checking, but is the above snippet the only code in that file? I assume it's bootstrapping the same ProcessWire installation that has ProcessLinkChecker installed, right?

Link to comment
Share on other sites

Hey @Can,

I just ran into some small things myself installing and configuring this module. Since I don't have shell access to the server (yet) I created a workaround. I've created a template and page called "cronjob" so I could trigger the script from an url (www.domainname.com/cronjobs/?key=123).

In the template.php I do a simple check on a get variable (key) to prevent people from accessing it on purpose. From there I include:

// Skip access since the guest user is loading the script
// Perhaps you might want to look into the permission check stuff since you're bootstrapping ProcessWire
$options = array('noPermissionCheck' => true);

// Load the Module to get the className
$linkCheckerModule = $this->modules->getModule("ProcessLinkChecker", $options);

// Include Teppo's LinkCrawler
require $config->paths->siteModules . $linkCheckerModule->className() . '/LinkCrawler.php';

// Start crawling
$crawler = new LinkCrawler();
$crawler->start();

// Stop ProcessWire from executing
$this->halt();

This seems to work fine for me. I've got a lot of data.

I still get some notices like Array to string conversion in */site/modules/ProcessLinkChecker/LinkCrawler.php on line 335*. I'll look into them tomorrow.

  • Like 3
Link to comment
Share on other sites

You mean $this->config in LinkCrawler.php? Would say it looks quite good, I put a var_dump($this->config) on line 151 (right after $this->config has been populated) and I'm getting this in the error message after clicking on check now on /processwire/setup/link-checker/

Spoiler

object(stdClass)#340 (16) { ["skipped_links"]=> array(0) { } ["cache_max_age"]=> string(5) "1 DAY" ["selector"]=> string(33) "status<8192, id!=2, has_parent!=2" ["http_host"]=> NULL ["log_level"]=> int(1) ["log_rotate"]=> int(0) ["log_on_screen"]=> bool(false) ["batch_size"]=> int(100) ["sleep_between_batches"]=> int(1) ["max_recursion_depth"]=> int(3) ["sleep_between_requests"]=> int(1) ["sleep_between_pages"]=> int(0) ["link_regex"]=> string(38) "/(?:href|src)=([\'"])([^#].*?)\g{-2}/i" ["skipped_links_regex"]=> NULL ["http_request_method"]=> string(11) "get_headers" ["http_user_agent"]=> string(120) "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36" } 

 

On 22.7.2016 at 3:39 PM, teppo said:

Could you check what the module config page of ProcessLinkChecker lists as the render method?

Don't now what you mean? Here /processwire/module/edit?name=ProcessLinkChecker right? don't know what I have to look for?!

On 22.7.2016 at 3:39 PM, teppo said:

Just checking, but is the above snippet the only code in that file? I assume it's bootstrapping the same ProcessWire installation that has ProcessLinkChecker installed, right?

No but for the tests I exit; right afterwards and those lines are the very first lines right after opening php and bootstrapping pw..

Thanks for your workaround @arjen think I'll give it a try soon :)

  • Like 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...