teppo Posted June 29, 2014 Share Posted June 29, 2014 (edited) This is a beta release, so some extra caution is recommended. So far the module has been successfully tested on at least ProcessWire 2.7.2 and 3.0.18, but at least in theory it should work for 2.4/2.5 versions of ProcessWire too. GitHub repo: https://github.com/teppokoivula/ProcessLinkChecker (see README.md for more techy details, settings etc.) What you see is ... This is a module that adds back-end tools for tracking down broken links and unnecessary redirects. That's pretty much all there is to these views right now; I'm still contemplating whether it should also provide a link text section (for SEO purposes etc.) and/or other features. The magic behind the scenes The admin tool (Process module) is about half of Link Checker; the other half is a PHP class called Link Crawler. This is a tool for collecting links from a ProcessWire site, analysing them and storing the outcome to custom database tables. Link Crawler is intended to be triggered via a cron task, but there's also a GUI tool for running the checker. This is a slow process and can result in issues, but for smaller sites and debugging purposes the GUI method works just fine. Just be patient; the data will be there once you wait long enough Now what? For the time being I'd appreciate any comments about the way this is heading and/or whether it's useful to you at all. What would you add to make it more useful for your own use cases? I'm going to continue working on this for sure (it's been a really fun project), but wouldn't mind being pushed to the correct direction early on. This module is already in active use on two relatively big sites I manage. Lately I haven't had any issues with the module, but please consider this a beta release nevertheless; it hasn't been widely tested, and that alone is a reason to avoid calling it "stable" quite yet. Screenshots Dashboard: List of broken links: List of redirects: Check now tool/tab: Edited July 9, 2016 by teppo Updated module description, status and screenshots. 18 Link to comment Share on other sites More sharing options...
ryan Posted June 29, 2014 Share Posted June 29, 2014 Teppo this looks fantastic, nice work! While I haven't yet been able to test it out here I will be soon, as I have a regular need for a tool like this. It's also one of those things that come up with clients a lot: "how do I keep track of when a link no longer works?". I've been using Google Webmaster tools for 404 discovery in the past, but it's often hard to separate the noise from the goods there, and it's not particularly client friendly either. Regarding the cron side of this, I immediately thought of IftRunner (which itself is triggered by cron) and how this might work great as a PageAction with IftRunner. PageActions can also be executed by ListerPro and presumably other tools in the future as well. 2 Link to comment Share on other sites More sharing options...
teppo Posted June 29, 2014 Author Share Posted June 29, 2014 Thanks, Ryan. Let me know how it handles once you do test it, would be interesting to know. My tests so far have been very limited in scope, so I'm fully expecting a pile of issues (and most likely a few things I've completely missed).. though of course the opposite would be cool too You've given me something new to consider there, will definitely take IftRunner and PageAction part into consideration. Link to comment Share on other sites More sharing options...
renobird Posted June 30, 2014 Share Posted June 30, 2014 Teppo, this looks awesome! Link to comment Share on other sites More sharing options...
fmgujju Posted June 30, 2014 Share Posted June 30, 2014 I use google chrome extension https://github.com/ocodia/Check-My-Links/ it's same as you are doing. It's good to have it inside pw admin panel. Link to comment Share on other sites More sharing options...
Peter Knight Posted July 31, 2015 Share Posted July 31, 2015 Is anyone using this on 2.6.8? Link to comment Share on other sites More sharing options...
Juergen Posted July 31, 2015 Share Posted July 31, 2015 I have installed it at a 2.6.10 dev version. The installation process was successfull, but if I want to check the links I get the following messages: 2015-07-31 17:30:34 admin START: id!=2, has_parent!=2 2015-07-31 17:30:34 admin BATCH: 1/2 (pages 1-52/52) 2015-07-31 17:30:34 admin FOUND Page: / 2015-07-31 17:30:35 admin CHECKED URL: http://www.juergen-kern.at/site/templates/favicon.ico (200) Warning: PDOStatement::execute(): MySQL server has gone away in /home/.sites/24/site1275/web/site/modules/ProcessLinkChecker/LinkCrawler.php on line 405 Warning: PDOStatement::execute(): Error reading result set's header in /home/.sites/24/site1275/web/site/modules/ProcessLinkChecker/LinkCrawler.php on line 405 Fatal error: Exception: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away (in /home/.sites/24/site1275/web/wire/core/Modules.php line 2264) #0 /home/.sites/24/site1275/web/wire/core/Modules.php(2264): PDOStatement->execute() #1 /home/.sites/24/site1275/web/wire/core/Modules.php(2523): Modules->getModuleConfigData(Object(ProcessPageSearch)) #2 /home/.sites/24/site1275/web/wire/core/Modules.php(446): Modules->setModuleConfigData(Object(ProcessPageSearch)) #3 /home/.sites/24/site1275/web/wire/core/Modules.php(1032): Modules->initModule(Object(ProcessPageSearch), false) #4 /home/.sites/24/site1275/web/wire/core/Modules.php(939): Modules->getModule('ProcessPageSear...') #5 /home/.sites/24/site1275/web/wire/modules/AdminTheme/AdminThemeReno/default.php(25): Modules->get('ProcessPageSear...') #6 /home/.sites/24/site1275/web/wire/core/admin.php(148): require('/home/.sites/24...') #7 /home/.sites/24/site1275/web/wire/modules/AdminTheme/AdminThemeReno/controller.php(13): require('/home/.sites/24...') #8 /home/.sites/ in /home/.sites/24/site1275/web/index.php on line 254 Error: Exception: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away (in /home/.sites/24/site1275/web/wire/core/Modules.php line 2264) #0 /home/.sites/24/site1275/web/wire/core/Modules.php(2264): PDOStatement->execute() #1 /home/.sites/24/site1275/web/wire/core/Modules.php(2523): Modules->getModuleConfigData(Object(ProcessPageSearch)) #2 /home/.sites/24/site1275/web/wire/core/Modules.php(446): Modules->setModuleConfigData(Object(ProcessPageSearch)) #3 /home/.sites/24/site1275/web/wire/core/Modules.php(1032): Modules->initModule(Object(ProcessPageSearch), false) #4 /home/.sites/24/site1275/web/wire/core/Modules.php(939): Modules->getModule('ProcessPageSear...') #5 /home/.sites/24/site1275/web/wire/modules/AdminTheme/AdminThemeReno/default.php(25): Modules->get('ProcessPageSear...') #6 /home/.sites/24/site1275/web/wire/core/admin.php(148): require('/home/.sites/24...') #7 /home/.sites/24/site1275/web/wire/modules/AdminTheme/AdminThemeReno/controller.php(13): require('/home/.sites/24...') #8 /home/.sites/ This error message was shown because site is in debug mode ($config->debug = true; in /site/config.php). Error has been logged. Link to comment Share on other sites More sharing options...
Peter Knight Posted July 31, 2015 Share Posted July 31, 2015 Hangs for me too on a test server. Great Module though. Love the ability to run the check directly in the Admin. Link to comment Share on other sites More sharing options...
teppo Posted July 8, 2016 Author Share Posted July 8, 2016 Looks like I've missed some messages here. I'm currently using this on a couple of sites with no issues; ProcessWire 2.7.2 and 3.0.18, on two separate servers. Would be interesting to hear if aforementioned issues still exist. Link to comment Share on other sites More sharing options...
Juergen Posted July 9, 2016 Share Posted July 9, 2016 Hello teppo, I have re-installed this module on a 3.25 dev version today and it works. I dont get any error messages 1 Link to comment Share on other sites More sharing options...
SteveB Posted July 9, 2016 Share Posted July 9, 2016 Is there a trick to making it crawl when PW is installed in a subdirectory? Link to comment Share on other sites More sharing options...
Juergen Posted July 15, 2016 Share Posted July 15, 2016 Can someone give me an example code of how to initialize this module with a cron job? Do I need to create a cron job module or can I use ready.php? Thanks for your hints! Link to comment Share on other sites More sharing options...
teppo Posted July 15, 2016 Author Share Posted July 15, 2016 @SteveB: shouldn't require any tricks, but to be honest I've never used such a setup myself, so it's probably a mistake on my side. I'll take a closer look at that ASAP @Juergen: README includes instructions for setting up a cron job. The gist of it is that you should make a cron job that runs the module's own Init.php file periodically. To be honest I'm not entirely sure what you mean by a cron job module or ready.php in this context – but please let me know what I'm missing! Link to comment Share on other sites More sharing options...
arjen Posted July 16, 2016 Share Posted July 16, 2016 @Juergen To setup a cronjob is really easy, but you have to understand some basics first. The cronjob past has nothing to do with ProcessWire. It is a separate program running on your server which is able to run commands at a certain time. It is either configured in your hosting admin panel (easiest, ask your hosting provider) or you can set-up it yourself through the command line. You can follow this example if you're running a Linux based server. You need to understand that you can execute a PHP file from the command line. Teppo has provided us with such a script that will activate the link checker. The file is "/ProcessLinkChecker/Init.php". This is the one the cronjob needs to run. If you are unsure what the correct path is you can ask your hosting provider or login into the shell and navigate to the "ProcessLinkChecker" folder and type "pwd". That will give you the current path. It will be something like: /srv/username/apps/appname/public/site/modules/ProcessLinkChecker/ Combine the path with your new knowlegde from the tutorial and you can set it up. p.s. If you are on Windows you need to create a "Task" in "Windows Task Scheduler". p.s. 2 You don't have to wait to test if the link is working since you can test the script by running: /usr/bin/php /path/to/site/modules/ProcessLinkChecker/Init.php >/dev/null 2>&1 p.s. 3 this whole timing stuff can be pretty confusing so use a tool like crontab.guru. p.s. 4 after proofreading this post now it seems pretty hard , but believe me after a few times you can set it up in a few minutes. 8 Link to comment Share on other sites More sharing options...
Ableson Posted July 17, 2016 Share Posted July 17, 2016 If you're looking for a simpler solution, you might consider one of the cron services which will load a specific URL at a given time. For example, I've used this service: https://www.easycron.com . Link to comment Share on other sites More sharing options...
Can Posted July 18, 2016 Share Posted July 18, 2016 Great module thanks teppo!! I can't edit crontab via ssh am only able to add crons via admin panel and there I can only provide a url and no path so without changing .htaccess I can't just run domain.com/site/modules/ProcessLinkChecker/Init.php..right? But I have already set up crons so I thought about copying contents of Init.php in an existing cron which should trigger it.. $linkCrawlerPath = $config->paths->siteModules . 'ProcessLinkChecker/LinkCrawler.php'; if (file_exists($linkCrawlerPath)) { require $linkCrawlerPath; $crawler = new \LinkCrawler(); $crawler->start(); } But then I'm getting those Notice: Undefined variable: wire in site/modules/ProcessLinkChecker/LinkCrawler.php on line 144Fatal error: Call to undefined function wire() in site/modules/ProcessLinkChecker/LinkCrawler.php on line 144 Uh, I'm running 3.0.25 that's why the backslash Any ideas? Or alternative paths? And, easier I included the Init.php in my cron script with the same result.. EDIT: Same error (at least the top one "undefined variable wire") when running from ProcessLinkChecker admin page.. Link to comment Share on other sites More sharing options...
teppo Posted July 19, 2016 Author Share Posted July 19, 2016 @Can: Thanks, I'll take a closer look at this ASAP. 2 Link to comment Share on other sites More sharing options...
teppo Posted July 19, 2016 Author Share Posted July 19, 2016 @Can: The issue you mentioned should be fixed in the latest version of LinkCrawler.php, though please let me know if it still persists. The problem was that LinkCrawler didn't have access to $wire from the global scope, but since PROCESSWIRE was already defined, it wasn't attempting to instantiate ProcessWire either. I'm no longer entirely sure that current behaviour makes sense in this case (perhaps I should rather allow the user to pass an instance of ProcessWire to LinkCrawler when instantiating it) but at least this seems to fix the issue at hand 1 Link to comment Share on other sites More sharing options...
Can Posted July 19, 2016 Share Posted July 19, 2016 After removing content from site/init.php and site/ready.php for now (throwing errors about redeclared functions) I'm getting this now: throw new Exception("Unrecognized render method"); I'm invoking LinkCrawler like this within an external script which bootstraps processwire (so not within template file, maybe that's the problem?) if ($modules->get('ProcessLinkChecker')) { require $config->paths->siteModules . 'ProcessLinkChecker/LinkCrawler.php'; $crawler = new \LinkCrawler(); $crawler->start(); } Link to comment Share on other sites More sharing options...
teppo Posted July 22, 2016 Author Share Posted July 22, 2016 @Can: Sorry for the delay. So far I haven't been able to reproduce the issue you're seeing, which is making it quite difficult to debug. This is one of those cases where it would be tremendously useful to be able to check which values LinkCrawler gets from the Process module, what $this->config contains, what that "unrecognized" render method really is, and so on Not calling the module from a template file isn't a problem, but I'm a bit confused why it would throw the "unrecognized render method" error. Could you check what the module config page of ProcessLinkChecker lists as the render method? This error should only happen if render_method config setting contains something weird or if it's undefined. At this point I can only assume that either LinkCrawler doesn't have access to the ProcessLinkChecker module (it tries to get it's config from there) or those config variables are somehow mishandled. Just checking, but is the above snippet the only code in that file? I assume it's bootstrapping the same ProcessWire installation that has ProcessLinkChecker installed, right? Link to comment Share on other sites More sharing options...
arjen Posted July 27, 2016 Share Posted July 27, 2016 Hey @Can, I just ran into some small things myself installing and configuring this module. Since I don't have shell access to the server (yet) I created a workaround. I've created a template and page called "cronjob" so I could trigger the script from an url (www.domainname.com/cronjobs/?key=123). In the template.php I do a simple check on a get variable (key) to prevent people from accessing it on purpose. From there I include: // Skip access since the guest user is loading the script // Perhaps you might want to look into the permission check stuff since you're bootstrapping ProcessWire $options = array('noPermissionCheck' => true); // Load the Module to get the className $linkCheckerModule = $this->modules->getModule("ProcessLinkChecker", $options); // Include Teppo's LinkCrawler require $config->paths->siteModules . $linkCheckerModule->className() . '/LinkCrawler.php'; // Start crawling $crawler = new LinkCrawler(); $crawler->start(); // Stop ProcessWire from executing $this->halt(); This seems to work fine for me. I've got a lot of data. I still get some notices like Array to string conversion in */site/modules/ProcessLinkChecker/LinkCrawler.php on line 335*. I'll look into them tomorrow. 3 Link to comment Share on other sites More sharing options...
Can Posted August 2, 2016 Share Posted August 2, 2016 You mean $this->config in LinkCrawler.php? Would say it looks quite good, I put a var_dump($this->config) on line 151 (right after $this->config has been populated) and I'm getting this in the error message after clicking on check now on /processwire/setup/link-checker/ Spoiler object(stdClass)#340 (16) { ["skipped_links"]=> array(0) { } ["cache_max_age"]=> string(5) "1 DAY" ["selector"]=> string(33) "status<8192, id!=2, has_parent!=2" ["http_host"]=> NULL ["log_level"]=> int(1) ["log_rotate"]=> int(0) ["log_on_screen"]=> bool(false) ["batch_size"]=> int(100) ["sleep_between_batches"]=> int(1) ["max_recursion_depth"]=> int(3) ["sleep_between_requests"]=> int(1) ["sleep_between_pages"]=> int(0) ["link_regex"]=> string(38) "/(?:href|src)=([\'"])([^#].*?)\g{-2}/i" ["skipped_links_regex"]=> NULL ["http_request_method"]=> string(11) "get_headers" ["http_user_agent"]=> string(120) "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36" } On 22.7.2016 at 3:39 PM, teppo said: Could you check what the module config page of ProcessLinkChecker lists as the render method? Don't now what you mean? Here /processwire/module/edit?name=ProcessLinkChecker right? don't know what I have to look for?! On 22.7.2016 at 3:39 PM, teppo said: Just checking, but is the above snippet the only code in that file? I assume it's bootstrapping the same ProcessWire installation that has ProcessLinkChecker installed, right? No but for the tests I exit; right afterwards and those lines are the very first lines right after opening php and bootstrapping pw.. Thanks for your workaround @arjen think I'll give it a try soon 1 Link to comment Share on other sites More sharing options...
lpa Posted June 28 Share Posted June 28 I tried this module and it crawls the links. The status of links is updated and the database contains the links. But the GUI does not show the lists of links and the menu tabs or "broken". PW version 3.0.234 and PHP 8.1. Is this module still maintained or is there some better alternatives? EDIT: The problem wwas this: The console has this Jquery error: JqueryCore.js?v=1.12.4:1 Uncaught Error: Syntax error, unrecognized expression: a[href^=#link-checker]:not(:first):not(a[href^=#link-checker-check-now]) at Sizzle.error (JqueryCore.js?v=1.12.4:1:18926) at Sizzle.tokenize (JqueryCore.js?v=1.12.4:1:28664) at Sizzle.select (JqueryCore.js?v=1.12.4:1:34917) at Function.Sizzle (JqueryCore.js?v=1.12.4:1:11015) at a.find (jquery-migrate-quiet…in.js?sblspu:2:3686) at jQuery.fn.init.find (JqueryCore.js?v=1.12.4:1:38739) at a.fn.find (jquery-migrate-quiet…in.js?sblspu:2:8931) at jQuery.fn.init (JqueryCore.js?v=1.12.4:1:40240) at new a.fn.init (jquery-migrate-quiet…in.js?sblspu:2:3137) at jQuery (JqueryCore.js?v=1.12.4:1:663) This was fixed putting quotes arand the anchors on line 121 in ProcessLinkChecker.js: $('a[href^="#link-checker"]:not(:first):not(a[href^="#link-checker-check-now"])').on('click', function() { In addition to the above, there are these warnings: Warning: Undefined variable $wire in www/site/assets/cache/FileCompiler/site/modules/ProcessLinkChecker/LinkCrawler.php on line 136 Warning: Undefined variable $wire in www/site/assets/cache/FileCompiler/site/modules/ProcessLinkChecker/LinkCrawler.php on line 143 Warning: Array to string conversion in www/site/assets/cache/FileCompiler/site/modules/ProcessLinkChecker/LinkCrawler.php on line 335 Link to comment Share on other sites More sharing options...
teppo Posted June 28 Author Share Posted June 28 8 hours ago, lpa said: I tried this module and it crawls the links. The status of links is updated and the database contains the links. But the GUI does not show the lists of links and the menu tabs or "broken". PW version 3.0.234 and PHP 8.1. Is this module still maintained or is there some better alternatives? I’ve not touched this module in years. I’m not surprised that the GUI is a little wonky. It was never tested on the UIKit admin theme, let alone recent jQuery versions ? That being said, thanks for testing and identifying the issue, looks like an easy fix. As for alternatives, there are of course third party tools (at least for public content), and there’s also Verify Links from Robin: I can’t say for sure how much time and effort I’ll be able to put into this module, so I’d suggest checking Robin’s module out ? 1 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now