Jump to content
teppo

Module: Process Link Checker

Recommended Posts

This is a beta release, so some extra caution is recommended. So far the module has been successfully tested on at least ProcessWire 2.7.2 and 3.0.18, but at least in theory it should work for 2.4/2.5 versions of ProcessWire too.
 
GitHub repo: https://github.com/teppokoivula/ProcessLinkChecker (see README.md for more techy details, settings etc.)
 
What you see is ...
 
This is a module that adds back-end tools for tracking down broken links and unnecessary redirects. That's pretty much all there is to these views right now; I'm still contemplating whether it should also provide a link text section (for SEO purposes etc.)  and/or other features.
 
The magic behind the scenes
 
The admin tool (Process module) is about half of Link Checker; the other half is a PHP class called Link Crawler. This is a tool for collecting links from a ProcessWire site, analysing them and storing the outcome to custom database tables.
 
Link Crawler is intended to be triggered via a cron task, but there's also a GUI tool for running the checker. This is a slow process and can result in issues, but for smaller sites and debugging purposes the GUI method works just fine. Just be patient; the data will be there once you wait long enough :)
 
Now what?
 
For the time being I'd appreciate any comments about the way this is heading and/or whether it's useful to you at all. What would you add to make it more useful for your own use cases? I'm going to continue working on this for sure (it's been a really fun project), but wouldn't mind being pushed to the correct direction early on.
 
This module is already in active use on two relatively big sites I manage. Lately I haven't had any issues with the module, but please consider this a beta release nevertheless; it hasn't been widely tested, and that alone is a reason to avoid calling it "stable" quite yet.

Screenshots

Dashboard:

link-checker-dashboard.png

List of broken links:

link-checker-broken-links.png

List of redirects:

link-checker-redirects.png

Check now tool/tab:

link-checker-check-now.png

Edited by teppo
Updated module description, status and screenshots.
  • Like 18

Share this post


Link to post
Share on other sites

Teppo this looks fantastic, nice work! While I haven't yet been able to test it out here I will be soon, as I have a regular need for a tool like this. It's also one of those things that come up with clients a lot: "how do I keep track of when a link no longer works?". I've been using Google Webmaster tools for 404 discovery in the past, but it's often hard to separate the noise from the goods there, and it's not particularly client friendly either. Regarding the cron side of this, I immediately thought of IftRunner (which itself is triggered by cron) and how this might work great as a PageAction with IftRunner. PageActions can also be executed by ListerPro and presumably other tools in the future as well. 

  • Like 2

Share this post


Link to post
Share on other sites

Thanks, Ryan. Let me know how it handles once you do test it, would be interesting to know. My tests so far have been very limited in scope, so I'm fully expecting a pile of issues (and most likely a few things I've completely missed).. though of course the opposite would be cool too :)

You've given me something new to consider there, will definitely take IftRunner and PageAction part into consideration.

Share this post


Link to post
Share on other sites

I have installed it at a 2.6.10 dev version. The installation process was successfull, but if I want to check the links I get the following messages:

2015-07-31 17:30:34	admin	    START: id!=2, has_parent!=2
2015-07-31 17:30:34	admin	    BATCH: 1/2 (pages 1-52/52)
2015-07-31 17:30:34	admin	        FOUND Page: /
2015-07-31 17:30:35	admin	            CHECKED URL: http://www.juergen-kern.at/site/templates/favicon.ico (200)

Warning: PDOStatement::execute(): MySQL server has gone away in /home/.sites/24/site1275/web/site/modules/ProcessLinkChecker/LinkCrawler.php on line 405

Warning: PDOStatement::execute(): Error reading result set's header in /home/.sites/24/site1275/web/site/modules/ProcessLinkChecker/LinkCrawler.php on line 405

Fatal error: Exception: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away (in /home/.sites/24/site1275/web/wire/core/Modules.php line 2264)

#0 /home/.sites/24/site1275/web/wire/core/Modules.php(2264): PDOStatement->execute()
#1 /home/.sites/24/site1275/web/wire/core/Modules.php(2523): Modules->getModuleConfigData(Object(ProcessPageSearch))
#2 /home/.sites/24/site1275/web/wire/core/Modules.php(446): Modules->setModuleConfigData(Object(ProcessPageSearch))
#3 /home/.sites/24/site1275/web/wire/core/Modules.php(1032): Modules->initModule(Object(ProcessPageSearch), false)
#4 /home/.sites/24/site1275/web/wire/core/Modules.php(939): Modules->getModule('ProcessPageSear...')
#5 /home/.sites/24/site1275/web/wire/modules/AdminTheme/AdminThemeReno/default.php(25): Modules->get('ProcessPageSear...')
#6 /home/.sites/24/site1275/web/wire/core/admin.php(148): require('/home/.sites/24...')
#7 /home/.sites/24/site1275/web/wire/modules/AdminTheme/AdminThemeReno/controller.php(13): require('/home/.sites/24...')
#8 /home/.sites/ in /home/.sites/24/site1275/web/index.php on line 254

Error: 	Exception: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away (in /home/.sites/24/site1275/web/wire/core/Modules.php line 2264)


#0 /home/.sites/24/site1275/web/wire/core/Modules.php(2264): PDOStatement->execute()

#1 /home/.sites/24/site1275/web/wire/core/Modules.php(2523): Modules->getModuleConfigData(Object(ProcessPageSearch))

#2 /home/.sites/24/site1275/web/wire/core/Modules.php(446): Modules->setModuleConfigData(Object(ProcessPageSearch))

#3 /home/.sites/24/site1275/web/wire/core/Modules.php(1032): Modules->initModule(Object(ProcessPageSearch), false)

#4 /home/.sites/24/site1275/web/wire/core/Modules.php(939): Modules->getModule('ProcessPageSear...')

#5 /home/.sites/24/site1275/web/wire/modules/AdminTheme/AdminThemeReno/default.php(25): Modules->get('ProcessPageSear...')

#6 /home/.sites/24/site1275/web/wire/core/admin.php(148): require('/home/.sites/24...')

#7 /home/.sites/24/site1275/web/wire/modules/AdminTheme/AdminThemeReno/controller.php(13): require('/home/.sites/24...')

#8 /home/.sites/

This error message was shown because site is in debug mode ($config->debug = true; in /site/config.php). Error has been logged. 

Share this post


Link to post
Share on other sites

Hangs for me too on a test server. Great Module though. Love the ability to run the check directly in the Admin.

Share this post


Link to post
Share on other sites

Looks like I've missed some messages here. I'm currently using this on a couple of sites with no issues; ProcessWire 2.7.2 and 3.0.18, on two separate servers.  Would be interesting to hear if aforementioned issues still exist.

Share this post


Link to post
Share on other sites

Hello teppo,

I have re-installed this module on a 3.25 dev version today and it works. I dont get any error messages :)

  • Like 1

Share this post


Link to post
Share on other sites

Can someone give me an example code of how to initialize this module with a cron job? Do I need to create a cron job module or can I use ready.php?

Thanks for your hints!

Share this post


Link to post
Share on other sites

@SteveB: shouldn't require any tricks, but to be honest I've never used such a setup myself, so it's probably a mistake on my side. I'll take a closer look at that ASAP :)

@Juergen: README includes instructions for setting up a cron job. The gist of it is that you should make a cron job that runs the module's own Init.php file periodically.

To be honest I'm not entirely sure what you mean by a cron job module or ready.php in this context – but please let me know what I'm missing!

Share this post


Link to post
Share on other sites

@Juergen

To setup a cronjob is really easy, but you have to understand some basics first.

The cronjob past has nothing to do with ProcessWire. It is a separate program running on your server which is able to run commands at a certain time. It is either configured in your hosting admin panel (easiest, ask your hosting provider) or you can set-up it yourself through the command line. You can follow this example if you're running a Linux based server.

You need to understand that you can execute a PHP file from the command line. Teppo has provided us with such a script that will activate the link checker. The file is "/ProcessLinkChecker/Init.php". This is the one the cronjob needs to run. If you are unsure what the correct path is you can ask your hosting provider or login into the shell and navigate to the "ProcessLinkChecker" folder and type "pwd". That will give you the current path. It will be something like:

/srv/username/apps/appname/public/site/modules/ProcessLinkChecker/

Combine the path with your new knowlegde from the tutorial and you can set it up.

p.s. If you are on Windows you need to create a "Task" in "Windows Task Scheduler".

p.s. 2 You don't have to wait to test if the link is working since you can test the script by running:

/usr/bin/php /path/to/site/modules/ProcessLinkChecker/Init.php >/dev/null 2>&1

p.s. 3 this whole timing stuff can be pretty  confusing so use a tool like crontab.guru.

p.s. 4 after proofreading this post now it seems pretty hard O0, but believe me after a few times you can set it up in a few minutes.

  • Like 8

Share this post


Link to post
Share on other sites

Great module thanks teppo!!

I can't edit crontab via ssh am only able to add crons via admin panel and there I can only provide a url and no path so without changing .htaccess I can't just run domain.com/site/modules/ProcessLinkChecker/Init.php..right?

But I have already set up crons so I thought about copying contents of Init.php in an existing cron which should trigger it..

$linkCrawlerPath = $config->paths->siteModules . 'ProcessLinkChecker/LinkCrawler.php';
if (file_exists($linkCrawlerPath)) {
	require $linkCrawlerPath;
	$crawler = new \LinkCrawler();
	$crawler->start();
}

But then I'm getting those

Notice: Undefined variable: wire in site/modules/ProcessLinkChecker/LinkCrawler.php on line 144
Fatal error: Call to undefined function wire() in site/modules/ProcessLinkChecker/LinkCrawler.php on line 144

Uh, I'm running 3.0.25 that's why the backslash

Any ideas? Or alternative paths? And, easier I included the Init.php in my cron script with the same result..

EDIT: Same error (at least the top one "undefined variable wire") when running from ProcessLinkChecker admin page..

Share this post


Link to post
Share on other sites

@Can: The issue you mentioned should be fixed in the latest version of LinkCrawler.php, though please let me know if it still persists. The problem was that LinkCrawler didn't have access to $wire from the global scope, but since PROCESSWIRE was already defined, it wasn't attempting to instantiate ProcessWire either.

I'm no longer entirely sure that current behaviour makes sense in this case (perhaps I should rather allow the user to pass an instance of ProcessWire to LinkCrawler when instantiating it) but at least this seems to fix the issue at hand :)

  • Like 1

Share this post


Link to post
Share on other sites

After removing content from site/init.php and site/ready.php for now (throwing errors about redeclared functions) I'm getting this now:

throw new Exception("Unrecognized render method");

I'm invoking LinkCrawler like this within an external script which bootstraps processwire (so not within template file, maybe that's the problem?)

if ($modules->get('ProcessLinkChecker')) {
	require $config->paths->siteModules . 'ProcessLinkChecker/LinkCrawler.php';
	$crawler = new \LinkCrawler();
	$crawler->start();
}

 

Share this post


Link to post
Share on other sites

@Can: Sorry for the delay. So far I haven't been able to reproduce the issue you're seeing, which is making it quite difficult to debug. This is one of those cases where it would be tremendously useful to be able to check which values LinkCrawler gets from the Process module, what $this->config contains, what that "unrecognized" render method really is, and so on :) 

Not calling the module from a template file isn't a problem, but I'm a bit confused why it would throw the "unrecognized render method" error. Could you check what the module config page of ProcessLinkChecker lists as the render method?

This error should only happen if render_method config setting contains something weird or if it's undefined. At this point I can only assume that either LinkCrawler doesn't have access to the ProcessLinkChecker module (it tries to get it's config from there) or those config variables are somehow mishandled.

Just checking, but is the above snippet the only code in that file? I assume it's bootstrapping the same ProcessWire installation that has ProcessLinkChecker installed, right?

Share this post


Link to post
Share on other sites

Hey @Can,

I just ran into some small things myself installing and configuring this module. Since I don't have shell access to the server (yet) I created a workaround. I've created a template and page called "cronjob" so I could trigger the script from an url (www.domainname.com/cronjobs/?key=123).

In the template.php I do a simple check on a get variable (key) to prevent people from accessing it on purpose. From there I include:

// Skip access since the guest user is loading the script
// Perhaps you might want to look into the permission check stuff since you're bootstrapping ProcessWire
$options = array('noPermissionCheck' => true);

// Load the Module to get the className
$linkCheckerModule = $this->modules->getModule("ProcessLinkChecker", $options);

// Include Teppo's LinkCrawler
require $config->paths->siteModules . $linkCheckerModule->className() . '/LinkCrawler.php';

// Start crawling
$crawler = new LinkCrawler();
$crawler->start();

// Stop ProcessWire from executing
$this->halt();

This seems to work fine for me. I've got a lot of data.

I still get some notices like Array to string conversion in */site/modules/ProcessLinkChecker/LinkCrawler.php on line 335*. I'll look into them tomorrow.

  • Like 3

Share this post


Link to post
Share on other sites

You mean $this->config in LinkCrawler.php? Would say it looks quite good, I put a var_dump($this->config) on line 151 (right after $this->config has been populated) and I'm getting this in the error message after clicking on check now on /processwire/setup/link-checker/

Spoiler

object(stdClass)#340 (16) { ["skipped_links"]=> array(0) { } ["cache_max_age"]=> string(5) "1 DAY" ["selector"]=> string(33) "status<8192, id!=2, has_parent!=2" ["http_host"]=> NULL ["log_level"]=> int(1) ["log_rotate"]=> int(0) ["log_on_screen"]=> bool(false) ["batch_size"]=> int(100) ["sleep_between_batches"]=> int(1) ["max_recursion_depth"]=> int(3) ["sleep_between_requests"]=> int(1) ["sleep_between_pages"]=> int(0) ["link_regex"]=> string(38) "/(?:href|src)=([\'"])([^#].*?)\g{-2}/i" ["skipped_links_regex"]=> NULL ["http_request_method"]=> string(11) "get_headers" ["http_user_agent"]=> string(120) "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36" } 

 

On 22.7.2016 at 3:39 PM, teppo said:

Could you check what the module config page of ProcessLinkChecker lists as the render method?

Don't now what you mean? Here /processwire/module/edit?name=ProcessLinkChecker right? don't know what I have to look for?!

On 22.7.2016 at 3:39 PM, teppo said:

Just checking, but is the above snippet the only code in that file? I assume it's bootstrapping the same ProcessWire installation that has ProcessLinkChecker installed, right?

No but for the tests I exit; right afterwards and those lines are the very first lines right after opening php and bootstrapping pw..

Thanks for your workaround @arjen think I'll give it a try soon :)

  • Like 1

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By gebeer
      I am happy to present my new fieldtype FieldtypeImageFromPage. It is made up of 2 modules:
      Fieldtype Image Reference From Another Page is a Fieldtype that stores a reference to a single image from another page. The image can be selected with the associated Inputfield.
      Inputfield Select Image From Page is an Inputfield to select a single image from images on a predefined page and it's children.
      And there also is a helper module that takes care of cleanup tasks.
      This module evolved out of a discussion about my other Module FieldtypeImagePicker.  It caters for use cases where a set of images is being reused multiple times across a site. With this fieldtype these images can be administered through a chosen page. All images uploaded to that page will be available in the inputfield.
      When to use ?
      Let editors choose an image from a set of images that is being used site-wide. Ideal for images that are being re-used across the site.
      Suited for images that are used on multiple pages throughout the site (e.g. icons).
      Other than the native ProcessWire images field, the images here are not stored per page. Only references to images on another page are stored. This has several advantages:
      one central place to organize images when images change, you only have to update them in one place. All references will be updated, too. (Provided the name of the image that has changed stays the same) Features
      Images can be manipulated like native ProcessWire images (resizing, cropping etc.) Image names are fully searchable through the API Accidental image deletion is prevented. When you want to delete an image from one of the pages that hold your site-wide images, the module searches all pages that use that image. If any page contains a reference to the image you are trying to delete, deletion will be prevented. You will get an error message to help you edit those pages and remove references there before you can finally delete the image. How to install and setup
      Download and install this module like any other modules in ProcessWire Create a page in the page tree that will hold your images. This page's template must have an images field Upload some images to the page you created in step 2 Create a new field. As type choose 'Image Reference From Another Page'. Save the field. In 'Details' Tab of the field choose the page you created in step 2 Click Save button Choose the images field name for the field that holds your images (on page template from step 2) Click Save button again Choose whether you want to include child pages of page from step 2 to supply images Add the field to any template You are now ready to use the field View of the inputfield on the page edit screen:

      View of the field settings

      The module can be installed from this github repo. Some more info in the README there, too.
      In my tests it was fairly stable. After receiving your valued feedback, I will eventually add it to the modules directory.
      My ideas for further improvement:
      - add ajax loading of thumbnails
      Happy to hear your feedback!
       
    • By gebeer
      Although the PW backend is really intuitive, ever so often my clients need some assistance. Be it they are not so tech savvy or they are not working in the backend often.
      For those cases it is nice to make some help videos available to editors. This is what this module does.
      ProcessHelpVideos Module
      A Process module to display help videos for the ProcessWire CMS. It can be used to make help videos (screencasts) available to content editors.
      This module adds a 'Help Videos" section to the ProcessWire backend. The help videos are accessible through an automatically created page in the Admin page tree. You can add your help videos as pages in the page tree. The module adds a hidden page to the page tree that acts as parent page for the help video pages. All necessary fields and templates will be installed automatically. If there are already a CKEditor field and/or a file field for mp4 files installed in the system, the module will use those. Otherwise it will create the necessary fields. Also the necessary templates for the parent help videos page and it's children are created on module install. The module installs a permission process-helpvideos. Every user role that should have access to the help video section, needs this permission. I use the help video approach on quite a few production sites. It is stable so far and well received by site owners/editors. Up until now I installed required fields, templates and pages manually and then added the module. Now I added all this logic to the install method of the module and it should be ready to share.
      The module and further description on how to use it is available on github: https://github.com/gebeer/ProcessHelpVideos
      If you like to give it a try, I am happy to receive your comments/suggestions here.
    • By Robin S
      A module created in response to the topic here:
      Page List Select Multiple Quickly
      Modifies PageListSelectMultiple to allow you to select multiple pages without the tree closing every time you select a page.
      The screencast says it all:

       
      https://github.com/Toutouwai/PageListSelectMultipleQuickly
      https://modules.processwire.com/modules/page-list-select-multiple-quickly/
    • By gebeer
      Hello all,
      sharing my new module FieldtypeImagePicker. It provides a configurable input field for choosing any type of image from a predefined folder.
      The need for it came up because a client had a custom SVG icon set and I wanted the editors to be able to choose an icon in the page editor.
      It can also be used to offer a choice of images that are used site-wide without having to upload them to individual pages.
      There are no image manipulation methods like with the native PW image field.
      Module and full description can be found on github https://github.com/gebeer/FieldtypeImagePicker
      Kudos to @Martijn Geerts. I used his module FieldTypeSelectFile as a base to build upon.
      Here's how the input field looks like in the page editor:

      Hope it can be of use to someone.
      If you like to give it a try, I'm happy to hear your comments or suggestions for improvement. Eventually this will go in the module directory soon, too.
    • By bernhard
      @Sergio asked about the pdf creation process in the showcase thread about my 360° feedback/survey tool and so I went ahead and set my little pdf helper module to public.
      Description from PW Weekly:
       
      Modules Directory: https://modules.processwire.com/modules/rock-pdf/
      Download & Docs: https://github.com/BernhardBaumrock/RockPDF
       
      You can combine it easily with RockReplacer: 
      See also a little showcase of the RockPdf module in this thread:
       
×
×
  • Create New...