Jump to content
maxf5

web crawler for cache warmup

Recommended Posts

Hey guys,

i am using WireCache and template cache. Wondering if there is some kind of web crawler already for processwire which is crawling all your site to warm up the cache?
(like in Shopware eCommerce you can warm up your cache with a crawler via cronjob, etc.)


Would be a nice feature :)

 

 

Unbenannt.PNG

Share this post


Link to post
Share on other sites

found a function which could be made with a cronjob 

or this library which could be used for a module :) http://phpcrawl.cuab.de

function crawl_page($url, $depth = 5)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) && isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= $path;
            }
        }
        crawl_page($href, $depth - 1);
    }
    echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
}
crawl_page($pages->get(1)->httpUrl, 2);

Share this post


Link to post
Share on other sites

You can pass a function to $cache->get and it will generate the cache for you

$expiration = 3600;
$cache->get("cache_name", $expiration, function() use($page) {
	$markup = "<div>$page->title</div>";
	return $markup;
});

https://processwire.com/api/ref/cache/get/

https://github.com/processwire/processwire/blob/57b297fd1d828961b20ef29782012f75957d6886/wire/core/WireCache.php#L136

Edit: Read again and I think that is not what you want. To prevent the cache being generated by a visitor you can generate it in a hook, depending on when you want the cache to update.

Share this post


Link to post
Share on other sites

how can i get a cache when it's not even generated yet? So, that's my idea for a web-crawler/robot which generates the cache for you by visiting all pages.

 

Share this post


Link to post
Share on other sites

This is one of my scripts that I use to quickly regenerate caches when I flush them. It wont work for dynamically generated urls (i.e. urlSegments), obviously

<?php
use ProcessWire\ProcessWire;
require_once 'vendor/autoload.php';

$wire = new Processwire();

// get urls for all public accessable pages
$urls = [];
foreach($wire->pages('id>0, check_access=1') as $p) $urls[] = $p->httpUrl;

header("Content-Type: text/plain");

// visit all urls
foreach($urls as $url) {
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_exec($ch);

    if(! curl_errno($ch)) {
        $info = curl_getinfo($ch);
        curl_close($ch);

        echo 'URL: ' . $url . "\n" .
             'Status: ' . $info['http_code'] . "\n";

        sleep(0.5);
    } else {
        echo 'ERROR: ' . $url . "\n";
    }
}

Save this in the same directory as index.php, such as cache.php then access it from mydomain.com/cache.php. It might take a while before anything to appear until output buffer is flushed to the browser.

  • Like 2

Share this post


Link to post
Share on other sites
7 minutes ago, maxf5 said:

how can i get a cache when it's not even generated yet?

The get method generates the cache if no cache is found, if you pass it a function/closure.

And with a hook (on page save or wherever it makes sense for your use case) you can generate/update the cache.

  • Like 1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By MoritzLost
      Process Cache Control
      This module provides a simple solution to clearing all your cache layers at once, and an extensible interface to perform various cache-related actions.
      The simple motivation behind this module was that I was tired of manually clearing caches in several places after deploying a change on a live site. The basic purpose of this module is a simple Clear all caches link in the Setup menu which clears out all caches, no matter where they hide. You can customize what exactly the module does through it's configuration menu:
      Expire or delete all cache entries in the database, or selectively clear caches by namespace ($cache API) Clear the the template render cache. Clear out specific folders inside your site's cache directory (/site/assets/cache) Clear the ProCache page render cache (if your site is using ProCache) Refresh version strings for static assets to bust client-side browser caches (this requires some setup, see the full documentation for details). This is the basic function of the module. However, you can also add different cache management action through the API and execute them through the module's interface. For this advanced usage, the module provides:
      An interface to see all available cache actions and execute them. A system log and logging output on the module page to see verify what the module is doing. A CacheControlTools class with utility functions to clear out different caches. An API to add cache actions, execute them programmatically and even modify the default action. Permission management, allowing you granular control over which user roles can execute which actions. The complete documentation can be found in the module's README.
      Plans for improvements
      If there is some interest in this, I plan to expand this to a more general cache management solution. I particular, I would like to add additional cache actions. Some ideas that came to mind:
      Warming up the template render cache for publicly accessible pages. Removing all active user sessions. Let me know if you have more suggestions!
      Links
      https://github.com/MoritzLost/ProcessCacheControl ProcessCacheControl in the Module directory CHANGELOG in the repository Screenshots


    • By verdeandrea
      Hello,
      I am using ProCache v3.1.8 on ProcessWire 3.0.96.
      Everything worked fine in the past, but today I noticed that the css file serverd by procache gives a 410 error. 
      The file is there, I checked.
      I deleted the cached files, I deleted the css file, I looked into the .htaccess file looking for some clues about this problem but nothing worked.
      The only way i can see my website correctly again is disabling ProCache. 
      Has anyone any clue on what could be the cause of the problem or on what should I do to fix it?
      Thanks!
    • By abdulqayyum
      Hy Processwire community,
      There are some problem in fileCompiler cache.
      when i change under the directory \site\templates\ it must change under the directory /site/assets/cache/FileCompiler/site/templates/
      but it does not update and functionality working with /site/assets/cache/FileCompiler/site/templates/ directory.
       
      In this case please suggest me how i clear fileCompiler cache?
      what i have to clear it manually?
      Thanks AbdulQayyum.
    • By modifiedcontent
      I had upgraded my Apache configuration to include PHP7.2 and PHP7.3 for a Laravel-based script on the same server. Somehow it/I messed up a previously fine Processwire site, in a very confusing way.
      The site still looks fine, but editing template files has no effect whatsoever. It is stuck on some kind of cached version. I have already disabled PHP7's OPcache, cleared browser caches, etc, with no effect.
      The pages now apparently come from PW's assets/cache/FileCompiler folder, even though I never enabled template caching for this site.
      I have tried adding "namespace ProcessWire;" to the top of the homepage template file, but then I get this fatal error:
      My functions.php file pulls data in from another Processwire installation on the same VPS with the following line:
      $othersitedata = new ProcessWire('/home/myaccount/public_html/myothersite/site/', 'https://myothersite.com/'); That apparently still works fine; the site still displays data from the other installation, but via the "cached" template that I am now unable to change.
       
      I don't know where to start with this mess. Does any of this sound familiar to anyone? Any pointers in the right direction would be much appreciated. 
       
      Edit:
      Adding "$config->templateCompile = false;" to config.php results in the same fatal error as above. 
    • By Jan235
      Hello,
      I'm started to play around with processwire. And I like it! My local dev system is up and runnig. I'm using the template factory with Twig. Anybody who use Twig and ProCache or is it possible to use both modules?
      Thanks in advance
×
×
  • Create New...