Jump to content
Pete

Module: XML Sitemap

Recommended Posts

I missed the XML sitemap generator that I used in a previous CMS so I built my own module to achieve the same functionality.

This module outputs an XML sitemap of your site that is readable by Google Webmaster Tools etc. I've generally found that it reduces the time it takes for new sites and pages to be listed in search engines using one in combination with Webmaster Tools etc (since you're specifically telling the service that a new site/new pages exist) so thought I may as well create a module for it.

The module ignores any hidden pages and their children, assuming that since you don't want these to be visible on the site then you don't want them to be found via search engines either.

It also adds a field called sitemap_ignore that you can add to your templates and exclude specific pages on a per-page basis. Again, this assumes that you wish to ignore that page's children as well.

The sitemap is accessible at yoursite.com/sitemap.xml - the module checks to see whether this URL has been called and outputs the sitemap, then does a hard exit before PW gets a chance to output a 404 page. If there's a more elegant way of doing this I'll happily change the code to suit.

Feedback and suggestions welcome :)

On a slightly different note, I wanted to call the file XMLSitemap originally so as to be clearer about what it does in the filename, but if you have a module that begins with more than one uppercase letter then a warning containing only the module name is displayed on the Modules page, so I changed it to Sitemap instead which is fine as the description still says what it does.

File can be downloaded via GitHub here: https://github.com/N.../zipball/master

  • Like 8

Share this post


Link to post
Share on other sites

Here's a list of services you can submit your sitemap to:

Google Webmaster Tools

Bing Webmaster Tools

There used to be a service for Yahoo called Site Explorer that was similar in functionality to the above two services, but it appears that this has now been replaced with Bing's offering. On the bright side it's one less service to sign up to :)

You can also submit to Ask using the following URL (replacing the relevant part with the full URL to your sitemap):

http://submissions.ask.com/ping?sitemap=http://www.the URL of your sitemap here.xml

Generally I find Google and Bing to be sufficient though, as the other search services seem to trawl their content reasonably quickly and find out about new sites that way sometimes I think.

  • Like 1

Share this post


Link to post
Share on other sites

It is also recommended to add the following to your robots.txt file (create one if you don't have one):

Sitemap: http://www.your-domain.com/sitemap.xml

This allows search engines to find your sitemap.xml, even if you didn't submit it to that specific search engine.

Regarding this module, until now I've always used a sitemap.xml template. When I created that one I used the regular sitemap page as an example.

But I will certainly try this module sooner or later.  :)

/Jasper

Share this post


Link to post
Share on other sites

That's actually how I began this module - as a separate template - but then I thought it would be quicker as a module and then there's no need for an actual page or any templates and it's all in one file.

Plus I really wanted to have that optional sitemap_ignore field that I'd used in the previous CMS I worked with.

I'm glad I did start creating it as a template and page though, else I'd have had more hassle working out why it wasn't outputting XML (I needed to specify that XML was being outputted by using a PHP header).

Share this post


Link to post
Share on other sites

Pete, nice job with this module, it looks very well put together and provides a helpful function. I like the simplicity of it being a module and how easy that makes installation. But there are a couple potential issues with this approach I wanted to mention.

First is that it's taking the place of the 404 page and a 404 header is getting sent before the sitemap.xml output is displayed. This will probably be an issue for any search engines that hit it. You can avoid this by moving everything from your generateSitemap() function into your init() function. You don't need to use any hooks. Of course, you won't have a $page to check at the init() stage, but I don't think you need it–just get rid of the $page->template != 'admin' check... your strpos() below it is likely just as fast.

Next concern is scalability. Generating this sitemap could be a slow thing on a large site. Anything that's slow to generate can be a DDOS hole. Because it's in a module, you can't as easily cache the output with a template setting. However, you could use the MarkupCache module to perform your own caching, and that's something I'd probably recommend here.

Last thing is that I'd suggest renaming the module since it falls in the Markup group of modules. Perhaps the name MarkupSitemapXML?

Other than that, I think it looks great.

Share this post


Link to post
Share on other sites

Thanks ryan - I was a bit worried about the 404 so this is a good solution and something I'll definitely bear in mind for future modules. I enjoy writing these modules as every one teaches me something new about how PW works behind the scenes.

I've implemented your suggestions and replaced the file in the first post with the modified version with the new name (if anyone has this installed already, simply uninstall the old module, delete the module file and install the new one).

I also implemented caching as you suggested and cached it for an hour (for most sites a day or more would probably be fine, but if you're writing articles or running a blog you want search engines to index your content as soon as possible so that seemed like a good period to set it at thoguh I've no idea how often search engines actually check a sitemap).

I agree that there are quite a few potential issues on larger sites and not just because of the sitemap's size. Google doesn't always index everything on larger sites, and I was just reading about how someone tested this by telling Google that things like news posts and forum posts (individual posts, not threads) should on;y be indexed once as they're unlikely to ever change. This apparently frees up the search engine crawlers to check the new content rather than go over the same content each visit.

I can see a few places that this would be useful in PW - for example a news section would only need each article to be indexed once (if you ever went back and edited a page, I believe the last modified date would force a re-index) and various other similar examples.

There's a good list of tags for sitemaps here: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=183668 and the tag to use would be <changefreq> , however that could get a bit tiresome if you had to have it on every page. A potential solution for a future version of this module would be to have such a field and simply have all child pages inherit the parent value if they don't have one set themselves.

Another option as a possible alternative to caching would be to run the module once on the first time the page is called, then only re-generate the sitemap once a page is edited/created/deleted as those are the only three times a change to the sitemap would technically be required.

I'll leave it as it is for now though.

Share this post


Link to post
Share on other sites

I forgot to ask - is there a better way of having a module run when an admin-only page is loaded other than this:

<?php
if ($page->template != 'admin') {
 // Do stuff
}
?>

It's no longer relevant for this module as it now runs before the page data is available, but when I had it in the first version I gave myself a pat on the back for figuring I could check for the admin template like this. I can see how such a check would be good for other modules though so just wondering if there's a better way for checking if we're in the admin.

Share this post


Link to post
Share on other sites

You also could use a Sitemap Index (<sitemapindex>) which allows you to have multiple sitemaps per site.

This allows you to break your sitemap up into several smaller sitemaps. You could even have different cache times per branch, eg. 5 minutes for the news section and 1 day for the other pages.

/Jasper

Share this post


Link to post
Share on other sites

Very good point - I'd completely forgotten about that!

One of the forum applications I use has this though I'm quite far behind on the versions. I believe they had one for forums, one for posts, one for calendars etc. It would have made more sense for them to split up posts by forum though as it gets a bit silly if it tries to put a million posts in one file (I don't have that problem fortunately ;)).

So thinking about this a bit more, it would make sense to have some template settings for this module too in a future version and specify cache times and re-index periods for all pages there. It could then create separate sitemaps on a per-template basis as you say. I think that would cover most eventualities there.

My only gripe would be that caching per-template wouldn't help if you were running a large blog/news site as one main template would apply to most of the content. Perhaps splitting it per year, or simply having it split the sitemap at X thousand pages if you have a huge amount of articles? I think per-year sitemaps for larger article sections would be best thinking about it. Presumably if a page from 2009 is modified and then ends up in the 2012 sitemap it's not a big issue - Google will be told in 2012 sitemap to index it again and it's not like you're going to modify old articles particularly often so I wouldn't see a need in cleaning up the 2009 cached sitemap in this case.

It does look like you can easily have a master sitemap linking to sub-sitemaps though so that part is pretty straightforward at least (and it looks like 50,00 pages per sitemap is the maximum allowed): http://googlewebmastercentral.blogspot.com/2006/10/multiple-sitemaps-in-same-directory.html

Anyway, this is one for when I next get a few free hours ;)

Share this post


Link to post
Share on other sites

Great update, thanks Pete!

What do you think about putting this on GitHub? (whether now or in the future)?

For the cache time, you could always make it configurable with the module if you wanted to. Let me know if I can assist.

I also like your alternative for time based caching, though I'm guessing time based caching is adequate for most. But if you want to pursue that alternative, you'd use the CacheFile class directly and have a Pages::save hook clear the cache file (or maybe just Pages::added and Pages::deleted hooks).

I wasn't aware of the <sitemapindex>, this sounds like a good idea for the large sites. But also sounds like a much more complex module.

Share this post


Link to post
Share on other sites

Cheers ryan

Forgot I had a Github account for a sec there. Added, updated the link in the first thread and added it to the growing modules spreadsheet on Google Docs ;)

Yup - I'll probably make the time configurable at some point in the future as it seems like an easy interim update before doing anything more complicated.

Share this post


Link to post
Share on other sites

Nice job Pete. It's easier to follow on GitHub just because we can pull in updates automatically without downloading, unzipping, etc. So many great new modules lately. I need to make a lot of additions to the modules directory.

Share this post


Link to post
Share on other sites

Mine is throwing an error:

Exception: Unable to create path: /MarkupSitemapXML/ (in /home/theseeke/public_html/wire/core/CacheFile.php line 62)
This error message was shown because you are logged in as a Superuser. Error has been logged.

From a quick read of the relevant modules, it's not apparent to me where it's trying to create that directory.

Share this post


Link to post
Share on other sites

It sounds like your site/assets/cache directory might not be writeable (it needs to be in order to be able to cache the sitemap). Hopefully making that writeable (CHMOD 644 probably) will fix that.

The cache folder itself isn't solely related to this module, so I'm sure you would have come across that message sooner or later.

Hope that fixes it for you :)

Share this post


Link to post
Share on other sites

It sounds like your site/assets/cache directory might not be writeable (it needs to be in order to be able to cache the sitemap). Hopefully making that writeable (CHMOD 644 probably) will fix that.

The cache folder itself isn't solely related to this module, so I'm sure you would have come across that message sooner or later.

Hope that fixes it for you :)

No such luck - still didn't work. Also tried 755 and 777 on the off chance, and recursed it through that whole tree to be sure.

Incidentally, file uploads and caching of size()'d images are working fine.

Share this post


Link to post
Share on other sites

Hmm... that's odd, I'll take a look at the module. Cheers for checking that though.

Share this post


Link to post
Share on other sites

Looks like it tries to create /MarkupSitemapXML/ folder in the root. There should be full path to cache folder visible on error msg.

Share this post


Link to post
Share on other sites

I worked it out - you have to install the Cache module as well, which is present in the default PW installation, just not installed by default (I already ahd it switched on on my test account!).

That will make it work, but I do need to put some sort of check into my module to make it a bit easier (need to read the posts around here on module dependencies).

Share this post


Link to post
Share on other sites

I worked it out - you have to install the Cache module as well, which is present in the default PW installation, just not installed by default (I already ahd it switched on on my test account!).

That will make it work, but I do need to put some sort of check into my module to make it a bit easier (need to read the posts around here on module dependencies).

Still doesn't seem to work, same error msg, damned if I know why?

http://theseekerr.com/sitemap.xml - there it is, if staring at my error provides any sort of motivation.

Share this post


Link to post
Share on other sites

Could you try uninstalling both modules and then reinstalling the cache one first and see if that helps please?

Share this post


Link to post
Share on other sites

Pete, try adding an extra line to your install function:

wire('modules')->get('MarkupCache'); 

That will just ensure that it's installed ahead of time and should resolve the problem in this particular case.

However, I don't think there is a problem with your code, I think it's actually a bug with the MarkupCache module (or maybe Module installer) because the MarkupCache module is responsible for making sure it's files go into the right place, and clearly it's not doing that. I will locate and fix the issue. I was able to reproduce it here by uninstalling the MarkupCache module and then letting it be installed at the time it's used. That seems to be the only time the issue occurs.

Share this post


Link to post
Share on other sites

Found it--it was a bug. Sorry for the inconvenience guys. This is one of those obscure bugs that takes the right set of circumstances to turn up, so these can be difficult to track down. Thanks for finding it. I've just committed the fix to the dev branch, and it should be merged into the stable branch likely tomorrow. Here's the commit message:

[dev c3a8ffe] Dev: fix issue with $modules->get('uninstalled module') where the module's init() wasn't called on an "autoload" module when it was used immediately after it was installed.

Share this post


Link to post
Share on other sites

I've tried uninstalling both the Sitemap module and the Cache module, then reinstalling the Cache module before the Sitemap module. I'm still not winning...

Share this post


Link to post
Share on other sites

Sorry, I think it's Markup Cache module and not Fieldtype - I forgot to specify which one (the Fieldtype cache module is on by default whereas ModuleCache isn't).

Share this post


Link to post
Share on other sites

Sorry, I think it's Markup Cache module and not Fieldtype - I forgot to specify which one (the Fieldtype cache module is on by default whereas ModuleCache isn't).

No dice. The module was already installed, and threw an error about a missing MarkupCache folder when I went to remove it. So I created the folder, and it uninstalled cleanly, deleting that folder. Then I reinstalled the two modules....still no go, and I notice that folder is still missing. I manually created it on the off chance that'd help but that didn't work either.

I'm on the verge of converting your code back to a template, this is getting silly...

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By teppo
      Needed a really simple solution to embed audio files within page content and couldn't find a module for that, so here we go. Textformatter Audio Embed works a bit like Textformatter Video Embed, converting this:
      <p>https://www.domain.tld/path/to/file.mp3</p> Into this:
      <audio controls class="TextformatterAudioEmbed"> <source src="https://www.domain.tld/path/to/file.mp3" type="audio/mpeg"> </audio> The audio element has pretty good browser support, so quite often this should be enough to get things rolling 🙂
      GitHub repository: https://github.com/teppokoivula/TextformatterAudioEmbed Modules directory: https://modules.processwire.com/modules/textformatter-audio-embed/
    • By Richard Jedlička
      Tense    
      Tense (Test ENvironment Setup & Execution) is a command-line tool to easily run tests agains multiple versions of ProcessWire CMF.
      Are you building a module, or a template and you need to make sure it works in all supported ProcessWire versions? Then Tense is exactly what you need. Write the tests in any testing framework, tell Tense which ProcessWire versions you are interested in and it will do the rest for you.

      See example or see usage in a real project.
      How to use?
      1. Install it: 
      composer global require uiii/tense 2. Create tense.yml config:
      tense init 3. Run it:
      tense run  
      For detailed instructions see Github page: https://github.com/uiii/tense
       
      This is made possible thanks to the great wireshell tool by @justb3a, @marcus and others.
       
      What do you think about it? Do you find it useful? Do you have some idea? Did you find some bug? Tell me you opinion. Write it here or in the issue tracker.
    • By Chris Bennett
      Hi all, I am going round and round in circles and would greatly appreciate if anyone can point me in the right direction.
      I am sure I am doing something dumb, or missing something I should know, but don't. Story of my life 😉

      Playing round with a module and my basic problem is I want to upload an image and also use InputfieldMarkup and other Inputfields.
      Going back and forth between trying an api generated page defining Fieldgroup, Template, Fields, Page and the InputfieldWrapper method.

      InputfieldWrapper method works great for all the markup stuff, but I just can't wrap my head around what I need to do to save the image to the database.
      Can generate a Field for it (thanks to the api investigations) but not sure what I need to do to link the Inputfield to that. Tried a lot of stuff from various threads, of varying dates without luck.
      Undoubtedly not helped by me not knowing enough.

      Defining Fieldgroup etc through the api seems nice and clean and works great for the images but I can't wrap my head around how/if I can add/append/hook the InputfieldWrapper/InputfieldMarkup stuff I'd like to include on that template as well. Not even sure if it should be where it is on ___install with the Fieldtype stuff or later on . Not getting Tracy errors, just nothing seems to happen.
      If anyone has any ideas or can point me in the right direction, that would be great because at the moment I am stumbling round in the dark.
       
      public function ___install() { parent::___install(); $page = $this->pages->get('name='.self::PAGE_NAME); if (!$page->id) { // Create fieldgroup, template, fields and page // Create new fieldgroup $fmFieldgroup = new Fieldgroup(); $fmFieldgroup->name = MODULE_NAME.'-fieldgroup'; $fmFieldgroup->add($this->fields->get('title')); // needed title field $fmFieldgroup->save(); // Create new template using the fieldgroup $fmTemplate = new Template(); $fmTemplate->name = MODULE_NAME; $fmTemplate->fieldgroup = $fmFieldgroup; $fmTemplate->noSettings = 1; $fmTemplate->noChildren = 1; $fmTemplate->allowNewPages = 0; $fmTemplate->tabContent = MODULE_NAME; $fmTemplate->noChangeTemplate = 1; $fmTemplate->setIcon(ICON); $fmTemplate->save(); // Favicon source $fmField = new Field(); $fmField->type = $this->modules->get("FieldtypeImage"); $fmField->name = 'fmFavicon'; $fmField->label = 'Favicon'; $fmField->focusMode = 'off'; $fmField->gridMode = 'grid'; $fmField->extensions = 'svg png'; $fmField->columnWidth = 50; $fmField->collapsed = Inputfield::collapsedNever; $fmField->setIcon(ICON); $fmField->addTag(MODULE_NAME); $fmField->save(); $fmFieldgroup->add($fmField); // Favicon Silhouette source $fmField = new Field(); $fmField->type = $this->modules->get("FieldtypeImage"); $fmField->name = 'fmFaviconSilhouette'; $fmField->label = 'SVG Silhouette'; $fmField->notes = 'When creating a silhouette/mask svg version for Safari Pinned Tabs and Windows Tiles, we recommend setting your viewbox for 0 0 16 16, as this is what Apple requires. In many cases, the easiest way to do this in something like illustrator is a sacrificial rectangle with no fill, and no stroke at 16 x 16. This forces the desired viewbox and can then be discarded easily using something as simple as notepad. Easy is good, especially when you get the result you want without a lot of hassle.'; $fmField->focusMode = 'off'; $fmField->extensions = 'svg'; $fmField->columnWidth = 50; $fmField->collapsed = Inputfield::collapsedNever; $fmField->setIcon(ICON); $fmField->addTag(MODULE_NAME); $fmField->save(); $fmFieldgroup->add($fmField); // Create: Open Settings Tab $tabOpener = new Field(); $tabOpener->type = new FieldtypeFieldsetTabOpen(); $tabOpener->name = 'fmTab1'; $tabOpener->label = "Favicon Settings"; $tabOpener->collapsed = Inputfield::collapsedNever; $tabOpener->addTag(MODULE_NAME); $tabOpener->save(); // Create: Close Settings Tab $tabCloser = new Field(); $tabCloser->type = new FieldtypeFieldsetClose; $tabCloser->name = 'fmTab1' . FieldtypeFieldsetTabOpen::fieldsetCloseIdentifier; $tabCloser->label = "Close open tab"; $tabCloser->addTag(MODULE_NAME); $tabCloser->save(); // Create: Opens wrapper for Favicon Folder Name $filesOpener = new Field(); $filesOpener->type = new FieldtypeFieldsetOpen(); $filesOpener->name = 'fmOpenFolderName'; $filesOpener->label = 'Wrap Folder Name'; $filesOpener->class = 'inline'; $filesOpener->collapsed = Inputfield::collapsedNever; $filesOpener->addTag(MODULE_NAME); $filesOpener->save(); // Create: Close wrapper for Favicon Folder Name $filesCloser = new Field(); $filesCloser->type = new FieldtypeFieldsetClose(); $filesCloser->name = 'fmOpenFolderName' . FieldtypeFieldsetOpen::fieldsetCloseIdentifier; $filesCloser->label = "Close open fieldset"; $filesCloser->addTag(MODULE_NAME); $filesCloser->save(); // Create Favicon Folder Name $fmField = new Field(); $fmField->type = $this->modules->get("FieldtypeText"); $fmField->name = 'folderName'; $fmField->label = 'Favicon Folder:'; $fmField->description = $this->config->urls->files; $fmField->placeholder = 'Destination Folder for your generated favicons, webmanifest and browserconfig'; $fmField->columnWidth = 100; $fmField->collapsed = Inputfield::collapsedNever; $fmField->setIcon('folder'); $fmField->addTag(MODULE_NAME); $fmField->save(); $fmFieldgroup->add($tabOpener); $fmFieldgroup->add($filesOpener); $fmFieldgroup->add($fmField); $fmFieldgroup->add($filesCloser); $fmFieldgroup->add($tabCloser); $fmFieldgroup->save(); /////////////////////////////////////////////////////////////// // Experimental Markup Tests $wrapperFaviconMagic = new InputfieldWrapper(); $wrapperFaviconMagic->attr('id','faviconMagicWrapper'); $wrapperFaviconMagic->attr('title',$this->_('Favicon Magic')); // field show info what $field = $this->modules->get('InputfieldMarkup'); $field->name = 'use'; $field->label = __('How do I use it?'); $field->collapsed = Inputfield::collapsedNever; $field->icon('info'); $field->attr('value', 'Does this even begin to vaguely work?'); $field->columnWidth = 50; $wrapperFaviconMagic->add($field); $fmTemplate->fields->add($wrapperFaviconMagic); $fmTemplate->fields->save(); ///////////////////////////////////////////////////////////// // Create page $page = $this->wire( new Page() ); $page->template = MODULE_NAME; $page->parent = $this->wire('pages')->get('/'); $page->addStatus(Page::statusHidden); $page->title = 'Favicons'; $page->name = self::PAGE_NAME; $page->process = $this; $page->save(); } }  
    • By Sebi
      Since it's featured in ProcessWire Weekly #310, now is the time to make it official:
      Here is Twack!
      I really like the following introduction from ProcessWire Weekly, so I hope it is ok if I use it here, too. Look at the project's README for more details!
      Twack is a new — or rather newish — third party module for ProcessWire that provides support for reusable components in an Angular-inspired way. Twack is implemented as an installable module, and a collection of helper and base classes. Key concepts introduced by this module are:
      Components, which have separate views and controllers. Views are simple PHP files that handle the output for the component, whereas controllers extend the TwackComponent base class and provide additional data handling capabilities. Services, which are singletons that provide a shared service where components can request data. The README for Twack uses a NewsService, which returns data related to news items, as an example of a service. Twack components are designed for reusability and encapsulating a set of features for easy maintainability, can handle hierarchical or recursive use (child components), and are simple to integrate with an existing site — even when said site wasn't originally developed with Twack.
      A very basic Twack component view could look something like this:
      <?php namespace ProcessWire; ?> <h1>Hello World!</h1> And here's how you could render it via the API:
      <?php namespace Processwire; $twack = $modules->get('Twack'); $hello = $twack->getNewComponent('HelloWorld'); ?> <html> <head> <title>Hello World</title> </head> <body> <?= $hello->render() ?> </body> </html> Now, just to add a bit more context, here's a simple component controller:
      <?php namespace ProcessWire; class HelloWorld extends TwackComponent { public function __construct($args) { parent::__construct($args); $this->title = 'Hello World!'; if(isset($args['title'])) { $this->title = $args['title']; } } } As you can see, there's not a whole lot new stuff to learn here if you'd like to give Twack a try in one of your projects. The Twack README provides a really informative and easy to follow introduction to all the key concepts (as well as some additional examples) so be sure to check that out before getting started. 
      Twack is in development for several years and I use it for every new project I build. Also integrated is an easy to handle workflow to make outputs as JSON, so it can be used to build responses for a REST-api as well. I will work that out in one section in the readme as well. 
      If you want to see the module in an actual project, I have published the code of www.musical-fabrik.de in a repository. It runs completely with Twack and has an app-endpoint with ajax-output as well.
      I really look forward to hear, what you think of Twack🥳!
      Features Installation Usage Quickstart: Creating a component Naming conventions & component variants Component Parameters directory page parameters viewname Asset handling Services Named components Global components Ajax-Output Configuration Versioning License Changelog
    • By Robin S
      Page Reference Default Value
      Most ProcessWire core inputfield types that can be used with a Page Reference field support a "Default value" setting. This module extends support for default values to the following core inputfield types:
      Page List Select Page List Select Multiple Page Autocomplete (single and multiple) Seeing as these inputfield types only support the selection of pages a Page List Select / Page List Select Multiple is used for defining the default value instead of the Text / Textarea field used by the core for other inputfield types. This makes defining a default value a bit more user-friendly.
      Note that as per the core "Default value" setting, the Page Reference field must be set to "required" in order for the default value to be used.
      Screenshot

       
      https://github.com/Toutouwai/PageReferenceDefaultValue
      https://modules.processwire.com/modules/page-reference-default-value/
×
×
  • Create New...