Jump to content

Very strange Pagination behaviour


ian
 Share

Recommended Posts

Hi all,

I've just noticed a strange issue with some paginated pages on my site UKMoths, (http://ukmoths.org.uk).  

I have a series of pages showing thumbnails of moths by family, here: http://ukmoths.org.uk/thumbnails.  The opening page shows the families but as you drill down, it displays all the species within a certain family.  If there are more than 12 then the output is paginated using standard MarkupPagerNav functionality.

On some however, I've noticed some long strings of random characters between the base url and the page notifier.  For example the crambidae list has 140 species so has about 12 pages.  Page 1 is fine, showing /thumbnails/crambidae, but pages beyond this, instead of the urls being like  /thumbnails/crambidae/page2 they are something like /thumbnails/crambidae/BVXAz1div6cNWKM3P5NDP7EoP4WA .... (cut for brevity) ... CSCHd6.9c6Nhh/page2.

I can't for the life figure out why this is happening.  It seems to be the case for both ProCache version pages and non-cached (when logged in).  If I look at the ProCache folder in the assets, the structure looks to be correct - i.e. a crambidae folder and then page2, page3 etc. folders.   I should point out that the pages render correctly, even with these odd urls.

It doesn't happen across the board though - it just seems to be certain ones - the /thumbnails/elachistidae folder pagination is fine - yet they're all using the same template.

And the same site on my dev system is fine.

Confused!  ???  Any one have any thoughts? 

Thanks,

Ian.

Link to comment
Share on other sites

 

@ian, I would strongly suggest that you provide information (to include Version of PW/PHP, any 3rd party modules installed).  It would greatly help if your can also provide a detailed explanation of how you believe things should be working, to include any code segments/site structure/custom templates that would aid anyone trying to provide assistance.

You stated that your development system is fine, so the first thing you could explain is the differences between your production and dev setups (if any)?

Any assistance that you can provide is greatly appreciated.

For historical purposes, I am providing the link to your original Showcase post about this website:

 

Link to comment
Share on other sites

Does clearing the cache for ProCache help?

The reason I suspect ProCache is that...

http://ukmoths.org.uk/thumbnails/crambidae

...exhibits the weird links but when a get variable is appended that would bypass ProCache the links are normal.

http://ukmoths.org.uk/thumbnails/crambidae?foo=bar

 

Link to comment
Share on other sites

@cstevensjr: Thanks, yes - I didn't provide much background information.  The live site uses PW 2.6.1 currently, with ProCache, FormBuilder, All In One Minify and Email Obfuscation (EMO). PHP version is 5.3.28.  My dev setup is MAMP Pro on Mac, running 5.5.10 and I've updated the dev site to PW 2.7.2 but haven't had chance to update the live site.  So yes, there are a few differences!

I recently installed Google Analytics but don't know if this behaviour coincided with that - I only noticed it when looking at the urls reported in GA.  I'll have to try some things over the weekend to narrow it down and remove a few variables.

@Robin S

Quote

Does clearing the cache for ProCache help?

Only briefly - the first couple of clicks to /page2 and page3 seemed OK after clearing the cache, then I moved to /page4 and the behaviour returned, even on the pages 2 and 3.  Once the urls are formatted this way, it's bypassing ProCache I presume because the rewrite doesn't work.

@tpr:

Quote

Just to make sure: is this also happening with JavaScript turned off? Eg. looking into the source code in browser.

Yes, the full string of characters in the URL is visible in the view source, and also in the cached file in the ProCache folder.

-------

Much appreciated all!

Ian.

Link to comment
Share on other sites

Ian, first thing to check is how you are using URL segments. It looks like pages like http://ukmoths.org.uk/thumbnails/crambidae/ allow URL segments. If you don't need URL segments, disable them. If you do need them, then make sure you validate the URL segments that come in and throw a 404 if you get something invalid. For example:

if(strlen($input->urlSegmentStr)) {
  // one or more URL segments are specified
  if($input->urlSegmentStr == 'photos') {
    // display photos
  } else {
    // invalid
    throw new Wire404Exception();
  }
}

Second thing I would look at is the email obfuscation module you have installed. I notice the code it is adding (near the bottom of the source) appears to be using the same encoding that appears in those URLs, so I'm wondering if that might be adding it? Try uninstalling the email obfuscation module at least temporarily to see if it makes any difference. 

  • Like 1
Link to comment
Share on other sites

Thanks Ryan, I think you may well have hit the nail on the head with the Email Obfuscation module.  I found that you can disable this on a template by template basis. Having disabled it for the thumbnails template and cleared the ProCache cache again, things seem to be OK at the moment. Obviously I'll have to keep an eye on it.

The template is using URL segments - the page itself is /thumbnails/ and each moth family is an URL segment (crambidae etc.).  This then pulls out just the thumbnails for species belonging to that family.  The number of families is relatively static but can change.  Probably I should set the $config->maxUrlSegments to 1 and then throw a 404 as you say for invalid values.

Cheers!,

Ian.

 

  • Like 1
Link to comment
Share on other sites

Hmm - looks like I - spoke too soon - it's still happening :-(

I'm still trying to eliminate various things but haven't pinned it down yet.  I've now uninstalled the Email Obfuscation module but some of these odd links have reappeared since.  If I clear the ProCache (or just delete the specific subtree in the ProCache folder), I can revisit the pages and hence regenerate the cached versions.  These seem OK, but some time later when the cache has expired, I revisit and the odd links can be there again (they are in the cached versions too).  It all seems rather intermittent.

I did set $config->maxUrlSegments=2 for a while, which returned a 404 when an affected link was visited, but didn't prevent the oddity. It just frustrated my visitors!

For anyone who's interested, here's the link to the 'base' thumbnail page.  http://ukmoths.org.uk/thumbnails/ - any of the thumbnails with more than 12 species will link to paginated versions and could be affected.

Appreciate all your help,

Link to comment
Share on other sites

Update:

Unfortunately I still haven't got to the bottom of this - the problem is still occurring.  Here's what I've tried:

  • Upgraded the live site to PW 2.7.2 to correlate better with my dev system.
  • Uninstalled the Email Obfuscation module and used my own routine
  • Set $config->maxUrlSegments to 2 (*see note below) and cleared the ProCache cache
  • Ensured validation of the urlSegment that pertains to the family against a list of known values (and cleared the ProCache cache)

The ProCache is set to expire at 24 hours but at random times before this (after clearing the cache and revisiting the page) the first (Page1)  page seems to be regenerated and if so, seems to exhibit the problem.  I can't reproduce this myself, it just happens on the site, and if bypassing the ProCache (when logged in etc) it seems OK.  Once Page1 is in the cache it breaks any subsequent page links (now that I've set config->maxUrlSegments) unless you bypass it manually by typing in /page2 or /page3 in the address bar.

* I'm using just one urlSegment beyond the page itself, but when using maxUrlSegments I have to set this to 2 to work - is this the expected behaviour?

Here's a snippet of my code:

 $metatitle = "British Moths | Thumbnail List by Family | UKMoths";
 $title = "Thumbnails by Family";
 $sanitizedfam = "";
 if (strlen($input->urlSegment(1))) {
	 $sanitizedfam = $sanitizer->text($input->urlSegment(1));
	 if (!isValidFamily($sanitizedfam)) throw new Wire404Exception();
	 $specieslist = $pages->find("template=species, fam={$sanitizedfam}, limit=12");
	 if (!count($specieslist)) throw new Wire404Exception();
	 $metatitle = "British Moths | Thumbnail List by Family | " . ucfirst($sanitizedfam);
	 $title = "Families: " . ucfirst($sanitizedfam);
}

The isValidFamily() function is my new validator against known values and throws a 404 if not valid. 

If there's an urlSegment(1) then the code returns the species containing the family name, or if none then throws a 404.

Attached is a partial screenshot of the source current cached version of this page, which is (at the time of writing) exhibiting the problem: http://ukmoths.org.uk/thumbnails/gracillariidae/.

I do appreciate any thoughts or further suggestions!

Thanks,

Ian.

 

Screen Shot 2016-07-07 at 2.45.44 PM.png

Link to comment
Share on other sites

Ian, I'm not sure what components of the URL correspond to each URL segment, but it looks to me like possibly the issue is urlSegment2 not urlSegment1. You've got lots of good URLs to test with there from that screenshot, so you'll want to make sure that when you access the page at a URL like that, you get a 404. That will solve the issue. Also, if you are only needing 1 level of URL segments, I would suggest doing your comparisons against $input->urlSegmentStr rather than $input->urlSegment1. For instance:

$fam = $input->urlSegmentStr; 
if(strlen($fam)) {
  if(!isValidFamily($fam)) throw new Wire404Exception();
  $specieslist = $pages->find("template=species, limit=12, fam=$fam");
  if(!count($specieslist)) throw new Wire404Exception();
  $fam = $sanitizer->entities($fam);
  $metatitle = "British Moths | Thumbnail List by Family | $fam";
  $title = "Families: $fam";
}

Using $input->urlSegmentStr is better because it is inclusive of all URL segments. So if there are extra URL segments packed on to the end (like the bogus ones we see after the family names) then this will catch it, without you having to check $input->urlSegment2 separately. 

For example, the $input->urlSegmentStr of the page accessed at /thumbnails/gracillariidae/ would simply be "gracillariidae". Whereas if accessed at /thumbnails/gracillariidae/some-bongus-junk, then the urlSegmentStr would be "gracillariidae/some-bogus-junk", which presumably your isValidFamily() function would be able to exclude. 

  • Like 2
Link to comment
Share on other sites

Thanks Ryan, I will certainly try the $input->urlSegmentStr - that simplifies it somewhat; I didn't know about that.

However, those links are resulting in a 404 now since I added the $config->maxUrlSegments=2 setting.  What I can't figure out is how those links are getting into the cached pages if visiting such a URL results in a 404.

Would I be correct in thinking that the ProCache cached page is generated earlier in the request lifecycle than my urlSegment/validation checking?  I guess that's something I can test on my dev later, but is it possible something like this could happen?: 

  1. Page is not cached yet, or has recently been cleared/deleted by maintenance
  2. Some user visits the page and (perhaps inadvertently) puts bogus junk in the url (/thumbnails/gracillariidae/some-bogus-junk)
  3. ProCache detects the page isn't cached and caches it to disk, creating the pagination links containing some-bogus-junk
  4. My $input->urlSegmentStr and isValidFamily() routines run and detect an invalid URL
  5. User is shown the 404 page but the page has been cached with wrong pagination urls

I've no idea whether that's even remotely possible, but that's how it appears.

Note: I realise now that the $config->maxUrlSegments considers the pagenumber segment, so $config-maxUrlSegments=2 is correct in my case.

 

Link to comment
Share on other sites

Quote

Would I be correct in thinking that the ProCache cached page is generated earlier in the request lifecycle than my urlSegment/validation checking?  I guess that's something I can test on my dev later, but is it possible something like this could happen?: 

No, ProCache only saves the cache after all the code in your template has been executed, and right before ProcessWire shuts down. Perhaps the old cache files are still present and the cache just needs to be cleared? Or perhaps the issue isn't the caching at all, and we were just seeing a side effect in the cache (this is what I think is most likely the case). 

Also your, $config->maxUrlSegments setting really does not matter too much, so long as validation of the $input->urlSegmentStr is working property.

It sounds like the bogus URL pages are throwing 404s, so that's good. So now what you need to look for is what's generating the bogus URLs in the first place. They are appearing in the code, so they are coming from somewhere. It looks like they aren't originating from URL segments at all–that's just the result, but apparently not the source. So we need to look deeper. 

Here's something interesting. If I view the page at: http://ukmoths.org.uk/thumbnails/gracillariidae/ and hover a pagination link at the bottom, like "4", I can see the bogus URL. Yet if I view (not inspect) the source code, the links are clean. What that points to is that something from the Javascript side is manipulating the links. However, I can't confirm it because all the links are now clean, can't get any more bogus links, almost like you found and fixed something while I was viewing it. :) But if you are still seeing the issue, try viewing the same page with javascript disabled. If you can confirm it, start zooming in on the different JS parts, like perhaps the email obfuscation JS is still getting called somehow or another?

  • Like 1
Link to comment
Share on other sites

Success (or at least I think so) :).  Thanks Ryan, of course you were right all along.  Please ignore the nonsense above about page request lifecycle.  On reading it back it's obvious that's not possible.

Here's where I believe I was going wrong:

  • On the first page (Page 1 of the paginated list) there is no page number identifier, e.g. the url is like /thumbnails/noctuidae.
  • Subsequent pages are like /thumbnails/noctuidae/page2, /thumbnails/noctuidae/page3 etc. Nothing unusual here.
  • The setting $config->maxUrlSegments is set to 2. This includes the page number segment if it's there, but crucially if it's not, it allows another page segment on the first page (e.g. /thumbnails/noctuidae/somerandomstringofcharacters.
  • I was validating the $input->urlSegment1 for valid names, but wasn't checking $input->urlSegment2 at all.  Hence somerandomstringofcharacters was getting through as segment2 on the first page only, and finding its way into the cached page. 
  • Thus, any page links on the first cached page were stuffed with these characters.  My validation routine only worked on pages other than the first.

Using $input->urlSegmentStr as Ryan suggests and validating against this solves the problem.  

Thanks Ryan and others for your patience!

Link to comment
Share on other sites

15 minutes ago, ryan said:

However, I can't confirm it because all the links are now clean, can't get any more bogus links, almost like you found and fixed something while I was viewing it. :) 

Yes, I think so - and you posted while I was writing my reply!

  • Like 1
Link to comment
Share on other sites

Great, glad you got it figured out! You must have literally fixed it between the time I reloaded the page, and clicked "view source", since one tab had the issue and the other didn't. :)

  • Like 2
Link to comment
Share on other sites

  • 2 years later...

I know this is an old post, however I seem to be seeing a similar issue which can be viewed here: https://www.edmplus.co/uk-ltd/edm-consumables/

If I'm logged in the pagination works fine and starts at 1, however, if you view as a guest then the pagination breaks. I've tried disabling caching for the template but to no avail. I'm not quite sure what the root of the problem is. Could anyone point me in the right direction?

I'm running ProcessWire 3.0.128.

Link to comment
Share on other sites

Hi @alexmercenary,

I don't think it's the same problem as I had - that was down to me not validating my UrlSegments. However I have had problems where the wrong "start" value is supplied to the selector, generally by being set by some other part of the code (some other selector with limit elsewhere in the code). I'm not sure that's your issue either, but it may be related.

It seems that on your very first page of results, the "start" is somehow being set at a different value, and then this is being cached, generating the wrong set of results and corresponding pagination.  If you bypass the cache (https://www.edmplus.co/uk-ltd/edm-consumables/?test=1) or explicitly specify the first page number (https://www.edmplus.co/uk-ltd/edm-consumables/Page1) then it works as expected. 

I'd look for something that could be inadvertently setting the start value or the $input->pageNum() when these aren't otherwise defined for the selector in question.

Not sure if that helps, but maybe it's a start...  Others more experienced may chip in!

  • Like 1
Link to comment
Share on other sites

Thanks for the response @iank

It's definitely a bit bizarre. I've not had this issue before. I've tried explicitly specifying the start index but then it just get's stuck on that start position. If I clear the cache then refresh the page it loads the first page as expected. As soon as I hit another page, page 1 stops working, like you say, if you append page1.

Link to comment
Share on other sites

Hmm, that is weird!

It appears as though the first call to a subsequent page number other than the base (1) is overwriting the cached page for the base, though I don't understand how that can happen. 

I presume you're using ProCache? If so, maybe check the versions of the cached files themselves: 

  1. When you clear the cache and call the base (first) page and it's rendered correctly, does it save the correct cached index.html in the appropriate ProCache folder in assets?
  2. What happens when you then visit one of the higher page numbers,  say /page5?  Does this 'root' index.html file immediately get overwritten?  Is there an index.html in the page5 folder in ProCache assets? 
  3. How do these two compare? 

You can also access the cached files directly in the browser since you'll know your own unique ProCache folder structure. This would eliminate any (unlikely) rewrite rule problems.

Link to comment
Share on other sites

It's randomly started working... haha. Bizarre.

In answer to everything you said above though. It is more or less exactly as you've stated above. It was generating the index.html cache for page 1 then overwriting it. No child directories for the paged results. I'm going to keep an eye on this one.

Link to comment
Share on other sites

I tell a lie @iank it's playing up again. I'll try and spend some more time debugging later and if I manage to resolve it (could be something so basic that I'm overlooking) I'll come back to you.

Thank you for taking the time to offer your advice on the matter. It's very much appreciated. It's weird how even when I've disabled all caching for those templates it's still playing up. If I disable ProCache the issue is no more. I wonder if the rewrites in the htaccess aren't working for some reason.

Either way, I shall report back.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...