Jump to content
ryan

PW 3.0.172 – Find faster and more efficiently

Recommended Posts

@ryan - this sounds like another awesome update! Admittedly I haven't played with this new version yet, but my initial thought after reading the blog post is whether we could have a:

$pages->find("$selector, field=all");

as a complement to:

$pages->find("$selector, field=none");

For many pages/templates, I always need to access all fields on every load and this would make it much simpler than listing out all the fields separately.

Does that sound like a useful addition?

  • Like 3
  • Thanks 1

Share this post


Link to post
Share on other sites

Hi @ryan and thank you for such great and in my case in-time update.

A couple of questions: 

Do these codes do the same thing or they differ internally?

$items = $pages->find("title|body*=$q, field=title|summary"); 

and 

$items = $pages->find("title|body*=$q, [
    'loadOptions' => [
      'joinFields'     => [
        'title',
        'summary'
    ]
  ]
]"); 

----------------------------------

I remember that there were some issues with autojoining of multi-value fields with 'loadOptions' option in find method (like page reference or options fields etc). Are there any changes? 

----------------------------------

I have found that if I specify a field that does not exist "findRaw" method starts to join default page properties:

chrome_KIZEdaXrO6.thumb.png.c5b3128ae0d31f2af7c413e71f4a6213.png

And without 

chrome_mPl655KtLS.png.2086df24f9365ecc19be4e2c63c8dbd6.png

----------------------------------

What is the right way to handle the situation when we need to autojoin some fields only for the current page. For example, SEO fields in most cases used only on the page object of the current page, but if I set these fields autojoin then than they will be autojoined on every find call, so then we have to specify that we don't want to join these fields in 'joinField' options or via newly added options that also is not very handy. 

I was trying to set autojoin flag for these field in init.php and remove it in ready.php like
 

// init.php

fields('seo_title')->addFlag(Field::flagAutojoin);
fields('seo_description')->addFlag(Field::flagAutojoin);
fields('seo_og_title')->addFlag(Field::flagAutojoin);
fields('seo_og_description')->addFlag(Field::flagAutojoin);

//ready.php

fields('seo_title')->removeFlag(Field::flagAutojoin);
fields('seo_description')->removeFlag(Field::flagAutojoin);
fields('seo_og_title')->removeFlag(Field::flagAutojoin);
fields('seo_og_description')->removeFlag(Field::flagAutojoin);

But in this case, there is no way to check does current template have these fields as on the "init" state template and page variables are not available, so all templates have to have these fields or we will get an error. Maybe there is something already that can help? 

  • Like 1

Share this post


Link to post
Share on other sites
Quote

this sounds like another awesome update! Admittedly I haven't played with this new version yet, but my initial thought after reading the blog post is whether we could have a: $pages->find("$selector, field=all");

@adrian

Sounds like a good idea to me. I think it would be possible if your selector specifies which template(s) are in use, so it knows what fields it can autojoin, rather than attempting to autojoin all fields in the system. I can look into it further. Though should also mention that not all fields support autojoin, so it wouldn't technically be possible to autojoin "all", though it could accomplish it by using autojoin to load what it can and then manually load the rest, so it could still deliver pages to you with all of the fields populated and no further loading necessary.

Quote

For many pages/templates, I always need to access all fields on every load and this would make it much simpler than listing out all the fields separately.

Not sure if this applies to your case or not, but wanted to mention that if doing this for just the current/page being rendered, then it wouldn't be worthwhile. The efficiency benefits in autojoin come into play when lots of pages are being loaded. 

Quote

Do these codes do the same thing or they differ internally?

@Zeka

It's a similar option in terms of what it does, but also quite different. The new "field=..." option (or findJoin method) uses an entirely different page finding method that performs the find and load in the same query. (This is possible since you are specifically telling it what fields to join). Whereas the options.loadOptions.joinFields you mentioned is just an option to the pages.getById() method that occurs after the find() has already taken place, and it's really more intended as an internal option rather than a public API one. The newly added feature can do the same thing more efficiently, and easily enough that I thought it belonged in the public API. One other difference to mention is that the options.loadOptions.joinFields adds to the default autojoin fields for the loaded pages rather than overrides it. So it doesn't give you the level of control that the new option does unless you don't have any fields with the autojoin flag in your system. 

Quote

I remember that there were some issues with autojoining of multi-value fields with 'loadOptions' option in find method (like page reference or options fields etc). Are there any changes? 

No changes here. There's only so much that can be joined in one query per field, so autojoin is not always possible, especially on multi-value (FieldtypeMulti) fields with lots of values. It should be possible on most Page reference and options fields though, so long as they don't have huge amounts of selections. If there's need for it though, I may be able to have fields (where applicable) store a cache of data (like FieldtypeCombo does) that can be autojoined. 

Quote

I have found that if I specify a field that does not exist "findRaw" method starts to join default page properties:

Since you specified a field that does not exist, I think you are seeing the "no fields specified: load all" behavior here. Probably it would be better if it threw an Exception or just returned a blank result set. 

Quote

What is the right way to handle the situation when we need to autojoin some fields only for the current page. For example, SEO fields in most cases used only on the page object of the current page, but if I set these fields autojoin then than they will be autojoined on every find call, so then we have to specify that we don't want to join these fields in 'joinField' options or via newly added options that also is not very handy. 

Autojoin isn't very useful for the current page, like the one you are rendering. There isn't likely going to be a worthwhile difference in overhead between autojoin fields and load-on-demand fields when rendering the current page being viewed. The difference becomes worthwhile when working with lots of pages that all have the same field. For instance, think of the page-list in the admin—having the title field autojoin is useful there. For most people, they probably don't ever even need to think about autojoin. But for those of us interested in optimization, it does offer some worthwhile improvements in the right situations. 

I wouldn't bother with trying to autojoin your SEO fields since you would only use those rendering the page being viewed. This kind of micro-optimization can take up your time without delivering benefits you'll ever feel. Though if you wanted to load them all in one chunk, a Combo field would be a great way to do that. 

 

  • Like 3

Share this post


Link to post
Share on other sites
1 hour ago, ryan said:

I think it would be possible if your selector specifies which template(s) are in use, so it knows what fields it can autojoin, rather than attempting to autojoin all fields in the system.

Yes, I assumed it would check the template of each page to be returned and only autoload those relevant fields. 

2 hours ago, ryan said:

Not sure if this applies to your case or not, but wanted to mention that if doing this for just the current/page being rendered, then it wouldn't be worthwhile. The efficiency benefits in autojoin come into play when lots of pages are being loaded. 

Right, but there wouldn't be any downside either, would there? It's a good point of course. I suppose the autojoin would be more useful for example on a blog parent page (showing featured image, title, author, tags, date, summary etc) for the latest 10 posts, rather than an individual blog page showing all those fields plus the full body content.

  • Like 1

Share this post


Link to post
Share on other sites
Quote

Yes, I assumed it would check the template of each page to be returned and only autoload those relevant fields. 

In that case it would have to use the existing find() method, which does that: performs a find() to find the page IDs and template IDs, breaks them into different groups by template, and then performs query(ies) to then load the pages with the appropriate autojoin fields for the templates. 

The new find method used by findJoin() and "field=..." option is a lot simpler so it does all of it in 1 query, which is possible because you are specifying the fields to autojoin (rather than the template, so it doesn't need to figure that part out).

Quote

Right, but there wouldn't be any downside either, would there? It's a good point of course. I suppose the autojoin would be more useful for example on a blog parent page (showing featured image, title, author, tags, date, summary etc) for the latest 10 posts, rather than an individual blog page showing all those fields plus the full body content.

I don't yet know if there is a downside, as I've never tried to autojoin-all before. I'm going to look into it though. But I think there are potential downsides, as 1 big query with a lot of joins can sometimes be slower and harder for MySQL to cache than a lot of simple queries. MySQL also has limits on how many joins it can perform in a query and how much data can be concatenated from a multi-row table, so it's likely that autojoin of all fields might be less reliable in some cases. I guess I see autojoin as a good optimization for some (or many) cases where it's found to be reliable and increase performance, but not all cases, and not something to use by default.

  • Like 2

Share this post


Link to post
Share on other sites

This is great, but I've encountered a possible bug, unless I'm doing something wrong.

As a test I used it to create a simple site map using a recursive function -- besides the page title, I also wanted a 'headline' field, so it seemed a good candidate for an autojoin test.

When I ran it, however, there were some unexpected results (missing child pages and missing page names in urls) with the autojoin feature.

Here is the plaintext output from my tests (correct version first, then the incorrect version):

Notes:

  • C = hasChildren / numChildren (same result)
  • $p = current Page object being traversed
  • In both cases, the sort order is default (manual sort in the backend).
  • All pages use the same template (so definitely contain the fields being autojoined).
  • In all cases, both the 'title' and 'headline' fields were present in the output, so I omitted them below.
Spoiler

CORRECT RESULT:  $p->children()

#ID     C    Parent    URL
------------------------------------------------------------
1027    0         1    /about/
1018    2         1    /code/
1022    2      1018    /code/code-section-a/
1023    0      1022    /code/code-section-a/project-a/
1024    0      1022    /code/code-section-a/project-b/
1060    0      1018    /code/code-section-b/
1017    4         1    /rpgs/
1019    4      1017    /rpgs/rpg-section-a/
1062    0      1019    /rpgs/rpg-section-a/rpg-page-a/
1063    0      1019    /rpgs/rpg-section-a/rpg-page-b/
1064    0      1019    /rpgs/rpg-section-a/rpg-page-c/
1065    0      1019    /rpgs/rpg-section-a/rpg-page-d/
1020    0      1017    /rpgs/rpg-section-b/
1021    1      1017    /rpgs/rpg-section-c/
1039    0      1021    /rpgs/rpg-section-c/rpg-page-a/
1025    3      1017    /rpgs/rpg-section-d/
1031    0      1025    /rpgs/rpg-section-d/rpg-page-a/
1032    0      1025    /rpgs/rpg-section-d/rpg-page-b/
1033    0      1025    /rpgs/rpg-section-d/rpg-page-c/
1026    1         1    /gallery/
1030    1      1026    /gallery/gallery-section-a/
1028    2      1030    /gallery/gallery-section-a/album-a/
1029    0      1028    /gallery/gallery-section-a/album-a/photo-a/
1038    0      1028    /gallery/gallery-section-a/album-a/photo-b/


INCORRECT RESULT: $p->children('join=title|headline')

#ID     C    Parent    URL
------------------------------------------------------------
1027    0         1    //                        <-- incorrect url (missing name)
1018    2         1    /code/
1022    2      1018    /code/code-section-a/
1023    0      1022    /code/code-section-a/project-a/
1024    0      1022    /code/code-section-a/project-b/
1060    0      1018    /code/code-section-b/
1017    4         1    /rpgs/
1019    4      1017    /rpgs/rpg-section-a/
1062    0      1019    /rpgs/rpg-section-a//     <-- incorrect url (missing name)
1063    0      1019    /rpgs/rpg-section-a//     <-- incorrect url (missing name)
1064    0      1019    /rpgs/rpg-section-a//     <-- incorrect url (missing name)
1065    0      1019    /rpgs/rpg-section-a//     <-- incorrect url (missing name)
1020    0      1017    /rpgs/rpg-section-b/
1021    1      1017    /rpgs/rpg-section-c/
1039    0      1021    /rpgs/rpg-section-c//     <-- incorrect url (missing name)
1025    3      1017    /rpgs/rpg-section-d/
1031    0      1025    /rpgs/rpg-section-d/rpg-page-a/
1032    0      1025    /rpgs/rpg-section-d/rpg-page-b/
1033    0      1025    /rpgs/rpg-section-d/rpg-page-c/
1026    1         1    /gallery/
1030    0      1026    /gallery//                <-- incorrect url (missing name),
                                                     and shows 0 children instead of 1

This is the recursive function I'm using:

// Called with:  siteMap($pages->get(1))
function siteMap(Page $p)  {
    $str = '';

    $children = $p->children('join=title|headline'); // with autojoin
    // $children = $p->children(); // without autojoin

    foreach ($children as $c) {
        $str .= $c->id . "    " . $c->hasChildren() . "    " . $p->id . "    " . $c->url . "\n";

        if ($c->hasChildren()) {
            $str .= siteMap($c);
        }
    }

    return $str;
}

The following methods all produce the same incorrect results:

$p->children('join=title|headline');
wire('pages')->find('parent='.$p.', join=title|headline');
wire('pages')->findJoin('parent='.$p, 'title,headline');
wire('pages')->findJoin('parent='.$p, ['title','headline']);

Furthermore, it made no difference if I used "field" or "join" as the keyword in the selector.

  • ProcessWire 3.0.172 (DEV)
  • PHP 7.2.0
  • MySQL 5.7.19
Edited by LMD
Forgot a point.
  • Like 2

Share this post


Link to post
Share on other sites

@LMD Thanks, it looks like a last minute optimization screwup on my part, changing a $page->set() to a $page->setForced(), when there was bug in $page->setForced() preventing it from populating $page->settings when it should. I have pushed a fix for it on the dev branch. Please let me know if you continue to see any errors. 

  • Like 2

Share this post


Link to post
Share on other sites

Hi,

Many thanks Ryan for this update. It's great to have something like this in the core!

I have a question though: is the autojoin working with PageImages? I quickly tried on my setup but my PageImages field acts as if it was empty.

Here is the dump when using

$object = $pages->get("parent=$objects, $sort, field=gallery");
bd($object);

1096429880_Capturedecran2021-02-07a16_46_49.thumb.png.347e0c1841a63aadfbe2452bc99fc95c.png

 

 

Edited by monollonom
Second dump is useless (bd($object->gallery->first())

Share this post


Link to post
Share on other sites
3 hours ago, ryan said:

@LMD Thanks, it looks like a last minute optimization screwup on my part, changing a $page->set() to a $page->setForced(), when there was bug in $page->setForced() preventing it from populating $page->settings when it should. I have pushed a fix for it on the dev branch. Please let me know if you continue to see any errors. 

Thanks Ryan, the update fixed the problem.

Share this post


Link to post
Share on other sites

Great additions - many of these are things that we have setup our own solutions when hitting performance issues on large scale (10 000+ pages with lots of fields) so it is great to see core supporting this kind of things. Thank you Ryan!

Quick note on findRaw - it looks like it doesn't support parent? It would be very powerful to be using parent just like any normal field based page relation.

  • Like 2

Share this post


Link to post
Share on other sites
Quote

Since you specified a field that does not exist, I think you are seeing the "no fields specified: load all" behavior here. Probably it would be better if it threw an Exception or just returned a blank result set. 

Just my 2 cents: it should throw an exception here - this current logic is not good in my opinion.

  • Like 3

Share this post


Link to post
Share on other sites

@ryan Thank you for the continued effort you put into PW and the quick solution to LMD’s problem, which bit me too for a second 😄!

I just noticed there still seems to be an issue with the join/field selector. If used with pagination it appears to add the limit/pagesize to the total number of results, which distorts pagination. One too many pages are shown, the last of them being empty.

For example, I have this selector:

 template=show, datum>=2019-01-01, datum<2020-01-01, has_parent=1234, sort=-datum, limit=50

With the join fields getTotal() gives me 102, without only 52 (one show per week, makes sense). Interestingly, on the second pagination page, getTotal() will say 54 (?!) and the superfluous 3rd page disappears.

  • Like 1

Share this post


Link to post
Share on other sites
On 2/6/2021 at 8:19 AM, ryan said:
Quote

I remember that there were some issues with autojoining of multi-value fields with 'loadOptions' option in find method (like page reference or options fields etc). Are there any changes? 

No changes here. There's only so much that can be joined in one query per field, so autojoin is not always possible, especially on multi-value (FieldtypeMulti) fields with lots of values. It should be possible on most Page reference and options fields though, so long as they don't have huge amounts of selections. If there's need for it though, I may be able to have fields (where applicable) store a cache of data (like FieldtypeCombo does) that can be autojoined. 

I recall that in the past auto joining multi-value page fields would fail quietly and simply result in bad data being returned (like a single page instead of all pages). Can this be changed so that it throws an exception (or at least skips trying to auto join it) if you try to auto join a field that can't be auto joined? That would help avoid nasty issues for programmers who aren't familiar with the internal workings here.

  • Like 1

Share this post


Link to post
Share on other sites

@ryan I'm trying out the new join feature within a regular page selector. I believe you mentioned that it is possible to join subfields of page fields. I'm wondering if you can explain that in a bit more detail. Does this still all happen within a single SQL query?

If I want to join the id and the title of a page in a page reference field, which of the following is the proper way to do this?

  1. $pages->find("template=foo, join=pageField|pageField.title");
  2. $pages->find("template=foo, join=pageField.title"); // Can I join pageField.title without also joining the pageField? If so, would I still have access to the page id?
  3. $pages->find("template=foo, join=pageField.id|pageField.title") // Is joining pageField.id the same thing as joining pageField?

I could answer these questions for myself if I knew of a good way of checking which data has already been loaded in the resulting page. What is the best way to double check what field data is already loaded on the page objects after performing a find (in order to make sure the join worked properly)?

Edit: I just realized I wasn't thinking clearly when I wrote this. Being able to join subfields of page reference fields would require loading those pages as their own separate page objects, which would always include the page ID, name, modified, and created dates. I am still wondering whether doing either join=pageField or join=pageField.title would actually cause those pages to be preloaded as part of the find, or whether it would only lazy load them when first requested, per the normal PW behavior.

Edited by thetuningspoon
Brain fart
  • Like 1

Share this post


Link to post
Share on other sites

@ryan - what do you think about having a referencesRaw() page method. If you're outputting a long list of pages referenced to another page, often all you need are the title and url field values. As an example, I am outputting a list of publications for a staff member. I believe this would really speed things up.

Thanks for considering.

  • Like 3

Share this post


Link to post
Share on other sites

@ryan - I just discovered that findRaw() won't return the "url" field. I suppose that is to be expected given that it's not a DB stored field, but rather calculated at runtime, but it does still somewhat limit the functionality of findRaw in LOTS of useful situations. Do you think it would make sense to support this, or would it just slows things down too much?

The other thing is I am wondering if you'd consider an option to return a standard PHP object rather than an array? It would make it so much easier to switch between find() and findRaw()

Thanks.

  • Like 3

Share this post


Link to post
Share on other sites

I've done something similar on my extension module for PageHitCounter where only page IDs of page hits are stored. You need to enable PagePath module and then you can join page paths to page ids easily and efficiently. The page paths are kept in sync by the module, cached in the DB and therefore do not need to be "calculated" at runtime on every item.

  • Like 1

Share this post


Link to post
Share on other sites

Hi @bernhard - thanks for chiming in, but I'm afraid I don't really understand what you are responding to - is it in response to my comment just above, or to @thetuningspoon or ?

That said, I am curious what your extension to PageHitCounter actually does - I'm a bit confused about your description as well.

Sorry, if I am being a bit slow this morning 🙂

Share this post


Link to post
Share on other sites

Hi @adrian Sorry for being too briefly 😄 

This is what I'm talking about:

tFC3ju1.png

My module hooks into the pagehitcounter's method that logs requests to pages. I had to find a way to get the page path that relates to a given page id efficiently and easily. Otherwise I'd just have a list of page ids shown to the user (which is not really helpful) or the other option would have been to store page paths to the DB or calculate them on runtime (also not ideal).

The solution I came up with is to install the PagePath module. That module creates a list of all pages that live on the system and keeps that list in sync with pages (eg when changing a pagename etc). You get a table in the DB that holds the page ID and the corresponding page path.

If you don't have that module installed, there is no way (as far as I know) to get that information from the database (well, I guess that's the reason why PagePaths module exists so I'm quite confident with that assumption 😄 ). RockFinder has the same limitation and I've taken another approach there, see the addPath() method: https://github.com/baumrock/rockfinder3#addpath

Does that clarify things? 🙂 

  • Like 1

Share this post


Link to post
Share on other sites

Hi @bernhard - yeah, thanks. I never really looked at the PagePaths module except for really early on in my PW days - it didn't seem like it solved a problem I had. It's description says: "Enables page paths/urls to be queryable by selectors" but we already query a page by it's path, eg: 

image.png.d32b4f1f7c145c24023c39a9ebf4a1af.png

Then I installed PagePaths and it didn't seem to significantly speed up the query, but maybe I need to test on a site with a lot of pages.

I understand that for my request for returning the URL via findRaw() that the pages_path table should make this possible (and very performant), but is there something it can actually do in terms of selectors that I am not understanding?

 

Share this post


Link to post
Share on other sites
39 minutes ago, adrian said:

but is there something it can actually do in terms of selectors that I am not understanding?

I can't answer that question - I've only (ab?)used it for listing id/path relationship of pages 🙂 

Share this post


Link to post
Share on other sites

@adrian @bernhard I've added support for getting 'url' and 'path' from $pages->findRaw() in the latest commit, but it requires the PagePaths module be installed. Now just need multi-language URL support for that module.

  • Like 5

Share this post


Link to post
Share on other sites

Awesome @ryan - thank you!

Can you please explain:

1) How PagePaths actually improves / changes searching by path in selectors

2) Is there any significant overhead having it installed, ie why isn't it installed by default?

Thanks!

  • Like 1

Share this post


Link to post
Share on other sites
On 3/6/2021 at 5:57 PM, adrian said:

The other thing is I am wondering if you'd consider an option to return a standard PHP object rather than an array? It would make it so much easier to switch between find() and findRaw()

@ryan - also, do you have any thoughts about this request - something you think makes sense, or are you not willing to consider it? Thanks.

  • Like 1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

×
×
  • Create New...