sql query performance tips and experiences

bernhard · December 14, 2017

I'm still looking for the best solution for my datatables module and I think there is no way around using direct sql queries... That makes it more complicated to setup but saves a lot of other troubles like proper caching, cache maintenance etc...

Unfortunately such sql queries can get quite complex unless I'm missing any pw magic that makes it easier to do?

This is what i have so far to query all pages of type "basic-page":

select
	id,
	title.data as title,
	headline.data as headline,
	body.data as body,
	(select group_concat(data separator ',') from field_test_page where pages_id = p.id) as test_page,
	test_repeater.data as test_repeater
from pages as p
left join field_title as title on title.pages_id = p.id
left join field_headline as headline on headline.pages_id = p.id
left join field_body as body on body.pages_id = p.id
left join field_test_repeater as test_repeater on test_repeater.pages_id = p.id
where p.templates_id = 29

the repeater is a little tricky for example. it gets even worse when you have to handle multilanguage fields or you want references to other pages' field values...

edit: maybe better like this?

select
	id,
	(select data from field_title where pages_id = p.id) as title,
	(select data from field_field1 where pages_id = p.id) as field1,
	(select data11041 from field_field1 where pages_id = p.id) as field1_de,
	parent_id,
	(select data from field_title where pages_id = p.parent_id) as parent_title
from pages as p
where p.templates_id = 44

thanks for any hints

dragan · December 15, 2017

Not sure about repeaters, but to find all pages that use template basic-page, you could do this:

SET @tid = (SELECT id FROM templates WHERE name='basic-page');
SELECT DISTINCT p.id, p.name
FROM pages AS p, templates as t
WHERE p.templates_id = @tid

or this (include page titles)

SET @tid = (SELECT id FROM templates WHERE name='basic-page');
SELECT DISTINCT p.id, p.name, ft.data AS title
FROM pages AS p, templates AS t, field_title AS ft
WHERE p.templates_id = @tid
AND (p.id = ft.pages_id)

bernhard · December 15, 2017

thanks for your suggestion dragan!

i modified your example a little bit. it's also clear and easy to extend with other fields... but still i find this one easier to read and maintain (having one line for each field vs. in your example (and my previous tries) it's 3 lines for each field):

select
	id,
	(select data from field_title where pages_id = p.id) as title,
	(select data from field_field1 where pages_id = p.id) as field1,
	(select data11041 from field_field1 where pages_id = p.id) as field1_de,
	parent_id,
	(select data from field_title where pages_id = p.parent_id) as parent_title,
	(select group_concat(data separator ',') from field_test_page where pages_id = p.id) as test_page
from pages as p
where p.templates_id = 44

i did some tests and it seems that both have the same performance... 65ms for a query over all 10.000 rows

1 hour ago, dragan said:

Not sure about repeaters

See "test_page" in the example above - also not that hard to get. Only problem could be special characters and quotes - not sure about that... but i think in that situations it would be the best to add a field that gets populated via a save hook then you could populate whatever value you want easily via PHP.

kongondo · December 15, 2017

16 hours ago, bernhard said:

I think there is no way around using direct sql queries..

Why is this the case?

bernhard · December 15, 2017

because it's a LOT more performant. constructing my datastring via pages->find() and a foreach takes 16 seconds for 10.000 rows and 5 columns (needing some extra seconds for every column added) whereas querying the database directly needs only some milliseconds.

my other idea was to cache the table rows on the dedicated templates but that leads to problems when you have a "parent" or "category" column because then it would take several seconds to recreate the cache of the table when the name of the category changes (updating up to thousands of rows' cache). It also leads to a lot of redundant data. All of that problems are solved when the DB is queried directly

kongondo · December 15, 2017

56 minutes ago, bernhard said:

because it's a LOT more performant. constructing my datastring via pages->find() and a foreach takes 16 seconds for 10.000 rows and 5 columns (needing some extra seconds for every column added) whereas querying the database directly needs only some milliseconds.

OK, I see. Maybe you've already discussed this elsewhere. If that's the case, apologies. Why would you need to query 10,000 rows at the same time? I suppose because you want to cache the data to make subsequent fetching faster. However, wouldn't pagination work better in this case (if you were to use a $pages->find() with limit)? You could even combine that with incremental caching, both server and client-side if you wanted. If you don't want pagination, you also have virtual pagination or infinite scrolling. I am just thinking out loud here. I haven't tested your datatables module yet. If I could avoid writing SQL queries myself, I'd go for it .

Edit: I see you've already discussed this in your other module's thread. So, we better continue there.

Edited December 15, 2017 by kongondo

bernhard · December 15, 2017

thanks kongondo, all valid points. the reason why i want to have all date loaded at once is because it makes the development of my module a lot easier and has some huge benefits over loading data paginated via ajax. when i have all the data available at the client i can use datatables' api to filter, sort, query, draw charts etc.; that would be very hard to achieve using server side techniques.

i was also trying to avoid direct sql queries, that's why i took other approaches for my first two versions of the module. but it turnes out that all of that approaches have some really big drawbacks. actually building the queries via SQL manually is not as difficult as i thought (the genious api and the easy page->find() operations was one of the main reasons i fell in love with processwire). and i have some nice ideas how to make it even more comfortable and easy.

I'm quite sure it will be really easy to use for everybody

bernhard · December 17, 2017

after seeing @adrian also used joins in this example (https://processwire.com/talk/topic/15524-preview-rockdatatables/?do=findComment&comment=158104) I did a quick google and it seems that there is a slight performance benefit in using joins over using subqueries.

When writing quieries manually (without any helping functions like in adrians example) i think it is better to use subqueries as they are a lot easier to write, read and maintain and they are less likely to return wrong results imho. see also https://stackoverflow.com/questions/2577174/join-vs-sub-query

I also did a quick test on my setup:

t();
$rows = [];
$result = $this->db->query("select
	id,
	title.data as title,
	field1.data as field1
from pages as p
left join field_title as title on title.pages_id = p.id
left join field_field1 as field1 on field1.pages_id = p.id
where p.templates_id = 44");
while($row = $result->fetch_object()) {
    $rows[] = $row;
}
d($rows);
d(t(), 'joins');


$rows = [];
$result = $this->db->query("select
	id,
	(select data from field_title where pages_id = p.id) as title,
	(select data from field_field1 where pages_id = p.id) as field1
from pages as p
where p.templates_id = 44");
while($row = $result->fetch_object()) {
    $rows[] = $row;
}
d($rows);
d(t(), 'subqueries');

Most of the time the joins were a little faster, but not always - so I think it should not matter regarding performance which one to use:

adrian · December 17, 2017

Personally I have just alway felt more comfortable with JOINS so I totally understand where the OP in that SO thread is coming from

Don't forget about:

$rows = $result->fetchAll(PDO::FETCH_ASSOC);

That will save the time required to loop through all 10,000 results to populate an array

Also, you should probably be using $database (PDO), rather than $db (mysqli)

bernhard · December 17, 2017

On 17.12.2017 at 5:41 PM, adrian said:

That will save the time required to loop through all 10,000 results to populate an array

thx, this was only for testing

On 17.12.2017 at 5:41 PM, adrian said:

Also, you should probably be using $database (PDO), rather than $db (mysqli)

thx, didn't know that!

writing on an answer in the datatables thread - i'm working on your findArray hook

bernhard · January 24, 2018

Ok today I hit a very strange problem regarding those mentioned sql queries. I built a little processmodule that shows the result of my queries. The queries themselves are stored in a PHP file which makes it possible to create different VIEWS (just like DB views) and include them easily via php's include().

Simple example:

// file allpages.php
return "SELECT * FROM pages";

// file publishedpages.php
$allpages = include(__DIR__ . '/allpages.php');
return "SELECT * FROM ($allpages) AS allpages WHERE allpages.status = 1";

This seemed to work great, so I continued to create some more complex queries like the ones mentioned above:

  SELECT
    effort.id AS id
    ,effort.status as status
    ,(SELECT data FROM field_rockprojecteffort_date WHERE pages_id = effort.id) AS date
    ,(SELECT pages_id FROM field_rockproject_efforts WHERE data = effort.id) AS projectid
  FROM
    pages AS effort
  WHERE
    templates_id = 95

Template id 95 is for "rockprojectefforts", thats items for all kind of actions related to a project.

So far, so good. Now the strange part: I want to create a combined listing of all projects and all efforts ( project, firsteffortdate (=when the project started), lasteffortdate (=when it ends), etc )...

<?php
$projects = include(__DIR__ . '/allprojects.php');
$efforts = include(__DIR__ . '/allefforts-slow.php');
return 
  "SELECT
    *
    ,(SELECT `date` FROM ($efforts) as `efforts` WHERE `projectid` = `projects`.`id` LIMIT 1) as `fromdate`
  FROM
    ($projects) as `projects`
  WHERE
    `projects`.`status` = 1
  ";

This works, but it is terribly slow (3,6 seconds)!

The included SLOW file looks like this:

<?php
// old non-performant query
return 
  "SELECT
    `effort`.`id` AS `id`
    ,`effort`.`status` as `status`

    /* data for the effort */
    ,(SELECT `data` FROM `field_title` WHERE `pages_id` = `effort`.`id`) AS `title`
    ,(SELECT `data` FROM `field_rockprojecteffort_date` WHERE `pages_id` = `effort`.`id`) AS `date`
    ,(SELECT `data` FROM `field_rockprojecteffort_price` WHERE `pages_id` = `effort`.`id`) AS `price`
    ,(SELECT `data` FROM `field_rockprojecteffort_vat` WHERE `pages_id` = `effort`.`id`) AS `vat`
    ,(SELECT `data` FROM `field_rockprojecteffort_amount` WHERE `pages_id` = `effort`.`id`) AS `amount`

    /* data for the corresponding project */
    ,(SELECT `pages_id` FROM `field_rockproject_efforts` WHERE `data` = `effort`.`id`) AS `projectid`
    ,(SELECT `data` FROM `field_title` WHERE `pages_id` = `projectid`) AS `projecttitle`
    ,(SELECT `data` FROM `field_rockproject_status` WHERE `pages_id` = `projectid`) AS `projectstatus`

    /* we also need the clientid to search for the first effort of a client */
    ,(SELECT `data` FROM `field_rockproject_client` WHERE `pages_id` = `projectid`) AS `clientid`
  FROM
    `pages` AS `effort`
  WHERE
    `templates_id` = 95
  ";

Whereas this file works just fine (0,03 seconds):

<?php
return
  "SELECT
    pages.id as id
    ,pages.status as status
    ,date.data as date
    ,efforts.pages_id as projectid
    ,title.data as title
  FROM
    pages
    JOIN field_rockprojecteffort_date
      AS date
      ON date.pages_id = pages.id
    JOIN field_rockproject_efforts
      AS efforts
      ON efforts.data = pages.id
    JOIN field_title
      AS title
      ON title.pages_id = pages.id
  WHERE
    templates_id = 95
  ";

another strange thing is that both queries as single queries work just fine (one needs 0,07 seconds, the slower one 0,11 seconds - but I don't think that is the issue):

I can live with the JOIN even though it is a little less readable and more complicated to setup, but I would really love to know what is going on and why this is the case.

BTW: On the second reload everything is even quicker so I guess there is some caching in the background. I did not find any options for disabling the cache in the WireDatabasePDO class - could this be a mysql internal cache?

Thanks for your time!

flydev · January 24, 2018

48 minutes ago, bernhard said:

could this be a mysql internal cache?

Yes, its called MySQL Query Cache

To disable it for a connection :

SET SESSION query_cache_type = OFF;

More details there :

https://www.dbrnd.com/2015/08/mysql-query-cache/

https://www.dbrnd.com/2015/08/mysql-query-cache-configuration/

Edited January 24, 2018 by flydev
links

FrancisChung · January 24, 2018

@bernhard , have you looked at the query plan for both SQL statements using a profiling tool or SQL Management tool?

I had a quick look at your SQL and the type of select query you're doing (lots of fields from lots of tables), an inner join is (almost) always going to be faster than subqueries .

There are instances where subqueries can be quicker than a join. If the subset of data returned by a Subquery is much smaller than if the table was used in a join query, (1000 rows vs a million row table for example) than this is where Subqueries can really shine. I'm guessing your tables are not?

If you can also use EXISTS, NOT EXISTS statements with a subquery, it's probably a good candidate as well because of the aforementioned reasons.

Normally, I'm a big advocate of readability of code but I draw the line on SQL because very often this is the place for bottlenecks for entire applications & workflows and it's the one place where a poorly optimised query can really cause issues down the line in too many ways. More readable code which is less optimised might only cost you a ms or a few seconds. Poorly optimised queries can cost you hours and even days depending on the dataset.

Proper indenting, good (SQL) coding standards and concise & detailed commenting can go a long way to improve readability of those yucky joins

bernhard · January 25, 2018

18 hours ago, flydev said:

Yes, its called MySQL Query Cache

Thanks! Didn't know about that. This worked to turn it off:

SET GLOBAL query_cache_size = 0;

17 hours ago, FrancisChung said:

@bernhard , have you looked at the query plan for both SQL statements using a profiling tool or SQL Management tool?

No, never heard about such tools My experience with SQL is quite limited so I'd be very thankful for some helpful links in this regard, thanks

Thank you both for your helpful answers!

SamC · January 25, 2018

4 hours ago, bernhard said:

My experience with SQL is quite limited

Sitting here learning MySQL for the first time (along with PHP, i.e. getting some actual core skills...), typing into the monitor stuff like ALTER and ADD. Then reading this thread. Then realising that I got a long road ahead. Gonna put the kettle on.

FrancisChung · January 25, 2018

I'm very well versed in MS SQL Server but not as much in MySQL. But the underlying principles are the same.

Pretty succinct explanation here
https://www.sitepoint.com/using-explain-to-write-better-mysql-queries/

And to get in deeper
https://dev.mysql.com/doc/refman/5.5/en/using-explain.html

It may also beneficial to create indexes for some of the fields you're joining with (e.g.efforts.data, pages_id) but best to verify with a query plan using the Explain command as above to verify your mileage.

Indexes with low cardinality (i.e. low number of unique values) usually perform worse than not having any indexes at all.

adrian · January 25, 2018

@bernhard - not sure if this is useful or not, but it shows you the SQL that the PageFinder uses for a given selector.

SamC · January 26, 2018

@adrian that's awesome! Need to try this, will be very helpful to see what queries are actually being run. Thanks.

@bernhard I was using mysql workbench yesterday, maybe that would be of use to you? You can get visuals etc. I've found it really useful so far.

https://dev.mysql.com/doc/workbench/en/wb-performance-explain.html

bernhard · January 26, 2018

@adrian thanks I was aware of that but I don't think it is of big help in my case.

@SamC thanks, looks interesting

I've always worked with PhpMyAdmin (on most hostings) and HeidiSQL (for local dev with laragon). The visual explain analyzer looks interesting though. Seems that HeidiSQL should also have this feature but it didn't work here on my first try...

I'll play around with that in the future - for now I'm happy with my results since I decreased the time for the query from terrible 90s to 900ms (uncached)

Thanks for all your input!

Pixrael · January 26, 2018

@bernhard Incredible results! Can you summarize which option you used finally?

bernhard · January 26, 2018

@Pixrael the summary is basically: Do a really bad job before with lots of foreach etc and then refactor everything to proper sql statements that join the related data together in a fraction of the time

At least I'm on step two of "make it work, make it fast, make it pretty"

Jokes aside: The background is that I have quite some performance issues with my CRM - there is an admin page that lists all projects and all their revenues. The revenues are themselfes pages with additional informations (like type of revenue (internal/external/etc), date and so on). This way it is possible to do all kinds of great stuff with that data (fake data here):

Listing all those related items is not so easy, thats why I used regular PW api before, looping all projects, summing up all related revenues etc.; And of course that is terribly inefficient. But a lot easier than setting up SQL queries with joins, sum(), concat() etc...

Since all the joins etc. are quite complex I created a little ProcessModule that will be part of my Datatables module (planned for this year) that lists the queries instantly, makes it possible to combine different views and - what stants out from regular sql tools - can interpret PHP. So you can for example create custom db views with variable setups:

Spoiler


<?php
/**
 * list that shows all revenues related to a project
 */
$s = 0;
$dt3 = $this->modules->get('RockDatatables3');

$year = 2018;
$query = '';

// create queries for all efforttypes
$field = $this->fields->get('rockprojecteffort_type');
foreach($field->type->getOptions($field)->each('value') as $eftypeid => $eftype) {
  if($eftype == 'invoiceitem') continue; // skip "rechnungsposten"
  $s++;
  
  // values
  $query .= ",(";
  $query .= 
    "SELECT
      sum(net)
    FROM
      {$dt3->getViewQuery('allrevenues')} AS ef_subquery_$s
    WHERE
      ef_subquery_$s.projectid = sub_revenues.projectid
      AND DATE_FORMAT(ef_subquery_$s.date, '%Y') = $year
      AND type = $eftypeid
    ";
  $query .= ") AS {$eftype}_0";

  // values for each month
  for($i = 1; $i <= 12; $i++) {
    $s++;
    $monthstr = $year.$i;
    $query .= ",(";
    $query .= 
      "SELECT
        sum(net)
      FROM
        {$dt3->getViewQuery('allrevenues')} AS ef_subquery_$s
      WHERE
        ef_subquery_$s.projectid = sub_revenues.projectid
        AND DATE_FORMAT(ef_subquery_$s.date, '%Y%c') = $monthstr
        AND type = $eftypeid
      ";
    $query .= ") AS {$eftype}_$i";
  }
}

// get revenues
$s++;
$revenues = "SELECT
    projectid
    ,sum(net)           AS nettotal
    ,sum(gross)         AS grosstotal
    $query
  FROM
    {$dt3->getViewQuery('allrevenues')} AS sub_revenues
  WHERE
    projectpwstatus = 1 /* only published projects */
  GROUP BY
    projectid
  ";

/**
 * main query
 */
return
  "SELECT
    *
  FROM
    {$dt3->getViewQuery('allprojects')}   AS projects
    LEFT JOIN ($revenues)                 AS revenues       ON revenues.projectid = projects.id
  ";

Resulting in this

PS: Just realized that this topic went quite far from the initial title (best query for pages->find() ), so I'll change it. @adrian 's suggestion was of course on spot of the old thread title

dragan · January 26, 2018

On 25.1.2018 at 2:38 PM, SamC said:

Then realising that I got a long road ahead

Don't want to divert this thread, but I first got my feet wet with more advanced SQL stuff reading https://philip.greenspun.com/sql/index.html

He used AOL server and Oracle, but the underlying principles and problems are similar in the entire DB-world I guess. And what's more, the guy was involved with really huge companies and lists some real-life problems and solution-approaches. Was kinda fun to read back in the day (the nerd-kind-of-fun).

tl;dr: Just keep on learning, today there's a million times more learning opportunities than 30 yrs ago (free software, online tutorials/courses etc., cheap hosting, and faster internet than dialup...).

FrancisChung · January 26, 2018

@bernhard, you might also consider using Views or Stored Procedures to encapsulate the logic in the DB rather than code.

I'm guessing that the user might be filtering on the dataset returned, but that's mainly done on the client side, not the server/db side?

bernhard · January 27, 2018

16 hours ago, FrancisChung said:

@bernhard, you might also consider using Views or Stored Procedures to encapsulate the logic in the DB rather than code.

I'm guessing that the user might be filtering on the dataset returned, but that's mainly done on the client side, not the server/db side?

Thanks, I've already used VIEWS for some of the operations, but the problem I had was that the definition of the view was done inside phpmyadmin and this was totally unmaintainable! See here:

So maybe there are better tools for such tasks? Not only is the definition of the view a 1-line-string but also I cannot add any comments to it. My "dynamically created views" that I have now are far better to read and can also use php (variables and functions).

The latter would not be possible using standard views, would it?

I'm talking about something like this:

<?php
// demo-sql-query.php
$year = 2018;

$months = '';
for($i=1; $i<=12; $i++) {
  $months .= ",(SELECT month_$i FROM demo WHERE year = $year) AS month_$i";
}

$sql = "SELECT
    id
    $months
  FROM
    pages
    LEFT JOIN ...
  WHERE
  	...
  ";

// result
SELECT
  id
  ,(SELECT month_1 FROM demo WHERE year = 2018) AS month_1
  ...
  ,(SELECT month_12 FROM demo WHERE year = 2018) AS month_12
FROM
  ...

The problem with that technique is that it is always hard to debug since you have to execute the command and you cannot just click on the view in phpmyadmin. Thats why i created the processmodule to list thos "views". So i just select "demo-sql-query.php" and see the resulting table ("view")

But I'm totally open to better solutions and learning new things.

FrancisChung · January 27, 2018

@bernhard, have you tried using another tool like MySQLWorkBench for example? Perhaps it has inbuilt formatting that is better? Did you create the views inside phpMyAdmin?

If you go down the view / stored procedure route, the normal workflow is to use DB scripts that create the views / stored procedures on the DB.

So if you need to make a change to the logic, you would edit the view / stored proc. logic on the script and then run the script on the DB.

These scripts can be generated using the generate script/schema functionality of these DB Management tools like MySQLWorkbench etc.

They normally have the following structure. (Following is pseudo code)

IF <DB Object> Exists Then Drop <DB Object>

CREATE <DB Object>

...<Business Logic here> ...

GRANT <DB Object> Permissions

Having these scripts will allow you make automated deployment easier later on.
Also, if you create your objects this way, this may solve your issue with the formatting because you have a direct hand in formatting the SQL that's being used.

Sign In

sql query performance tips and experiences

Recommended Posts

bernhard

dragan

bernhard

kongondo

bernhard

kongondo

bernhard

bernhard

adrian

bernhard

bernhard

flydev

FrancisChung

bernhard

SamC

FrancisChung

adrian

SamC

bernhard

Pixrael

bernhard

dragan

FrancisChung

bernhard

FrancisChung

Recently Browsing 0 members

Browse

Activity

My Activity Streams

Support

Store

My Details