Jump to content

Creating thousands of pages via the API


David Beesley
 Share

Recommended Posts

Hi All,

On one of our PW sites, we are importing data from an external API into the processwire database.

The import contains around 450 "Objects" and on average each of these objects has around 20 sub-objects.

This means I am creating around 9,500 pages during the initial full import (subsequent imports only make amends and additions)

The problem here is I seem to be hitting the php memory limit.

I have already upped it to 256Mb, and don't really want to up it further.

 

I am nulling and unsetting everything I feasibly can, so the memory bloat now appears to be in the process of page creation.
I am nulling the object after each page is fully saved, and can see the destructor method is running. So cannot fathom why I have this memory bloat.

The bloat appears to be around 120kb, per page created. Which means currently I'd need over 1Gb dedicated to this process for it to complete.

Is there something I have missed?

Link to comment
Share on other sites

No joy I'm afraid :(

memory usage BEFORE page creation: 19656360
memory usage AFTER page creation: 19735712
memory usage BEFORE page creation: 19735712
memory usage AFTER page creation: 19815680
memory usage BEFORE page creation: 20493800
memory usage AFTER page creation: 20652664

 

Link to comment
Share on other sites

Do you have $config->debug enabled? If yes, does setting it to false change memory consumption?

Also, you may try calling gc_collect_cycles() at the end of the loop after uncaching the page, to allow PHP to garbage collect unused objects with circular references.

  • Like 5
Link to comment
Share on other sites

@David Beesley

One approach I've used before when importing thousands of records that need to be made into pages, is to stash the data from the external API into a work queue (php-resque or one of the WireQueue implementations) and have cron-initiated workers regularly batch-process a certain number of queue entries, creating pages for them, before exiting. I've done this for a client in the past and it works very well.

  • Like 5
Link to comment
Share on other sites

Hey @netcarver

Thanks for bringing WireQueue to my attention

I'll have a look through the forum thread to see how this could be implemented in my case.

@BitPoet

I have tried turning debugging off and I end each foreach loop thusly

$page = null; // null out page object
unset($page); // remove pointer to page object (null) in memory

wire('pages')->uncacheAll(); // clear the page cache

gc_collect_cycles(); // force invoke the garbage collector

But there is no discernible change in memory usage

 

@kongondo

I'm unsure how I would implement your example.

My page creation process is 

$page = wire('pages')->get("parent={$parent_page->id}, this_id={$page['id']}");  // where this_id & $page['id'] are this entities id in the api

if(!$page->id)
{
	// CREATE new page						
	$new_page = new Page();
	... // Some generic fields
			
	// save
	$new_page->save();

	$page = wire('pages')->get($new_page->id); // Set my page object to this new page
}

... // Set some more fields based on API data

// save
$page->save();

(nulling and unsetting omitted for clarity)

 

Thanks all for your help so far. This is a great community :)

  • Like 1
Link to comment
Share on other sites

After seeing your ode, the solution @kongondo pointed at doesn't really apply here because you're not creating a page array, but looping through the data that comes from the API.

Any reason why you're not using the same variable to create a page, and to add the fields and save? You would spare a call to the database on each iteraction if you didn't:

$page = wire('pages')->get("parent={$parent_page->id}, this_id={$page['id']}");  // where this_id & $page['id'] are this entities id in the api

if(!$page->id)
{
    // CREATE new page                        
    $page = new Page(); // <- removed the $new_page variable
    ... // Some generic fields
            
    // save
    $page->save();

    // not needed -> $page = wire('pages')->get($new_page->id);
}

... // Set some more fields based on API data

// save
$page->save();

 

  • Like 5
Link to comment
Share on other sites

One thing I did on a past site was to get a database copy from the live site, restore on a local machine, do all the stuff locally then restore. I had around 20,000 items I was importing and ran into timeout / memory issues. I was also pre-processing images which was helpful as I was able to upload those as well. My current server sometimes eats up the ram due to image variations for large images but I currently run that site on a server with 15gb.

If this is a one time deal you may be able to think this way, otherwise there might be some more server / php / apache memory settings you can tweak to get usage down. 

  • Like 2
Link to comment
Share on other sites

@diogo Thanks for the update. I'll be honest with you, like most things in this profession I have inherited it from a predecessor and that bit was working so I left it alone.

I have done loads of micro-optimisations through this process, so may add yours to it.

I'm now also thinking that because debugging is off, and was the primary cause of the memory bloat (I hate the term memory leak), I may not need the multiple nulls, unsets and forced garbage collector cycles.

Link to comment
Share on other sites

  • 2 years later...

I just want to share my findings from today when I wanted to create one million pages quickly:

First, via Tracy Console:

ini_set('max_execution_time', 150);
$last = $pages->findOne("parent=/data,include=all,sort=-id");
$last = (int)$last->title ?: 0;
for($i=$last+1;$i<$last+1000;$i++) {
    // create page
    $p = new Page();
    $p->template = 'basic-page';
    $p->parent = '/data';
    $p->title = $i;
    $p->save();
    l("saved page $i");
    $pages->uncacheAll();
    gc_collect_cycles();
}
d('done');

It started at 23 pages per second:

Vji4jdQ.png

And at the end of the 150 seconds it was already down at 4 pages per second:

rRHgHPB.png

Testing without logging:

tGIqFyK.png

qUwTR6w.png

1EH6MfW.png

Memory usage also increased proportionally.

Finally running a PHP script from the command line that bootstrapped PW worked well:

<?php namespace ProcessWire;
$time_pre = microtime(true);
include("../../../index.php"); // bootstrap ProcessWire
$last = $pages->findOne("parent=/data,sort=-id");
$i = (int)(string)$last->title ?: 0; // multilang quickfix convert to string

// foreach($pages->find('parent=/data') as $p) {
//     $p->delete();
//     echo "deleted $p\n";
// }
// return;

$i++;
$num = 0;
while($i <= 1000000) {
    // create page
    $p = new Page();
    $p->template = 'basic-page';
    $p->parent = '/data';
    $p->title = $i;
    $p->save();
    $pages->uncacheAll();
    gc_collect_cycles();
    $num++;
    echo "done: $num ($i)\n";
    $i++;
}

function convert($size)
{
    $unit=array('b','kb','mb','gb','tb','pb');
    return @round($size/pow(1024,($i=floor(log($size,1024)))),2).' '.$unit[$i];
}

$time_post = microtime(true);
$t = $time_post - $time_pre;
$m = memory_get_usage();
echo "created $num pages in " . round($t, 3) . "s, " . round($t/$num, 3) . "s per page, used " . convert($m) . " memory\n";

create.gif.efc1a64c4c0120ddf1b6c169e420f0ca.gif

Benefits:

  • quick
  • runs in the background and does not lock the site
  • direct feedback via echo without creating huge log files
  • cancel via ctrl+c any time
  • easy persistant saving of your scripts
  • Like 15
  • Thanks 1
Link to comment
Share on other sites

without uncacheAll and gc_collect_cycles()

created 500 pages in 24.82s, 0.05s per page, used 20.88 mb memory

using only gc_collect_cycles()

created 500 pages in 24.074s, 0.048s per page, used 15.16 mb memory

using only $pages->uncacheAll()

created 500 pages in 24.366s, 0.049s per page, used 20.88 mb memory
created 2000 pages in 96.191s, 0.048s per page, used 25.07 mb memory

Seems that gc_collect_cycles() is the best option?! Not sure if I'm using it correctly, though: https://stackoverflow.com/questions/19715048/what-gc-collect-cycles-function-is-useful-for

Link to comment
Share on other sites

2 hours ago, bernhard said:

Memory usage also increased proportionally.

Just be aware that if you have Tracy calls inside your script (d(), etc), these also contribute to the total memory usage :-). In my test cases (some unrelated work), the amount used by Tracy can be significant.

  • Like 2
Link to comment
Share on other sites

Thx, yes I thought that would be the case. But without any logging you never know whether the script is still running or not (or what the progress is). Console + echo is really nice for such situations where the tracy console does not fit ?

Link to comment
Share on other sites

8 minutes ago, bernhard said:

But without any logging you never know whether the script is still running or not

True. In my case, I used $log to save the memory usage results to file to get an average. After that, I switched to Tracy but was well aware of how much memory my script was actually using vs any additions from Tracy :-).

  • Like 2
Link to comment
Share on other sites

How about deleting thousands of pages using the API?  Right now I'm using this hackish approach, but it's only when syncing my live db to my dev db, so even though it's not clean and may leave orphan records (which aren't causing any problems if they are even there), it's ok for my purposes:

Github Request:
https://github.com/processwire/processwire-requests/issues/131

Link to comment
Share on other sites

14 hours ago, dragan said:

How long did finally this 1 million page-creation script take?

I stopped at 200k, that's enough for my performance tests for now (RockTabulator)

 

9 hours ago, Jonathan Lahijani said:

How about deleting thousands of pages using the API?

created 1000 pages in 52.123s, 0.052s per page, used 15.16 mb memory
deleted 1000 pages in 5.759s, 0.006s per page, used 19.35 mb memory
Quote

script that deletes 15,000 pages takes about 30 minutes which is painful.

On my setup it would take 6*15 = 90seconds to delete 15k pages. Maybe you have some hooks running? That's always a pain and slowing down things when you need to modify huge amounts of pages...

Code used:

<?php namespace ProcessWire;
$time_pre = microtime(true);
include("../../../index.php"); // bootstrap ProcessWire

function convert($size)
{
  $unit=array('b','kb','mb','gb','tb','pb');
  return @round($size/pow(1024,($i=floor(log($size,1024)))),2).' '.$unit[$i];
}

function showTime($time_pre, $num) {
  $time_post = microtime(true);
  $t = $time_post - $time_pre;
  $m = memory_get_usage();
  echo "processed $num pages in " . round($t, 3) . "s, " . round($t/$num, 3) . "s per page, used " . convert($m) . " memory\n";
}

$num = 0;
foreach($pages->find('parent=/data,id>257975') as $p) {
  $p->delete();
  echo "deleted $p\n";
  gc_collect_cycles();
  $num++;
}
showTime($time_pre, $num);
return;

 

Link to comment
Share on other sites

If using InnoDB, use transactions to batch the page creation and deletion.

With Batch Child Editor and transactions, I am able to create 35,5k pages (with over a dozen fields) in 2min 44s on my local machine. So much faster than Bernhard's result.

  • Like 2
Link to comment
Share on other sites

  • 3 weeks later...
18 hours ago, dragan said:

Wow, PW (backend) became absolutely unusable after I created some 14k pages. Absurd long loading times.

Maybe you have some hooks that are causing this? Or a module that does some calculations? A default PW installation should not have any problems with several thousand pages...

Link to comment
Share on other sites

Yeah, I was thinking the same.

tl;dr: there's probably something broken with my server, rather than PW.

In my several tests, I had to shut down Laragon (and Apache manually via Task Manager) and restart the server.

I also found an Apache error, same as reported here, (installing a new Apache didn't help)

+ a PHP error:

#8 C:\laragon\www\pwdev\wire\core\ProcessWire.php(444): ProcessWire\Session->__construct(Obj in C:\laragon\www\pwdev\index.php on line 64
[13-Jul-2019 19:00:41 Europe/Zurich] PHP Fatal error:  Exception: SQLSTATE[70100]: <<Unknown error>>: 1317 Query execution was interrupted (in C:\laragon\www\pwdev\wire\core\WireDatabasePDO.php line 480)

I also switched to InnoDB in the meantime... So most likely there's something broken now with my LAMP setup ?

But I'll disable every hook now and also some modules, and see how it goes. Does anyone know if it's safe to delete DB-sessions manually in mySQL?

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...