flydev Posted December 6, 2019 Share Posted December 6, 2019 Hello guys, I need to import about 300K text files (each file has ~256 bytes) and turn them into pages on a website hosted on a shared host. I know I could use the PCNTL extension functions to do this job but its not available (I asked the support to enable it but I think that the answer will be negative). Do you guys have/know a work-around, script or any other idea I am not thinking of now ? Thanks ? Link to comment Share on other sites More sharing options...
wbmnfktr Posted December 6, 2019 Share Posted December 6, 2019 Didn't @bernhard write something similar about how to be efficient in creating thousands of pages a few days/weeks back? Those weren' files but at least performance wasn't the problem if I remember correctly. 1 Link to comment Share on other sites More sharing options...
flydev Posted December 6, 2019 Author Share Posted December 6, 2019 (edited) Yeah thanks, I remember a post about that but sorry for not explaining my main issue. The issue is not performances about creating page, but importing those 300k files without being able to use the PCNTL functions, and thus, I ran into memory limit and max execution time issue. PS: I have not control for these two settings. Edited December 6, 2019 by flydev precision. Link to comment Share on other sites More sharing options...
gebeer Posted December 6, 2019 Share Posted December 6, 2019 I fork long running processes to the background. You don't need PCNTL functions for this. In an import module which takes some minutes to run, I have a file "importworker.php" <?php namespace ProcessWire; include(__DIR__ . "/../../../index.php"); // bootstrapping PW error_reporting(2); // setting error reporting // ini_set('max_execution_time', 300); // 300 seconds = 5 minutes wire('log')->save('productimport', "starting import: " . date('Y-m-d H:i:s')); $importModule = wire('modules')->get("ProcessImportProducts"); $importModule->importController('start'); wire('log')->save('productimport', "Import finished: " . date('Y-m-d H:i:s')); Then there is a method for forking the heavy work into the background public function startImportWorker() { $path = $this->config->paths->siteModules . "{$this->className}/"; $command = "php {$path}importworker.php"; $outputFile = "{$path}output.txt"; $pid = shell_exec(sprintf("%s > $outputFile 2>&1 & echo $!", $command)); return; } All output of the importworker script is piped to output.txt. So I can see what happens when the process is running in the background. Some methods in my module echo stuff so I can see it in output.txt. Also for longer running loops in my module, I use the ini_set('max_execution_time', 300) method to prolong execution time. And I unset variables along the way to take care of memory issues. With some ajaxy JS, I get the contents of output.txt and show them inside a div#status in my module, so the user knows that there is sth going on. var ProcessImportProducts = { init: function() { $('#startimport').on('click', function(e){ e.preventDefault(); $.get($(this).data('href'), function( data ) { // console.log(data); ProcessImportProducts.pollResults(0); }); }); }, pollResults: function(timestamp) { var statusUrl = '?getstatus=1'; var statusText = $('#status'); // var loader = $('.loader').clone(); if(!timestamp) statusText.html(''); $.ajax( { type: 'GET', dataType: 'json', url: statusUrl, success: function(data){ // console.log(data); // if file has changed append data to statusText if(timestamp != data.timestamp ) statusText.html(data.message).append('<div class="loader"></div>'); // call the function again, this time with the timestamp we just got from server var timeout = setTimeout(function() { ProcessImportProducts.pollResults(data.timestamp); }, 1000); if(data.timestamp == 0) { clearTimeout(timeout); $('.loader').addClass('hide'); } // scroll to bottom of status div statusText.scrollTop(statusText.prop("scrollHeight")); } } ); } }; $(document).ready(function() { ProcessImportProducts.init(); }); EDIT: heres the part of my ___execute() function, that returns the status stuff for the JS if($this->config->ajax) { if($this->input->start == 1){ $this->startImportWorker(); echo 1; return; } if($this->input->getstatus == 1) $this->returnStatus(); } else { // module output to screen } Here's a good read about running processes in the background: https://medium.com/async-php/multi-process-php-94a4e5a4be05 Hope that helps. 9 Link to comment Share on other sites More sharing options...
dragan Posted December 6, 2019 Share Posted December 6, 2019 The gist of that thread was: Create a standalone script and bootstrap PW. (temporarily at least) switch tables to InnoDB. -> not sure about that use $pages->uncacheAll() + gc_collect_cycles() in each loop https://processwire.com/talk/topic/14487-creating-thousands-of-pages-via-the-api/?do=findComment&comment=187826 3 Link to comment Share on other sites More sharing options...
Beluga Posted December 6, 2019 Share Posted December 6, 2019 The database will be the bottleneck. Use InnoDB and transactions. Find all items with the text "transaction" on this page: https://processwire.com/api/ref/wire-database-p-d-o/ https://processwire.com/blog/posts/using-innodb-with-processwire/ For an example of how they are used, find "transaction" in this: https://github.com/adrianbj/BatchChildEditor/blob/master/BatchChildEditor.module.php (BCE includes the copied supportsTransaction function, so it will work with older PW versions as well) 5 Link to comment Share on other sites More sharing options...
OLSA Posted December 20, 2019 Share Posted December 20, 2019 Hi, here is what I used in last project for about 10 000 pages and and it is very simple and basic script. It's read csv file line by line and create pages, but to avoid execution time limits and to get some other options (eg. "pause" option and later "continue", real-time monitoring, etc.) I use very simple Ajax loop. Here is attachment and inside it is some "how-to" txt file. unzip-and-place-content-inside-templates.zip Please note that I use this for ~10 000 pages (in my case, processing time ~1s/page) and for more than that number you can try to do some optimisations and test it. There are few places for that. Teoretically it can works few days, but is it worth? ?Regards. 8 Link to comment Share on other sites More sharing options...
Recommended Posts