Jump to content

Importing 300,000 files into pages


flydev
 Share

Recommended Posts

Hello guys, 

I need to import about 300K text files (each file has ~256 bytes) and turn them into pages on a website hosted on a shared host.

I know I could use the PCNTL extension functions to do this job but its not available (I asked the support to enable it but I think that the answer will be negative).

 

Do you guys have/know a work-around, script or any other idea I am not thinking of now ?

 

Thanks ?

Link to comment
Share on other sites

Yeah thanks, I remember a post about that  but sorry for not explaining my main issue. 

The issue is not performances about creating page, but importing those 300k files without being able to use the PCNTL functions, and thus, I ran into memory limit and max execution time issue.

 

PS: I have not control for these two settings.

Edited by flydev
precision.
Link to comment
Share on other sites

I fork long running processes to the background. You don't need PCNTL functions for this.

In an import module which takes some minutes to run, I have a file "importworker.php"

<?php namespace ProcessWire;
include(__DIR__ . "/../../../index.php"); // bootstrapping PW
error_reporting(2); // setting error reporting
// ini_set('max_execution_time', 300); // 300 seconds = 5 minutes

wire('log')->save('productimport', "starting import: " . date('Y-m-d H:i:s'));
$importModule = wire('modules')->get("ProcessImportProducts");
$importModule->importController('start');
wire('log')->save('productimport', "Import finished: " . date('Y-m-d H:i:s'));

Then there is a method for forking the heavy work into the background

	public function startImportWorker() {

		$path = $this->config->paths->siteModules . "{$this->className}/";
		$command = "php {$path}importworker.php";

		$outputFile = "{$path}output.txt";
		$pid = shell_exec(sprintf("%s > $outputFile 2>&1 & echo $!", $command));
		return;

	}

All output of the importworker script is piped to output.txt. So I can see what happens when the process is running in the background. Some methods in my module echo stuff so I can see it in output.txt.

Also for longer running loops in my module, I use the ini_set('max_execution_time', 300) method to prolong execution time.

And I unset variables along the way to take care of memory issues.

With some ajaxy JS, I get the contents of output.txt and show them inside a div#status in my module, so the user knows that there is sth going on.

var ProcessImportProducts = {


	init: function() {


		$('#startimport').on('click', function(e){
			e.preventDefault();
			$.get($(this).data('href'), function( data ) {
				// console.log(data);
				ProcessImportProducts.pollResults(0);
			});
		});

	},

	pollResults: function(timestamp) {

		var statusUrl = '?getstatus=1';
		var statusText = $('#status');
		// var loader = $('.loader').clone();
		if(!timestamp) statusText.html('');
	    $.ajax(
	        {
	            type: 'GET',
	            dataType: 'json',
	            url: statusUrl,
	            success: function(data){
	                // console.log(data);
	                // if file has changed append data to statusText
	                if(timestamp != data.timestamp ) statusText.html(data.message).append('<div class="loader"></div>');
	                // call the function again, this time with the timestamp we just got from server
					var timeout = setTimeout(function() {
						ProcessImportProducts.pollResults(data.timestamp);
					}, 1000);
	                if(data.timestamp == 0) {
	                	clearTimeout(timeout);
	                	$('.loader').addClass('hide');
	                }
	                // scroll to bottom of status div
	                statusText.scrollTop(statusText.prop("scrollHeight"));
	            }
	        }
	    );


	}

};


$(document).ready(function() {
	ProcessImportProducts.init();
}); 

EDIT: heres the part of my ___execute() function, that returns the status stuff for the JS

		if($this->config->ajax) {
			 if($this->input->start == 1){
			 	$this->startImportWorker();
			 	echo 1;
			 	return;
			 }
			 if($this->input->getstatus == 1) $this->returnStatus();
		} else {
			// module output to screen
		}

Here's a good read about running processes in the background: https://medium.com/async-php/multi-process-php-94a4e5a4be05

Hope that helps.

 

  • Like 9
Link to comment
Share on other sites

The database will be the bottleneck. Use InnoDB and transactions. Find all items with the text "transaction" on this page: https://processwire.com/api/ref/wire-database-p-d-o/

https://processwire.com/blog/posts/using-innodb-with-processwire/

For an example of how they are used, find "transaction" in this: https://github.com/adrianbj/BatchChildEditor/blob/master/BatchChildEditor.module.php

(BCE includes the copied supportsTransaction function, so it will work with older PW versions as well)

  • Like 5
Link to comment
Share on other sites

  • 2 weeks later...

Hi,

here is what I used in last project for about 10 000 pages and and it is very simple and basic script.
It's read csv file line by line and create pages, but to avoid execution time limits and to get some other options  (eg. "pause" option and later "continue", real-time monitoring, etc.) I use very simple Ajax loop.

importer-processing.gif.a5b45a223e72245a439e9d3a7bd7cb32.gif

Here is attachment and inside it is some "how-to" txt file.

unzip-and-place-content-inside-templates.zip

Please note that I use this for ~10 000 pages (in my case, processing time ~1s/page) and for more than that number you can try to do some optimisations and test it. There are few places for that. Teoretically it can works few days, but is it worth? ?
Regards.

 

  • Like 8
Link to comment
Share on other sites

 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...