Leaderboard

Creating 20 million pages

Hi, Thanks a lot for all the feedback. I did some additional tests based on all of the suggestions you gave me and results are already amazing!! Figure 1 shows @ryan suggestions tested independently: 1. I created the $template variable outside the loop. 2. I created the $parent variable outside the loop. The boost in performance is surprising! Defining the $parent outside the loop made a huge difference (before I didn't assigned the parent explicitly, it was already defined in the template thus the assignment was automatic) 4. I also tried this suggestion ($page->name = "protein" . $i;) and although it seems to boost a bit performance I didn't include the plot because results were not conclusive. Still I will include this in my code. Figure 2 is based on @horst suggestion. I tested the impact of calling gc_collect_cycles() and $pages->uncacheAll() after every $database->commit(). I didn't do a test for $pages->uncache($page) because I thought $pages->uncacheAll() was basically the same. Maybe this is not true (?). Results don't show any well defined boost in performance (I guess ryan's recent reply predicted this). I still need to try @BitPoet suggestion because I am sure this is something that will boost performance. I am now doing this tests on my personal computer. I will do this test when running on the dedicated server. I will also would like to try generators (first time a hear about them ) ______________________________________________________________________________________________________________________________________________________________________________________________ One last thing regarding the fields in the protein template and the data structure in general (the pseudo code I posted initially was just as an example). Proteins are classified into groups. Each protein can belong to more than one groups (max. 5). My original idea was to use repeaters because for each protein I have the following information repeated: GroupID [integer], start [integer], end [integer], sequence [text] The idea is that from GroupID you can go to the particular group page (I have around 50k groups) but I don't necessarily need a page reference for this. The csv is structured as follow. Note that some protein entries are repeated which means that I shouldn't create a new page but add an entry to the repeater field. Protein-name groupID start end sequence A0A151DJ30 41 3 94 CPFES[...]]VRQVEK A0A151DJ30 55 119 140 PWSGD[...]NWPTYKD A0A0L0D2B9 872 74 326 MPPRV[...]TTKWSKK V8NIV9 919 547 648 SFKYL[...]LEAKEC A0A1D2MNM4 927 13 109 GTRVW[...]IYTYCG A0A1D2MNM4 999 119 437 PWSGDN[...]]RQDTVT A0A167EE16 1085 167 236 KTYLS[...]YELLTT A0A0A0M635 1104 189 269 KADQE[...]INLVIV Since I know repeaters also creates additional overhead I am doing all my benchmarks without them. I can always build the websites without them. In the next days I will do some benchmarks including repeaters just to see how it goes. Once again, thanks for all the replies!

April 13, 2021

8 points

Creating 20 million pages

Just to add, if the points from @ryan and @horst aren't enough (they should boost import times quite noticeably) you could try dropping the FULLTEXT keys on the relevant fields' tables before the import and recreating them afterwards (ALTER TABLE `field_fieldname` DROP KEY `data` / ALTER TABLE `field_fieldname` ADD FULLTEXT KEY `data` (`data`)). Finally, a big part of MySQL performance depends on server tuning. The default size for the InnoDB buffer pool (the part of RAM where MySQL holds data and indexes) is rather small at 128MB. If you have a dedicated database server, you can up that to 80% of physical memory to avoid unnecessary disk access.

April 13, 2021

7 points

101aandd.com

101aandd.com New site mostly designed by the client and implemented in Processwire. Mostly concentrating on small subtle animations.

April 17, 2021

4 points

Creating 20 million pages

@fedeb Glad that moving the $parent outside the loop helped there. The reason it helps is because after a $pages->save() is the automatic $pages->uncacheAll(), so the auto-assigned parent from the template is having to be re-loaded on every iteration. By keeping your own copy loaded and assigning it yourself, you are able to avoid that extra overhead in this case. Avoid getting repeaters involved. I wouldn't even experiment with it here. That will at minimum triple the number of pages (assuming every protein page could have a repeater). Repeaters would be just fine if you were working in the thousands-of-pages territory, but in the millions-of-pages territory, it's not going to be worth even attempting. Using a ProFields table field would be the best alternative if you needed it to be queryable data. If you didn't need it to be queryable data (groupID, start, end, sequence), I would leave them as they are, space-separated in a plain textarea field — they can easily be parsed out at runtime so you can access them as as properties of the page. (If that suits your need, let me know and I'll get into how that can be done). When working at large scale, it's also always good to consider custom building a Fieldtype module for the purpose too (that's another topic, but we can get into it too). For your groupID, if the same groupID is referenced by multiple proteins, and there is more information about each "group" (other than just an ID) then I think it would make sense for it to be a Page reference field. What is the max number of groupID+start+end+sequence rows that a protein can have? If there is a natural limit and it's not large, then that would open up some new storage possibilities too. Another optimization you can make in your loop: $page->sort = $i; This prevents it from having to detect and auto-assign a sort value based on the quantity of children the parent page has. For the $page->name, if each page will have a unique "protein-name" then you might also consider using that rather than the ("protein" . $i), as it will be more reflective of the page than a generic index number.

April 13, 2021

4 points

Creating 20 million pages

I think you could try the generators here while you are playing with CSV. For .eg <?php function getRows($file) { $handle = fopen($file, 'rb'); if ($handle === false) { throw new Exception('open file '.$file.' error'); } while (feof($handle) === false) { yield fgetcsv($handle); } fclose($handle); } // allocate memory for only a single line in the csv file // do not need the entire csv file is read into memory $generator = getRows('../data/20_mil_data.csv'); // foreach ($it as $row) {print_r($row);} while ($generator->valid()) { print_r($generator->current()); //$generator->current() is your $row // playing with ProcessWire here $generator->next(); } $generator->rewind(); // http://php.net/manual/en/class.generator.php That's always my #1 choice while working with big datasets in PHP.

April 13, 2021

4 points

Creating 20 million pages

Whats about a #5: $pages->uncache($page) after the $database->commit(); to additionally free some memory.

April 13, 2021

4 points

Creating 20 million pages

@fedeb That's the largest quantity of pages I've heard of anyone creating in ProcessWire, by a pretty large margin. So you are in somewhat uncharted territory. But that's really cool you are doing that. I would be curious how different the graph would be if you split it up into batches so that you aren't creating more than a certain quantity per execution/runtime. For instance, maybe you create 10k in one execution and another 10k in the next, etc., or something like that. Would the same slowdown still occur? If so, I would start to think it might be the database index and increased overhead in maintaining that index as the quantity increases. On the flip side, if restarting the process to create each set in batches solves the slowdown, then I would think it might be memory or resource related. A couple things you can do to potentially (?) improve your page creation time: 1. At the top of your code (before the loop) put: $template = $templates->get('protein'); Then within the loop set: $page->template = $template; 2. I don't see a parent page assignment. How are you doing that? Double check that you aren't asking PW to load the parent page every time in the loop and instead handle it like with the template in #1 above. 3. What kind of fields are on your "protein" template? Depending on their type, there may be potential optimizations. Especially if any are Page references. Can you paste in a line or two from the CSV? 4. If you can assign a $page->name = "protein" . $i; rather than having PW auto-generate a name from the title, that will save some resources too.

April 12, 2021

4 points

Creating 20 million pages

Hi, A bit of background. I am creating a website which lets you navigate through a protein database with 20 million proteins grouped into 50 thousand categories. The database is fixed in size, meaning no need to update/add information in the near future. Queries to the database are pretty standard. The problem I am currently having is the time it takes to create the pages for the proteins (right now around a week). Pages are created reading the data from a csv file. Based on previous posts I found on this forum (link1, link2) I decided to use $database transactions to load the data from a php script (InnoDB engine required). This really boosts performance from 25 pages per second to 200 pages per second. Problem is performance drops as a function of pages created (see attached image). Is this behavior expected? I tried using gc_collect_cycles() but haven't noticed any difference. Is there a way to avoid the degradation in performance? A stable 200 pages per second would be good enough for me. Pseudo code: $handle = fopen($file, "r"); $trans_size = 200 // commit to database every _ pages try { $database->beginTransaction(); for ($i = 0; $row = fgetcsv($handle, 0, " "); ++$i) { // fields from data $title = $row[0]; $size = $row[1]; $len_avg = $row[2]; $len_std = $row[3]; // create page $page = new Page(); $page->template = "protein"; $page->title = $title; $page->size = $size; $page->len_avg = $len_avg; $page->len_std = $len_std; $page->save(); if (($i+1)%$trans_size == 0) { $database->commit(); // $pages->uncacheAll(); // gc_collect_cycles(); $database->beginTransaction(); } } $database->commit(); } I am quiet new to process wire so feel free to criticize me ? Thanks in advance

May 6, 2021

3 points

Weekly update – 9 April 2021

@cb2004 Got it. I'll put out an update to the ProcessWireUpgrade module this week. Support for identifying the latest version of Pro modules is a function of the modules directory rather than the upgrades module. I've been meaning to do this, so thanks for the reminder. I have gone ahead and updated it so that it can now identify the latest versions of all Pro modules. Though I can't add support for download+install upgrades of Pro modules, as they are access controlled so there can't be public download URLs for these. I also think that in general it's always better to install or upgrade modules directly on the file system, as that prevents permissions problems (for when apache is not running as your user account), and makes it easier to troubleshoot and resolve issues when installing or upgrading modules.

April 13, 2021

3 points

Feature Requests

@adrian would it be possible to get a refresh&clear link here? I'm developing processmodules quite extensively those days and when adding a new nav item I always need to click "refresh" for triggering the module to catch the changes and then "clear session and cookies" to make the change visible in the menu. It would be great to get all that via one click. We know how annoying plenty of reloads get when we have to use them a lot... ? Thx for considering!

April 13, 2021

2 points

Creating 20 million pages

Unless I'm forgetting something, the $pages->uncache($page); won't help here because $page is a newly created Page that wasn't loaded from the database. So it's not going to be cached either. Uncaching pages is potentially useful when iterating through large groups of existing pages. For instance, if you are rendering or exporting something large from the contents of existing pages, you might like to $pages->uncacheAll() after getting through a thousand of them to clear room for another paginated batch. Though nowadays we have $pages->findMany() and $pages->findRaw(), so there are fewer instances were you would even need to use uncache or uncacheAll, if ever. ProcessWire actually does an uncacheAll() internally after saving a page already. This is necessary because changes to a page or additions/deletions to the page tree may affect other pages, and we don't want any potential for old cached data to appear in future $pages->find() or other operations. Just one example is if we called $parent->children() before a save, and then after the save called it again, we'd want our new page to be in the children rather than having it return the previously cached value. There are a lot of similar cases, so the safest bet is for PW to uncache the results of future page get/find operations after a save as the default behavior. So that's the way it's always done it. As far as I can tell from fedeb's example (and often other with import operations), it may be better to tell PW to skip this "uncacheAll-after-save" behavior. That's because imports often involve Page reference fields, and you don't want PW to have to reload referenced pages after every save. So you could potentially reduce overhead by telling it not to uncache after save, i.e. $pages->save($page, [ 'uncacheAll' => false ]); I'm not sure if fedeb's import involves loading of any other pages, whether for page reference fields, or anything else. So it may not matter one way or the other here, but wanted to mention it just in case. I know about ProcessWire tuning, but not about MySQL server tuning. When dealing with 20 million rows that seems like getting into the territory where optimizations to the DB configuration deserve a lot of focus, so I would bet that BitPoet's suggestions are going to make the most difference.

April 13, 2021

2 points

Feature Requests

@adrian, I have an idea for a different solution to the problem. I'll report back once I've done some work on it.

April 12, 2021

2 points

wireHTTP API authorization issue

But it will not help others that may have the same problem in the future ?

April 13, 2021

1 point

Webp support

Hello @eelkenet, I had that issue on my local MAMP server, when this feature was introduced. I tried this config option, but for me it doesn't work at all. Is it mandatory to edit your .htaccess? Because strategies 1 and 2 didn't work for me, I use strategy 3 with an own config variable ($config->useWebP) and disable WebP on my local MAMP server. Regards, Andreas

April 13, 2021

1 point

Oh no, not another migration module!

Version 0.0.2 now on GitHub https://github.com/MetaTunes/ProcessDbMigrate This version more fully allows for different page ids in source and target systems. A meta value (idMap) maintains the mapping. This allows the replacement of links in RTE fields provided the relevant pages are all in the migration. Also, all existing image variants are migrated. EDIT: Now 0.0.3 fixes install problem and adds upgrade via modules -> refresh.

April 12, 2021

1 point

Hanna Code Helper

Hey @teppo, just wanted to let you know that I ran into a problem with this module after installing the new HannaCode module. Here's the details https://github.com/ryancramerdesign/ProcessHannaCode/issues/23 I know this is an older module and most people will probably use HannaCodeDialog now, but I thought I would let you know in case you had some sites still running this module? Thanks for all you do with Processwire!

April 12, 2021

1 point

Feature Requests

Just tested now and unfortunately it's logging me out again. I tried moving things to __destruct and removing the session_write_close but that just resulted in the timeout issue ?

April 12, 2021

1 point

[solved] Using WireTempDir

Hi! I just tried out your solution and it worked perfectly! Thanks a lot for the quick reply and the amazing explanation! Since I am creating temporary files on demand for download I think I don't have a big security issue nor I need to track the files, right? Basically there is a link to a download.php which triggers the download of the file. There is no visible link to that file. I include the download.php code just to close the thread properly. It is probably not the best code so if there something I should change please let me know. // create temp dir $temp_dir = $files->tempDir('downloads'); $temp_dir->setRemove(false); $temp_dir->removeExpiredDirs(dirname($temp_dir), $config->erase_tmpfiles); // remove dirs older than $config->erase_tmpfiles seconds // create zip $zip_file = $temp_dir . "test.zip"; $result_zip = $files->zip($zip_file, $data); // download pop-up if (headers_sent()) { echo 'HTTP header already sent'; } else { if (!is_file($zip_file)) { header($_SERVER['SERVER_PROTOCOL'].' 404 Not Found'); echo 'File not found'; } else if (!is_readable($zip_file)) { header($_SERVER['SERVER_PROTOCOL'].' 403 Forbidden'); echo 'File not readable'; } else { header($_SERVER['SERVER_PROTOCOL'].' 200 OK'); header("Content-Type: application/zip"); header("Content-Transfer-Encoding: Binary"); header("Content-Length: ".filesize($zip_file)); header("Content-Disposition: attachment; filename=\"".basename($zip_file)."\""); readfile($zip_file); exit; } }

March 26, 2021

1 point

Clean way to output data as json

For simple json outputs, you can use WireArray::explode and json_encode() or wireEncodeJSON() methods https://processwire.com/api/ref/wire-array/explode/ $myPages = $pages->find('template=basic-page'); // extract required fields into plain array $data = $myPages->explode(['title', 'created']); echo wireEncodeJSON($data);

May 9, 2017

1 point

Sign In

ryan

Points

Posts

fedeb

Points

Posts

BitPoet

Points

Posts

Hector Nguyen

Points

Posts

Popular Content

Creating 20 million pages

Creating 20 million pages

101aandd.com

Creating 20 million pages

Creating 20 million pages

Creating 20 million pages

Creating 20 million pages

Creating 20 million pages

Weekly update – 9 April 2021

Feature Requests

Creating 20 million pages

Feature Requests

wireHTTP API authorization issue

Webp support

Oh no, not another migration module!

Hanna Code Helper

Feature Requests

[solved] Using WireTempDir

Clean way to output data as json

Browse

Activity

My Activity Streams

Store

My Details

Support