Creating 20 million pages

fedeb · April 12, 2021

Hi,

A bit of background. I am creating a website which lets you navigate through a protein database with 20 million proteins grouped into 50 thousand categories.

The database is fixed in size, meaning no need to update/add information in the near future. Queries to the database are pretty standard.

The problem I am currently having is the time it takes to create the pages for the proteins (right now around a week). Pages are created reading the data from a csv file. Based on previous posts I found on this forum (link1, link2) I decided to use $database transactions to load the data from a php script (InnoDB engine required). This really boosts performance from 25 pages per second to 200 pages per second. Problem is performance drops as a function of pages created (see attached image).

Is this behavior expected? I tried using gc_collect_cycles() but haven't noticed any difference.

Is there a way to avoid the degradation in performance? A stable 200 pages per second would be good enough for me.

Pseudo code:


$handle = fopen($file, "r");
$trans_size = 200   // commit to database every _ pages

try {

    $database->beginTransaction();

    for ($i = 0; $row = fgetcsv($handle, 0, " "); ++$i) 
    {

        // fields from data
        $title                  = $row[0];
        $size                   = $row[1];
        $len_avg                = $row[2];
        $len_std                = $row[3];

        
        // create page
        $page = new Page();
        $page->template          = "protein";
        $page->title             = $title;
        $page->size              = $size;
        $page->len_avg           = $len_avg;
        $page->len_std           = $len_std;
        $page->save();
        
        
        if (($i+1)%$trans_size == 0)
        {
            $database->commit();
            // $pages->uncacheAll();
            // gc_collect_cycles();
            $database->beginTransaction();

        }
    }
    $database->commit();
}

I am quiet new to process wire so feel free to criticize me ?

Thanks in advance

ryan · April 12, 2021

@fedeb That's the largest quantity of pages I've heard of anyone creating in ProcessWire, by a pretty large margin. So you are in somewhat uncharted territory. But that's really cool you are doing that. I would be curious how different the graph would be if you split it up into batches so that you aren't creating more than a certain quantity per execution/runtime. For instance, maybe you create 10k in one execution and another 10k in the next, etc., or something like that. Would the same slowdown still occur? If so, I would start to think it might be the database index and increased overhead in maintaining that index as the quantity increases. On the flip side, if restarting the process to create each set in batches solves the slowdown, then I would think it might be memory or resource related.

A couple things you can do to potentially (?) improve your page creation time:

1. At the top of your code (before the loop) put: $template = $templates->get('protein'); Then within the loop set: $page->template = $template;

2. I don't see a parent page assignment. How are you doing that? Double check that you aren't asking PW to load the parent page every time in the loop and instead handle it like with the template in #1 above.

3. What kind of fields are on your "protein" template? Depending on their type, there may be potential optimizations. Especially if any are Page references. Can you paste in a line or two from the CSV?

4. If you can assign a $page->name = "protein" . $i; rather than having PW auto-generate a name from the title, that will save some resources too.

horst · April 13, 2021

Whats about a #5: $pages->uncache($page) after the $database->commit();
to additionally free some memory.

BitPoet · April 13, 2021

9 hours ago, fedeb said:

Is there a way to avoid the degradation in performance? A stable 200 pages per second would be good enough for me.

Just to add, if the points from @ryan and @horst aren't enough (they should boost import times quite noticeably) you could try dropping the FULLTEXT keys on the relevant fields' tables before the import and recreating them afterwards (ALTER TABLE `field_fieldname` DROP KEY `data` / ALTER TABLE `field_fieldname` ADD FULLTEXT KEY `data` (`data`)).

Finally, a big part of MySQL performance depends on server tuning. The default size for the InnoDB buffer pool (the part of RAM where MySQL holds data and indexes) is rather small at 128MB. If you have a dedicated database server, you can up that to 80% of physical memory to avoid unnecessary disk access.

ryan · April 13, 2021

Unless I'm forgetting something, the $pages->uncache($page); won't help here because $page is a newly created Page that wasn't loaded from the database. So it's not going to be cached either. Uncaching pages is potentially useful when iterating through large groups of existing pages. For instance, if you are rendering or exporting something large from the contents of existing pages, you might like to $pages->uncacheAll() after getting through a thousand of them to clear room for another paginated batch. Though nowadays we have $pages->findMany() and $pages->findRaw(), so there are fewer instances were you would even need to use uncache or uncacheAll, if ever.

ProcessWire actually does an uncacheAll() internally after saving a page already. This is necessary because changes to a page or additions/deletions to the page tree may affect other pages, and we don't want any potential for old cached data to appear in future $pages->find() or other operations. Just one example is if we called $parent->children() before a save, and then after the save called it again, we'd want our new page to be in the children rather than having it return the previously cached value. There are a lot of similar cases, so the safest bet is for PW to uncache the results of future page get/find operations after a save as the default behavior. So that's the way it's always done it.

As far as I can tell from fedeb's example (and often other with import operations), it may be better to tell PW to skip this "uncacheAll-after-save" behavior. That's because imports often involve Page reference fields, and you don't want PW to have to reload referenced pages after every save. So you could potentially reduce overhead by telling it not to uncache after save, i.e. $pages->save($page, [ 'uncacheAll' => false ]); I'm not sure if fedeb's import involves loading of any other pages, whether for page reference fields, or anything else. So it may not matter one way or the other here, but wanted to mention it just in case.

I know about ProcessWire tuning, but not about MySQL server tuning. When dealing with 20 million rows that seems like getting into the territory where optimizations to the DB configuration deserve a lot of focus, so I would bet that BitPoet's suggestions are going to make the most difference.

Hector Nguyen · April 13, 2021

I think you could try the generators here while you are playing with CSV. For .eg

<?php
function getRows($file) {
    $handle = fopen($file, 'rb');
    if ($handle === false) {
        throw new Exception('open file '.$file.' error');
    }
    while (feof($handle) === false) {
        yield fgetcsv($handle);
    }
    fclose($handle);
}
 
// allocate memory for only a single line in the csv file
// do not need the entire csv file is read into memory
$generator = getRows('../data/20_mil_data.csv');
 
// foreach ($it as $row) {print_r($row);}
while ($generator->valid()) {
    print_r($generator->current()); //$generator->current() is your $row
    // playing with ProcessWire here
    $generator->next();
}
$generator->rewind();
 
// http://php.net/manual/en/class.generator.php

That's always my #1 choice while working with big datasets in PHP.

ryan · April 13, 2021

@Hector Nguyen This is cool to see generators in action. Though as far as I know, PHP's fgetcsv() never loads the whole file in memory at the same time, regardless of which method is used to call it. I think it just loads one line at a time (?), but this reminds me that an optimization to fgetcsv() is to tell it what the longest possible line might be (as 2nd argument), so that it doesn't have to figure it out. Fedeb's example has 0 as the 2nd argument to fgetcsv(), which means "let PHP figure it out", so some overhead could be reduced here by giving it a number like 1024 or whatever the largest line length (in bytes) might be. There may be other benefits to using generators here though? I haven't experimented with them much yet so am curious.

fedeb · April 13, 2021

Hi,

Thanks a lot for all the feedback. I did some additional tests based on all of the suggestions you gave me and results are already amazing!!

Figure 1 shows @ryan suggestions tested independently:

1. I created the $template variable outside the loop.

2. I created the $parent variable outside the loop. The boost in performance is surprising! Defining the $parent outside the loop made a huge difference (before I didn't assigned the parent explicitly, it was already defined in the template thus the assignment was automatic)

4. I also tried this suggestion ($page->name = "protein" . $i;) and although it seems to boost a bit performance I didn't include the plot because results were not conclusive. Still I will include this in my code.

Figure 2 is based on @horst suggestion. I tested the impact of calling gc_collect_cycles() and $pages->uncacheAll() after every $database->commit(). I didn't do a test for $pages->uncache($page) because I thought $pages->uncacheAll() was basically the same. Maybe this is not true (?). Results don't show any well defined boost in performance (I guess ryan's recent reply predicted this).

I still need to try @BitPoet suggestion because I am sure this is something that will boost performance. I am now doing this tests on my personal computer. I will do this test when running on the dedicated server. I will also would like to try generators (first time a hear about them )

______________________________________________________________________________________________________________________________________________________________________________________________

One last thing regarding the fields in the protein template and the data structure in general (the pseudo code I posted initially was just as an example).

Proteins are classified into groups. Each protein can belong to more than one groups (max. 5). My original idea was to use repeaters because for each protein I have the following information repeated:

GroupID [integer], start [integer], end [integer], sequence [text]

The idea is that from GroupID you can go to the particular group page (I have around 50k groups) but I don't necessarily need a page reference for this.

The csv is structured as follow. Note that some protein entries are repeated which means that I shouldn't create a new page but add an entry to the repeater field.

Protein-name groupID start end sequence
A0A151DJ30 41 3 94 CPFES[...]]VRQVEK
A0A151DJ30 55 119 140 PWSGD[...]NWPTYKD
A0A0L0D2B9 872 74 326 MPPRV[...]TTKWSKK
V8NIV9 919 547 648 SFKYL[...]LEAKEC
A0A1D2MNM4 927 13 109 GTRVW[...]IYTYCG
A0A1D2MNM4 999 119 437 PWSGDN[...]]RQDTVT
A0A167EE16 1085 167 236 KTYLS[...]YELLTT
A0A0A0M635 1104 189 269 KADQE[...]INLVIV

Since I know repeaters also creates additional overhead I am doing all my benchmarks without them. I can always build the websites without them. In the next days I will do some benchmarks including repeaters just to see how it goes.

Once again, thanks for all the replies!

ryan · April 13, 2021

@fedeb Glad that moving the $parent outside the loop helped there. The reason it helps is because after a $pages->save() is the automatic $pages->uncacheAll(), so the auto-assigned parent from the template is having to be re-loaded on every iteration. By keeping your own copy loaded and assigning it yourself, you are able to avoid that extra overhead in this case.

Avoid getting repeaters involved. I wouldn't even experiment with it here. That will at minimum triple the number of pages (assuming every protein page could have a repeater). Repeaters would be just fine if you were working in the thousands-of-pages territory, but in the millions-of-pages territory, it's not going to be worth even attempting. Using a ProFields table field would be the best alternative if you needed it to be queryable data. If you didn't need it to be queryable data (groupID, start, end, sequence), I would leave them as they are, space-separated in a plain textarea field — they can easily be parsed out at runtime so you can access them as as properties of the page. (If that suits your need, let me know and I'll get into how that can be done). When working at large scale, it's also always good to consider custom building a Fieldtype module for the purpose too (that's another topic, but we can get into it too).

For your groupID, if the same groupID is referenced by multiple proteins, and there is more information about each "group" (other than just an ID) then I think it would make sense for it to be a Page reference field. What is the max number of groupID+start+end+sequence rows that a protein can have? If there is a natural limit and it's not large, then that would open up some new storage possibilities too.

Another optimization you can make in your loop:

$page->sort = $i;

This prevents it from having to detect and auto-assign a sort value based on the quantity of children the parent page has.

For the $page->name, if each page will have a unique "protein-name" then you might also consider using that rather than the ("protein" . $i), as it will be more reflective of the page than a generic index number.

Hector Nguyen · April 14, 2021

14 hours ago, ryan said:

@Hector Nguyen This is cool to see generators in action. Though as far as I know, PHP's fgetcsv() never loads the whole file in memory at the same time, regardless of which method is used to call it. I think it just loads one line at a time (?), but this reminds me that an optimization to fgetcsv() is to tell it what the longest possible line might be (as 2nd argument), so that it doesn't have to figure it out. Fedeb's example has 0 as the 2nd argument to fgetcsv(), which means "let PHP figure it out", so some overhead could be reduced here by giving it a number like 1024 or whatever the largest line length (in bytes) might be. There may be other benefits to using generators here though? I haven't experimented with them much yet so am curious.

@ryan I'm afraid not, fget() in PHP will return one string (a line) but fgetcsv() will return an indexed array. An array in PHP is not very friendly with big dataset because it's still saved into the memory!

The main difference is in memory usage. In the first case all the data are in memory when you return the array. Because of no matter what, if you "returns" values, that will be save in memory.

In the second case you get a lazy approach, and you can iterate the values without keeping all of them in memory. This would be a great benefit when memory is a constraint compared to the size of the sum of all your data.

I've managed to import 4GB CSV file on the Shared VPS 1GB mem in seconds, the bottleneck of "yield" method now is your MySQL.

bernhard · April 14, 2021

@fedeb Very interesting, thx for sharing! ?

@ryan Wouldn't it be great to abstract all that findings/knowledge into the files API? Importing data from CSV is a quite common need and thinking about all the pieces (like utf8 encoding, wrong delimiters etc) can be tedious and does not need to be ?

$file = "/my/large/file.csv";
$tpl = $templates->get('foo');
$parent = $pages->get(123);
$options = [
  'delimiter' => ',',
  'length' => 100,
  'firstLineArrayKeys' => true,
];
while($line = $files->readCSV($file, $options)) {
  $p = $this->wire(new Page());
  $p->template = $tpl;
  $p->parent = $parent;
  $p->title = $line['Foo Column'];
  $p->body = $line['Bar Column'];
  $p->save();
}

Support this request on github: https://github.com/processwire/processwire-requests/issues/400

fedeb · April 14, 2021

@ryan for the time being the data (groupID, start, end, sequence) are not supposed to be queryable. Ideally groupID should be, because I would like to display all proteins belonging to a groupID in the group page but I think I will use a workaround for this: I have a file for each group containing this information which I plan to parse when loading the group page. Individual files have at most 1000 lines (proteins). In this way I avoid querying 20+ million entries each time you try to access a particular group page.

As you suggested I will load each entry (groupID, start, end, sequence) as a text field and then use php explode method to parse it into an array at runtime.

The only doubt is probably on groupID:

Quote

For your groupID, if the same groupID is referenced by multiple proteins, and there is more information about each "group" (other than just an ID) then I think it would make sense for it to be a Page reference field. What is the max number of groupID+start+end+sequence rows that a protein can have? If there is a natural limit and it's not large, then that would open up some new storage possibilities too.

A single groupID can be referenced by multiple proteins and it does contain additional information displayed in their respective group page (I create the group pages separately). The natural limit is around 20 groups although normally this number is 2 or 3 groups per protein. With this setup is it worth using a Page reference field? What are the other storage possibilities?

In the future I think I will end up using ProFields or building a Fieldtype module. For this last approach I think I need to read a bit more about modules since I am new to processwire. This tutorial posted by bernhard is a good start.

@Hector Nguyen if you are not constraint by memory then loading the csv into memory all at once is the way to go, right?

Thanks for all the useful suggestions.

p.s. maybe I am diverging from the original thread. If you prefer I can open a new one.

ryan · April 15, 2021

@Hector Nguyen getcsv() does return an array, but it is an array representing and consuming the memory of just one line from a CSV file (i.e. the columns from 1 row). fgetcsv() is just a layer on top of the fgets() function, which reads one line at a time from a file, which is what makes it memory friendly. On the other hand, PHP functions like file() or file_get_contents() do read the entire file in memory, so they are not memory friendly, even if they are fast.

@fedeb I think the best route to take for your groupID+start+end+sequence would be a custom Fieldtype. This would give you all of the benefits of having a repeater, and without any of the overhead. Custom fieldtype may sound complicated, but it's not at all. I've developed a module that can very easily be adapted for this need. See FieldtypeEvents which was created as an example to build exactly this sort of thing from. If you are interested in that route and have any questions, I'm happy to walk you through it.

HMCB · May 2, 2021

As a designer with limited development experience who found PW a few years back, I can say this discussion has been very encouraging to read. PW has enabled me to make functional sites, but this thread opens my eyes to what’s possible. The last thing you want as someone not well versed in the intricacies of programming is leaving a client in the position of being hampered by the original decision in choosing a platform.

For instance, anyone with a non-biased eye can see the real pain die-hard proponents of Webflow vocalize on their community forum. It’s come to an impasse. Whether it’s the limitation of the number of custom fields, or of the number of related records one can have, or the speed of queries, or the responsiveness of the dev UI slowing to a crawl because even a small dataset can make building a site laborious. PW eliminates those pain points. The challenge is communicating its value to a wider audience. My forte is communications, strategy, and brand building. So I tend to look at brand voice as part of evaluating a product or service.

The great thing is that PW is architecturally sound. Thanks Ryan and everyone for refining it while keeping it true to its ideals.

bernhard · May 2, 2021

I've opened a request issue for my idea above: https://github.com/processwire/processwire-requests/issues/400

If you like that idea please show your interest by giving it a thumb up, thank you ?

HMCB · May 2, 2021

1 hour ago, bernhard said:

I've opened a request issue for my idea above: https://github.com/processwire/processwire-requests/issues/400

If you like that idea please show your interest by giving it a thumb up, thank you ?

If I could give it 1000 thumbs up, I would. I have an upcoming project that I’d love to import my data and not have to think about how best to do it. Although, it would be nice if related fields/records could also be dictated. In my case, I’d import the related records first into PW. Then I’d import the records/pages that need to connect to those related records.

gebeer · May 5, 2021

@fedeb what software did you use to create the performance graphs?

fedeb · May 5, 2021

Hi, I used matplotlib library from Python. Configuration is the default one so with only few lines you get the plots!

elabx · May 6, 2021

Very interesting thread! I've had a couple people pretty amazed too by the volume PW can handle but this case is really interesting!

Sign In

Creating 20 million pages

Recommended Posts

fedeb

ryan

horst

BitPoet

ryan

Hector Nguyen

ryan

fedeb

ryan

Hector Nguyen

bernhard

fedeb

ryan

HMCB

bernhard

HMCB

gebeer

fedeb

elabx

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

Activity

My Activity Streams

Support

Store

My Details