Distinct selector in ProcessWire

Harmster · October 25, 2012

Hey,

I'd tried to do something on my website, ofcourse running on processwire, and i got a memory error when i tried to run through a array with 3xxx records and when i commented that one out it did not give me an error. Now i asked my co-worker to have a look at this and he asked me if processwire does have some kind of Distinct selector?

Kind regards,

Harm.

nik · October 25, 2012

ProcessWire removes duplicates grouping results by page id thus giving only distinct pages as a result. So no need for a distinct there.

That 3000 records shouldn't include any duplicates assuming it's a result from a single $pages->find() call. Solution would be to add more restricting selectors to the find itself.

If that's not possible then pagination is the way to go. Add start and limit selectors to get records from 0-999 (start=0, limit=1000) and loop that increasing start by 1000 on every iteration until you've got less than 1000 result rows returned.

This way you'll end up with less Page objects in memory at the same time. Something like 500-1000 rows at a time should be fine, but that depends on how heavy data you've got there. (Actually, only autojoin fields and the fields you're accessing do count here.) You may need to call $pages->uncacheAll() after every iteration to flush the previous iteration's Page objects from memory.

No example at the moment, sorry, got to go for now.

Harmster · October 25, 2012

Sorry for my poor explaination,

This is however not what i need.

I want to foreach through those pages and not only prevent duplicate titles but also other values within that page.

The page contains

- title

- name

- action

- value

- result

- users

etc etc.

Now i only want to retrieve 1 unique value of each of those fields

Lets say i got 3 pages with 2x the same result i only want to get that 1x.

I hope its more clear now.

Kind regards,

Harm

nik · October 25, 2012

That's more or less what I though you were trying to do - so I'm thinking it was me who wasn't clear enough then.

Because of memory limitations you can't foreach through the whole 3000 page resultset in one go, just like you said in the first place. Instead you can loop through the very same 3000 pages, but in 500 page pieces.

In a hurry, again. Not sure about the syntax and definitely not tested, but you'll get the idea:

$start = 0;
$limit = 500;
do {
 // replace "..." with your actual selector
 $results = $pages->find("..., start=$start, limit=$limit");

 foreach ($results as $resultPage) {
   // do you magic here, collect the results that match to another PageArray maybe?
 }

 // free some memory
 $pages->uncacheAll();

 // advance to next set
 $start = $start + $limit;

} while (count($results) > $limit);

Hope this helps. There could be another variable to make sure the do-while doesn't get crazy, but I left that out for now.

ryan · October 26, 2012

For the other part of it, in preventing the duplicates. As mentioned before, the pages you retrieve are already going to be unique. But like you said, you may have duplicate field values even among unique pages. Lets say that you wanted to guarantee the uniqueness of a 'title' field, for example:

$uniqueResults = array(); 
foreach($results as $resultPage) {
 $uniqueResults[$resultPage->title] = $resultPage; 
}

Following that, there will be no pages with duplicate titles in your $uniqueResults array. You can take the same approach with any other field(s). Though make sure your field resolves to a string before using it as a key in your $uniqueResults array.

Harmster · October 29, 2012

Okay, i think my lack of english and info given here doesn't really give me a solution, I am sorry.

Heres the code.

if($_SERVER['REQUEST_METHOD'] == 'POST')
{
$selection = "";
if(!empty($input->post->ip))
{
 $ip = $input->post->ip;
 $selection .= ", ip=$ip";
}
if(!empty($input->post->action))
{
 $action = $input->post->action;
 $selection .= ", action=$action";
}
if(!empty($input->post->user))
{
 $us = $input->post->user;
 $selection .= ", users=$us";
}
if(!empty($input->post->result))
{
 $result = $input->post->result;
 $selection .= ", result=$result";
}
if(!empty($input->post->value))
{
 $value = $input->post->value;
 $selection .= ", value=$value";
}
$log_files = $pages->find("template=logs". $selection .", start=$start, limit=$limit, sort=-datetime");
$all_log_files = $pages->find("template=logs". $selection ."");
if(count($log_files) == 0)
{
 $mainContent .= "Er zijn geen resultaten gevonden";
}
}
else
{
$log_files = $pages->find("template=logs, start=$start, limit=$limit, sort=-datetime");
$actions = array();
$ips = array();
$values = array();
$results = array();
foreach($all_log_files as $log_file)
{
 if(empty($actions["$log_file->action"]))
 {
	 $actions[$log_file->action] = 1;
 }
 if(empty($ips["$log_file->ip"]))
 {
	 $ips[$log_file->ip] = 1;
 }
 if(empty($values["$log_file->value"]))
 {
	 $values[$log_file->value] = 1;
 }
 if(empty($results["$log_file->result"]))
 {
	 $results[$log_file->result] = 1;
 }
}

Theres around ~3000-5000 pages

This gives me an error:

Fatal error: Allowed memory size of 33554432 bytes exhausted (tried to allocate 45 bytes) in xxx/xxx/xxx/wire/core/Page.php on line 311

Theres

ryan · October 31, 2012

Memory isn't an unlimited resource, so there is always a limit as to how many pages you can keep in memory at once. You'd need to code this in a scalable manner, which means find a way to do it that doesn't require loading thousands of pages in memory at once. I'm not sure I understand the code example enough to suggest an alternative. But you can always go directly in with an SQL query ($db->query) if you need to do something that you can't accomplish with the $pages->find().

One other thing I want to mention is that your $selection variable here is open to selector injection. Make sure you run any values you get through $sanitizer->selectorValue(), or a more specific $sanitizer function, before placing them in a selector string. For example:

// use selectorVaue when the value will be a freeform string
$ip = $sanitizer->selectorValue($input->post->ip);
$selection .= ", ip=$ip";

// sanitize with pageName when you expect the value to be [-_.a-z0-9]
$action = $sanitizer->pageName($input->post->action);
$selection .= ", action=$action";

// typecast to an integer when you expect the value to be a number
$us = (int) $input->post->user;  
$selection .= ", users=$us";

diogo · September 9, 2013

I had the need of iterating through a lot (and I mean A LOT) of pages by template, and was also having memory problems. With this amount of pages it was becoming really annoying to do it few at a time, so I came up with a solution that doesn't need a wireArray. With this I could iterate over more than 50.000 pages with the same template on a website with more than 100.000 pages by bootstrapping PW from the command line. I just had to set a time bigger limit to PHP with set_time_limit(), and everything went fine and without interruptions.

while (1) {
    $p = wire('pages')->get("template=my_template, id>$id"); // get page with id bigger than previous
    if(!$id = $p->id) break; // assign current page's id to $id or break the loop if it doesn't exist

    // do stuff using $p as the current page 
    
    wire('pages')->uncacheAll(); // clean the memory (with $p->uncache() doesn't work. why?)
};

edit: actually, using the command line I don't think set_time_limit() is even needed.

dragan · September 15, 2013

pls disregard this altogether. I decided to do it "the PW way", and use sub-categories (sub-pages) instead.

First of all, it looks more clean in the backend, and I don't have to construct over-complicated queries anymore.

The import process via CSV takes now 2 more steps, but that's OK. i.e.

1. Create sub-category "container" pages

2. Create the pages

3. Move pages to their containers

With my initial idea the import process would have been a lot easier, but I like a clean content overview in the backend... and I guess my client too

~~Is there a way to do something like a GROUP / DISTINCT when I don't know the actual value?~~

~~I have plenty of products in the same page-hierarchy, and one field is product_type_group. Several pages can share the same value.~~

~~I'd like just to get the first one and skip the others.~~

I already have a dozen product group within "products", but within these I'd prefer not to create another hierarchy level. I know how I'd do in without API methods, but I wanted to know if miss a special selector / API function.

Hari KT · April 11, 2014

Hey,

I was also checking for this. The problem why I am searching is I have a template with fieldtype of Page.

And this fieldtype Page can be same or different .

I want to only get the unique fieltype Page , from the template.

Edit : I know I can get the results and traverse why I am asking is the selector I made was having a limit 5, and I don't know whether loading the PagesArray of 1000's will be a drawback and make it slow to compare the results.

Thank you.

teppo · April 11, 2014

I was also checking for this. The problem why I am searching is I have a template with fieldtype of Page.

And this fieldtype Page can be same or different .

I want to only get the unique fieltype Page , from the template.

If you need to handle a large quantity of pages, I'd probably rely on SQL. Sounds like a rather trivial task that way, though this, of course, depends on what you're actually after. If I'm reading your post correctly and it's just selected pages you're looking for:

SELECT GROUP_CONCAT(data SEPARATOR '|') data FROM (SELECT DISTINCT data FROM field_myfield ORDER BY data LIMIT 5) f;

After that you've got a list of pages you can pass to $pages->find().. though I don't quite understand why you'd want to do this with the limit, so there's probably something I'm misinterpreting here. I hope you get the point here anyway

IMHO It's questionable whether selectors even should be able to handle every imaginable task. This, for an example, seems quite rare need to me (and is already easily solved by either loop or SQL).

Selectors are good at finding pages in general, while finding distinct values, even when those values are later used for finding other pages, sounds like a job for something entirely different -- or some kind of combination of SQL and selectors.

Hari KT · April 12, 2014

Hi @teppo,

It is really nice to know we can dig into SQL. I wasn't really aware where these data's were stored. I will check your idea, and it sounds good to me.

Thank you.

gebeer · April 4, 2016

$start = 0;
$limit = 500;
do {
  // replace "..." with your actual selector
  $results = $pages->find("..., start=$start, limit=$limit");

  foreach ($results as $resultPage) {
    // do you magic here, collect the results that match to another PageArray maybe?
  }

  // free some memory
  $pages->uncacheAll();

  // advance to next set
  $start = $start + $limit;

} while (count($results) > $limit);

Thank you nik, for that code. I am using it and it works quite well. Only thing that needs to be changed is the while from while (count($results) > $limit); to

while (count($results) = $limit);

Otherwise it loops 1 more time than it should.

Sign In

Distinct selector in ProcessWire

Recommended Posts

Harmster

nik

Harmster

nik

ryan

Harmster

ryan

diogo

dragan

Hari KT

teppo

Hari KT

gebeer

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

Activity

My Activity Streams

Support

Store

My Details