nbcommunication Posted February 19, 2016 Share Posted February 19, 2016 Hello, I'm developing a 'directory' web app which allows users to add their own listings, and admins to review and approve listings. All user interaction is done on a custom front-end - not using the processwire admin. After a deadline date, all listings will be reviewed and exported as custom markup for creating a print edition. As part of that process the client needs to be able to identify potential duplicates. I created the script below, which does the following: Takes in a GET variable assigned to $l - a letter of the alphabet gets all the listings starting with $l finds other listings in that PageArray that have the same surname and a matching telephone number (tels is a table field) Returns an array of 'original' ids with duplicate ids within it if ($l) { $l = $sanitizer->selectorValue($l); $all_listings = $pages->get(1070)->children('include=all,surname^='.$l); $dups = array(); foreach ($all_listings as $a) { $existing = $allr->find('surname='.$sanitizer->selectorValue($a->surname).',tels.value='.$a->tels->implode('|','value').',tels.count>0,id!='.$a->id); if (count($existing)) { $c = 0; foreach ($existing as $e) { if (!count($dups[$e->id])) $dups[$a->id][$c++] = $e->id; } } } echo print_r($dups,true); } The script above does what it is meant to do. The problem is that there are nearly 5000 listings, and it is really slow - searching through 'A' takes about 15-20 seconds. I'd actually added in the letter filter as it was hitting the PHP timeout without it. I can't add a 'limit=' as in order to find the duplicates it needs to check all the listings. Are there better ways to do this? I had initially tried a direct MySQL query, but my knowledge of this is pretty basic, and gave up when it got to trying to match the 'tels'. Cheers (and thanks for the best system going!), Chris Link to comment Share on other sites More sharing options...
adrian Posted February 19, 2016 Share Posted February 19, 2016 Hi @nbcommunication and welcome to the forums. Try using ->count() instead of ->find(), eg: $existing = $allr->count('surname='.$sanitizer->selectorValue($a->surname).',tels.value='.$a->tels->implode('|','value').',tels.count>0,id!='.$a->id); But, a direct SQL query might be what you really need - it might be worth having another look at that approach. Link to comment Share on other sites More sharing options...
nbcommunication Posted February 19, 2016 Author Share Posted February 19, 2016 Hi adrian, Thanks for the reply - I'd hoped to be able to use count, but I need to return the ids of the duplicates too Link to comment Share on other sites More sharing options...
adrian Posted February 19, 2016 Share Posted February 19, 2016 Hi adrian, Thanks for the reply - I'd hoped to be able to use count, but I need to return the ids of the duplicates too Right - I thought about that after posting, but I wonder if it still might be better to do count first and if there is a result, then find it. if ($l) { $l = $sanitizer->selectorValue($l); $all_listings = $pages->get(1070)->children('include=all,surname^='.$l); $dups = array(); foreach ($all_listings as $a) { $selector = 'surname='.$sanitizer->selectorValue($a->surname).',tels.value='.$a->tels->implode('|','value').',tels.count>0,id!='.$a->id; $existingCount = $allr->count($selector); if ($existingCount > 0) { $existing = $allr->find($selector); $c = 0; foreach ($existing as $e) { if (!count($dups[$e->id])) $dups[$a->id][$c++] = $e->id; } } } echo print_r($dups,true); } Link to comment Share on other sites More sharing options...
kongondo Posted February 19, 2016 Share Posted February 19, 2016 (edited) Hi Chris, Welcome to the forums. In addition to what @adrian said, I am wondering why you can't check for and reject duplicates at the point of creation? If listings are pages and they all share a single parent (id #1070) and they are created by users, you should be able to reject listings when they are being created...Something like: $p = new Page(); $p->title = $sanitizer->text('Some Title'); $p->parent = $pages->get(1070); $p->template = $templates->get('listing'); $p->surname = $sanitizer->text($l);// or however you sanitize if($p->title) $p->name = $sanitizer->pageName($p->title);//sanitize and convert to a URL friendly page name //check if name and surname already taken within same parent if($p->parent->child("name={$p->name}, surname= $p->surname, include=all")->id) { // if the same name and surname exist, don't create the listing page $msg = 'That listing already exists'; } else { $p->save(); $msg = 'Successfully created listing'; } echo $msg; This way, you will never have duplicates. What about shared surnames? Edited February 19, 2016 by kongondo 3 Link to comment Share on other sites More sharing options...
nbcommunication Posted February 19, 2016 Author Share Posted February 19, 2016 Hi, There are a few 'duplication' measures in place. When a listing is submitted, the page name is logged if it finds a listing with the same name (which is a ->pageName() version of the title and address), and appends the datetime to the new listing when a listing is approved by the client admin, it archives any exact duplicates it finds (from a previous version of the directory already in place). The problem is that it needs to be able to identify potential duplicates at the time they are reviewing the listings... I'll look into the MySQL query stuff - thanks guys for your help so far! Cheers, Chris Link to comment Share on other sites More sharing options...
kongondo Posted February 19, 2016 Share Posted February 19, 2016 Obviously I don't get the full picture but a couple of things come to mind.. Use IDs to log duplicates? At review stage, use limit and pagination to get listings in batches (i.e. don't get all the 5K children!), ordered by surname or other criteria that will ensure duplicates appear on the same list then do whatever...? 1 Link to comment Share on other sites More sharing options...
nbcommunication Posted February 22, 2016 Author Share Posted February 22, 2016 Hey there, I've devised another (DB query) solution that works better. Basically it searches a field table for duplicate values using this function: function dbQuery($table, $column) { $o = array(); $q1 = wire('db')->query(" SELECT `$column`, COUNT(`$column`) as c FROM `field_$table` GROUP BY `$column` HAVING c > 1 "); while ($r1 = $q1->fetch_array()) { $v = addSlashes(trim($r1[$column])); $o[$v] = array(); $q2 = wire('db')->query(" SELECT `pages_id` FROM `field_$table` WHERE `$column` = '$v' "); while ($r2 = $q2->fetch_array()) $o[$v][] = $r2['pages_id']; } return $o; } It then cycles through page ids, filtering out any pages that haven't been modified since a certain date (to do with the review period / 2 year lifespan of listing data), and also sorting them into two different arrays (residential listings and commercial listings). It then pops out the duplicates. I'm using this to filter by the following fields: title - dbQuery('title', 'data') tels - dbQuery('tels', 'value') - (table field) email - dbQuery('email', 'data') mobiles - dbQuery('mobiles', 'value') - (table field) It seems to be working well and it is a lot faster than what I had previously! Cheers, Chris Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now