Identifying Potential Duplicates

nbcommunication · February 19, 2016

Hello,

I'm developing a 'directory' web app which allows users to add their own listings, and admins to review and approve listings. All user interaction is done on a custom front-end - not using the processwire admin. After a deadline date, all listings will be reviewed and exported as custom markup for creating a print edition.

As part of that process the client needs to be able to identify potential duplicates. I created the script below, which does the following:

Takes in a GET variable assigned to $l - a letter of the alphabet
gets all the listings starting with $l
finds other listings in that PageArray that have the same surname and a matching telephone number (tels is a table field)
Returns an array of 'original' ids with duplicate ids within it

if ($l) {

	$l = $sanitizer->selectorValue($l);

	$all_listings = $pages->get(1070)->children('include=all,surname^='.$l);
	$dups = array();

	foreach ($all_listings as $a) {

		$existing = $allr->find('surname='.$sanitizer->selectorValue($a->surname).',tels.value='.$a->tels->implode('|','value').',tels.count>0,id!='.$a->id);
		if (count($existing)) {
			$c = 0;
			foreach ($existing as $e) {
				if (!count($dups[$e->id]))
					$dups[$a->id][$c++] = $e->id;
			}
			
		}

	}

	echo print_r($dups,true);
}

The script above does what it is meant to do. The problem is that there are nearly 5000 listings, and it is really slow - searching through 'A' takes about 15-20 seconds. I'd actually added in the letter filter as it was hitting the PHP timeout without it. I can't add a 'limit=' as in order to find the duplicates it needs to check all the listings.

Are there better ways to do this? I had initially tried a direct MySQL query, but my knowledge of this is pretty basic, and gave up when it got to trying to match the 'tels'.

Cheers (and thanks for the best system going!),

Chris

adrian · February 19, 2016

Hi @nbcommunication and welcome to the forums.

Try using ->count() instead of ->find(), eg:

$existing = $allr->count('surname='.$sanitizer->selectorValue($a->surname).',tels.value='.$a->tels->implode('|','value').',tels.count>0,id!='.$a->id);

But, a direct SQL query might be what you really need - it might be worth having another look at that approach.

nbcommunication · February 19, 2016

Hi adrian,

Thanks for the reply - I'd hoped to be able to use count, but I need to return the ids of the duplicates too

adrian · February 19, 2016

Hi adrian,

Thanks for the reply - I'd hoped to be able to use count, but I need to return the ids of the duplicates too

Right - I thought about that after posting, but I wonder if it still might be better to do count first and if there is a result, then find it.

if ($l) {

    $l = $sanitizer->selectorValue($l);

    $all_listings = $pages->get(1070)->children('include=all,surname^='.$l);
    $dups = array();

    foreach ($all_listings as $a) {

        $selector = 'surname='.$sanitizer->selectorValue($a->surname).',tels.value='.$a->tels->implode('|','value').',tels.count>0,id!='.$a->id;

        $existingCount = $allr->count($selector);

        if ($existingCount > 0) {
            $existing = $allr->find($selector);
            $c = 0;
            foreach ($existing as $e) {
                if (!count($dups[$e->id]))
                    $dups[$a->id][$c++] = $e->id;
            }
            
        }

    }

    echo print_r($dups,true);
}

kongondo · February 19, 2016

Hi Chris,

Welcome to the forums.

In addition to what @adrian said, I am wondering why you can't check for and reject duplicates at the point of creation? If listings are pages and they all share a single parent (id #1070) and they are created by users, you should be able to reject listings when they are being created...Something like:

$p = new Page();
$p->title = $sanitizer->text('Some Title');
$p->parent = $pages->get(1070);
$p->template = $templates->get('listing');
$p->surname = $sanitizer->text($l);// or however you sanitize

if($p->title) $p->name = $sanitizer->pageName($p->title);//sanitize and convert to a URL friendly page name
						
//check if name and surname already taken within same parent
if($p->parent->child("name={$p->name}, surname= $p->surname, include=all")->id) {			
	// if the same name and surname exist, don't create the listing page
	$msg = 'That listing already exists';
	
}
else {
	$p->save();
	$msg = 'Successfully created listing';
}
		
echo $msg;

This way, you will never have duplicates.

What about shared surnames?

Edited February 19, 2016 by kongondo

nbcommunication · February 19, 2016

Hi,

There are a few 'duplication' measures in place.

When a listing is submitted, the page name is logged if it finds a listing with the same name (which is a ->pageName() version of the title and address), and appends the datetime to the new listing
when a listing is approved by the client admin, it archives any exact duplicates it finds (from a previous version of the directory already in place).

The problem is that it needs to be able to identify potential duplicates at the time they are reviewing the listings...

I'll look into the MySQL query stuff - thanks guys for your help so far!

Cheers,

Chris

kongondo · February 19, 2016

Obviously I don't get the full picture but a couple of things come to mind..

Use IDs to log duplicates?
At review stage, use limit and pagination to get listings in batches (i.e. don't get all the 5K children!), ordered by surname or other criteria that will ensure duplicates appear on the same list then do whatever...?

nbcommunication · February 22, 2016

Hey there,

I've devised another (DB query) solution that works better. Basically it searches a field table for duplicate values using this function:

function dbQuery($table, $column)
{

	$o = array();

	$q1 = wire('db')->query("
		SELECT `$column`,
		COUNT(`$column`) as c
		FROM `field_$table`
		GROUP BY `$column`
		HAVING c > 1
	");

	while ($r1 = $q1->fetch_array()) {

		$v = addSlashes(trim($r1[$column]));
		$o[$v] = array();

		$q2 = wire('db')->query("
			SELECT `pages_id`
			FROM `field_$table`
			WHERE `$column` = '$v'
		");

		while ($r2 = $q2->fetch_array()) $o[$v][] = $r2['pages_id'];

	}

	return $o;

}

It then cycles through page ids, filtering out any pages that haven't been modified since a certain date (to do with the review period / 2 year lifespan of listing data), and also sorting them into two different arrays (residential listings and commercial listings). It then pops out the duplicates.

I'm using this to filter by the following fields:

title - dbQuery('title', 'data')
tels - dbQuery('tels', 'value') - (table field)
email - dbQuery('email', 'data')
mobiles - dbQuery('mobiles', 'value') - (table field)

It seems to be working well and it is a lot faster than what I had previously!

Cheers,

Chris

Sign In

Identifying Potential Duplicates

Recommended Posts

nbcommunication

Link to comment

Share on other sites

adrian

Link to comment

Share on other sites

nbcommunication

Link to comment

Share on other sites

adrian

Link to comment

Share on other sites

kongondo

Link to comment

Share on other sites

nbcommunication

Link to comment

Share on other sites

kongondo

Link to comment

Share on other sites

nbcommunication

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

Activity

My Activity Streams

Store

My Details

Support