Jump to content

Identifying Potential Duplicates


nbcommunication
 Share

Recommended Posts

Hello,

I'm developing a 'directory' web app which allows users to add their own listings, and admins to review and approve listings. All user interaction is done on a custom front-end - not using the processwire admin. After a deadline date, all listings will be reviewed and exported as custom markup for creating a print edition. 

As part of that process the client needs to be able to identify potential duplicates. I created the script below, which does the following:

  • Takes in a GET variable assigned to $l - a letter of the alphabet
  • gets all the listings starting with $l
  • finds other listings in that PageArray that have the same surname and a matching telephone number (tels is a table field)
  • Returns an array of 'original' ids with duplicate ids within it
if ($l) {

	$l = $sanitizer->selectorValue($l);

	$all_listings = $pages->get(1070)->children('include=all,surname^='.$l);
	$dups = array();

	foreach ($all_listings as $a) {

		$existing = $allr->find('surname='.$sanitizer->selectorValue($a->surname).',tels.value='.$a->tels->implode('|','value').',tels.count>0,id!='.$a->id);
		if (count($existing)) {
			$c = 0;
			foreach ($existing as $e) {
				if (!count($dups[$e->id]))
					$dups[$a->id][$c++] = $e->id;
			}
			
		}

	}

	echo print_r($dups,true);
}

The script above does what it is meant to do. The problem is that there are nearly 5000 listings, and it is really slow - searching through 'A' takes about 15-20 seconds. I'd actually added in the letter filter as it was hitting the PHP timeout without it. I can't add a 'limit=' as in order to find the duplicates it needs to check all the listings.

Are there better ways to do this? I had initially tried a direct MySQL query, but my knowledge of this is pretty basic, and gave up when it got to trying to match the 'tels'.

Cheers (and thanks for the best system going!),

Chris

Link to comment
Share on other sites

Hi @nbcommunication and welcome to the forums.

Try using ->count() instead of ->find(), eg:

$existing = $allr->count('surname='.$sanitizer->selectorValue($a->surname).',tels.value='.$a->tels->implode('|','value').',tels.count>0,id!='.$a->id);

But, a direct SQL query might be what you really need - it might be worth having another look at that approach.

Link to comment
Share on other sites

Hi adrian,

Thanks for the reply - I'd hoped to be able to use count, but I need to return the ids of the duplicates too :(

Right - I thought about that after posting, but I wonder if it still might be better to do count first and if there is a result, then find it.

if ($l) {

    $l = $sanitizer->selectorValue($l);

    $all_listings = $pages->get(1070)->children('include=all,surname^='.$l);
    $dups = array();

    foreach ($all_listings as $a) {

        $selector = 'surname='.$sanitizer->selectorValue($a->surname).',tels.value='.$a->tels->implode('|','value').',tels.count>0,id!='.$a->id;

        $existingCount = $allr->count($selector);

        if ($existingCount > 0) {
            $existing = $allr->find($selector);
            $c = 0;
            foreach ($existing as $e) {
                if (!count($dups[$e->id]))
                    $dups[$a->id][$c++] = $e->id;
            }
            
        }

    }

    echo print_r($dups,true);
}
Link to comment
Share on other sites

Hi Chris,

Welcome to the forums.

In addition to what @adrian said, I am wondering why you can't check for and reject duplicates at the point of creation? If listings are pages and they all share a single parent (id #1070) and they are created by users, you should be able to reject listings when they are being created...Something like:

$p = new Page();
$p->title = $sanitizer->text('Some Title');
$p->parent = $pages->get(1070);
$p->template = $templates->get('listing');
$p->surname = $sanitizer->text($l);// or however you sanitize

if($p->title) $p->name = $sanitizer->pageName($p->title);//sanitize and convert to a URL friendly page name
						
//check if name and surname already taken within same parent
if($p->parent->child("name={$p->name}, surname= $p->surname, include=all")->id) {			
	// if the same name and surname exist, don't create the listing page
	$msg = 'That listing already exists';
	
}
else {
	$p->save();
	$msg = 'Successfully created listing';
}
		
echo $msg;

This way, you will never have duplicates. 

What about shared surnames?

Edited by kongondo
  • Like 3
Link to comment
Share on other sites

Hi,

There are a few 'duplication' measures in place.

  • When a listing is submitted, the page name is logged if it finds a listing with the same name (which is a ->pageName() version of the title and address), and appends the datetime to the new listing
  • when a listing is approved by the client admin, it archives any exact duplicates it finds (from a previous version of the directory already in place).

The problem is that it needs to be able to identify potential duplicates at the time they are reviewing the listings...

I'll look into the MySQL query stuff - thanks guys for your help so far!

Cheers,

Chris

Link to comment
Share on other sites

Obviously I don't get the full picture but a couple of things come to mind..

  • Use IDs to log duplicates?
  • At review stage, use limit and pagination to get listings in batches (i.e. don't get all the 5K children!), ordered by surname or other criteria that will ensure duplicates appear on the same list then do whatever...?
  • Like 1
Link to comment
Share on other sites

Hey there,

I've devised another (DB query) solution that works better. Basically it searches a field table for duplicate values using this function:

function dbQuery($table, $column)
{

	$o = array();

	$q1 = wire('db')->query("
		SELECT `$column`,
		COUNT(`$column`) as c
		FROM `field_$table`
		GROUP BY `$column`
		HAVING c > 1
	");

	while ($r1 = $q1->fetch_array()) {

		$v = addSlashes(trim($r1[$column]));
		$o[$v] = array();

		$q2 = wire('db')->query("
			SELECT `pages_id`
			FROM `field_$table`
			WHERE `$column` = '$v'
		");

		while ($r2 = $q2->fetch_array()) $o[$v][] = $r2['pages_id'];

	}

	return $o;

}

It then cycles through page ids, filtering out any pages that haven't been modified since a certain date (to do with the review period / 2 year lifespan of listing data), and also sorting them into two different arrays (residential listings and commercial listings). It then pops out the duplicates.

I'm using this to filter by the following fields:

  • title - dbQuery('title', 'data')
  • tels - dbQuery('tels', 'value') - (table field)
  • email - dbQuery('email', 'data')
  • mobiles - dbQuery('mobiles', 'value') - (table field)

It seems to be working well and it is a lot faster than what I had previously!

Cheers,

Chris

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...