Jump to content

Selector to find pages with identical value in a field?


Webrocker
 Share

Recommended Posts

Hi,

I'm still testing my idea to port a 1999 cgi script based forum that has been spaghetti code 'enhanced' over the last 15 yrs over to processwire. (yeah, don't ask :-)) Currently I have ~ 4000 users and ~190000 "legacepost" pages imported via the api, and I have to say that I'm pretty impressed how easy this went. the legacyposts have fields with values like "rootpostid", "parentpostid", "authorid", which are populated with the values from the old forum. now I created a script that created new "post" pages, and filled "pages" fields like "postauthor",  "postroot" and "postparent" with data relating to my processwire install, and of course the post's real data like title, body etc pp.

This also went pretty well (I'm amazed how nicely the lister behaves, even with now about 400000 pages under the belt), but during this page creation some thousand duplicate posts have been created, most like because I accidently fired the import twice (I imported the original post in batches). I can identify those by their "legacy_id" - there shouldn't be two "post" pages with the same "legay_id".

Is there a "processwire" way to find those, or is this a job for plain mysql?

I have some other issues with the design of my "post"page data (by using page fields that query the whole bunch of "post" pages I think I grind php to a halt), but I'll create a separate topic for this and keep this here to the "duplicate" question.

cheers,
Tom

Link to comment
Share on other sites

Hi Webrocker,

A direct MySQL query is the way to go here. The solution from LostKobrakai works, in the context of ProcessWire the query looks like this:

SELECT pages_id, COUNT(data) AS count FROM field_legacy_id GROUP BY data HAVING count > 1

Now you can extract the page IDs from the result and delete these pages via the Pw API, as there is existing a duplicate post. Note that if the count is greater than 2 somewhere, you'd need to execute this query more than once.

  • Like 2
Link to comment
Share on other sites

After identifying the page id with the above, I now try to actually get rid of them. The Problem is; by using the API, the php script timeout will be triggered - that's b/c there are now ~400000 pages involved… So I'm trying to get this done directly with mysql, but I ran into some perf issues as well.

SELECT id FROM pages p, field_as_post_legacy_id lid
  WHERE p.id=lid.pages_id 
    AND p.parent_id=1035 
  GROUP BY lid.data HAVING COUNT(lid.data) > 1)

This will return about 10000 ids which are duplicate pages and can be kicked out of the pages table.

if I simply put them in a subquery like this:

DELETE FROM pages
  WHERE id IN (
    SELECT id 
      FROM pages p, field_as_post_legacy_id lid 
      WHERE p.id=lid.pages_id 
        AND p.parent_id=1035 
      GROUP BY lid.data HAVING COUNT(lid.data) > 1
  );

it won't work b/c mysql does't allow for the same table in DELETE and SELECT FROM statement.

As a workaround I duplicated the pages table, and now this

DELETE FROM pages 
WHERE id IN (
  SELECT id FROM pages_backup p, field_as_post_legacy_id lid 
    WHERE p.id=lid.pages_id 
      AND p.parent_id=1035 
    GROUP BY lid.data HAVING COUNT(lid.data) > 1
);

starts, but eventually this too will time out.

I also tried to put this into one query, but the "group by ..having..." seems to throw an error if used in the DELETE statement. :-/

DELETE p FROM pages p, field_as_post_legacy_id lid 
  WHERE p.id=lid.pages_id 
  AND p.parent_id=1035 
  GROUP BY lid.data HAVING COUNT(lid.data) > 1)

As you probably can guess by now, this mysql query crafting thing is not one of my greatest skills, so if anyone has some hints on how to get this to work without grinding the mysqlserver ;-), I would be very thankful :-)

cheers
Tom

Link to comment
Share on other sites

I'd still delete them with the ProcessWire API. This makes sure that also related data in the "field_xx" tables as well as files get deleted properly.

To solve the timeout issues, I see two options:

  • Run the "cleanup-script" via command line by bootstraping Pw, usually timeout is no problem there, as Apache isn't involved.
  • Group your pages into chunks with the help of "start" and "limit" in your selector.

Or both together :)

Cheers

Link to comment
Share on other sites

Hi Wanze,

thanks for confirming my own nebulous assumptions. at the moment the cleanup script is running in batches of 1000s, and trashes (not delete) the pages via the api. good idea to call the script via the command line, why haven't I thought of that :-)

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...