Webrocker Posted August 15, 2015 Share Posted August 15, 2015 Hi, I'm still testing my idea to port a 1999 cgi script based forum that has been spaghetti code 'enhanced' over the last 15 yrs over to processwire. (yeah, don't ask ) Currently I have ~ 4000 users and ~190000 "legacepost" pages imported via the api, and I have to say that I'm pretty impressed how easy this went. the legacyposts have fields with values like "rootpostid", "parentpostid", "authorid", which are populated with the values from the old forum. now I created a script that created new "post" pages, and filled "pages" fields like "postauthor", "postroot" and "postparent" with data relating to my processwire install, and of course the post's real data like title, body etc pp. This also went pretty well (I'm amazed how nicely the lister behaves, even with now about 400000 pages under the belt), but during this page creation some thousand duplicate posts have been created, most like because I accidently fired the import twice (I imported the original post in batches). I can identify those by their "legacy_id" - there shouldn't be two "post" pages with the same "legay_id". Is there a "processwire" way to find those, or is this a job for plain mysql? I have some other issues with the design of my "post"page data (by using page fields that query the whole bunch of "post" pages I think I grind php to a halt), but I'll create a separate topic for this and keep this here to the "duplicate" question. cheers,Tom Link to comment Share on other sites More sharing options...
LostKobrakai Posted August 15, 2015 Share Posted August 15, 2015 http://stackoverflow.com/a/688551 This should give you all duplicates used on field_legacy_id. 1 Link to comment Share on other sites More sharing options...
Wanze Posted August 15, 2015 Share Posted August 15, 2015 Hi Webrocker, A direct MySQL query is the way to go here. The solution from LostKobrakai works, in the context of ProcessWire the query looks like this: SELECT pages_id, COUNT(data) AS count FROM field_legacy_id GROUP BY data HAVING count > 1 Now you can extract the page IDs from the result and delete these pages via the Pw API, as there is existing a duplicate post. Note that if the count is greater than 2 somewhere, you'd need to execute this query more than once. 2 Link to comment Share on other sites More sharing options...
Webrocker Posted August 15, 2015 Author Share Posted August 15, 2015 Hi Wanze & LostKobrakai, thank you for the instant replies! Link to comment Share on other sites More sharing options...
Webrocker Posted August 16, 2015 Author Share Posted August 16, 2015 After identifying the page id with the above, I now try to actually get rid of them. The Problem is; by using the API, the php script timeout will be triggered - that's b/c there are now ~400000 pages involved… So I'm trying to get this done directly with mysql, but I ran into some perf issues as well. SELECT id FROM pages p, field_as_post_legacy_id lid WHERE p.id=lid.pages_id AND p.parent_id=1035 GROUP BY lid.data HAVING COUNT(lid.data) > 1) This will return about 10000 ids which are duplicate pages and can be kicked out of the pages table. if I simply put them in a subquery like this: DELETE FROM pages WHERE id IN ( SELECT id FROM pages p, field_as_post_legacy_id lid WHERE p.id=lid.pages_id AND p.parent_id=1035 GROUP BY lid.data HAVING COUNT(lid.data) > 1 ); it won't work b/c mysql does't allow for the same table in DELETE and SELECT FROM statement. As a workaround I duplicated the pages table, and now this DELETE FROM pages WHERE id IN ( SELECT id FROM pages_backup p, field_as_post_legacy_id lid WHERE p.id=lid.pages_id AND p.parent_id=1035 GROUP BY lid.data HAVING COUNT(lid.data) > 1 ); starts, but eventually this too will time out. I also tried to put this into one query, but the "group by ..having..." seems to throw an error if used in the DELETE statement. :-/ DELETE p FROM pages p, field_as_post_legacy_id lid WHERE p.id=lid.pages_id AND p.parent_id=1035 GROUP BY lid.data HAVING COUNT(lid.data) > 1) As you probably can guess by now, this mysql query crafting thing is not one of my greatest skills, so if anyone has some hints on how to get this to work without grinding the mysqlserver ;-), I would be very thankful cheersTom Link to comment Share on other sites More sharing options...
Wanze Posted August 16, 2015 Share Posted August 16, 2015 I'd still delete them with the ProcessWire API. This makes sure that also related data in the "field_xx" tables as well as files get deleted properly. To solve the timeout issues, I see two options: Run the "cleanup-script" via command line by bootstraping Pw, usually timeout is no problem there, as Apache isn't involved. Group your pages into chunks with the help of "start" and "limit" in your selector. Or both together Cheers Link to comment Share on other sites More sharing options...
Webrocker Posted August 16, 2015 Author Share Posted August 16, 2015 Hi Wanze, thanks for confirming my own nebulous assumptions. at the moment the cleanup script is running in batches of 1000s, and trashes (not delete) the pages via the api. good idea to call the script via the command line, why haven't I thought of that Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now